Welcome to EnviroDIY, a community for do-it-yourself environmental science and monitoring. EnviroDIY is part of WikiWatershed, an initiative of Stroud Water Research Center designed to help people advance knowledge and stewardship of fresh water.
New to EnviroDIY? Start here

Stability Testing ~ how to do it?

Home Forums Mayfly Data Logger Stability Testing ~ how to do it?

Viewing 15 reply threads
  • Author
    Posts
    • #14976
      neilh20
      Participant

        I’m just wondering who might be doing some system stability/reliability testing ? and what their setup is.

        By system stability: I define it as test setup representing real world conditions that can exercise a majority of core code.

        Traditionally, in open source this is such a sqiggy area, that it requires a community of supporters to exercise it.  In commercial organizations, this is typically performed by a quality assurance group that reports to a different level of management than the software designers.

        My core stability test setup consists of a Mayfly 0.5b, with a wifi Digi XB2B, a temperature/humidity sensor AM2320 and a battery 4A, with a uSD, and a modified FTDI monitoring cable.  My 2nd stability system consist of the same but with a Digi LTE modem.

        The modified FTDI monitoring cable is to be able to provide non-intrusive monitoring for system status.

        Mayfly 0.5b is the latest, and for want of any other definition, only the latest revision of hardware should be used.

        Digi WiFi XB2B a local stable wifi  may provide the most reliable access to the internet, and the WiFi provides the ability to manage a gateway error conditions. Big advantage with WiFi is unmetered connection.

        4AHr LiIon Adafruit-354, Charging temperature 0~45C, discharge -20~60C .  I’ve been having problems with 2A batteries in cold temperatures that I can’t explain.

        Pimoroni AM2320 , Adafruit-3721 is so that the I2C subsystem is exercised Adafruit-3721

        For real-world conditions I define it as varying equipment temperature,  0C(-5C) to 40C OK,  varying wireless connectivity, varied solar including no-solar till the battery exhausts.

        This generally looks like this – https://github.com/neilh10/ModularSensors/wiki/Testing-overview

        The other part of a  stability/reliable  system is having a clear set of test objectives with a repeatable build process to produce software, which I believe I have. I always start from no libs and let it pull in everything.

        I am building off 0.27.0  While ModularSensors is amazing, it has a diverse ecosystem of supporting modules and a reliable reference system is the basis of any reliability discussions.

         

        I had been hoping to get a stability test for 0.27.0  on my standard system over the hols, but after running for a week  of  testing 24-48hours, finding some issues, tweaking , trying again, all I can say is I am running into issues on the basic stability system.

        I did have some pretty good stability tests in the  past (0.25?) , so now I’m questioning those, and thinking how I might check them out again. One big difference from my last tests is its mild winter, temperature drop closer to 0C than before.

        The issues  are the XB2B hybrid is not always responding, and I’m getting some unexpected RESETs reboots.

        The XB2B hybrid seems to get busy, and a series of “+++” do not result in a response. I’ve solved some of them with some strategic delays(), but I think I need to explore TinyGSM 0.10.9

        I’m getting periodic resets ~ at this point I can only think it is the WatchDog kicking in. However since it doesn’t advertise when its going to bite, I need to do some work on this and also how to figure out what caused the RESET (read MCUSR early) . However after reset, then the XB2B hybrid is not able to find the WiFi SSID  sometimes for a couple of hours. If I turn off the power, and then power it back up it immediately connects.

        The link to the data.monitormywatershed.org often doesn’t get a response over many Posts. I’m sort of assuming this may mostly on the MMW side, but could also be the XB2B links not always going up, so trying to figure out a reliable configuration. There maybe something happening with MMW response, that is then making the XB2B busy such that it doesn’t respond to “+++”

        So really just putting this out there to see if anybody else on a defined release has got a setup that they are getting some good longterm results with?  🙂

         

      • #14995
        neilh20
        Participant

          Some status – the periodic Mayfly RESETs turned out, I think, to be happening when polling the Insitu LT500 gauge over SDI12. https://github.com/EnviroDIY/ModularSensors/issues/344

          The new code breaks the polling of the Insitu LT500, I’m doing debug under various scenarios to try and under stand it.
          https://github.com/EnviroDIY/ModularSensors/issues/346

          I’m seeing instability on the WiFi S6B hybrid that I didn’t see on the 0.25.0 release. It is repeatable, but doesn’t make a lot of sense to me at this point – so maybe its driver error. https://github.com/EnviroDIY/ModularSensors/issues/347
          My next step is to try and do a poweroff of the WiFi S6B instead of Sleep using the LTE Bee Adapter.

        • #15009
          neilh20
          Participant

            Well after running for some time with the WiFi S6B, and still having problems after it has been running for a couple of hours,
            I’ve switched to one my target field internet comms of Digi XBee3 LTE-M over Verizon.

            Very interesting the Digikey RevXsystems https://dataplans.digikey.com (thanks to @mbarney for the suggestion) has a new low cost 50M/month CAT-M1 plan.
            Wow this is excellent!. Based on some recent past experience with a marginal/failing system this is going to be worthwhile.

            Since this was a new Xbee3 LTE, I have upgraded the Xbee3 to the latest software using the Digi “TH Development” board. Then with on the LTE adapter on the Mayfly, I plugged in the battery. At its core the sw is @srgdamiano hard sweat with DigiXBeeCellularTransparent.cpp, though I have modified it to show the connecting process to the cell network.

            Since connecting for the first time over LTE and Verizon is such an occasional event, I thought I would paste in the trace.

            It connected first time… (whew!)

            Attempting to connect to the internet and synchronize RTC with NIST
            This may take up to two minutes!
            Lte internet comms with Digi XBee3 Cellular LTE-M IMEI OK HwVer 4B48 FwVer 11417
            Loop=Sec] rx db : Status ‘ Operator ‘ #Polled Cell Status every 1sec
            0=7.89] 0:0x22 ‘OK’
            1=8.91] 0:0x22 ‘OK’
            2=9.93] 0:0xff ‘OK’
            WATCHDOG ISR barksUntilReset 149 <–WatchDogAVR
            3=10.95] 0:0xff ‘OK’
            4=11.97] 0:0xff ‘OK’
            5=12.99] 0:0xff ‘OK’
            6=14.01] 0:0xff ‘OK’
            7=15.04] 0:0x22 ‘OK’
            8=16.06] 0:0x22 ‘OK’
            9=17.08] 0:0x22 ‘OK’
            10=18.10] 0:0x22 ‘OK’
            Try +CREG ‘
            11=19.13] 0:0xe ’22’
            WATCHDOG ISR barksUntilReset 148 <–WatchDogAVR
            12=20.15] 0:0x0 ’22’ Cnt=1
            13=21.17] 0:0x0 ’22’ Cnt=2
            14=22.19] 0:0x0 ’22’ Cnt=3
            Digi Xbee3 setup Sucess. Registration ‘ 0 ‘
            mdmIP[ 1 / 16 ] ‘ 0.0.0.0 ‘= 7
            mdmIP[ 2 / 16 ] ‘ 0.0.0.0 ‘= 7
            WATCHDOG ISR barksUntilReset 147 <–WatchDogAVR
            mdmIP[ 3 / 16 ] ‘ 0.0.0.0 ‘= 7
            mdmIP[ 4 / 16 ] ‘ 0.0.0.0 ‘= 7
            mdmIP[ 5 / 16 ] ‘ 100.104.156.99 ‘= 14
            XbeeWLTE IP# [ 100.104.156.99 ]
            0 ] Connect time.nist.gov
            WATCHDOG ISR barksUntilReset 146 <–WatchDogAVR
            WATCHDOG ISR barksUntilReset 145 <–WatchDogAVR
            1 ] Connect time.nist.gov
            WATCHDOG ISR barksUntilReset 144 <–WatchDogAVR
            NIST responded after 2562 ms
            Internal Clock within 5 seconds of NIST.
            Putting modem to sleep

          • #15011
            neilh20
            Participant

              I’ve received more “LTE Bee Adapter” cards, and looking to experiment using it to investigate why the  Xbee WiFi S6B is not reliably connecting to the local wifi network. https://github.com/neilh10/ModularSensors/issues/21

              The LTE Bee Adapter card provides power directly from the LiIon battery, control of the Xbee reset, and potentially also an Xbee power OFF capability.

              The WiFi S6B is specified for 3.14 to 3.46V so this isn’t a good long term solution. The LiIon Battery can be up to 4.2V.

              Part of my tests are to let the LiIon battery discharge, as might be expected in the field with little sun, and then incrementally charge it back up, as solar is available, possibly over days. This is one of the most difficult parts of powering (and testing), slowly varying power availability.  This appears to be causing some unreliability,  and I haven’t yet been able to identify if it is something in my setup or something else.

              Part of  LiIon battery discharge characteristic, is its voltage drops and internal impedance rises. The priority is to keep the Mayfly running, with good traceable wall time, taking sensor readings (with wall time)  and then transmit (when power available) to the internet. The rate of voltage drop, and impedance rise, is dependent on the capacity of the LiIon battery.  I’m standardizing on a 4AHR outdoor (-10C?) rated battery. LiIon impedance also rises as temperature drops. So there is a narrow window of when the battery, as measured by its voltage, can support the highest dynamic power demand – typically when using RF power.  For the real world, discharging a 4AHR battery can take a week, which is a good thing normally, but for testing I’m having to be creative.

              So the first part of the test with WiFi S6B/LTE Bee Adapter was to see if it would get into the state of not connect to the WiFi network -~ and if did, would the RESET bring it out.  However in overnight/24hrs it has connected to the WiFi every time as expected. So that’s a good thing. (Though MMW POSTs gave my a “201” in 500mS about 1-in-5 times, with the more typical no response timeout being 3000mS)

              So going to go to back to standard powering WiFi, but with logic sensors on the WiFi S6B, and also add 0.1uF ceramic decoupling capacitance directly to the WiFi module pins, which will allow me to monitor the Vcc as well as logic sensor.

               

            • #15017
              neilh20
              Participant

                I’m testing the stability of the Mayfly with the Digi WiFi S6B with software using the 0.27.5 base., running off the LiPo battery.

                Some of the the WiFi S6B hybrids initially connect to the WiFi network, get NIST time, and then later after a couple of cycles of sleeping/waking will no longer connect to the wifi. I’ve connected a Salaea Logic Analyzer (8 Channels) to a number of pins on the WiFi S6B, and its showing problems with the power rail. The S6B hybrid has a tight specification for Vcc of 3.14 to 3.46V

                With power supplied from the USB +5V rail(500mA)+LiIon, the Salaea Analog channel shows the Vcc at 3.266V, and when sleep req is activated, there is a 20uS glitch to 3.118V.

                RF devices require good decoupling, and the LTE adapter has this decoupling. I’ve modified an LTE adapter to take power from the Mayfly Vcc ~ nominally 3.3V – and feed it into the LTE adapter’s power socket. I’ve also added a large capacitor 680uF with Low Series ESR 68mOhms to the 2pin power socket. (Wurth 860080274013)

                I expect this to smooth any power surges from the S6B.

                With Battery Power LiIon nominally 4.2V, and this carrier board, on power up, after reset there are some extended power spikes of 4mS, that dips from 3.229V to 3.06V.

                 

                After running for over 8hours on an speedy soak test  cycle of sleeping/waking taking readings and POSTing to MMW every 2minutes successfully – it starts failing to connect to the local WiFi. Its left running for the next 48hours over the weekend, and fails to reconnect.

                Just wondering if there are any suggestions?

                 

                 

                 

              • #15051
                neilh20
                Participant

                  The WiFi S6B Vcc spec is very tight at  3.14-3.46V and the earlier trace showed that when the WiFi comes out of sleep, the current demand on the Vcc could be pulling it out of specification. In order to check if the Vcc is causing the Xbee S6B WiFi modem a problem, I’ve put together a separate regulator based on the TCR3DF335,LM which regulates to 3.35V and can pulse to 400mA, with a normal spec of 300mA.

                  SMT is so nice when you have the right parts. The TCR3D  fits on a SC74 prototyping board with 1uF decoupling, and can fit between the LiIon bat at 4.2V and the LTE Adapter board that carries the X6B and plugs into the Mayfly.

                  Looking at the trace the Vcc power is much smoother and now well within specification. The Salaea Vcc probe on S6B Vcc, measures close to 3.35 (3.26V), then when S6B turns on drops to 3.31V. When the S6B has a power draws, it drops to 3.28V for up 1.5mS. All within good headroom from the lower 3.14V.

                  Trace of when the S6B initialization sequence below.

                  So now just need to let it run for a few days to see if it makes a difference.

                   

                • #15062
                  Sara Damiano
                  Moderator

                    Did the extra power smoothing work for you?  I’ve noticed the issues with the WiFi XBee’s, but I wasn’t having them drop that frequently and we don’t have any “production” loggers deployed with WiFi, so I never bothered to try and fix anything.

                  • #15066
                    neilh20
                    Participant

                      Hi Sara, the short answer is no ~ which is a good.

                      I’ve had some other urgent issues that have come up, and so have had to leave it for a while.   In my test bench trial above over a couple of days with 2minute sleep/wake  – 3 runs from reset – all failed after #1~5hrs #2 4.5+7.5 #3-4Hrs – so something happening after 100+ sleep/wake events.

                      Its seems like it must be software tickling the S6B in some way that it doesn’t like  and something changed but not sure when.  I’m still thinking about it.

                      I do have a system that I want to deploy using WiFi but its not a high priority.

                      At the same time another system 0.27.5, with fixed SDI-12,  using Verizon/LTE CAT-M1 , has been stable. Its using a 15minute sensor reporting, and 2hour update schedule.

                    • #15198
                      neilh20
                      Participant

                        A status for testing with 0.28.01 – after integrating to this release with the SDI-12 bug fixed and then leaving it running for two weeks, using a Verizon LTE CAT-M1 has worked very well.  A deliberate characteristic of this test setup was to have a very limited solar aspect  a small charge at maximum 0.5A in the morning- but the overall power usage has been pretty low. U

                        Any usage of the Digi WiFi/0.28.01 soon gets hungup, and I plan to look at it.
                        On my fork I also have a BatteryManagementSubsystem combined with a reliable delivery that have worked in combination very well.
                        The following picture summarizes my testing.
                        Stability testing for 14days

                      • #15372
                        neilh20
                        Participant

                          I’ve created a low cost monitoring console using a Raspberry Pi and FTDI cable. This allows a release debug OUTPUT to be monitored just as if it was on the console and compared to https://monitormywatershed.org/

                          https://github.com/neilh10/ModularSensors/wiki/Test-monitoring-host

                           

                           

                        • #15449
                          neilh20
                          Participant

                            An update to my stability testing, this partly makes me collect the dates and status.

                            A test system “tu_rc_EC” standalone EC “Stream Disconnect”  monitor built on 0.25.0 has been running since the beginning of Oct very well. I plan on describing this and haven’t done so yet.

                            The “TUCA-NA13” remote Verizon wireless system in the wilds measuring a stream depth, with two gauges – Keller and LTC500 – built on 0.25.0, has stopped recording on MMW on March 28. It started on Jan 29<sup>th</sup> after a previous outage, so ran for 2months. It is very remote and appears to go through periods when the Verizon network has low signal or MMW is not responding, but it has always recovered. A site visit next week might restore it.

                            An early beta “tu_rc_test06” is in my yard, and a duplicate of the TUCA-NA13, with version0.28.3. This is from my fork, with extra features, but based on 0.28.3. It stopped running, and I have a terminal on it that caught what happened

                            So tu_rc_test06 started  Apr 1st (it survived Apr 1st), and froze on Apr 19<sup>th</sup> .  Looking at the log,   the Mayfly awoke at

                            … zzzZZ Awake @ 2021-04-19T16:07:00-08:00

                            then POSTED to MMW successfully
                            — Response Code — 201 waited 2107 mS Timeout 5000
                            Going to sleep. Ram( 6127 )  ZZzzz…
                            Watchdog disabled. barksUntilReset 150 <–WatchDogAVR

                            then never woke up.
                            At a guess, a hypothesis, the RTC clock never woke it up.

                            @srgdamiano
                            I wonder if you’ve seen anything like this?

                            Looking at the Sodaq_DS3231 RTC it reinitializes every sleep cycle. In another life, working on a large product, we had some very occasional reliability issues with the I2C bus. When there was an issue it was spectacular, and once happened before a very visible customer. We came up with a workaround.

                            The I2C hardware protocol is not a guaranteed transaction, and could have noise on the line.    So I’m trying a modification that does a read of the Sodaq_DS3231  registers to verify that they have been set correctly. It is a long shot, and happy to take any suggestions.

                          • #15450
                            Sara Damiano
                            Moderator

                              No, I don’t think I’ve seen that.

                            • #15451
                              neilh20
                              Participant

                                Ok thanks.

                              • #15460
                                neilh20
                                Participant

                                  My test06 system froze for a 2nd time. This time I pressed the User Button, which is also tied to an interrupt, and started up again.   I have made updates described here; https://github.com/neilh10/ModularSensors/issues/34  and restarting the testing.

                                   

                                • #15887
                                  neilh20
                                  Participant

                                    I’m using Release 0.30.0 for stability regression testing, its gone well, and captured a visual description of it here 0.30.0 testing

                                    • #16081
                                      neilh20
                                      Participant

                                        For my fork https://github.com/neilh10/ModularSensors/releases/tag/v0.30.0.release1_211023
                                        I got the WiFi S6 communicating reliably, and its been testing for over a couple of weeks.
                                        It seems on going to sleep, it really wasn’t doing what was necessary to be able to sleep. It seemed to be leaving TCP/IP links setup. Which then depending on network timers, might be there when it wakes up. If the sleep time was short enough it would still have the link available. Since it used to work, probably what happened was a timer somewhere else (MMW) was reduced. The cure in the end was on sleep, change the destination IP to local:, and then sw reset the device. Then when it wakes up, it reconnects to the SSID, and then setups up the TCP/IP to the remote MMW.
                                        https://github.com/EnviroDIY/ModularSensors/issues/347 describes the issue

                                    • #16216
                                      neilh20
                                      Participant

                                        In the early part of this year 2021, I did some regression testing using an Insitu LT500/SDI12 with a modbus board and telecom LTE, and I ran into a hard to find issue on the Mayfly 0.5b  – the Uart Rx line is unterminated..

                                        When extending features, its not unusual that problems are uncovered in new ways of exercising the software.

                                        As part of a terminal based set the date on a simple Mayfly, I introduced a command line interface.

                                        The objective was to be able to set the Date and Time on a Mayfly when its been installed or about to be installed. Typically a connected Mayfly gets the date when it polls NIST or by special date setting program.

                                        A command line can be very useful in other ways as I brought out here

                                        https://www.envirodiy.org/topic/how-to-dump-contents-of-file-on-sd-card-to-serial/#post-15830

                                        For the initial command line interface I used String.

                                        Now, it turns out that String has some down sides, and its valuable to understand these downsides – so as not to trip over them like I did.

                                        https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentation.html

                                        https://cpp4arduino.com/2018/11/21/eight-tips-to-use-the-string-class-efficiently.html

                                         

                                        So I created a potential issue in the software, that for incoming characters for the UART RX, I added them to a String. The hardware unfortunately can generate a lot of random characters, and this can cause the String to use up a lot of memory.

                                        In the process of debugging this I wrote a utility to dump the ram and showing how much stack and heap used I called it dumpFreeRam() . This gives a periodic visual of the ram allocation, and what might be using it.

                                        The critical issue is monitoring during integration testing how much free ram is available, and what might be using it.

                                        I did find one bug this way when a lot of ram got used up with the above usage of String. I’ve changed it to a more traditional fast circular buffer.

                                        This utility at the end of each invocation calculates a summary “Free ram never allocated”. However this by itself doesn’t indicate when the ram actually got allocated, so the debug listing needs to be evaluated for what event caused any ram to be used.

                                        The example below shows that a program used up to 273 bytes after running for a couple of hours. 6161-591=273bytes. Fortunately  longer testing doesn’t see more ram leakage.

                                        [2021-08-01 16:22:14.881] Free ram never allocated between (bytes dec) 6144 and 6161

                                        and after two hours

                                        [2021-08-01 18:20:15.013] Free ram never allocated between (bytes dec) 5888 and 5912

                                         

                                        The output on the terminal using Teraterm looks like this (using a Nanolevel/RS485)

                                        after running for two hours

                                         

                                        The utility is in my fork, search for dumpFreeRam() in following

                                        https://github.com/neilh10/ModularSensors/commit/842eab880fdf6548d8b6d14951e4f0ad45727ecd

                                    Viewing 15 reply threads
                                    • You must be logged in to reply to this topic.