MMW Data Outage? – EnviroDIY

This topic has 24 replies, 8 voices, and was last updated 2020-08-10 at 9:17 PM by neilh20.

Viewing 22 reply threads

Author

Posts
- 2020-07-12 at 9:05 PM #14311
  Robert S
  Participant
  Is there a problem with the data age not updating on the MMW site or an outage of cellular coverage?
  
  I am seeing a lot of sites that haven’t been updated in the last 7 hours (at the time of this posting).
  
  Robert
  Attachments:
  MMW_data.jpg
- 2020-07-12 at 9:55 PM #14313
  Cal
  Participant
  I know Hologram is having trouble. None of my devices could connect for past 7 hours.
- 2020-07-13 at 5:36 AM #14314
  Robert S
  Participant
  Thanks.
  
  I see some sites came back online @ 7:15 UTC but some are still reporting out.
- 2020-07-13 at 7:58 AM #14315
  Cal
  Participant
  All my devices came back online after 3:30 am ET – outage was around 15 hours by Hologram. I’m going to look at alternate providers if this happens again.
- 2020-07-13 at 10:04 AM #14318
  Heather Brooks
  Keymaster
  Hi folks, chiming in here to let you know that we now have a dedicated Monitor My Watershed forum! You can find it in the dropdown options under Forums in the main menu. I’ve moved this topic over; it should not affect the existing conversation in any way but should make it more discoverable when members are looking for help with MonitorMW.
- 2020-07-13 at 11:30 AM #14319
  Jim Moore
  Participant
  I reposted this here from my post this morning on “infrastructure and equipment” forum:
  
  “I noticed that what appears to be a general shutdown of the 2G network in the N. Chester county area which occurred Sunday, 7/12, at 13:45 EDT. I have 8 stations in the Great Marsh Institutes’s network. Three of these woke up this morning but the others are still mute.
  
  According to Hologram it looks like 2G is on borrowed time
  
  Does anyone have any details on the time line for remaining 2G support?
  
  @shicks
  
  Does Stroud have any large scale upgrade plans? I understand that 4G requires new modems and if so are the 2G modems I just purchased a few months ago now scrap!? Maybe we could put in a group order for th 4G modems to at least get a quantity discount and sell the 2G modems on ebay if they have any intrinsic value.”
- 2020-07-13 at 12:20 PM #14321
  Shannon Hicks
  Moderator
  There was a major worldwide outage with Hologram yesterday that is still being resolved. You can read more about the details here on their status page: https://status.hologram.io/
  
  So it’s not a MonitorMyWatershed outage, it has only affected EnvidoDIY stations with 2G modems, because none of our stations with 4G modems were affected. T-Mobile is the only provider for 2G right now, and they are planning to decommission their 2G network at the end of 2020, so we have been making plans to upgrade all of our existing stations from 2G to 4G by the end of the year. This 2G outage gives us a glimpse of what it would look like in January when all the 2G stations will drop offline. Upgrading from 2G to 4G involves visiting each station, reprogramming the Mayfly, and replacing the modem and antenna, and adding a LTEbee adapter. So it requires about $100 worth of hardware and visiting each station, of which we have about 40, so it’s a time-consuming process but will be necessary to keep everything online.
  
  So any stations currently offline right now are likely using 2G hardware, and Hologram says things should be returning to normal soon. The owners of these stations should start to consider upgrading their hardware before the end of the year in order to prevent the permanent 2G outage that is inevitable. And if your 2G station doesn’t come back online in the next few days after everything returns to normal, you might visit it and check the battery. Sometimes cell outages stress the battery by causing excessive connection times every 5 minutes, which can drain the battery faster, especially in areas that are shaded by the full leaf canopy this time of year.
- 2020-07-13 at 2:16 PM #14322
  Jim Moore
  Participant
  Thanks for the update, Shannon. Are there any plans to buy a quantity of 4G hardware in the expectation of a discount? I will need at least 8 and would be glad to help out where needed.
  - 2020-07-13 at 3:07 PM #14326
    Shannon Hicks
    Moderator
    This is an equipment question and not related to MMW, so I’ll answer that in your other forum thread.
- 2020-07-13 at 2:25 PM #14323
  Jim Moore
  Participant
  @heather
  
  Since it’s not a MMW issue should I move my technical questions back to my original post on “infrastructure and equipment” forum?
- 2020-07-13 at 2:47 PM #14324
  Robert S
  Participant
  @heather
  
  My bad for starting this post in the wrong category. I should have paid more attention!
  
  @shicks
  
  Forgive my ignorance about the technical details of cellular service but …
  As for “any stations currently offline right now are likely using 2G hardware…”, SL168 (Punches Run was upgraded to 4G in 2019 but was still down. If the logger cannot connect to 4G, will it try to connect in other ways? I’ll have to read up a little more to understand it.
  
  Robert
  - 2020-07-13 at 3:07 PM #14325
    Shannon Hicks
    Moderator
    The station at Punches Run has been back online for several hours now. It has a 4G board on it. The cellular hardware is either 4G only or 2G only. The connectivity issues are occurring somewhere in Hologram’s system behind the scenes and I don’t know any of the details, other than all of our 2G stations are still offline, and some of our 4G stations were offline yesterday, but most (if not all) are back online, some were not affected at all. There have been several outage problems like this in the past 5 years that we’ve been deploying cellular-equipped loggers, and we usually just have to be patient and wait for the carriers and service providers to fix their issues. That’s why the Mayfly loggers have redundant on-board memory cards for storing sensor data. So no data has been lost, and owners will just have to visit their stations to retrieve the memory cards to fill in the data gaps from the periods of missing cellular data.
- 2020-07-13 at 3:14 PM #14328
  Robert S
  Participant
  @shicks
  
  Okay. I misunderstood.
  
  I thought you were saying that ONLY 2G stations were affected. I did see that the Punches Run station was back up this morning.
  
  Thanks for the clarification.
  
  Robert
- 2020-07-14 at 5:52 PM #14332
  Shannon Hicks
  Moderator
  The Hologram network was back up and running early this morning, and there have been no further issues with their network today. All of the stations that lost connectivity on Sunday are functioning normally.
- 2020-07-15 at 12:54 PM #14333
  Anthony Aufdenkampe
  Participant
  Thanks Robert, Cal, Jim and others for this thread, and to @shicks for providing all those updates on the Hologram situation.
  
  I would like to add that our team overseeing the Monitor My Watershed portal has also noticed some issues on our end, which are separate from the Hologram issues, and we are working on improving MMW services.
  
  Many of these issues have been apparent since the COVID-19 work-at-home orders were put in place. Since then, we’ve noticed intermittent 502 & 504 gate errors responses from our server (either when browsing via the web or when posting data from a monitoring device) and we’ve noticed a general slowness when browsing the portal.
  
  One of the issues is that since COVID-19, the web servers at LimnoTech that host MMW are getting a lot more internet traffic as LimnoTech staff use our VPN to work from home and for other reasons. To that end, LimnoTech done a number of upgrades to LimnoTech’s network that resulted in several planned 1-2 hour outages from time to time. Although we’ve seen some improvements, this hasn’t solved all the issues. We’re continuing to work on optimizing our network.
  
  The other issue is that Monitor My Watershed is running on an aging software stack, which could probably benefit from a not-so-trivial round of updates.
  
  The Stroud Water Research Center and LimnoTech are committed to long-term maintenance and development of the Monitor My Watershed data sharing portal. We have developed a roadmap for the next phase of development, and are presently exploring funding options to get started. If you have any leads for potential funding, please contact us! Every drop counts!
  
  Our development roadmap includes hosting MMW on Amazon Web Services (AWS) for enterprise-class up-time. It also includes addressing various “tech-debt” items that naturally accrue as a software system ages, along with many other items that we’ve listed in our Release 0.12 – Tech Debt / Refactor Code milestone on GitHub.
- 2020-07-15 at 3:22 PM #14334
  Matt Barney
  Participant
  Thanks, Anthony, for this informative post! It speaks to some questions we at Trout Unlimited have had about MMW, its current performance, and future direction, as we look to expand our Mayfly deployments.
  
  Matt
  
  (cc @jlemontu-org)
- 2020-07-15 at 3:41 PM #14335
  Anthony Aufdenkampe
  Participant
  Matt, I’m glad you found the update helpful! I would be interested in connecting with you to more of your perspectives and long-term needs, if that’s of interest to you.
- 2020-07-23 at 9:57 AM #14376
  Anthony Aufdenkampe
  Participant
  Hey All, we found an issue that cropped up in late June on the database server for Monitor My Watershed. We’ll be doing a planned maintenance shutdown today at around 12:30 ET to fix the issue, and it will likely take about an hour. We’ll let you know when it is back up and running.
- 2020-08-06 at 2:15 PM #14453
  Anthony Aufdenkampe
  Participant
  We think we found and fixed the major issue with slowness and 502/504 Gateway errors that have been increasing since this spring!
  
  Please let us know if you see any slowdowns from the web-browser or gateway errors from devices trying to post data, just in case there’s yet another issue we didn’t notice.
  
  @tslawecki , @htaolimno and LimnoTech’s IT team have done numerous fixes, tweaks and optimizations to our servers, virtual machines, network, routers, and firewall over the last 2-3 weeks. For example, we had some internal DNS issues and some internal API calls were going out to the global internet rather than staying inside our firewall, slowing things down. There were other similar issues. A lot of those tweaks appeared to make differences, but the outage this weekend showed us that we hadn’t found the root solution.
  
  We now believe that a major factor in the outages was that the database holding the measured values grew to be bigger than the RAM we allocated for the database virtual machine, which caused a major performance slowdown when combined with the other issues. We increased the database server virtual machine RAM to 60 GB (our max available), which gives us about a year of breathing room given the current database size of 43 GB (177 million data points!) and our current growth rate of almost 8 million data points per month.
  
  The long-term solution is to optimize software stack so that we’re not so constrained by server RAM, which we know is possible. We would want to do this in combination with work on the first 4-7 issues in our Release 0.12 – Tech Debt / Refactor Code milestone, and we are actively looking for funding to do this work, as I mentioned above.
- 2020-08-07 at 11:06 AM #14458
  Matt Barney
  Participant
  Great news, @aufdenkampe! Thanks to you and your team.
  
  I ran a test overnight, with a Mayfly sampling every 5 minutes. Out of 179 samples sent to MMW, only one received Response Code 504, at 04:30 MST, Aug 7th. All other POST messages received successful response codes (201). The 04:30 point did not get saved to the MMW database. What I’ve observed in the past was that the ‘504’ points still got saved in the database. In any case, this appears to be an improvement compared to my previous tracking of 504 errors.
  
  There were 4 other sample points during my test which the Mayfly saved to the SD card but apparently never attempted to send to MMW, as there were no “Sending data” nor “POST” messages in the log at those times. I believe this is a Mayfly/Xbee3 issue, not a MMW issue. I’ve only seen it when using an LTE modem, not when using WiFi.
  
  Best,
  
  Matt
  
  Trout Unimited
- 2020-08-07 at 1:08 PM #14462
  neilh20
  Participant
  Hey good to hear.
  
  I haven’t been able to do a lot of testing, and I was out yesterday but I enabled a laptop computer to monitor one beta system overnight that is using verizon starting at 9pm PST. (though I forgot to add the power cord to the laptop and it turned off after 2hrs !! ). Its sampling at 15minutes, taking 8 readings, and pushing the 8 updates every 2hours, at an offset of 7 minutes. That is at 23:07, 01:07, 03:07. The POST timeout is tighter at 5 seconds, if it doesn’t get a response it records it as a 504. I’ve created a POSTLOG.TXT on the uSD that records all post attempts. If it doesn’t get a 201 it queues the readings and then retrys on the next sucess 201.
  
  Looking at the POSTLOG.txt this morning, it has mostly got 504’s, with a few 201s.
  
  The Debug Log that I got from a POST of 8 readings at PST 23:07pm (2020-08-07T07:07:00-08:00 ) were all 504
  
  Downloading from MMW the .csv file this morning, and looking at the records, a good number 24 readings didn’t make it to the database, but those that did, all made it.
  
  I’ll set up some more testing later today.
  
  https://github.com/ODM2/ODM2DataSharingPortal/issues/483
  
  https://github.com/EnviroDIY/ModularSensors/issues/194
- 2020-08-09 at 3:11 PM #14463
  neilh20
  Participant
  Over the last couple of days I’m getting very good response when using a WiFi, and the response time for 201 ack is sub 1second.
  
  This is a fast check with 2min sampling time, and SendX=2, so delivery every 4minutes. I get a response typically under 0.5seconds, and occasionally ~ 0.6Seconds. So for 1250 messages all have been delivered, 1st time or subsequent retrys..
  
  For the beta verizon system, with sampling at 15minutes, and SendX=8, that is wireless connection every 2hrs, and timeout of 5seconds, there are burst of successful delivery with ack 201. The ack time is sometimes at about 1.4Seconds, but mostly when successful at about 4.5seconds. So I’m guessing this is something to do with Verizon’s network. I’m going to have to change the timeout back to the 10seconds for better characterization.
  
  Thanks to the MMW team for finding the issues and getting it responding.!!!
- 2020-08-10 at 1:29 PM #14464
  Matt Barney
  Participant
  I repeated my test for another ~48 hour run, sampling every 5 minutes, but this time using WiFi instead of XBee3 cellular. All of my 548 sample points made it to the MMW database, and all POST messages sent by the Mayfly received successful response code 201. So data upload via cellular appears to be significantly less reliable, even when cell signal is good.
- 2020-08-10 at 1:42 PM #14465
  Anthony Aufdenkampe
  Participant
  Neil & Matt, thanks for that very good news from your testing!
  
  I’m really glad to hear that everything seems to be working well again!!!
- 2020-08-10 at 9:17 PM #14466
  neilh20
  Participant
  The response is great.
  
  For my WiFi/Xbee S6 accelerated updates 2min sampling update every 4minutes the ACK time over 700 POSTS time is between 200mS and 774mS. All POST succeeding 1st attempt.
  
  For my Verizon/Xbee LTE at 15minutes sampling the ACK time is typically 5sec, very occasionaly about 1.5sec, and also 7Sec. For this test it delivered the outstanding readings that weren’t delivered previously, and then the new readings.
  
  I am working on new feature Reliable Delivery, as cellular wireless range can vary and be unreliable. Often though there are periods of greater reliability (wind in the right direction). So if the first POST attempt doesn’t succeed, it is serialized to a QUExx.txt file to be retried when there is a connection.
Author

Posts