We think we found and fixed the major issue with slowness and 502/504 Gateway errors that have been increasing since this spring!
Please let us know if you see any slowdowns from the web-browser or gateway errors from devices trying to post data, just in case there’s yet another issue we didn’t notice.
@tslawecki , @htaolimno and LimnoTech’s IT team have done numerous fixes, tweaks and optimizations to our servers, virtual machines, network, routers, and firewall over the last 2-3 weeks. For example, we had some internal DNS issues and some internal API calls were going out to the global internet rather than staying inside our firewall, slowing things down. There were other similar issues. A lot of those tweaks appeared to make differences, but the outage this weekend showed us that we hadn’t found the root solution.
We now believe that a major factor in the outages was that the database holding the measured values grew to be bigger than the RAM we allocated for the database virtual machine, which caused a major performance slowdown when combined with the other issues. We increased the database server virtual machine RAM to 60 GB (our max available), which gives us about a year of breathing room given the current database size of 43 GB (177 million data points!) and our current growth rate of almost 8 million data points per month.
The long-term solution is to optimize software stack so that we’re not so constrained by server RAM, which we know is possible. We would want to do this in combination with work on the first 4-7 issues in our Release 0.12 – Tech Debt / Refactor Code milestone, and we are actively looking for funding to do this work, as I mentioned above.