Welcome to EnviroDIY, a community for do-it-yourself environmental science and monitoring. EnviroDIY is part of WikiWatershed, an initiative of Stroud Water Research Center designed to help people advance knowledge and stewardship of fresh water.
New to EnviroDIY? Start here

Response code 504 from data.envirodiy.org

Home Forums Mayfly Data Logger Response code 504 from data.envirodiy.org

Viewing 18 reply threads
  • Author
    Posts
    • #13736
      Matt Barney
      Participant

        Today I’ve been intermittently getting a 504 HTTP response when POSTing to the data portal:

        This is from a Mayfly on my desk, using an XBee3 LTE and connected to the serial monitor in PlatformIO/VSCode. So far I’ve been unable to generate any error when posting the same messages via Postman from my laptop; it always seems to return a 201, and the data correctly shows up on MMW.

      • #13738
        Sara Damiano
        Moderator

          Huh.  I’m not aware of any particular reason today would be worse than any other day.  There was an update to MonitorMW released on Thursday (1/23), but it mostly dealt with Leaf Pack data and shouldn’t be affecting anything today.

        • #13739
          neilh20
          Participant

            I’m seeing it as well. It starts of as 201, and then after 10~2 POST starts printing 504. It has been happening from late last week – but is being recorded on MMW. I wasn’t monitoring anything early last week,

          • #13742
            Matt Barney
            Participant

              In my case, any time I get a 504 response, the corresponding data do not show up on MMW. Pasting the same JSON message body into Postman and sending it always succeeds with a 201.

            • #13744
              Sara Damiano
              Moderator

                @aufdenkampe – any thoughts?

                Postman automatically tacks on more standard http headers than I include in the post from ModularSensors.  I wonder if one of them is making a difference. In PostMan after making a request you can click the “Headers” tab and you’ll see 8 or 9 “temporary” headers that it used in addition to your token header.  ModularSensors only includes host, content-type, and content-length (and, of course, token).  If you open the EnviroDIYPublisher.cpp you can find and re-activate lines to add the cache control and connection headers (a few places needed: lines 24-25, 163-164, and 229-233).  Maybe one of them will make a difference.  If neither of those, I suppose it could be one of the accept headers, which I’ve never added.

              • #13745
                Anthony Aufdenkampe
                Participant

                  Hmm. I wonder if it is somehow connected to our last release on Thu. Jan. 23. As Sara mentioned, we didn’t touch the code related to the HTTP post request, but at that time we did make some changes to our virtual machine host OS and our router.

                  Does it make a difference if you post to monitormywatershed.org, which is now our primary hostname, rather than data.envirodiy.org? I wonder if the rerouting might contribute.

                • #13746
                  Matt Barney
                  Participant

                    Looks like it will still happen when posting to monitormywatershed.org. It’s intermittent, as before. Log file attached.

                    To do this test, I modified line 21 in EnviroDIYPublisher.cpp as follows:

                    Matt

                    Attachments:
                  • #13760
                    Matt Barney
                    Participant

                      I just found issue #303 on the ODM2DataSharingPortal site that may be related to my issue: I have a blank UUID in my example request, posted above. I’ll retest and report back.

                    • #13761
                      Matt Barney
                      Participant

                        OK, looks like I still get 504 errors intermittently, even when all UUIDs are specified properly.

                      • #14006
                        Matt Barney
                        Participant

                          I’ve been getting a lot of 504 (Gateway Timeout) HTTP responses to the POST messages sent to MMW. I ran a test today with my Mayfly sampling every 2 minutes, and out of 111 samples, I got 34 successful (201) response codes and 76 response codes of 504. I also got one 502-Bad Gateway, and that datapoint didn’t make it up to MMW; the rest did. My log file is attached.

                          Is anyone else getting these, and do we know what causes them?

                        • #14008
                          neilh20
                          Participant

                            I’m seeing similar. Out of 15posts at two miuntes apart. the first 6 where 201, and then the rest mixed 504 and 201.
                            The site is https://monitormywatershed.org/sites/TU-RC-Test03a/
                            The post structure looks good:

                            Sending data to [ 0 ] data.envirodiy.org
                            POST /api/data-stream/ HTTP/1.1
                            Host: data.envirodiy.org
                            TOKEN: f4c00cb8-91bd-4ba3-8229-10f68e99605e
                            Content-Length: 410
                            Content-Type: application/json

                            {“sampling_feature”:”275b362b-86ab-4079-bce0-2ae5c4e96350″,”timestamp”:”2020-04-01T14:25:02-08:00″,”6288baaa-d291-4a82-a0b6-7b28b6faa0df”:15,”6e433b80-fa12-41c1-952c-bda827c1b2fb”:3.957,”1f2b4122-75f1-4e5b-b6df-16ec6f4aa30e”:0.2129,”5ba31d7b-9ce7-4621-b97b-0c72f9ab414e”:18.6,”1cc06ba7-b0ec-4df5-8986-c529fae578a2″:16.81,”33faa79b-04fd-4277-ab0a-24f6dbaaa931″:0.0010,”3705167b-9cb6-49bd-bfee-dc49c2a99a97″:134}

                            — Response Code —
                            504

                            ————- when put through a JSON PrettyPrint, it looks good
                            {
                            “sampling_feature”: “275b362b-86ab-4079-bce0-2ae5c4e96350”,
                            “timestamp”: “2020-04-01T14:20:02-08:00”,
                            “6288baaa-d291-4a82-a0b6-7b28b6faa0df”: 13,
                            “6e433b80-fa12-41c1-952c-bda827c1b2fb”: 3.964,
                            “1f2b4122-75f1-4e5b-b6df-16ec6f4aa30e”: 0.1935,
                            “5ba31d7b-9ce7-4621-b97b-0c72f9ab414e”: 18.4,
                            “1cc06ba7-b0ec-4df5-8986-c529fae578a2”: 16.81,
                            “33faa79b-04fd-4277-ab0a-24f6dbaaa931”: 0.0009,
                            “3705167b-9cb6-49bd-bfee-dc49c2a99a97”: 134
                            }

                          • #14010
                            Matt Barney
                            Participant

                              Thank you, Neil!

                              Here is more information, from a test I ran overnight using a 2 minute sampling interval, from 04-01 15:32 to 04-02 08:08 MST (UTC-7:00):

                              • 498 total samples taken, of which 7 failed to get inserted into the MMW database.
                              • 55 sample events received a Response Code 504.
                              • 1 event received RC 400.
                              • The remainder received RC 201 (Successfully created).

                              Here are the timestamps, in MST, of the 7 missing sample events, along with their corresponding Response Codes or error messages:

                              • 16:00 (RC 504)
                              • 18:28 (GPRS connection failed.)
                              • 18:30 (RC 504)
                              • 23:22 (RC 504)
                              • 23:48 (RC 504)
                              • 03:48 (RC 400)
                              • 07:48 (RC 504)

                              I’ll attach my code and the log file from this test run. The site name on MMW is TU_BOISE.

                              I’m curious whether anyone else may have received 504 responses from the server at similar times, and whether others are seeing missing data points in MMW. It’s difficult (or impossible) to detect these 504 errors at the Mayfly, unless it’s connected to Serial Monitor, as mine is during testing, but you might see missing points on MMW if you download a csv of your data.

                              Or perhaps there’s logging on the MMW server that could correlate to these gaps? Thanks for any insights that anyone can provide!

                              Matt

                              Attachments:
                            • #14012
                              neilh20
                              Participant

                                Hi Matt
                                You’re doing some good characterization work. Thankyou for sharing it.
                                A basic issue to keep in mind is that all communications can be unreliable as there is complex telecoms infrastructure.

                                An arbitrary wireless placement’s reliability is difficult to quantify ~ distance to cell head, direction of wind, fog/rain etc. A wireless network location needs to be categorized for being reliable.

                                I use the office based WiFi to attempt to have a reliable reference point. (and I’m still seeing timeouts 504 with data delivery and also some posts not being inserted)

                                From an industry context, communication protocol retry’s are the only way to provide for reliable data delivery, and I raised this feature request.
                                https://github.com/EnviroDIY/ModularSensors/issues/194

                                I haven’t managed to implement it in my EnviroDIY systems yet either https://github.com/neilh10/ModularSensors, but its high on my list, and periodically I give me a kick on the butt for not having got it done. Like now. Ideally this type of reliable delivery is implemented early so that overall reliability can be characterized. I implemented another system, and even though networks in rural areas could be down for weeks, the data would eventually ALL be pushed to the server.

                                I have implemented a Sequence number (EnviroDIY_Mayfly_SampleNum) which increments on the Mayfly side, and then on the monitorMyWatershed it can be viewed, and any missing numbers detected.

                                So for a field system the sequence number appears to be continuous from Oct/24/2019, https://monitormywatershed.org/tsa/?sitecode=TU-RC-01&variablecode=EnviroDIY_Mayfly_SampleNum&view=visualization&plot=true
                                So for this system, running over I think an ATT CAT-M1 modem, there are only occasional packet losses of 1 – see graph generated from .xls file
                                Packet losses

                                That said I have been seeing a lot of (504) from my desk workstations – and haven’t dug into why.

                              • #14015
                                Matt Barney
                                Participant

                                  Hi Neil,

                                  Thanks for your good explanations and ideas. As you’ve described, the Mayfly->MMW system in its current form would need additional engineering effort to be able to support a high-reliability communication system with built-in buffering, retries, etc.

                                  At the moment, I’m trying to understand why our Mayfly systems’ reliability has dropped recently, apparently since we’ve implemented code changes. We have loggers that have been in the field for 9 months, sampling every 5 minutes, with no data missing from MMW. Beginning in January, we loaded new firmware onto a handful of those boards (or have simply swapped new, reprogrammed boards  in their place) in order to take advantage of improvements that were made to ModularSensors since the boards were first deployed last summer/fall. Since installing that new firmware, a number of these Mayfly stations have begun to have multiple missing datapoints per day, and in some cases, missing data for hours before resuming.

                                  So what I’ve attempted to do is to reestablish a known, working, baseline state to see if I can identify any bug that I might have introduced. I began by cleaning up my development environment (removing all PlatformIO libraries from global storage), creating a new PlatformIO project from the logging_to_MMW example code, and making minimal changes (setting UUIDs, station identifier, etc.). The errors and missing data I’ve described have been the outcome of this testing so far. So I’m a bit stumped as to why I’ve been unable to get back to the higher reliability that we seem to have had previously.

                                  It seems to me that, in the case of the 504 response codes, since the MMW REST endpoint is returning a response, that it is receiving the messages, but perhaps for reasons internal to the server, is failing to save them to its database.

                                  Thanks for reading. I’m kind of a one-man show as far as writing and testing this code, so I’m grateful for any and all suggestions!

                                  Best,

                                  Matt

                                  P.S. I have to mention the irony that, when I went to have a look at your ModularSensors repo, github spun for a while and then eventually returned a 504-Gateway Time-out page. 🙂 It has since recovered.

                                • #14016
                                  Anthony Aufdenkampe
                                  Participant

                                    Hey Matt and Neil, we’ve had a bunch of server-side issues in the last few days, so I’m guessing that’s more the problem than anything on your device. Part of the issue was a 5.5 hour internet outage to our servers caused by a Comcast fiber-optic line issue on March 31 in the afternoon. But we’ve had other issues that we’re trying to track down.

                                    If this happens to you again, please run a traceroute or tracert command from your computer and send us the results (maybe in an email to me). See https://en.wikipedia.org/wiki/Traceroute

                                    I’m connecting this to the related issue on GitHub: https://github.com/ODM2/ODM2DataSharingPortal/issues/477

                                  • #14017
                                    neilh20
                                    Participant

                                      Hi Anthony ~ thanks for the headsup. For today my desk proto system running at 15minutes intervals/WiFi I haven’t seen any timeouts, all 201’s in the last 5hrs

                                      Matt, sorry to hear the reconstruction issues. Welcome to software development.  I’ve got lots of silver hairs from it.

                                      The process of proving or verifying software functionality, improving reliability (however its defined),  is a tough and sometimes expensive process.  For over all reliability and repeat-ability, you could start another forum discussing best practices.  I’d be happy to share my best practice.

                                      For one software consideration you might read this for future methods  https://github.com/neilh10/ModularSensors/wiki/Release-downloads 

                                      I had one job interview early in my career, this small business had a student do something for him, and then the student moved on. I was interviewed; could I construct from paper listings the program, and then make some modifications to the program.!

                                      At least we have github now, and make copies of the .git. 🙂

                                      I have my desktop development environment, and do accelerated testing (run it 2minute intervals over WiFi). Then when I have stable program, I both label it and uniquely copy and label the firmware.hex.    Then I move that out to an outside more realistic “beta” test site (eg 15minutes over CAT-M1)  and let the software run under realistic conditions for some time.

                                      Then with no known issues, I move it to the field, That is what has happened here https://monitormywatershed.org/sites/TU-RC-01/

                                      However, for this site, I’m getting anomalies about every month which was a bit beyond my initial testing.

                                      Oh well! Got to work on a local test environment to shadow the field one.

                                       

                                    • #14020
                                      Matt Barney
                                      Participant

                                        Hi Anthony, Thanks for that, and will keep the tracert in mind!

                                        Hi Neil, Yep, that’s software development; I’ve got the grey hairs too! I am interested in learning more about your thoughts and approaches to reproducibility, so will start a thread on that soon. Your progressive testing approach sounds great, and is what I’m trying to work toward as our organization prepares to ramp up deployments. Just trying to find some stable baseline at the moment, and I’m not there yet. I’ll be posting a summary of my latest MMW testing shortly.

                                        Matt

                                      • #14021
                                        Matt Barney
                                        Participant

                                          Hi again.

                                          I ran another test overnight, again using a 2 minute sampling interval, with no code changes, but I removed the build flags for modem debugging from the platformio.ini. Here are the results:

                                          • 427 total samples taken
                                          • 52 sample events received a Response Code 504.
                                          • 246 sample events received a Response Code 201.
                                          • 155 of the samples are absent from the MMW database.

                                          The new wrinkle here is that sometimes the Mayfly apparently didn’t attempt to send to MMW; this happened 129 times. In the log file, one of these occurrences looks like this:

                                          … whereas a “normal” interval across consecutive samples, where the Mayfly does attempt to send, looks like this:

                                          Any idea why the Mayfly didn’t send data at these times?

                                          I’ll attach the log file and a time-series chart of when data was sent/not sent.

                                          Thanks,

                                          Matt

                                          (Edit Apr6: Renaming my logfile so that it will successfully upload here.)

                                        • #14046
                                          Matt Barney
                                          Participant

                                            Today I re-tested, using new hardware: Mayfly, modem, antenna, and LTEBee adapter. I’m still seeing the same problems: both Response Code 504 from MMW, as well as what I described above, where there was apparently(?) no attempt by the Mayfly to upload data after saving it to the SD card based on the messages it logged. I’ll attach the log file. Here is a summary (all times in MST):

                                            • Ran from 04-08T11:48 to 04-08T16:30
                                            • 142 samples recorded: ‘Line Saved to SD Card’
                                            • 136 samples made it into the MMW database
                                            • 14 received RC 504
                                            • 123 received RC 201
                                            • Timestamps of missing points (in MST):
                                              • 11:52 – no attempt to send
                                              • 14:14 – RC 504
                                              • 15:04 – no attempt to send
                                              • 15:10 – no attempt to send
                                              • 15:18 – no attempt to send
                                              • 15:30 – no attempt to send

                                            Thanks for any insights.

                                            Matt

                                        Viewing 18 reply threads
                                        • You must be logged in to reply to this topic.