Emonhub failure

Hi

I got an issue with Emonhub in the past days.

I have got a RPI+HDD running Emonhub + EmonCMS (V8.1.2, apt version)
Emonhub is forwarding data to the local EmonCMS server and to a remote server (Emoncms.org)

Firstly, data in both emoncms servers were being updated at a very low rate (every 100-200 seconds instead of every tens seconds). 12 hours later, I got no more update in both EmonCMS server. Everything is back to work now after having restarted the emonhub service. See attached graph. It's a power feed which is created in emoncms from a Wh feed sent by emonhub.

It seems to be a problem coming from the local emoncms that replied "sent!ok" to emonhub.
Please find attached the emonhub log file saved just before restarting the emonhub service.

Eric

 

EDIT : I just see a strange line at the end of the log file. It's the last line before emonhub stopped forwarding data :

2014-07-28 22:48:01,653 WARNING Discarded RX frame 'non-numerical content' : ['3', '146', '0', '124', '250', '36', '0', '250', '255', '', '1', '184', '1', '0', '0', '205', '1', '0', '0', '73', '100', '0', '0']

 

pb66's picture

Re: Emonhub failure

Hi Eric

Sorry to hear you've had a problem, I've had a look at the files and see the last line you refer to, the line previous to that is

2014-07-28 22:48:01,651 INFO  NEW FRAME :  3 146 0 124 250 36 0 250 255  1 184 1 0 0 205 1 0 0 73 100 0 0

Which shows in the frame received from node 3, the 9th byte has been omited, why? I couldn't say but the "NEW FRAME" log is prior to any emonHub processing and is for showing the frame in it's original form, so it is probable the frame was received by emonHub that way. This is just one frame and it's occured sometime after the main issue so I wouldn't be concerned about it just yet.

The last good localhost receipt and first occurrence of the rogue "sent!ok" in your logs was at 18:42 on the 28th

2014-07-28 18:42:23,184 DEBUG Receipt acknowledged with 'ok' from http://localhost/emoncms

2014-07-28 18:42:44,592 WARNING Send failure: wanted 'ok' but got Message sent!ok

at this time the updates would of dropped significantly as emonHub would of been stuck trying to deliver the frame for a while then looped again, reading a single frame and getting stuck. so with 5 nodes each of these would get read every 5 post attempts - so very slow for both servers

its not clear when the actual first occurrence was, the post body belonging to the first occurrence in your logs contains 100 frames which is a batch limit set in emonhub so from that I can only tell there were at least 100 frames undelivered at that point in time.

This particular situation should not occur with newer revisions of emonhub as threading is implemented so emonHub is no longer at the mercy of each target server being contactable and able to respond correctly. This does not explain why it stopped reporting all together though and unfortunately your logs do not extend into the period of no updates.

When emonHub was restarted all the buffers were cleared so normal intervals resumed, did you restart emoncms or just emonhub? 

I really don't know how or why the acknowledgement message would change without outside influence, this would appear to be the cause.

Did the data slow and top at the same times on each server or did the emoncms.org server continue to log data at a slow rate beyond the last update on the local server? if they are the same then the local emoncms was being updated and the "incorrect" response of  "sent! ok" could be catered for in emonhub by expecting the response to "contain" ok rather than "be equal to" ok, but that seems like a work around rather than a fix.

Paul
 

pb66's picture

Re: Emonhub failure

PS another concern is the actual value reported, your graph shows a significant drop in power across the slow updates period. the wh accumulator is designed to avoid this, so that should also be looked into also as a network outage or sketchy network connection could present data the same way.

Paul

Eric_AMANN's picture

Re: Emonhub failure

Hi Paul,

Thank for being always ready !

This does not explain why it stopped reporting all together though and unfortunately your logs do not extend into the period of no updates.

>> I got no logs during the period with no updates. The next lines in my emonhub.log file are about the emonhub restart that I did on the 30th of july

When emonHub was restarted all the buffers were cleared so normal intervals resumed, did you restart emoncms or just emonhub?

>> I just restarted the emonhub service. I did nothing on the local emoncms. Since I have restarted emonhub, there is no more "sent!ok" message in the logs.

Did the data slow and top at the same times on each server or did the emoncms.org server continue to log data at a slow rate beyond the last update on the local server?

>> Please find a graph to understand what happened on both server. It's the same feed which represents the total number of kWh elapsed. (There is an offset but don't look at it). The node 4 sends the Wh elapsed every ten seconds and then the total number of kWh elapsed is calculated in emoncms with a accumulator process.

One can see that during the slow update period, the total number of kWh was increasing at a slow rateonemoncms.org. That's ok. But during that period, the total number of kWh increased at an abnormal rate (4-5 times more that expected). IT looks like data were posted several times.

Hope this help to understand what happened.

Eric

pb66's picture

Re: Emonhub failure

Okay, so the graphs clearly show the data was being delivered to the local emoncms, therefore had the unexplained "sent!ok" confirmation been sent or accepted as "ok" then emonhub would of deleted the sent data from the buffer and moved on, issue 1 is where or why did the message change?

and the repeated posting caused the wh accumulators to continue accumulating, which led to exaggerated totals I don't think that should happen so that's another issue.

It sounds like emonHub just stopped after a while, maybe it crashed what size are your buffers? did you have your stats php running? could it of been a RAM shortage?

I think we need to look at the confirmation process, emonhub has changed alot over the last couple of months, can you post the confirmation code from emonhub_dispatcher.py so we know exactly what you have? the current version has 

reply = self._send_post(post_url, post_body)
        if reply == 'ok':
            self._log.debug("Receipt acknowledged with '" + reply + "' from " + self._settings['url'])
            return True
        else:
            self._log.warning("Send failure: wanted 'ok' but got "+reply)

When the confirmation is "ok" returning "True" tells emonHub it's ok to delete the sent data from the buffer, the _send_post() code is  

def _send_post(self, post_url, post_body=None):

        reply = ""
        request = urllib2.Request(post_url, post_body)
        try:
            response = urllib2.urlopen(request, timeout=60)
        except urllib2.HTTPError as e:
            self._log.warning("Couldn't send to server, HTTPError: " +
                              str(e.code))
        except urllib2.URLError as e:
            self._log.warning("Couldn't send to server, URLError: " +
                              str(e.reason))
        except httplib.HTTPException:
            self._log.warning("Couldn't send to server, HTTPException")
        except Exception:
            import traceback
            self._log.warning("Couldn't send to server, Exception: " +
                              traceback.format_exc())
        else:
            reply = response.read()
        finally:
            return reply

 

Trying to track the origin, I have searched the python modules for "sent!ok" and not found it, nor is it in emonhub or emoncms. I did however find the only occurrences of "sent!" with an exclamation mark are in emoncms locale files "Message successfully sent! and user module "Password recovery email sent!" did you change local or reset password in emoncms recently? Even if you did this doesn't explain how the 2 could get lumped together though.

Eric_AMANN's picture

Re: Emonhub failure

I didn't modify the default configuration : bufferMethod = memory and bufferSize = 1000.

Regarding the RPI's stat, I was only logging the RAM used (ram used + ram cached + buffers) and the CPU temp. It shows that it seems not be a RAM issue (see attached).

My version of the emonhub_dispatcher.py file seems to be older as I have no _send_post() function
Here is my confirmation code :

        else:
            reply = response.read()
            if reply == 'ok':
                self._log.debug("Receipt acknowledged with '" + reply + "' from " + self._settings['url'])
                return True
            else:
                self._log.warning("Send failure: wanted 'ok' but got "+reply)

Attached is a copy of my emonhub_dispatcher.py file.

I didn't change local or reset password in emoncms recently.

Eric

Eric_AMANN's picture

Re: Emonhub failure

PS : in my logs, you can also find such message :

WARNING Send failure: wanted 'ok' but got Message sent!Message sent!ok

 

pb66's picture

Re: Emonhub failure

Do you still have the events module installed? and do you have any events configured ? I recall you did install it and That is the one and only place I can find "Message sent!" 

https://github.com/emoncms/event/blob/186539af82a78e6cff844365618443bc1d95bec9/event_model.php#L302

I've been over the emonHub code and can see no other place it could come from other than emoncms, whether it's possible for the events module to interfere with the request response I cannot say, but it certainly seems the most feasible source .

Looking at your graphs, it doesn't look like a RAM issue but is it possible the heat was the final straw? there is a massive increase in temp in-line with the constant attempts to post which just suddenly drops and is probably where the posting finally stops. 

Although I didn't find a definitive answer about the temp range of a Pi, I think general consensus is above 50 deg C is not good. This document  says it may be possible to run at over 50 deg C but also states this

When the system was pushed up to 55°C (131°F), the web browser stalled and the system went
comatose. Temperature of the SOC/RAM was 75°C (167°F). (Since the SOC is under the RAM no direct
temperature measurements of the SOC can be made.) The network part was running at 73°C (163° F).
When the system was shut down and rebooted, it would not operate until the temperature of the
SOC/RAM parts dropped to 50°C. 

Which is cause enough to suspect heat could of been the final straw since you have recorded temps nearing 57 deg C at the GPU, plus the hdd is also in the same compartment and we don't know what temp that was at.

Paul

pb66's picture

Re: Emonhub failure

i just looked at some of your previous graphs and maybe the temp isn't so extreme, it's still high but you have run at those temps before,

You have changed the graph on your last post so there is more detail now and I can see Pi still running after where you marked emonhub shut down, presumably the log files stopped at that time, Does the pi have the ability to end a process if its causing temp issues? Just wondering what changed after 12hrs of consistently trying to post the same data,

Eric_AMANN's picture

Re: Emonhub failure

Hi,

Sorry for my bad english. When I marked "emonhub halt" or "emonhub shutdown", I am talking about the emonhub service and not the RPI itself. At any moment, the RPI has stopped. I guess that emonhub service has stopped as there were no more logs until I restarted that service.

I managed to run a RPI many hours at 65°C (see here) without any problem. So, I'm not sure it is a temp issue. The temp is high (55°C) but stable. According to me, the temp is a consequence and not a cause of  emonhub's state.

Regarding the event module, you must be on the right way as I'm using it on my local emoncms ! I'm not sure there is link but I was facing some problems with it. When a specific condition became TRUE, it sent me the corresponding alert by email. But, when the condition went back to FALSE, the event module kept on sending me email until I deleted the corresponding alert.

To prevent any other emonhub crash, I will temporary delete all alerts.

Eric

 

 

pb66's picture

Re: Emonhub failure

Hi,

There's no issue with your english at all Eric,

I totally agree the high temp is a consequence of emonHub working so hard continually trying to the post data and get a valid response from emoncms, I didn't intend to imply otherwise, I was speculating as to what may have caused emonhub to eventually stop, we may never know!

I'm quietly confident the cause was from within the events module, your confirmation of a issue with it combined with it being the only place i can find "Message sent!" used is too coincidental to be ignored.

Paul

Eric_AMANN's picture

Re: Emonhub failure

Hi,

I had another emonhub failure few days ago with the configuration described above :

  - One RPI (+HDD running) running  Emonhub + EmonCMS (V8.1.2, apt version)
  - Emonhub is forwarding data to the local EmonCMS server and to a remote server (Emoncms.org)

Firstly, data in both emoncms servers were being updated at a very low rate (every 100-200 seconds instead of every tens seconds). some hours later, I got no more update in both EmonCMS server. The root file system of the RPI was full :

pi@raspberrypi ~ $ df
Filesystem     1K-blocks    Used Available Use% Mounted on
rootfs           2721336 2563696         0 100% /
/dev/root        2721336 2563696         0 100% /
devtmpfs          215824       0    215824   0% /dev
tmpfs              44820     240     44580   1% /run
tmpfs               5120       0      5120   0% /run/lock
tmpfs              89620       0     89620   0% /run/shm
/dev/mmcblk0p1     57288   19064     38224  34% /boot

I discovered that I got some very long and repetitive log in /var/log/apache2/error.log (see attached).  I moved them to another machine and then deleted them, but it was growing immediately again and again. I had to stop my local emoncms.

In the emonhub log file, one can see that emonhub got a wrong reply from my local emoncms.

Again just before emonhub failure, I got the same message than above : "WARNING Discarded RX frame 'non-numerical content'"

Eric

 

pb66's picture

Re: Emonhub failure

Hi Eric

I've had a browse through the logs and cannot explain the emoncms entries in the apache log, I'm not familiar enough with emoncms or apache. 

I couldn't find the "WARNING Discarded RX frame 'non-numerical content'" message but if it came at the end prior to failing it wouldn't of been the root cause, the emoncms "unpack .. require 4 got 0" messages date back to 16th august which coincides with some phpmyadmin log entries did you trigger something then ?

Whilst emoncms was not returning "ok" this would cause emonhub to "gum up" as the early version you are using is single thread, regardless of the cause of this issue I would strongly recommend upgrading emonhub as multi-threading was introduced to prevent emonhub from being effected by no or slow responses from servers. Although the ever increasing log sizes would eventually stop anything working.

I didn't notice anything sinister in the emonhub.log beyond the massive "reply" from emoncms and it seems emonhub resumed operation and posted the data as the next reply from emoncms is "ok".

Maybe Trystan or someone well versed in emoncms or the data engines can shed some light on the apache logs for us.

Paul

Eric_AMANN's picture

Re: Emonhub failure

The "WARNING Discarded RX frame 'non-numerical content'" message is at the very end of the emonhub.log file. It's the last message before emonhub stops sendind data. I agree with you that the cause of the failure seems to come from my local emoncms.

BTW, I will move to the last version of emonhub.

Eric

 

Eric_AMANN's picture

Re: Emonhub failure

Hi

I encountered another Emonhub failure (third one...) with the June-apt-version. (I have to update it, I know)

Unlike the two first failures, I didn't get any message starting by "Send failure: wanted 'ok' but got ..."
Like the two first failures, the last message in the logs (attached) is :

2014-09-19 23:47:31,846 INFO  NEW FRAME :  2 36 255 219 255  1 36 3 0 0 7 0 0 0 61 99 0 0
2014-09-19 23:47:31,849 WARNING Discarded RX frame 'non-numerical content' : ['2', '36', '255', '219', '255', '', '1', '36', '3', '0', '0', '7', '0', '0', '0', '61', '99', '0', '0']

Every thing was fine until that message (normal update in Emoncms, no send failure , ...)
So this message seems to be the unique cause of the three failures.
It looks like that a byte has been omited.

Some questions :

   - how is it possible to loose a byte inside Emonhub ?
   - how to prevent Emonhub to halt when such an event occur ?

 

Another strange thing in my logs. The buffer was always growing although I got "Receipt acknowledged with 'ok' from http://emoncms.org".  Now everything is working fine after restarting the Emonhub service. The buffer is not growing all the time and data are sent to EMonCMS.

Eric

pb66's picture

Re: Emonhub failure

Hi Eric, sorry to hear you're having troubles again.

I encountered another Emonhub failure (third one...) with the June-apt-version. (I have to update it, I know)

I really do appreciate your early involvement with emonhub but many things have been changed in emonhub to improve it's functioning as well as improve it's error handling and reporting, if there is a problem with emonhub it may of already been resolved or not even be relative to the latest build. It is also difficult to support the early version you are using since it was "mid development" prior to any versioning so I cannot be sure what code you have so please update !

Unlike the two first failures, I didn't get any message starting by "Send failure: wanted 'ok' but got ..."

Unlike the first two occasions which logged an unexpected response from your local emoncms, you are not sending to a local emoncms for it to return an unexpected response.

So this message seems to be the unique cause of the three failures.

No, this message seems to commonly be the last thing emonhub logs before the "failure". In both the previous occasions we have found emonhub was not the cause and we have no reason to suspect anything different at this time.

how is it possible to possible to loose a byte inside Emonhub ?

Since the first thing emonhub does is log the raw string to the log file and the byte is already missing at that point it is likely to of been missing on arrival to emonhub, perhaps emonhub is slower to be effected by the imminent failure and the serial port skips a beat or something (for example).

how to prevent Emonhub to halt when such an event occur ?

emonhub will not fail just because of a missing byte, if you try posting to emonhub from a sketch with a deliberately omitted value it would run indefinitely and fill the log file with the "discarded RX frame messages"

Another strange thing in my logs. The buffer was always growing although I got "Receipt acknowledged with 'ok'

As discussed in our PM dated 1/7/2014, this was due to a small bug that was resolved back in june where only a single frame was deleted from the buffer rather than the number of frames successfully posted.

I think it would be unlikely that the larger buffer size would of been able to get to a size that could "choke" the Pi's RAM since it can only be 1000 items max and I doubt if emoncms would of failed due to duplicated data, so I'm currently at a loss as to the actual cause.

But I am absolutely sure updating emonhub would have a positive impact regardless.

I'm always happy to try and help, but it would be easier for us both if you updated emonhub.

Do you have any other info?

Paul

Eric_AMANN's picture

Re: Emonhub failure

Hi Paul,

sorry to hear you're having troubles again.

Make sure that I was not complaining at all ! My goal was just to share the bugs I had.

It is also difficult to support the early version you are using since it was "mid development" prior to any versioning so I cannot be sure what code you have so please update !

I haven't update yet my version because I can't physically access to the RPI but only by SSH. I'm afraid of losing the control on the RPI during the update if something goes wrong. I would rather have a version that that makes me loose 1 or 2 days of data each month than loosing the RPI. Maybe, I'm too paranoiac ...

If my version is very out of date, I agree that it doesn't make any sense to share the bugs I had .

I'm always happy to try and help, but it would be easier for us both if you updated emonhub.

I totally understand your point. If I were you I would say the same (in a less polite way).

Do you have any other info?

No but one more remark :  it's unclear for me if emonhub is released or not and what is the last version number.

EDIT : For both, Emonhub and EmonCMS, it would be great to have release notes explaining the new features, the corrected bugs, the installation procedure, ... I know it's a long work.

Eric

 

pb66's picture

Re: Emonhub failure

Hi Eric

i understand now why you haven't updated, I didn't realize it was that same project. however that confuses me a little as I'm sure we confirmed you had updated the system after I did that bug fix.

If you want we can just edit the dispatcher code to fix that bug, which may well ease the load enough to prevent further issues. But there again I guess you only have a week or so for that project to run now, so probably not worth doing. 

it's unclear for me if emonhub is released or not

Me too :-)

Our original intention was to release emonhub as a debian package, but as that looks  less likely now and it's already in general use  I guess it is already unofficially launched. I gave the version included on the latest SD card a version ID of rc1.0 and I have rc1.1 in testing at the moment, minor stuff really just settings validation, default settings and automatic baud selection for rfm type devices. There is a couple of things that need to be decided about installer scripts and the repo structure (to ease upgrading) before they are set in stone but aside from that and the lack of documentation I think v1.0 is good to go!

Paul

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.