emonGLCD - Not responding.

Anyone else noticing that the unit freezes on the new code.. (This appeard when the icons appeared)

The display is on but its frozen in time.. LEDs not working / fixed on, Time not updating and the temp not being sent back to http://emoncms.org/, buttons dont work, only a removing of power and reapplying it make it work again.. Seem to happen after 24hrs / 36Hrs..  But im looking at seeing if this is true.

Thanks

Rob

tontonsam's picture

Re: emonGLCD - Not responding.

I have the same issue,

my newbie status doesn't help, cannot find the issue.

buttons don't work...

Unit freezes at random times

I ordered a second unit for a friend and same issue appears... so i guess it's code as you said!

Anyone else had this issue?

Petrik's picture

Re: emonGLCD - Not responding.

Added watchdog first activation in setup routine and then reset towards the end of the ccoid function - now been running several days witout problems - fingers crosed.

mad_dad's picture

Re: emonGLCD - Not responding.

I had similar issues. Added a flashing heatbeat icon to display to give visibility that s/w is still running.

It would randomly freeze.

I have modified my SolarPV sketch and now runs continuously, also created new sketch to display 3 solar water panels temps and 2 hots tanks temps

I will have a look when i get home as to the changes

tontonsam's picture

Re: emonGLCD - Not responding.

Could you elaborate on, what exactly you did? :-) 

thx

Robert Wall's picture

Re: emonGLCD - Not responding.

What is "the ccoid function"? File & line no?

Petrik's picture

Re: emonGLCD - Not responding.

void() + ipad keyboard -> ccoid ;-)

 

 

PaulOckenden's picture

Re: emonGLCD - Not responding.

Where's the like button ;)

P.

Robert Wall's picture

Re: emonGLCD - Not responding.

Where's the like button ;)

Probably next to the 'any' key.

mad_dad's picture

Re: emonGLCD - Not responding.

If i remember correctly i added a few lines of code to report received data on the serial port and the sketch size got near the 32,256 byte max.

In the end i changed the code to remove the history page, as i never used it, sketch size reduced and problem went away. Also removed the serial reporting too later.

Added my own code to display 3 heartbeat pulses on the LCD, 

1. every second to show it still running

2. when data received from emontx

3. when time received from NanodeRF

This code necessitated removing the history page

mad_dad's picture

Re: emonGLCD - Not responding.

I also had problems with emontx RF transmissions keep failing( i have two senders on different  channels)

This was because of sending debug messages to the serial port and then send RF data. I assume the serial transmission uses some from of interrupt driven device driver and returns system control before data is sent complete allowing interrupts to interfere with the RF call.  So this could also affect the LCD RF transmission if used before hand.

Correct me if i am wrong.

Robert Wall's picture

Re: emonGLCD - Not responding.

I took the same basic sketch, but modified it somewhat (making the LEDs and backlight track ambient lighting a bit better to give a more constant perceived brightness) and in doing so I too ran out of memory. I recovered sufficient simply by tidying up the code, introducing 'for' loops rather than repeating code, etc, - all simple little things that added up.

Taking out all Serial commands helps enormously.

I loaded the standard sketch this evening and it locked up after an hour. I didn't see any problems with my slimmed-down version while I had it on test for several days continuously (it's built for a friend, so not actually in use yet).  It says:
Binary sketch size: 32,064 bytes (of a 32,256 byte maximum).

On the RF matter, I believe I read somewhere in JeeLabs that there is a limit on the channel occupancy by any one transmitter, and this is enforced in the library. It's possible that you have hit this. I don't know how the limit is implemented - my guess is it blocks until the appropriate period has elapsed based on the length of the last transmission.
 

mad_dad's picture

Re: emonGLCD - Not responding.

I have 2 GLCDs running and have both been running for over  week now, until last night the Solar PV one locked up.

Strangely this is the one with the code size very near the maximum.

fluppie007's picture

Re: emonGLCD - Not responding.

Maybe one can code a 'daily reboot' function for the emonGLCD? To avoid lock-ups?

Robert Wall's picture

Re: emonGLCD - Not responding.

Strangely this is the one with the code size very near the maximum.

I don't think it's all that strange, IIRC the stack grows downwards towards the top of program space, and if there isn't room, something's got to give. The compiler doesn't warn you.

Try slimming the code down a bit! (That's the best suggestion I can make at the moment).

Take out all Serial.***** statements unless you absolutely need them for debugging.
Take out the SolarPV_type that you don't use.
Reduce the size of the str[ ]  array where possible.
Take out the 5 lines or so that write the time, and are repeated several times, and make them a function and call that instead.
Do the same with writing Power. (If the position changes, pass it in as a parameter).
Take out constants like

 int MINTEMP = -15;

and make them preprocessor directives instead (on the basis that the compiler is usually better at optimising than you are!).
Replace repeated strings (like "n days ago") with a loop.
Pre-calculate any maths (e.g  * 0.2 / 3600000  becomes  * 5.555556e-8).

And anything else you can find like the above. Compile and check the size each time you make a change, just to make sure you're going in the right direction.

 

Maybe one can code a 'daily reboot' function for the emonGLCD? To avoid lock-ups?

Surely it is better to remove the cause of the problem rather than fix with a fiddle? And how will the reboot work if it locks up before the 24 hours? And how do you deal with the history, write it to EEPROM before you reboot, I presume? But you can't do that inside the watchdog, and if you write the EEPROM too frequently, it will wear out.

PaulOckenden's picture

Re: emonGLCD - Not responding.

Also, are the people with units locking up running V1.3 or 1.4 boards? Just in case it makes a difference?

 

P.

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi guys,

Thanks for debugging this, it seems like it's a case of at ATmega running out of memory. For the time being I have disabled the 'history' page on the default emonGLCD home energy monitor and solar PV examples. Please could you give the examples a go an let me know if this increases stability.

I will further look into the problem when I'm back from holiday. 

Cheers! 

Robert Wall's picture

Re: emonGLCD - Not responding.

I have been meaning to understand how the icons get drawn - I suspect it might be possible to save some bytes there too.

(Glyn: If the Solar_PV switch was done on a preprocessor #ifdef / #ifndef , then the unused one isn't even compiled, of course)

mad_dad's picture

Re: emonGLCD - Not responding.

Both mine are  V1.4 models.

One has a code size of 24K and has never froze

The other SolarPV one has 31K+ occasionally freezes

I have now removed history page and pared it down to 30.8K. Will try this for a week.

alpav's picture

Re: emonGLCD - Not responding.

Hi everybody. I've built up an emonGLCD v1.4 that has been running with the sketch got from github (just translated to Italian) for the first 6 days of its life.

On Sunday morning I have found it stuck, not responding to anything, time frozen at 01:04. Rebooted and still up and running. During the night the emoncms has been updated.

I then found this thread and I've spent some time to optimize the code following the suggestions above. I didn't want to remove any screen so I've gained not so much and my sketch is now something more than 31 kb. Better than before, anyway.

After 25 hours it stuck again, tonight.

At the same time, I've got an error from emoncms.org (I was checking my dasboards). The ADSL has been unstable for few minutes. The base failed to get the time and I've found the emonGLCD frozen with the baseFail error string displayed.

I am not sure that this is related to the problem, but I am pretty sure that also before the first stuck the baseFail error appeared. This evening I had to restart the base (OKG with W5200) too because despite the two leds were blinking apparently in the normal way, the time on the GLCD was not updated.

After that everything is working again but I've noticed a weird thing on the emoncms. All the data already present there have been shifted back of one day. For example: the GLCD frozen at 20:04. Till then the PV power logged on the 19th of November was 6.61 kWh. After the base reboot I can read the PV power for the 19th of November is zero and the power for the 18th of November is 6.61 kWh. And all the previous data are shifted back of one day.

Hope it helps.

Alberto

robw's picture

Re: emonGLCD - Not responding.

Well im still getting the crashing even with the new code..

Iv tried removing the icons as a trial and this so far has given me 3 days and counting with out a crash..

Iv been trialing the use of this code from Jeelabs. http://jeelabs.org/2011/05/22/atmega-memory-use/   

Will let you know how i get on.

Robert Wall's picture

Re: emonGLCD - Not responding.

alpav

"This evening I had to restart the base (OKG with W5200) too because despite the two leds were blinking apparently in the normal way, the time on the GLCD was not updated."

It is possible for the Wiznet module to fail to initialise. On that basis, I think it is quite probable that it can crash, and leave the OKG still running. Glyn and I have developed a modification to allow the Wiznet to be reset under software control. If you do that, you will be able to reset and reconnect if, for example, you do not receive an acknowledgement from the server.

The details are here: http://wiki.openenergymonitor.org/index.php?title=Open_Kontrol_Gateway#W...

(My fix was slightly different, so the software is different but the result is identical).

alpav's picture

Re: emonGLCD - Not responding.

Hi Robert,

I've been away for a bit (sickness + business travel) and I've had to leave my investigation.

I've definitely built the modification as per the detailed instructions at http://wiki.openenergymonitor.org/index.php?title=Open_Kontrol_Gateway#W... and I am running the sketch from github. I don't fully understand if something peculiar is needed in order "to allow the Wiznet to be reset under software control". I mean: does the sketch need to be modified?

There has been no need to reset the base anymore since my last post (Nov 25) and looking at the emoncms logs (I am using the emoncms.org) it seems it didn't crash during this period. Still experiencing some difference between what is logged by the emoncms and what is logged by the emonGLCD. I suppose that this is due to a different success rate in getting the readings from the emonTx: it can be either the emonbase getting less samples than the emonGLCD (I don't think so as they're located in the very same place) or an high failure rate of the posting activity. Moreover, at morning - every morning - the emonGLCD displays the basefail error even though I can see both the display and the emoncms.org showing the correct data. The error disappears around 10 o'clock. I couldn't collect more information so far so I'll postpone the analysis.

What I am more concerned of is about the emonGLCD freezing quite regularly. It works for approx. 7 days then it stuck up, completely frozen, display, leds, everything. It needs to be powered off and on. Then it goes for another week. I would like to avoid the solution of putting an automatic reset on a timer as I actually prefer to understand the problem and solve it.

I've surfed the idea of the size of the executable so I've been working on make it smaller optimizing the code at first (passing from around 32k to 31k) then just cutting out almost everything. Now my temporary sketch is just showing the first template and I've cut out everything else, getting to 21k. Still freezing in 7 days.

So I came to the conclusion that the size of the executable is not the problem. I suspect that the issue might be related to the RAM usage and the size of heap and stack. I'm afraid that some memory leakage may corrupt the sketch runtime and make the GLCD crash. But I don't really know how to investigate this.

Any idea? Am I out of track?

Thanks.

Robert Wall's picture

Re: emonGLCD - Not responding.

It does indeed sound like a memory leak or an overflow somewhere. There's a library function that Trystan posted about here http://openenergymonitor.org/emon/node/1370#comment-7422  that might help. I guess you'd need to call it regularly and see if the number changed. I think that would tell you if you have a leak. But if you do have a leak, finding where it is it might be an entirely different matter!

Do you still have the History page? That runs on a weekly cycle. Something that happens every 10 s is close to overflowing an unsigned integer after 7½ days. Those are the kinds of things that I would look for.

Going back to the first question, now that you have the Wiznet reset on an IO pin, you can trigger a reset if (say) you get no response from the server after n retries, or whatever you wish. What I did was reset the Wiznet during startup after the Arduino had settled, then I enabled the watchdog and arranged it so that the watchdog did not get reset if network comms failed. That way, the watchdog would reset the OKG sketch,  the sketch would reset the Wiznet inside setup( ) as it restarted. But if you can determine that you only need to reset the Wiznet module, then it's possible to do only that by driving the pin it's connected to low for a short period. Copy the lines from the standard sketch:

  pinMode(WizResetPin, OUTPUT);        // Reset the Wiznet module
  digitalWrite(WizResetPin, LOW);               
  delay(5);
  digitalWrite(WizResetPin, HIGH);

 

fluppie007's picture

Re: emonGLCD - Not responding.

Same here, will try to reduce the code to avoid lock-ups.

Ian Eagland's picture

Re: emonGLCD - Not responding.

Hi

I have an emonGLCD. I have tried most of the example sketches over the last month or so. Every one apart from the tester sooner or later freezes. I have removed all references to serial but it makes no difference. Has any one a sketch they are willing to share (that includes radio function) that runs continuously which I could use as a base for my requirement which is very simple. I want to display the temperature from just one of my temperature nodes.

MartinR's picture

Re: emonGLCD - Not responding.

I have a theory that emonGLCD locks up due to problems with the JeeLib library.

Transmission systems, particular radio ones with multiple transmitters are very tricky to code for because you can never guarantee what will be received. This is made worse in this case because of the high bit rate used (another bugbear of mine!) and the way there is nothing to prevent 2 nodes transmitting at the same time.

The code needs to be written defensively to take account of  all possible transmission errors, I'm not sure the JeeLib code is.

One bit that concerns me is in the interrupt routine...

if (rxstate == TXRECV) {
        uint8_t in = rf12_xferSlow(RF_RX_FIFO_READ);

        if (rxfill == 0 && group != 0)
            rf12_buf[rxfill++] = group;
           
        rf12_buf[rxfill++] = in;
        rf12_crc = _crc16_update(rf12_crc, in);

        if (rxfill >= rf12_len + 5 || rxfill >= RF_MAX)
            rf12_xfer(RF_IDLE_MODE);
    } else {....

This bit of code puts received characters into a buffer until rf12_len bytes are received or the buffer fills.

The problem is that rf12_len is one of the received characters in the buffer, which may itself be corrupted during transmission (remember this is before the CRC check) or some characters may simply be lost and there aren't enough received to ever reach rf12_len.

I think that in this situation the code will simply wait forever or until enough bytes come along to reach the rf12_len count, but then the next message will be corrupted too and so on.

Happy to be proved wrong about this but it doesn't look ideal to me.

Robert Wall's picture

Re: emonGLCD - Not responding.

I haven't studied the library, so I didn't want to sling mud without justification, however I'd go along with the notion that the problem is more likely to be in a library. However, I'm not sure you are correct with "there is nothing to prevent 2 nodes transmitting at the same time". I understood there was - transmission only happens if the channel is 'clear' - but that of course depends on both transmitters being able to hear one another!

The big problem as I see it is that the library cannot do much else because it tries to be totally universal and keep to a minimum size of packet. And in doing so some robustness has been sacrificed.

MartinR's picture

Re: emonGLCD - Not responding.

There is a test to see if anything is being received before transmission is initiated as you say but sooner or later you're going to get 2 nodes that both see a clear channel at the same time and consequently both transmit at the same time.

I don't think this is the main issue though, I think the more common problem will be bytes being lost or corrupted from a single transmitter.

I don't agree that there's not much the library can do though. A simple timeout in the receiver could prevent locking up. I have one in my own code and I haven't had any lockups.

Robert Wall's picture

Re: emonGLCD - Not responding.

In that case, would a fork of the library be our best option? Or should the suggestion be put to JeeLIb?

(I agree with the timeout - I meant there's little the library can do if it can't detect the other transmission).

Ian Eagland's picture

Re: emonGLCD - Not responding.

This is all fairly new to me. I understood  that the library is used in nearly all emon nodes but the only one I have any trouble with is emonGLCD. Is there something unique about the way the emonGLCD sketch uses the library? Or something different about the way the sketch is coded?

MartinR - Are you willing to share the code you are using. I would really like to get my emonGLCD working reliably.

fluppie007's picture

Re: emonGLCD - Not responding.

That's indeed a good remark! My emonTx always works and sends data every 10s and I have almost never a lost packet.
I think that this happens because emonGLCD does sending and receiving. The emonTx is only sending.

EDIT: is the JeeLib based on this lib: http://www.das-labor.org/wiki/RFM12_library/en or is it a completely from scratch written library?

PaulOckenden's picture

Re: emonGLCD - Not responding.

Don't forget that there's loads of other Jeenode based projects out there (not just openenergymonitor), working successfully. 

I've got a little project running on a JeeLink which sends and receives, and which has been running for a few weeks now.

P.

JBecker's picture

Re: emonGLCD - Not responding.

Quote: There is a test to see if anything is being received before transmission.....

From a discussion in another forum it is clear that this test does not work (this is at least correct for the status of the Jeelib library 1 week ago). The library uses a byte read function for accessing the RFM12B chip status which does not assert chip select. Even when changing this to the working word read, the DRSSI bit in the RFM12 is reacting so slow that the information is not very useful.

But I do not think that this causes the problem with the emonGLCD.

The stack (in RAM) has nothing to do with program size (FLASH), so it is also not very likely that increasing program size will lead to stack problems.

BR, Jörg.

 

MartinR's picture

Re: emonGLCD - Not responding.

Ian Eagland - my code won't be much use to you as it stands as it's written for my 3-phase system and uses a different bit rate, data format etc.

fluppie007 - that library looks interesting but it isn't written for Arduino, just generic AVR. It is different from the JeeLabs one but may need some work before it can be used as a substitute

What this really needs is for someone to run a hardware debugger on an emonGLCD that regularly locks up to see where the code gets stuck.

Ian Eagland's picture

Re: emonGLCD - Not responding.

I just realised the NanodeRF also sends and receives. I have one and as far as I know it has never hung up. The main difference in the code I can see with my limited knowledge is that it has a watchdog enabled. Is adding this to the emonGLCD a way of solving the freezing problem?

JBecker's picture

Re: emonGLCD - Not responding.

Brownout detection and watchdog should always be enabled. Watchdog especially so if while loops without timeout are used in the code. And the Jeelib library code for RFM12 uses while loops in some places. 

MartinR's picture

Re: emonGLCD - Not responding.

A watchdog should only really be used to recover from catastrophes like power glitches or atomic particle hits. They shouldn't be used to cover up fixable bugs.

I'd be more inclined to turn the watchdog off on the NanodeRF and then see if that locks up. 

fluppie007's picture

Re: emonGLCD - Not responding.

Avontech's picture

Re: emonGLCD - Not responding.

Just built a script (Sunday) stripped down to the bare minimum, single screen reads and displays output from two emonTX's each with 3 CT's on them, it has now been running for 2 days without a hitch - that's 1 1/2 days longer than it's ever run before :)

One emomTX is monitoring consumed power and generated power (PV ) the other is monitoring the output from each of the three Solar PV inverters (14kW system :) )  I've included the LED colour change bit of the script.

emonGLCD no longer freezes, however occasionally it loses / doesn't get / ignores the input from the 2nd emonTX monitoring the individual inverters, - sometimes for up to 8 hours, rest is running fine, and then suddenly it gets a reading and starts updating regularly.

So, a small script (22k) appears to be 'stable' though still not reliably reading multiple emonTX's

Petrik's picture

Re: emonGLCD - Not responding.

With wdt my emonglcd has been running over a month without hanging. Currently 3xEmonTX including several 1-wire sensors and 2xFunky are messured. This did not remove the potential problems, but improved usability.

Avontech's picture

Re: emonGLCD - Not responding.

Spoke too soon, stopped / hung / froze 2 hours later ....

@Petrik, how have you got / kept yours running would you be prepared to share your script? 
Also what are the dates of the libraries?
What do you mean by 'With wdt'?

Thanks

fluppie007's picture

Re: emonGLCD - Not responding.

I think he means the watchdog timer. I'm also interested in the modified code.

JBecker's picture

Re: emonGLCD - Not responding.

Quote: A watchdog should only really be used to recover from catastrophes like power glitches or atomic particle hits. They shouldn't be used to cover up fixable bugs.

Ok, then I just have a very different sight on the use of a watchdog (even if I am absolutely sure to get up in time every morning, I can use an additional alarm clock. If I than wake up due to the ringing, I know that something 'strange' has happened). I think it is not very clever to not use a safeguard, if it is available (clearly not meant to offend you, Martin!). The reset cause can be read out after startup of the program and a watchdog reset can be flagged to the user or stored in EEPROM for later analysis. Why not use this additional method to find a possible reason for the freezes?

 

PaulOckenden's picture

Re: emonGLCD - Not responding.

I bet it turns out to be an if (a=b) problem (as opposed to ==). Seems to be the main cause of most of my bugs ;)

 

P.

MartinR's picture

Re: emonGLCD - Not responding.

I wasn't trying to say you shouldn't use a watchdog Jörg. What I was trying to say was that if you've got a design fault in either software or hardware, which I think we clearly have here, then masking it by resetting the processor every time it occurs isn't the answer. Although I can fully understand why people want to use it so that their particular system keeps going.

Your example of using the watchdog to track down a problem is fine, you should use every tool you have available. (I wasn't offended at all by the way).

fluppie007's picture

Re: emonGLCD - Not responding.

Something stupid probably, but could it be that the LCD controller gives 'bad' signals to the Atmel MCU and as a result, it freezes?

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi guys, I've just dipped back into this thread. Thanks for continuing to debug this. It's only recently since the addition of the icons that the freezing seems to have started. I have had an emonGLCD running for 6 months or so in my house without a reset before the icons were introduced. Since running the sketch with the icons I too have experienced lockups every week or so. This is a shame since the icons are very nice! 

 

It sounds like we have hit the upper ceiling of the atmega328. In the future maybe a more powerful micro could be used. In the meantime I would be happy to sacrifice the history page and power view pages in return for having the icons, hopefully this will be stable. I'll implement it on my home emonGLCD when I get home next week and see how it goes. Maybe it would be possible to optimise the way the icons ore implemented or simplify them. Expect an update in the new year. Have a good Christmas! 

Petrik's picture

Re: emonGLCD - Not responding.

Here is a simple way how to add WDT to emonglcd... just these three bolded lines needs to be added...

 

#include <JeeLib.h>
#include <GLCD_ST7565.h>
#include <avr/wdt.h> 

....

void setup()
{
wdt_enable(WDTO_8S);

....

void loop()
{

....

wdt_reset();
if ((millis()-slow_update)>10000)
  {

...

 

 

Robert Wall's picture

Re: emonGLCD - Not responding.

Yes, but as MartinR says, it is better to solve the problem than to cover it up. If you have daily totals stored and displayed in your emonGLCD, resetting via the watchdog destroys those.

robw's picture

Re: emonGLCD - Not responding.

Iv been playing around for the last few weeks with my GLCD..  It seems if you disable the LDR / PWM LEDs it stays up for longer... Iv still get lockups just not as many..

I did experence these lockups when i first tested the PWM of the leds.. I wonder if other could test this.

Im still running the extra page and the icons in my version at the mo.. will try to remove more later to see if i can get it never to crash.

fluppie007's picture

Re: emonGLCD - Not responding.

Last night I took Glyn's advise and removed the icons from the functions and also removed the icons.ino file. I'm curious now how long it will stay up :-).

fluppie007's picture

Re: emonGLCD - Not responding.

Sadly, it still locks up. I'm going to add the Watchdog timer in the code.

EDIT: Watchdog code added and I discovered something weird. When I touch the emonGLCD's  crystal or 2 ceramic capacitors next to the crystal. The arduino resets or shows really strange behaviour (display artefarcts, backlight falls out). Could this be the reason why emonGLCD locks up from time to time?

Avontech's picture

Re: emonGLCD - Not responding.

I've now got a small code running that monitors two emonTX's each with 3 CT's monitoring a solar PV installation (3 inverters), it has still been locking up, so I've just removed the LED code as well. (single screen, no history or power screens)

No Watchdog running as yet.

My emonGLCD was one of those where I let the two capacitors reconfigure themselves as I had a 'fading' LCD.

When I reprogram it, it won't properly reboot by powering down and up. Doing that via the usb power, it starts up and then hangs after 10 seconds (I've got a small counter in the sketch), to restart it properly I have to attach the programming header (connected to the USB port PC or not doesn't matter) as I attach it it reboots, remove the header it then runs OK (save the locking up after about 24 - 36 hours) I'll leave it running to see how long it goes for without 'hanging'

 

 

Brian D's picture

Re: emonGLCD - Not responding.

I have two emonGLCD. Unit A has been in continuous operation for a couple of weeks and has never crashed. Unit B was assembled on 29/12/2012 and crashed overnight. It was power cycled and within one hour had crashed again.

These units are identical build states running the same software which is the standard code with nothing removed but some stuff added. The only difference I could see was in the power supply. Unit A was using an old switch mode Alcatel phone charger that gave an output on load of 5.09V. Unit B was using an Emerson switch mode PSU from OEM shop which gave an output of 5.1V on load.

Checking the noise on the 5V rail the Alcatel has a very small noise floor which a ‘scope at 50mV/div could not resolve. The OEM PSU has a 50mV 500Hz component which really should not be a problem but on the other hand should not be there.

Looking at the emonGLCD schematic there is no large input cap on the 5V rail. Checking the emonTx it is the same circuit.

The obvious test was to run unit A on unit B’s PSU. This was done last night and this morning unit A had crashed.

As an experiment I added a 47uF electrolytic to the 5V rail on both units and measured the noise level. It is now a flat line using either PSU.

Fortunately next to the mini USB connector there is a very small space with available pads for a small electrolytic.  I don’t know if this is the answer to the GLCD crashing problem and I only have a sample of two units to play with but if anyone else wishes to try this I would be interested to hear the result.

 

fluppie007's picture

Re: emonGLCD - Not responding.

That's very interesting! I have to say, with the watchdog code it's running 4 days in a row now. But as stated, it resets the kwh from that day. So I'm very keen to try out the 47µF cap on the 5V rail. A bigger cap (100/470 µF) is no problem I guess?

Brian D's picture

Re: emonGLCD - Not responding.

Fluppie said:

A bigger cap (100/470 µF) is no problem I guess?

No problem electrically but you may have a problem finding room for it.

I thought my findings worth posting but we have yet to prove if this is really the problem. Your contribution will be a great help.

Both units still OK here.

fluppie007's picture

Re: emonGLCD - Not responding.

Appears that I have +- 50 47µF caps left from a DIY amplifier a few years ago. The mod was easy for me because I soldered female headers to the Jeeport and the PWR/SER/I2C "port'. So, just a male header with the 47µF cap soldered to it and plug it in the emonGLCD :-). I'm very curious if it will solve the freezes!

robw's picture

Re: emonGLCD - Not responding.

Im not sure the cap is going to solve it.. Iv added one to mine (only got a 25v 10uF so this may not be enough) just to test but mine is powered by a lipo battery and still crashes.. 

iv added some delays in, in places to see if this helps.. Will report back later.

 

Brian D's picture

Re: emonGLCD - Not responding.

My guess is that the cap will not fix the problem  but now we have four units running the same test which is good.

Let's see what tomorrow brings.

Avontech's picture

Re: emonGLCD - Not responding.

I've just added the 47µF capacitor across the power and ground connections on the JeePort, - we'll see.

Brian D's picture

Re: emonGLCD - Not responding.

Last night my wife's unit (unit A) crashed twice and was locked up again this morning. The new unit was still working this morning.

Given that my wife's unit never crashed before she blames the little blue thing that I added :(

Unit A was running on the Alcatel PSU and unit B on the Emerson PSU.

I now have the two units in the same room  running from the same wall socket but there is no evidence so far to support the 5V noise idea. However, unit A went for about two weeks without ever crashing and now it crashed three times in 24 hours. If it continues to do that I have no choice but to remove the cap.

MartinR's picture

Re: emonGLCD - Not responding.

If one of your units crashes again it might be worth trying switching the other one off and then seeing if it happens again just to see if it's transmission related as I suggested above.

Are both your emonGLCDs still on the same node ID?

oh - and Happy New Year everyone :)

robw's picture

Re: emonGLCD - Not responding.

my unit crashed at 0:59 as reported on the LCD.

Still running from a battery (lipo) and had the added cap 25V 10uF

PaulOckenden's picture

Re: emonGLCD - Not responding.

Is it worth trying one with a WRONG group ID (i.e. so it gets no data). 

This would prove whether it's related to the processing of the data received.

P.

robw's picture

Re: emonGLCD - Not responding.

Just crashed again.  Less than 10 mins of use. 

What type of solar has everyone got. 

Eg type 1 or type 2. This may help if its related to one set of code

I'm type 2.  No time date being sent to the glcd just power from the tx. 

Brian D's picture

Re: emonGLCD - Not responding.

Type 2 here.

Both GLCD running same code so I guess same node ID. I will look into that later today.

I ran an un-calibrated noise injection test this morning. Both GLCD plugged into an extension socket with about 1m of cable. The extension socket plugged into a 30m extension socket and a 12V Weller soldering iron (which has a transformer) was plugged into the long extension lead. The mains side of the Weller was then repeatedly switched on and off.

This test has been known to bring susceptible equipment to it's knees fairly quickly but in this case neither GLCD crashed although there did appear to be the very occasional loss of comm's.

I think the PSU idea is a red herring but I need to get more data on how often either or both units are crashing before I make further changes otherwise I won't know what the effect is.

Incidentally, when the crashes occur I check that the backlight control is not working, LED control is not working, LCD is frozen and buttons are not responded to.

Avontech's picture

Re: emonGLCD - Not responding.

I'm running 'Type 1' solar, - I've also got three Inverters, so run two emonTX's though the code I'm running is stripped right down - single screen only display total gen, total consumption, calculated difference, plus the output from each of the three inverters.

If you want to have a look it the code - it's here: http://worcesterrenewables.com/openenergymonitor/HomeEnergyMonitor_868_d...

(On the Display, CT3 on Node 10 currently displays the 'difference' )

Been going for 8 hours so far, though that is not uncommon..

Avontech's picture

Re: emonGLCD - Not responding.

Here's an interesting phenomenon, one of the pieces of code in my script is a timer that simply starts from zero when rebooted, before I installed the capacitor, i had occasionally rebooted it by plugging and unplugging the programming  usb adaptor onto the emonGLCD (it didn't need to be connected to the pc, just the bare adaptor), on some occasions it brought it back to lfe without resetting the timer, on others it did reset.

Also if I just power it up from the USB power supply it ALWAYS gets to 10 seconds and then hangs, I either have to start it connected to the PC and then swap supplies 'live' or power it up and reboot as above.

I saw a similar effect by putting the capacitor across the programming header before soldering it to the JeePort, however it still needs the reboot with the soldered in capacitor....

MartinR's picture

Re: emonGLCD - Not responding.

That may be a different problem as they don't normally hang after exactly 10s. Should be a lot easier to debug though if it is repeatable.

Are you transmitting temperature every 10 seconds as in the standard code?  If so I would think that is the first place to look.

Brian D's picture

Re: emonGLCD - Not responding.

Martin said:

If one of your units crashes again it might be worth trying switching the other one off and then seeing if it happens again just to see if it's transmission related as I suggested above.

This test (one switched off) has now been running for four hours without a problem but more importantly I realise now that I have only ever seen a failure when two units are running. When one of the pair crashes the other seems to keep going OK.

I think Martin could be on to something here. - Test continues.

fluppie007's picture

Re: emonGLCD - Not responding.

Mine is still running. approx. 30 hours now. My RFM2Pi base station doesn't send out time. So my emonGLCD only transmits the temperature and receives from emonTx.

Standard power+temperature sketch, group id 210 and node id 20.

Avontech's picture

Re: emonGLCD - Not responding.

Hanging at 10 secs on start up 'fixed' temporarily by remarking out the transmission of temperature from emonGLCD, however it still doesn't find / read the data from the emonTX's when started up that way. It doesn't;t matter which power supply I use, so long as it has the USB programming interface dongle and powers via the programming header, i.e. I can use the usual USB power supply and it starts up properly., a straight through cable and it doesn't....

Still going string 4 hours after reboot we'll see in the morning..

Brian D's picture

Re: emonGLCD - Not responding.

I have a ‘reliable’ failure condition. If both GLCD units are running then one but never both will fail within a few hours. The GLCD’s now have different node ID’s (20 & 25)

Yesterday I removed the temperature transmission code from both GLCD. Now both units have run for 10 hours without failing. They have not done that before.

 

MartinR's picture

Re: emonGLCD - Not responding.

Everything I've heard so far still seems to point to a problem with the JeeLib library. As fluppie007 has confirmed you can still get problems with a single transmission, as it can still be corrupted, it's just more likely with multiple transmitters. I have a test running which will hopefully prove this one way or the other.

Avontech - your problem sounds like a reset issue since the reset line is connected to the header but not the USB socket. It may be that the fuses in the ATmega328 aren't programmed correctly for the internal reset.

PaulOckenden's picture

Re: emonGLCD - Not responding.

Is anyone friends with Jean-Claude Wippler? If JeeLib is the current prime suspect it might be worth getting his input.

P.

JBecker's picture

Re: emonGLCD - Not responding.

Quote: When I touch the emonGLCD's  crystal or 2 ceramic capacitors next to the crystal.The arduino resets or shows really strange behaviour.

I do not know how the fuses are normally set for arduino hardware, but such a behavior is typical for ATmegas if the fuses are not set to 'full swing' crystal (which will increase the drive and therefore amplitude of the crystal oscillations).

But this has nothing to do with this threads main problem.

Ian Eagland's picture

Re: emonGLCD - Not responding.

Just to add my experience. I have tried every thing suggested in this thread except adding the capacitor. My emonGLCD always freezes after random periods. Three days ago I commented out the transmit code in the GLCD firmware. Still running without any problems and I think this is the longest it has ever run. Way back I think Glyn or Trystan said they had an earlier version of the code working for months prior to adding the icons. Does any one know if this earlier version include the transmit code?

DaveLloyd's picture

Re: emonGLCD - Not responding.

Just to confirm these observations. My two emonGLCDs (one v1.3 and one V1.4) had been working reliably for many months. I updated them to new sketches including the icons and templates and now they are both suffering from this freeze issue. If I don't restart the frozen emonGLCD the other emonGLCD will run for several days. They have different node IDs.

robw's picture

Re: emonGLCD - Not responding.

Right iv managed to get mine to now to run for 3+ days straight.. A record, It seems to manage a max of 1.5days and that was once.

I tried paying around with icons. As this for me seems to be the time it lost the stability. No real success though even removing them.. But when looking at the code there are lots of variables of int;s where most can be byte.. (well i think they can. eg why is the hr and min and int, a byte (0-255) should be fine) so iv changed them and removed some dupe code see below for an example.

all changes are commented (i hope) // RW ADD should you just want to look

eg

    if (SolarPV_type==1){
    usekwh += (emontx.power1 * 0.2) / 3600000;
    //genkwh += (emontx.power2 * 0.2) / 3600000;       //See Below
    }
   
    if (SolarPV_type==2){
    usekwh += ((emontx.power1 + emontx.power2) * 0.2) / 3600000;
    //genkwh += (emontx.power2 * 0.2) / 3600000;       //See Below
    }
   
    genkwh += (emontx.power2 * 0.2) / 3600000;         // moved from solarpv type

anyway if anyone else wants to take a look or even try my hack at it please do.. (you will need to change the solar type and maxgen

https://github.com/rob-walker/EmonGLCD/tree/master/emonGLCD_SolarPV

Please ignore display.ino and serial.ino  I'm not great at github and cant work out how to delete files..

rob

JBecker's picture

Re: emonGLCD - Not responding.

Rob, does this modification really decrease the code size for you? This certainly is redundant code, but the compiler should recognize this and optimize it.

I think that this code is written for clarity and is not really optimal.

It does not make sense to test SolarPV_type for equality with different values, you would normally use an else if instead (a define would be even better). 

Then the multiplication by 0.2 and subsequent division by 3600000 could be replaced by a single division by (5*3600000). This should really decrease code size.

Then a few lines further down there are two for loops which can be replaced by one.

And, and, and. But does this really help to increase the stability of the code?

There are at least two real 'bugs' in the code (even if they might not be very serious):

- After commenting the code for page 4, the variable page should only be increased to three when a button is pressed.

- After reception of a message from node 15, emontx still points to the receive buffer and if evaluated (e.g. emontx.power1) can give completely wrong results.

But all this should not really be the reason for freezing.

BR, Jörg.

 

fluppie007's picture

Re: emonGLCD - Not responding.

Sadly the 47µF doesn't solve the problem. So far only adding the watchdog timer worked. So maybe as stated above, it could be due to a change in JeeLib.

robw's picture

Re: emonGLCD - Not responding.

Yes it does but not by much..I'm not sure reducing the size of the code is the problem. If it fits on the chip, its fine as it would not upload otherwise.

It was more of when going through it to look for variables of type int to change to byte to reduce the ram size, as if it runs out of ram you get some very weird results. Looking at http://jeelabs.org/2011/05/22/atmega-memory-use/ showed were close to using it all so i gave it a try. 

In this section were using 18 bytes of ram. where my version were using 10...   I THINK.!!!!

/*#define emonGLCDV1.3              // un-comment if using older V1.3 emonGLCD PCB - enables required internal pull up resistors. Not needed for V1.4 onwards
const int SolarPV_type=2;           // Select solar PV wiring type - Type 1 is when use and gen can be monitored seperatly. Type 2 is when gen and use can only be monitored together, see solar PV application documentation for more info
const int maxgen=1800;              // peak output of soalr PV system in W - used to calculate when to change cloud icon to a sun
const int PV_gen_offset=5;          // When generation drops below this level generation will be set to zero - used to force generation level to zero at night
const int greenLED=6;               // Green tri-color LED
const int redLED=9;                 // Red tri-color LED
const int LDRpin=4;                 // analog pin of onboard lightsensor
const int switch1=15;               // Push switch digital pins (active low for V1.3, active high for V1.4)
const int switch2=16;
const int switch3=19;
*/
// Changed most from Int to Byte.  RW ADD
//#define emonGLCDV1.3              // un-comment if using older V1.3 emonGLCD PCB - enables required internal pull up resistors. Not needed for V1.4 onwards
const byte SolarPV_type=2;          // Select solar PV wiring type - Type 1 is when use and gen can be monitored seperatly. Type 2 is when gen and use can only be monitored together, see solar PV application documentation for more info
const int maxgen=1800;              // peak output of soalr PV system in W - used to calculate when to change cloud icon to a sun
const byte PV_gen_offset=5;         // When generation drops below this level generation will be set to zero - used to force generation level to zero at night
const byte greenLED=6;              // Green tri-color LED
const byte redLED=9;                // Red tri-color LED
const byte LDRpin=4;                // analog pin of onboard lightsensor
const byte switch1=15;              // Push switch digital pins (active low for V1.3, active high for V1.4)
const byte switch2=16;
const byte switch3=19;

I'm not a programmer so that's why i'm asking. There may well be good reasons to use ints. but so far the code runs and works for me at least and has not hung. Running 4 days now. That is real progress for me.

Iv also changed varables in other places but again it would be best to have some one else look over it and give it a once over.. 

PaulOckenden's picture

Re: emonGLCD - Not responding.

If it was a code size issue I'd normally expect the crash timing to be repeatable, not random.

P.

robw's picture

Re: emonGLCD - Not responding.

Paul i agree,  if  it was code size it would not upload or crash at the same time.. But with RAM you really dont know whats going on.. 

 

Ian Eagland's picture

Re: emonGLCD - Not responding.

Hi

I am sure it is not the firmware code size (although it may be lack of RAM) . I am using a very small sketch as I am only displaying outside temperature from one emonTX. It was always locking up. I removed transmit code from the GLCD firmware as mentioned a couple of days ago. Still running.

 

JBecker's picture

Re: emonGLCD - Not responding.

@ Rob: All these values are declared as const, so should not normally use RAM but Flash!

This might be the reason why your code size decreased.

 

PaulOckenden's picture

Re: emonGLCD - Not responding.

The transmit code is very kludgy, and it calls sendStart() when it shouldn't because of a horrible hack loop detection. This is something that Brian D pointed out to me the other day.

Perhaps if someone sorted this mess out the problems might go away?

P.

JBecker's picture

Re: emonGLCD - Not responding.

Paul,

are you talking about the cansend() function? This can (and should finally!) be solved by commenting out the RFM12 byte read. But this does not have any deeper impact on memory usage and should not freeze the unit!?!

robw's picture

Re: emonGLCD - Not responding.

@JBecker

Ahh thanks. That would explain the code side..

Iv also changed some in the main code also.. so these must me for the RAM.

//int hour = 0, minute = 0;
byte hour = 0, minute = 0;          // RW ADD

//int node_id = (rf12_hdr & 0x1F);
byte node_id = (rf12_hdr & 0x1F);      // RW ADD

//int i; for (i=6; i>0; i--) gen_history[i] = gen_history[i-1];
byte i; for (i=6; i>0; i--) gen_history[i] = gen_history[i-1];    // RW ADD

//Int LDRbacklight=map(LDR, 0, 1023, 25, 250);
byte LDRbacklight=map(LDR, 0, 1023, 25, 250);

There are others as well but maybe this is just enough to get it stable..

Robert Wall's picture

Re: emonGLCD - Not responding.

A long while ago, I read somewhere that the size of the stack isn't checked, and it grows downwards from the top of RAM towards the heap/global data. Now if the stack overwrites the data on the heap, you get junk data. If the data overwrites the stack, methinks a crash is inevitable.

I've just hit Google with the problem. This http://blog.wickeddevice.com/?p=359 isn't the original source, but it clearly explains the potential problem and it might help.

JBecker's picture

Re: emonGLCD - Not responding.

The whole thing definitely 'smells' like a stack problem. There are some suggestions on the Arduino site to check the available RAM space at runtime. This should give a clear hint.

Passing loads of parameters to functions like in draw_solar_page() eats up a lot of stack space (here more than 40 bytes). It would help to use only the variable size needed (hours and minutes fit into bytes and do not need to be float). Better still to pass a pointer to a structure with all the values.

In draw_solar_page() and the other template functions, a char array with 50 bytes (char str[50]) is reserved for the string functions. This is also much more than needed ( I think) and could also be decreased (not clear if it helps, depends a bit on how the compiler handles this). 

 

Robert Wall's picture

Re: emonGLCD - Not responding.

(char str[50]) ... is also much more than needed. Correct. It seems as if 50 has been used throughout simply because it is "safe" and there's no risk of a string overflow. 30 is enough in most cases, even less in some others. All that needs to be done is find the maximum number of characters written in each function and add one. Even better, as many of these are purely temporary and only for display formatting, it doesn't need to be a local string anyway and one common global temporary string could be used, given a little care.

Avontech's picture

Re: emonGLCD - Not responding.

Added the capacitor AND removed the temperature transmit data code, been running constantly now for just under 5 days with no lockups.

[However (and I'll start another thread on this) I can't get my OKG with ENC28J60  base to work at all now I've got two emonTX's running consistently. Stop one of them and it runs fine...Base code is the NanodeRF_mutinode ]

Brian D's picture

Re: emonGLCD - Not responding.

This emonGLCD bug has proved difficult to trace because as MartinR indicated way back in the thread we really required someone to put a hardware analyser on and establish exactly what happens. The popular speculation is that the processor crashes and the proposed reason is a stack overflow. This could be true or there may simply be a routine that fails to return if called in the wrong way. My guess is that the latter is true.

To test all of this I built a test platform that was capable of demonstrating problems associated with the use of the rfm12. This led me to realise that the frequency at which errors occurred were related to the combined frequency at which my units are programmed to transmit. Let me explain.

My emonTx is set to transmit once every 3 sec and my two GLCD transmit temperature once every 10 sec. If only one GLCD is running then there are no lockup problems. If both GLCD and the emonTx are running then one or other of the GLCD will lockup fairly quickly but after that the second GLCD will continue without lockup.

Using the above timings on my test platform I have demonstrated that the two GLCD will eventually become aligned with their transmission times. They do not synchronise but they simply drift into alignment because they are running at the same repetition frequency and there is a small difference in their components. When this happens the call to rf12_canSend() will not return true until the two units have drifted apart again and this can take minutes.

This should not be a problem because at first glance the code is written to deal with this. But is it? This is the significant line:

int i = 0; while (!rf12_canSend() && i<10) {rf12_recvDone(); i++;}  // if ready to send + exit loop if it gets stuck as it seems to

Clearly the original author realised there was some sort of problem here and put in the i++ kludge but instead of providing a break the code is allowed to fall through to the next line which is:

rf12_sendStart(0, &emonglcd, sizeof emonglcd);                      // send emonglcd data

Now we are in trouble because we are violating the following JeeLabs statement:

The rf12_sendStart() function may only be called in two very specific situations:

·         right after rf12_recvDone() returns true - used for sending replies / acknowledgements

·         right after rf12_canSend() returns true - used to send requests out

As a test I changed the frequency of transmission for my three units to reduce the chance of repetitive alignment of transmissions. The best way to achieve this is to set each unit to a different prime number so I set the emonTx to 3, GLCD a to 5 and GLCD b to 7.

Last night using those numbers my system ran all night with no problems for the first time and with temperature sending restored.

This is good convincing stuff but has not dealt with the fr12_recvDone(); kludge above.

I decided to stop here because if someone else can confirm that this is indeed the correct explanation of the problem then it will make sense for the kludge to be dealt with properly and an up issue of the code made.

It would make sense for the code to have a #define that will allow for multiple GLCD’s and allocate node ID’s and transmission intervals according to GLCD identity. This may be tricky as the base and the emonTx rates should be considered.

Sorry it's a bit long.

PaulOckenden's picture

Re: emonGLCD - Not responding.

Great work Brian.

Even with your prime numbers you are still going to get occasional clashes, although the frequency (and thus time between hangs) should be greatly reduced.

As for the kludge, I suggest:

int i = 0; while (!rf12_canSend() && i<10) {rf12_recvDone(); i++;}
rf12_sendStart(0, &emonglcd, sizeof emonglcd);
rf12_sendWait(0);

Should become:

int i = 0; 
while (!rf12_canSend() && i<10) {
      rf12_recvDone();
      i++; }

if (i<10) {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0); }

 

That way the send only happens if the 'kludge' has exited normally.

This should even (hopefully) stop the crashes even without the prime numbers, although this will of course result in data being lost because of the send clashes.

P.

JBecker's picture

Re: emonGLCD - Not responding.

Good findings, Brian!

So the major problem seems to be that two nodes try to or are really transmitting simultaneously....

One of the unsolved problems of the jeelib RFM12 code is, that rf12_canSend() does not really work as intended. It should actually look for RF 'in the air' and prevent that a transmission is started when another node is already transmitting. But this test simply does not work (this is known since quite some time but has not yet been solved or cannot be solved easily). So rf12_canSend() does not really test if another node is already sending and therefore does not prevent two nodes from sending simultaneously (which will at least destroy both messages with very high probability).

One of the big questions is still why this situation will result in freezing one of the two nodes code execution!?!

What you and Paul have done is a good workaround, but it would still be interesting to know the underlying reason for the freeze.

 

Robert Wall's picture

Re: emonGLCD - Not responding.

I've not experimented nor studied the protocol in detail, but Brian, it sounds as if you've got a good explanation there.

Putting my systems engineer's hat on, I think there's a case for retaining the existing protocol for "simple" (and battery powered) systems, but I think a completely new approach is called for when you have a "complex" system.

The problem isn't a new one. One technique that is generally used in situations like this is to delay a small random interval and then retry.  However, this depends on every unit that transmits being able to hear every other one. If this is not the case, then whatever you do, the problem will remain.

The only solution that I know of introduces a master controller in the system, which every node can hear, and you must then turn the whole protocol upside down and have the controller (which might be a base or a GLCD) poll the transmitters to request data. This of course raises many more problems: battery operation of the emonTx becomes impractical because it must either remain listening at all times, or turn on and listen when it expects to be polled; and having two incompatible protocols ("simple" and "complex") and two sets of sketches will introduce all manner of confusion and complication.

PaulOckenden's picture

Re: emonGLCD - Not responding.

it would still be interesting to know the underlying reason for the freeze.

Surely calling a function in a way that JeeLabs specifically say not to do is a good pointer. I wouldn't be at all surprised if this was the cause of the lockups.

P.

JBecker's picture

Re: emonGLCD - Not responding.

Surely calling a function in a way that JeeLabs specifically say not to do is a good pointer.

Yes, absolutely agreed! Now that you write it, I can recall that this is said 'expressis verbis' in the jeelib documentation. Just forgot about that :-(.

Seems it is possible to mix up the complete receive/transmit state machine if the correct sequence is not adhered to. Would still be nice if this situation would not result in a freeze, but using functions in a wrong way is clearly a bug that can be (and has to be) avoided.

Brian D's picture

Re: emonGLCD - Not responding.

Paul

 I agree with that although we can probably take it a step further.

Flow control through the RFM12B is achieved via the Jeelabs library call.  Presumably a call to rf12_canSend() simply results in the library routine testing a software flag which is handled by an interrupt routine.

My guess is that rf12_canSend() will return very quickly and this is also probably true for rf12_recvDone() consequently the improved kludge will rattle through very quickly indeed compared to the time it will take for the rfm12 to send whatever is in its buffer or (more likely) whatever the conflicting station is doing.

What I am getting at here is that if 'i' ever becomes >0 then it will probably become > 10 because you are waiting for some slow process to finish so you may as well not use the kludge and simply have:

if (!rf12_canSend() )

 {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);

 }

If the transmissions are not aligned then a send should occur OK on the next call. If alignment continues because the units timers are harmonically related then transmission will fail for a while but the unit will not lockup.

When my current test has completed 24 Hours I shall try the above.

 

PaulOckenden's picture

Re: emonGLCD - Not responding.

Do we know who wrote this original code (the i loop kludge, I mean?). Might be good to know if there was a particular reason for coding it this way...

 

P.

JBecker's picture

Re: emonGLCD - Not responding.

Although I did not write the emonGLCD code, I came to a very similar solution for another project.

There are three reasons, why rf12_canSend() will return with false:

- the state machine is not in TXRECV state

- the receive buffer is not empty

- there is RF in the air, means another node is transmitting

The first two cases are depending on he internal state machine and will be true after calling rf12_recvDone() (similar as Brian described, it will sort of 'fall through'). But the RF in the air is depending on something that is not influenced by these calls. It might take some milliseconds until the other node stops sending. And the 'i-loop' is a timeout for waiting until this happens.

The problem is, that due the bug in rf12_canSend(), the RF check will always show 'no RF', so this will never block rf12_canSend().

EDIT: correction: rf12_recvDone() will not fall through, if a reception of a message from another node is just in progress. So this is another reason for an 'i-loop' timeout!

EDIT2: ok Brian and Paul, now I finally understand what you mean. The 'i-loop' as in the emonGLCD code will not work as planned (?), because it will timeout within a few hundred processor cycles and this will in most cases not be enough time for any ongoing transmission from other nodes to complete. This is the difference to my code, where I use a 1ms wait  inside the loop (waiting for a 1ms flag supplied by an interrupt routine)!

But I still do not understand Brians proposal:

if (!rf12_canSend() )

 {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);

 }

wouldn't it make more sense to do:

rf12_recvDone();

if (rf12_canSend() )

 {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);

 }

 

 

Brian D's picture

Re: emonGLCD - Not responding.

JBecker >

wouldn't it make more sense to do:

rf12_recvDone();

if (rf12_canSend() )

 {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);

 }

Yes, well done. Silly me.

 

 

robw's picture

Re: emonGLCD - Not responding.

Good work guys.. But not one to complain .

But why has this suddenly happened to the code on the GLCD? 
surely if this was a problem with rfm12b code we would of seen this with the old code also? 

Iv not changed anything to do with the rfm12b code, i have how ever changed the main code and iv managed to get it stable now running 5 days its never made it past 36hrs before. (Since the icons)

Rob,

(ps not saying it is or is not just asking questions, Trying to understand my self.)

Brian D's picture

Re: emonGLCD - Not responding.

robw said:

But why has this suddenly happened to the code on the GLCD? 

Good question Rob. My test code gives a good demonstration of the extreme marginality of this problem. If anyone wishes to replicate the tests that I ran I will put the sketch somewhere accessible.

What surprised me was the fact that I could alter the test script in seemingly insignificant ways and yet change the behaviour from failing regularly to failing extremely rarely. In the case of the real software the kludge required 10 consecutive errors for a possible lockup and a small software change could prevent that.

This is a difficult problem to solve and there is always the danger that a fix may merely hide the real problem so it's right to question the validity of a proposed solution.

fluppie007's picture

Re: emonGLCD - Not responding.

Something else, does anyone have the old non-locking sketch? Then we all could load that one, run it for a time and then start to add code to "replicate" the current sketch. Or am I talking crap?

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi Guys, 

I've just dipped back into this thread. It looks like some good discoveries have been made. The insight into how the RF12 lib function calls work is very interesting. Trystan added in the i++ loop a while back after we discovered it was getting stuck. 

It seems I was wrong about the icons causing this problem, that may have been a contributor but a few of you have realized even taking the icons out does not solve the issue. I had no luck with my cut down example, it did crash after a few days. 

I've gone through the tread as best I can and have compiled most suggestions to optimise the code into an example. It's up on github called SolarPV_LowMem_Dev: https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV_lowMem_Dev

I've implemented the following 

  • Removed all pages apart from the main solar PV page
  • Use byte instead of int where values are under 255
  • Removed duplication of code in solarPV type section, removed gen calculations to be outside as suggested by robw
  • Converted Kwh calculation to consist of one multiplication instead of a multiplication and division as suggested by JBecker
  • Changed RF12 Tx code as suggested by JBecker

For me it compiles to 28.8KB on Arduino 1.02, Ubuntu 12.10. If you can see another code optimisations/bug fixes we can make to the code please send me a github pull request and I'll merge the changes. 

The code has been un-tested. When I get home tomorrow I'll set it running on an emonGLCD. If anyone has another suggestions I have several spare emonGLCD I can setup running different code setups to evaluate stability. Once we have a stable setup it would be a good ideas to implement a watchdog time as backup. I agree, a watchdog should not be used to fix solvable problem with the code, besides it would not be great to have a random reboot as the Kwh/d totals would be lost. 

Thanks again for helping to solve this. Stability is very important and should come before running ahead to add new functions. Once we have a stable example I shall make a stable 'release' branch on github and make another 'dev' branch where new features (yet to be tested for stability) can be implemented. 

JBecker's picture

Re: emonGLCD - Not responding.

Glyn,

there are another two or three (or more) things that could be done:

- after receiving a DateTime message from node 15, emontx still points to the receive buffer and will give completely wrong values if evaluated (every 200ms) before another message from node 10 arrives

- LCDbacklight does not have to be evaluated between hours 22 and 5

- genkwh += (emontx.power2 * 0.2) / 3600000;       still has this unneeded divide/multiply

The first one is really nasty (and much harder to correct :-) )

BR, Jörg.

 

 

 

 

robw's picture

Re: emonGLCD - Not responding.

Glyn

Nice work..  There are a few more doubles or ints you could change also.. See below or git pull 

Main file....

Line 103

double temp;             // Need for full resolution
byte maxtemp;          // Dont need the full resolution we only show full numbers
int mintemp;              // May be Negitive

​Line 119

//Serial.begin(9600);        // Were not using it so why bother...

Line 199  We can do it all in one so we dont need an int for LDR the varable.

//int LDR = analogRead(LDRpin);                     // Read the LDR Value so we can work out the light level in the room.
 byte LDRbacklight = map(analogRead(LDRpin), 0, 1023, 50, 250);    // Map the data from the LDR from 0-1023 (Max seen 1000) to var GLCDbrightness min/max

Templates

Line 41

-void draw_temperature_time_footer(double temp, double mintemp, double maxtemp, double hour, double minute)
+void draw_temperature_time_footer(double temp, byte mintemp, byte maxtemp, byte hour, byte minute)

Line 78

-void draw_solar_page(double use, double usekwh, double gen, double maxgen, double genkwh, double temp, double mintemp, double maxtemp, double hour, double minute, unsigned long last_emontx, unsigned long last_emonbase) 
+void draw_solar_page(double use, double usekwh, double gen, double maxgen, double genkwh, double temp, byte mintemp, byte maxtemp, byte hour, byte minute, unsigned long last_emontx, unsigned long last_emonbase)
 

Rob

MartinR's picture

Re: emonGLCD - Not responding.

It's good that the code is being tidied up but there is a risk here that the problem is going to get buried again and forgotten about only to come back to bite in the future.

It would be better at this stage to stick with the code that fails most often and track down the real cause of the locking up.

I have one more snippet of information that may help in this regard. When locked up the interrupt signal from the RF12B is permanently high. This means that the RF12B cannot be in it's normal idle state because if it were then the interrupt would go low the next time the emonTX transmitted

PaulOckenden's picture

Re: emonGLCD - Not responding.

It would be better at this stage to stick with the code that fails most often and track down the real cause of the locking up.

That's exactly what Brian D did - see above.

P.

JBecker's picture

Re: emonGLCD - Not responding.

It would be better at this stage to stick with the code that fails most often and track down the real cause of the locking up.

For those guys who want to locate and fix the bug, yes. But there might also be a number of people around who just want a working solution in the meantime. And testing of the new code can be done in parallel by this (majority ?) of users. If we have only one freeze with the modified code, this would be an very strong indication that there is still something wrong.
 

Brian D's picture

Re: emonGLCD - Not responding.

Martin R said:

It's good that the code is being tidied up but there is a risk here that the problem is going to get buried again and forgotten about only to come back to bite in the future.

I think it already has!

Let me clarify my findings:

1. The fundamental reason for the lockup is a violation of the Jeelabs rule:

The rf12_sendStart() function may only be called in two very specific situations:

·         right after rf12_recvDone() returns true - used for sending replies / acknowledgements

·         right after rf12_canSend() returns true - used to send requests out

2. The reason the standard code violates this rule is a kludge in this line of the standard code:

int i = 0; while (!rf12_canSend() && i<10) {rf12_recvDone(); i++;}  // if ready to send + exit loop if it gets stuck as it seems to

3. The condition that initiates the failure sequence is when transmissions  from multiple units become aligned.

The fix is not one silver bullet and as Robert Wall explained above this is a complex problem however, my personally acceptable solution is:

(a)    Avoid transmission alignment by setting transmission intervals on different units to different prime numbers

(b)   Fix the kludge with this:

  rf12_recvDone();
  if (rf12_canSend() )

 {
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);
 }

Implementing the above works for me (one emonTx two emonGLCD). If it does not work in all cases then I have got something wrong so if anyone can show where it fails I will be interested hear about it. Alternatively, if this is considered an acceptable solution then hearing a few more reports confirming that it works would be a good idea before further changes are made.

glyn.hudson's picture

Re: emonGLCD - Not responding.

@robw thanks for the pull request, it's been merged. 

@MartinR - I'm taken the 'back to basics' approach this should give us a stable example as soon as  possible which I think is best for the project. General users can then use this stable example. Advance users can then set about building back in all the features that were removed like the multi page views and the history page and experimenting on a 'dev' Github branch which will be labeled as un-tested for stability. 

@Brian D - Thanks a lot for your insight, your 'fix' has been incorporated into the latest build. See below. 

As of 8pm today (8/1/13) I have set my home emonGLCD running the LowMen Dev optimised example. I've created a GitHub 'tag' on the emonGLCD repo to mark this point. You can be sure that your running the same code as me by downloading tag V01 from :https://github.com/openenergymonitor/EmonGLCD/tags even if the example is updated in the meantime. The latest code can be viewed here: https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV_lowMem_Dev. I'm still learning how to use GitHub and am continually amazed by how powerful it is. This is the first time I've used the tagged feature, I'm impressed. I think we'll try and make use of this feature more in the future to mark strategic points in software development. 

Now all that's left is to sit back, it will hopefully by several months before we know that we've been successful or not. It will probably be a case of "A watched emonGLCD never crashes"!  

john-h's picture

Re: emonGLCD - Not responding.

I've been following this thread with interest ever since I built my GLCD (mid October) and found it was hanging every few days. I really want to log the daily usage and generation figures, so obviously am interested in a solution -- and the discovery of the problem with the JeeLib calls seems very relevant. I will shortly make the suggested change.

But I've also noticed something else that I believe will cause a lock-up in the GLCD and the Nanode. The value returned by millis() is an unsigned long, and according to the documentation rolls over about every 50 days. This means that code such as

if ((millis()-fast_update)>200)
  {
    fast_update = millis();

will cease working if the roll-over does not occur between these two lines of code.

There is also a possible loop timing error: if the if is not executed until more than 200ms have elapsed, the additional milliseconds will be ignored. This will make the accumulation of energy slightly inaccurate as that assumes 200ms loops. Replacing the

    fast_update = millis();

with

    fast_update += 200;

should fix that, but is still problematical at the roll-over. (fast_update will now roll over in the same way as millis(), but the roll-overs will happen at slightly different times so the comparison will go wrong). I cannot yet think of a clean way of overcoming this.

Petrik's picture

Re: emonGLCD - Not responding.

 

If the lockout is because of receiving errors - would like to ask how much background air traffic you have ? With using jeenode rf12 test script I was surprised to notice that some of the houses nearby seem to have systems transmitting a lot of data at 868Mhz. 

 

glyn.hudson's picture

Re: emonGLCD - Not responding.

@Petrik good point, it would be good to get someone else running https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV_lowMem_Dev tag V01 in a place with lots of RF noise on the ISM bands. My house is probably not very noisy, the nearest neighbor is about three miles away! 

All is well so far, my emonGLCD running the LowMem_Dev example has been running nicely since 10pm yesterday :-)

It's too early to draw any conclusions though. 

Ian Eagland's picture

Re: emonGLCD - Not responding.

Hi

One thing puzzles me in all this discussion of the problem being in the rf12 transmit (Although I agree, since removing the transmit code which I do not need in my particular setup, I have not had a single lock up).

I have 3 emonTx that are battery powered and all transmit plus an emonTx mains powered that also transmits. As far as I know they do not lock up. The transmit code all looks very similar to me but I am no expert. So can some one explain the difference between an emonTx transmitting and an emonGLCD transmitting. Or perhaps explain where I am being foolish!

One thing I just realised is that the emonTx never receive, only transmit. Also I confirm I have no watch dog on the battery powered emonTx.

 

Pcunha's picture

Re: emonGLCD - Not responding.

I think i can test it. I´m on a very very high noise environment at 433 Mhz. I´ve temporarily solved this issue implementing an watchdog on the emonGLCD (ugly solution). But that resets my data every time it restarts (-+ 6 hours to 1 day).

I´ll try implement these changes in the code and see if it will continue to reset. Since the reset rate is very high here, i´ll probably report back soon.

glyn.hudson's picture

Re: emonGLCD - Not responding.

My test emonglcd has just crashed running the v01 lowMem Dev code (see above) has crashed after less Ia day. Last night I changed the solar pv type to type 2 (to match my setup), it had been running for 24hrs without a crash as type 1. After changing to type 2 in the morning it had frozen, I'm not sure if this was just a coincide. 

The v01 code was highly optimised and incoperated most of the suggestions on this thread. Has anyone else had any luck with v01? Or any further insight. I'm busy for next next few days, I'll pick this up next week. 

Avontech's picture

Re: emonGLCD - Not responding.

Type 1 solar PV, I tried the low mem dev code and it kept either continuously power cycling or froze... back to my own code with the transmit rem'd out and it appears stable even with LEDs though no history.

Brian D's picture

Re: emonGLCD - Not responding.

Has this thread gone quiet because the suggested mod's have fixed (or hidden) the problem for everyone? My two GLCD have been running without lockup now for about 10 days. The only 'fix' changes are the two that I identified above.

Has anyone else a result to share?

Avontech has added weight to the theory that the lockup relates to the transmit code so maybe he/she could try the two mod's that I have mentioned.

robw's picture

Re: emonGLCD - Not responding.

Hi Brian

Iv been working on trying to figure out why its still happening on the old sketch.. 

I thought i had it sorted as its been working fine for 9 days.. Then iv had 2 lockups in the last 24hrs with out changing a thing.. CAN YOU BELIEVE IT..  So back to the drawing board..  

Rob

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi Brian,

I implemented your changes in the lowMem Dev example: https://github.com/openenergymonitor/EmonGLCD/blob/master/SolarPV_lowMem_Dev/SolarPV_lowMem_Dev.ino but I have still experienced a lockup, often several in 24hrs which is worse than before! Do you think you could post your code exactly what your currently running so we can see how you've implemented the changes. 

Do you think you could have a look over the LowMem Dev example and try it out for yourself.

Cheers, 

Brian D's picture

Re: emonGLCD - Not responding.

I have attached the script that I am running plus the associated template and icon files which are special.

Looking at your LowMem example you have the transmit delay set to 10s:

if ((millis()-slow_update)>10000)

This is not a prime number. The second change that I am using is to set all system transmission intervals to different prime numbers. This is more of a test than a fix. My experience suggests that lockups only occur when transmissions become aligned and that cannot happen if all units use different primes for their transmission intervals.

Even if the use of primes stops the lockups there is still a problem but at least we have another clue to the cause.

There is not much point in me trying the lowmem sketch as anything I try without the primes and the kludge removed produces lockups.

There is another fix that seemed to work for me and that was to remove all the rfm12 code and use Martin Roberts code which does not use the library. This proved to be completely stable which was interesting but it's a bit radical and hopefully will not be necessary.

 

MartinR's picture

Re: emonGLCD - Not responding.

I haven't had any lockups with my code either Brian but until the real cause of these lockups is found I can't really be sure that the same issue isn't present, but just much less likely. That's why I keep suggesting that the right way to debug this is to stick with the code that is most likely to fail and make a concerted effort to find the problem. I lost interest in this thread once it was clear that the consensus was different.

I have a similar function to your prime number solution (which is very neat BTW) in my own system. I use a series of timeslots, where the emonTx is the master, which all the other nodes listen to, and then they each transmit in their allocated timeslot 100ms apart. In addition to the emonTx I have 2 emonGLDs and 2 temperature nodes on my hot water cylinders. This does rely on all nodes being able to receive the emonTx transmission though which your idea doesn't.

Robert Wall's picture

Re: emonGLCD - Not responding.

I agree with you there, Martin. With any bug like this, the objective is to get it to fail on command with the absolute minimal sketch, and then it's all but problem solved.

Brian D's picture

Re: emonGLCD - Not responding.

Martin said:

> That's why I keep suggesting that the right way to debug this is to stick with the code that is most likely to fail and make a concerted effort to find the problem.

I entirely agree - my problem is that having reached my own conclusion regarding the area where the problem is I don't really know what the next step should be.

What I am sure of is that after commenting out the rf12 send commands the lockups stopped. Also, after adding the two mods that I have identified above the lockups stopped.

The difficulty is the library code is used in lots of applications and yet others have not reported the same problem. I know that you referred to a potential library problem way back in this thread so that should be worth looking at.

It's worth mentioning that I don't have a base station (yet) therefore like you this GLCD transmit stuff is redundant as far as I am concerned but when I realised that the original code was violating the Jellab library rule I thought a fix would be easy so I tried it and it seemed to work.

If we look at Glyn's comments he has implemented the kludge removal change but not the prime number change and he continues to get problems. Unfortunately he has a bucket load of other changes as well which does not help but my guess is that the use of prime numbers as transmit intervals prevents the error condition and the kludge removal improves or changes the susceptibility but neither are true fixes.

I am happy to try some more experiments because I now have a stable sketch that I can return to so if you have any suggestions on what to try next I am all ears.

Incidentally, the neat prime number idea was Paul Ockenden's

MartinR's picture

Re: emonGLCD - Not responding.

Thanks Robert & Brian - it's good to know I'm not a lone voice!

I do have a base station Brian, in fact I have 2, a Raspberry Pi running emonCMS and an OKG running as a very simple server but I didn't mention them above because neither of them transmit.

I did make an attempt at cracking this when I built my second emonGLCD. I set it running the standard SolarPV code and listening to an emonTx separate from my normal system (they don't interfere because my normal system runs a non-standard baud). My first problem was that it took 3 days before it locked up but when it eventually did I ascertained that there was no activity on any ATMEGA pin other than the backlight (which is toggled by hardware). I also noticed that the RFM12 interrupt was permanently high so I concluded that it wasn't in receive mode.

My next step was to try to determine where the code was when it locked up and my first suspicion was the rf12_sendWait function in RF12.cpp because it has a while loop - while (rxstate != TXIDLE)...

My strategy was to toggle an unused bit while in this loop so I'd know if it was stuck there. This time it took 4 days before it stopped but when it did the test output wasn't toggling so I'm pretty sure it doesn't get stuck there. This is the point at which I lost interest.

The ATMEGA328 does support DebugWire and I do have an AVR Dragon debugger, although I've only played with it once to see if it worked. It looks like a great bit of kit and good value but I'm not sure I want to lock my PC up for 4+ days waiting for GLCD to crash!

edited to say: sorry Paul O - still a neat idea though!

JBecker's picture

Re: emonGLCD - Not responding.

The difficulty is the library code is used in lots of applications and yet others have not reported the same problem. 

I am not using the emonGLCD at all (sorry, but I do not have an application for it), but I had a very similar problem in one of my projects. This uses an MSP430, RFM12B and (more or less, see below) the same Jeelib library. I thought that it might help me and others (with emonGLCD) if I dig deeper into the subject. Last Sunday, I put a lot of time into this (plus logic analyzer and oscilloscope). I prepared two identical nodes with different node addresses and set them up to send a message every 200ms. One of them always stopped after maximum 1 minute runtime. To make it short, there was an additional while loop in rf12_control(), which made the units freeze. This while loop is not there in the original Jeelibs library (and I don't know who added it to the MSP430 version of the library).

After removing this while loop I had the two nodes running for >20 hours without any freeze (still at 200ms send interval). What do I want to tell you with this little story? I just want to say that the Jeelib code itself seems to be ok (yes, different hardware, different timing, different code, I know).

Wouldn't it be an idea to set two emonGLCD with a 'freezing' version of the code to a much faster send interval and try to catch the freeze with a logic analyzer? This has at least helped me a lot to fix it.

 

   

Pcunha's picture

Re: emonGLCD - Not responding.

I just fixed my code with:

rf12_recvDone();

  if (rf12_canSend() )

 

{
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);
    rf12_sendWait(0);
}

and its now working without a crash for almost one week. I thik at least this issue is solved for me.

TheBluProject's picture

Re: emonGLCD - Not responding.

Hi @ll.

I've been experiencing random  lock-ups as well (I've built my GLCD @ the beginning of Dec).

Around 2,5 weeks ago I've decided to add the ACK support to my system (1xEmonTX + Martin's Funky with DHT22, and RasPi base station), and it has been working for 13 days straight without any problems at all ..

Here's the modified data sending proc, if someone is interested.

https://github.com/TheBluProject/EmonGLCD/blob/Devel/HEMFunky/senddata_a...

This is  not a fully finished and polished code yet, but hey ... my EmonGLCD has never been working for almost 2 weeks without a lockup before ;)

mharizanov's picture

Re: emonGLCD - Not responding.

These screens look cool, I will try your code :)

glyn.hudson's picture

Re: emonGLCD - Not responding.

This afternoon at about 1pm I set my emonGLCD up running the lowMem_Dev code (https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV_lowMem...) with Brian D's ingenious prime number transmission delay. I went for a delay between temperature and light level transmissions of 17s. 

It's now 11:00pm and the emonGLCD is still running...time will tell. 

Thanks Brian for this simple but rather clever idea. 

Brian D's picture

Re: emonGLCD - Not responding.

Two weeks after implementing my GLCD fix one unit has locked up. This suggests that as predicted the mod dramatically reduces the occasions when the failure condition arises but  it's still a workaround not a fix. It's possible that the various devices on 433MHz around here have something to do with the problem but again the fault lies with the GLCD in its susceptibility to whatever causes the lock-up.

Martin identified a lack of interrupts from the RFM12 when the lock-up condition occurs. I can confirm that was also true on my locked-up unit today.

The summary of my experiments is that I know how to make the problem worse and I know how to improve it. What I don't know is how to completely fix it.

My next test will be to remove the GLCD temperature transmission from both units as I don't use it and it does not represent room temperature anyway. The last time I tried this test it appeared to be successful although now I think it will take many weeks running without error to be totally convincing.
 
 

MartinR's picture

Re: emonGLCD - Not responding.

Still useful information Brian. As you say, it's consistent with the theory that the problem is connected with transmission collisions and you have simply reduced the likelihood of that happening.

One thing I noticed while looking through the ATmega328 data sheet is that you can set the watchdog timer to call an interrupt routine instead of resetting the CPU. Using this it should be possible to write a watchdog interrupt handler which displays various system variables on the LCD and then halts. This may give some clues as to where & why lockup occurs.

If you are really ambitious you could also trace back through the stack to see what the CPU was doing when the lockup occured. 

glyn.hudson's picture

Re: emonGLCD - Not responding.

I'm afraid to report that even with Brian's prime number idea implemented my emonGLCD running the low_mem dev code still crashes in a few hours of use :-(

Like Brian, I think the next think to do is to remove the transmitting code. I will set a test running tonight. 

Series530's picture

Re: emonGLCD - Not responding.

If its of any use , I can vouch for the fact that my fusion sketch seems to be rock solid with my GLCD. When I was developing the code initially I had issues with weird characters being displayed and an occasional freeze. I opted for a different USB power supply and this addressed the freezing. Cutting down the code with compile directives and removing redundancy addressed the weird character issue. My code has been running reliably for many weeks now. 

it will expect a certain payload content so really needs to be run with the emonTX sketch to be properly tried out. If nothing else, take a look and see how it is put together.

Brian D's picture

Re: emonGLCD - Not responding.

Series 530:

If its of any use , I can vouch for the fact that my fusion sketch seems to be rock solid with my GLCD.

But do you have two GLCD units?

My experience is that the lockup only happens when I have two GLCD units running. There is no base station and one emonTx. I think the problem arises when the two GLCD units align transmission.

robw's picture

Re: emonGLCD - Not responding.

Brian

Iv had lockups with one GLCD, A Pi as a base station and the TX.. So im not sure the two GLCD's are a cause, Or may be this accelerates the lockups?

I can rule out power supply as I'm using a lipo battery and still get lockups.

Rob

JBecker's picture

Re: emonGLCD - Not responding.

Is everybody absolutely sure that this is not a power supply problem?

There is still this nasty setting in the original jeelib RF12M initialization which sets the battery threshold comparator to 3.1V and allows battery low interrupts. A small dip in supply voltage (LEDs on, backlight on, transmission running) could be enough to trip the threshold and lockup the transmit state machine.

This problem should be known to everyone using the original jeelib code and battery supply!?! I I don't know why this setting has never been 'corrected' (AFAIK). But maybe this leads into a wrong direction.

Would perhaps still be worthwhile trying to modify the jeelib code (disable battery low interrupts or set threshold to 2.2V):

// RF12 command codes
#ifdef USE_BATT_THRESH
#define RF_RECEIVER_ON         0x82DD
#define RF_XMITTER_ON          0x823D
#define RF_IDLE_MODE           0x820D
#define RF_SLEEP_MODE          0x8205
#define RF_WAKEUP_MODE         0x8207
#else
#define RF_RECEIVER_ON         0x82D9
#define RF_XMITTER_ON          0x8239
#define RF_IDLE_MODE           0x8209
#define RF_SLEEP_MODE          0x8201
#define RF_WAKEUP_MODE         0x8203
#endif

and/or change:

    rf12_xfer(0xC049); // 1.66MHz,3.1V
to:

  rf12_xfer(0xC040); // 1.66MHz,2.2V

(in rf12.cpp)
 

robw's picture

Re: emonGLCD - Not responding.

JBecker

You may be correct..

The original code i did to dim the leds biased on power import / export did cause the lockups but i put this down at the time to me not knowing what i was doing. Or thought it was also something to do with reading the LDR,

So im not one to rule it out. just saying i can rule out a dodgy power supply not able to keep up. I get lockups even with a full battery 4.2V all the way down to 3.2V.

 

 

Brian D's picture

Re: emonGLCD - Not responding.

I like Martin’s idea of using the watchdog as a trap instead of simply resetting the CPU.

Rather than getting into heavy stuff like analysing the stack content I have simply used a variable to identify each function call so that the content of the variable can be displayed in the watchdog trap.

Example:

function_number = 1;

this_function();

function_number=2;

that_function();

display function_number

Testing  is easy enough with one or two of these:          while(digitalRead(switch1));

If I press and hold switch1 for more than a few seconds I can test that the trap is working. The LED’s are used to confirm that the ISR(WDT_vect) is invoked correctly.  There is a while(1); at the end of the trap.

Now all I have to do is output the content of function_number somehow and I can see which is the last function to be called. Serial.print can’t be used in an interrupt service routine so I have tried using the LCD display but without success. I used draw_solar_page but although it returns OK nothing changes on the screen.

Is there an obvious way to drive the LCD from inside an interrupt service routine?

 

MartinR's picture

Re: emonGLCD - Not responding.

good work Brian.

Did you call glcd.refresh() ?

draw_solar_page() just writes to the frame buffer

Brian D's picture

Re: emonGLCD - Not responding.

Thanks Martin - that works.

Test now running.

fluppie007's picture

Re: emonGLCD - Not responding.

Brian, if your test is working, would you mind sharing your sketch? Then more people can load it to their emonGLCD's :-).

Brian D's picture

Re: emonGLCD - Not responding.

A zip file is attached but be aware that my sketch expects a special emonTx. All three files are special.

I expected a lock up within a few hours but have not yet seen one after nearly 3 days!

 

o_cee's picture

Re: emonGLCD - Not responding.

If this is a problem with RAM, wouldn't it be an alternative to start looking at driving the display from the base instead? Has anyone tried that? 

It seems to me that you would be able to be a lot more flexible in configuring the display, as well as using all available data in emoncms. It looks like glcdlib got support for this kind of setup as well (glcd_proxy.cpp).

JBecker's picture

Re: emonGLCD - Not responding.

I think I have seen the same problem (node freezing during transmission) now also with an emontx. I am using my own sketch with timer 1 interrupts there and the original jeelibs library for RFM12B. The emontx transmits power values exactly every 10 seconds. A second node with an MSP430 and an RFM12B driver very similar to jeelibs sends temperature values every ~27 seconds (using the voltage and temperature dependant MSP430 internal low frequency oszillator).

All these power and temperature data is recorded using emoncms on a raspberry pi.

The last power values from the emontx recorded in the database were at Unix time 1359883034. The next temperature data from the MSP430 node came in at 1359883044. This is exactly the time when the next transmission from the emontx was expected (ok, only with 1 second accuracy). The emontx stopped during transmission (LED was burning and this LED is switched on directly before starting transmission and switched off after transmission has completed).

Does this mean that the freezing has nothing to do with the particular emonGLCD code but is a generic 'feature' of the jeelibs driver? Has anybody else seen this happen on other nodes than the emonGLCD?

 

Brian D's picture

Re: emonGLCD - Not responding.

The argument against a jeelib library error has been that this problem will have occurred elsewhere and be reported. This may or may not be true.

My tests with two emonGLCD and one emonTx have shown that the problem is extremely variable in its behaviour and tests to trap an occurrence to provide new evidence have so far failed to deliver. The emonTx code that I am running was written by Martin and does not use the jeelib library. Interestingly this code does not appear to suffer from the lockup problem. Also, when I modified the emonGLCD to use Martin's code that also failed to lockup.

If you dig way back into this thread you will see that Martin suggested the following:

*******************************************************************************************

I have a theory that emonGLCD locks up due to problems with the JeeLib library.

Transmission systems, particular radio ones with multiple transmitters are very tricky to code for because you can never guarantee what will be received. This is made worse in this case because of the high bit rate used (another bugbear of mine!) and the way there is nothing to prevent 2 nodes transmitting at the same time.

The code needs to be written defensively to take account of  all possible transmission errors, I'm not sure the JeeLib code is.

One bit that concerns me is in the interrupt routine...

if (rxstate == TXRECV) {
        uint8_t in = rf12_xferSlow(RF_RX_FIFO_READ);

        if (rxfill == 0 && group != 0)
            rf12_buf[rxfill++] = group;
           
        rf12_buf[rxfill++] = in;
        rf12_crc = _crc16_update(rf12_crc, in);

        if (rxfill >= rf12_len + 5 || rxfill >= RF_MAX)
            rf12_xfer(RF_IDLE_MODE);
    } else {....

This bit of code puts received characters into a buffer until rf12_len bytes are received or the buffer fills.

The problem is that rf12_len is one of the received characters in the buffer, which may itself be corrupted during transmission (remember this is before the CRC check) or some characters may simply be lost and there aren't enough received to ever reach rf12_len.

I think that in this situation the code will simply wait forever or until enough bytes come along to reach the rf12_len count, but then the next message will be corrupted too and so on.

Happy to be proved wrong about this but it doesn't look ideal to me.

********************************************************************************************

My tests with two emonGLCD and one emonTx have shown that the problem is extremely variable in its behaviour and tests to trap an occurrence to provide new evidence have so far failed to deliver.

We all appear to have ignored Martin's suggestion so maybe it's time to have a closer look at what he has proposed.

JBecker's software skills are better than mine so if you are happy to lead the way - what do you think?

 

DaveF's picture

Re: emonGLCD - Not responding.

I have the normal set up  plus a LP emontx temp node and a battery operated JNode - data recorded on Raspberry PI x2.  The raspberry PI's to date has been faultless - the emocmd works fine

The Jnode has locked 3 time in the last 30 days and also I note the pressure reading flips its sign +/- 30,000 units and I am wondering why?

I have an old emonGLCD v1.2 with old software this has not locked in the last 40 days of continuous use.

Hope this helps

Dave

 

JBecker's picture

Re: emonGLCD - Not responding.

We all appear to have ignored Martin's suggestion so maybe it's time to have a closer look at what he has proposed.

Brian, Martin,

I have not really ignored Martins remark, I can clearly follow him and I also think that this part of the code makes the receive routine vulnerable to transmission and reception errors. But I still think that the freeze always (?) happens during transmission. This is at least what I have found until now.

Hmmm, next thing I will try will be to set up two Arduino compatible nodes with 'orignal' jeelibs code to transmit in very fast intervals and wait for lockups. If this happens often enough (at least every few minutes), then it should be possible to catch the event with a logic analyzer. (I have already tried this with two of my MSP430 nodes but did not get any lockups)

BR, Jörg. 

o_cee's picture

Re: emonGLCD - Not responding.

Looking at http://jeelabs.org/2011/05/07/rf12-skeleton-sketch/ the code for emonGLCD does things a bit different. Has anyone tried adapting the "official" way, including checking rf12_len and using memcpy?

Lloyd's picture

Re: emonGLCD - Not responding.

Apologies if this is totally irelvant, but my emonGLCD (I think it is v1.3) has run continuously since about July.  It receives from 2 x emonTX and 1 x emonBase, and I have done nothing to tweak when each transmits.  But I have disabled the transmit from the emonGLCD, and I can't remember why I did that.  I have modified the code somewhat to suit the data I want to display, but nothing fundamental in the way the rf library is called.

Hope this helps,

Lloyd

JBecker's picture

Re: emonGLCD - Not responding.

Hmmm, next thing I will try will be to set up two Arduino compatible nodes with 'orignal' jeelibs code to transmit in very fast intervals and wait for lockups.

Just to keep you informed, I did what I said above. One node sending every 400ms, the other every ~2s. No lockups during >24 hours so far!!! This is using stripped down code, so it seems to have something to do with the parts omitted !?!

BR, Jörg.

Brian D's picture

Re: emonGLCD - Not responding.

Jörg said:

Just to keep you informed, I did what I said above. One node sending every 400ms, the other every ~2s. No lockups during >24 hours so far!!! This is using stripped down code, so it seems to have something to do with the parts omitted !?!

I have made the mistake of jumping to conclusions with this problem. It usually comes back to bite you !

The test I have been running for the last week or so uses code that used to fail regularly but has the watchdog trap (not reset) added and has not yet failed.

I have noticed before that sometimes you add some code and the problem appears to change - then it comes back again.

I suggest you leave it for a couple of weeks before you reach a conclusion.

Barry Broom's picture

Re: emonGLCD - Not responding.

Hi all. I have a similar problem. My emonGLCD v1.4 freezes up approximately every five days. Prior to the firmware update with four screens and icons it ran faultlessly. I have loaded Glyn's new low memory sketch and I will feedback results (hopefully not for a couple of weeks of problem free running!)

Brian D's picture

Re: emonGLCD - Not responding.

Good news!

The test that I have been running since 24 Jan finally produced a lockup today and the watchdog trap activated where it dumped the function code on the LCD.

I have attached  a screen shot of the locked up display plus the sketch that is was running on both emonGLCD units.

The important bit of information is that the function code showing on the LCD is 6 and the important section from the sketch is this:

  emonglcd.temperature = (int) (temp * 100);                          // set emonglcd payload
    function=5;
    rf12_recvDone();
    function=6;
   
    if (rf12_canSend())

    {
     function=7;
     rf12_sendStart(0, &emonglcd, sizeof emonglcd);
     function=8;
     rf12_sendWait(0);

You can see that the function which failed to return was rf12_canSend() because this comes next:

void loop()
{
   wdt_reset();  // Service the dog
   function=1; // This is the function call identifier
  if (rf12_recvDone())

I think this is what we were looking for - so what do we do now?

Brian

Brian D's picture

Re: emonGLCD - Not responding.

I should add that the function code is displayed in place of Voltage and appears in the bottom left corner of the LCD screen.

Brian D's picture

Re: emonGLCD - Not responding.

I have only just noticed - if you look at the diverted power quadrant (bottom right) you will see that I am sending the real time and maximum temperature of the triac used to control the dump load. The maximum figure looks OK but the real time figure is 0.2 which is wrong and I have never seen that before. This could be a coincidence but if so it's surprising.

 

PaulOckenden's picture

Re: emonGLCD - Not responding.

Hmmm.... The hang could be in canSend, but unless I'm mistaken it could equally be in the start of sendStart, up until the point where the display interrupt gets called. 

Unless I'm being thick.

P.

Brian D's picture

Re: emonGLCD - Not responding.

Paul O:

Hmmm.... The hang could be in canSend, but unless I'm mistaken it could equally be in the start of sendStart, up until the point where the display interrupt gets called. 

I think this is true:

If rf12_canSend() returns true function will = 7

if rf12_canSend() returns false function will = 1

if rf12_canSend() fails to return function will = 6

Hopefully one of the software gurus will check this as it's crucial to the test.

MartinR's picture

Re: emonGLCD - Not responding.

Well done Brian!

That's pretty conclusive that the problem is in rf12_canSend(). Now all we need to do is work out where. There are some unchecked while loops that path (e.g in rf12_byte()).

Paul - it can't be in sendStart. Once the variable 'function' is set to 7 then that value would be displayed once the watchdog tripped.

Barry Broom's picture

Re: emonGLCD - Not responding.

Just following up from my post on Tuesday. I trialed Glyn's low memory GLCD sketch and found that it crashed consistently in under two hours. Unlike the standard emonGLCD sketch with icons, the low memory sketch blanks the screen and switches off the LEDs. I have moved back to the standard sketch with icons which typically freezes up after a week. I can confirm that prior to the icons change, the old emonGLCD ran reliably for weeks.

glyn.hudson's picture

Re: emonGLCD - Not responding.

Yes, well done Brian! That's a good bit of debugging. We know have more of a focus. 

MartinR's picture

Re: emonGLCD - Not responding.

I just had a quick look at rf12_byte(), which is called from rf12_canSend() and it does indeed look vulnerable.

This is the what it looks like without all the #ifdefs...

static uint8_t rf12_byte (uint8_t out) {
    SPDR = out;
    // this loop spins 4 usec with a 2 MHz SPI clock
    while (!(SPSR & _BV(SPIF))) ;
    return SPDR; }

All it does is write the value 'out' to the SPI and then wait for the SPIF bit to be set in the SPSR register, which happens when the SPI transfer is complete. It then reads the SPI data register, SPDR, which clears the SPIF bit.

The immediate problem I can see is that this function is also called by the RFM12 interrupt routine so if an RFM12 interrupt occurs when this function is waiting in the while loop then the SPIF will be cleared in the interrupt routine and the outer loop will wait forever for it to be set. As the comment in the code says this window is only a few microseconds which could explain why it doesn't happen very often.

There may also be other areas in the code where this can happen too.

 

Brian D's picture

Re: emonGLCD - Not responding.

That is looking good Martin.

I guess that the next step is a tiny piece of code that focuses on this vulnerability and quickly produces 'reliable' errors. If we can reach that stage then the problem can be passed to jeelabs for the necessary changes to be made.

I am game for more testing. Who is going to write this 'nasty' bit of code? :)

 

 

o_cee's picture

Re: emonGLCD - Not responding.

Brian: Why is there a rf12_recvDone() before rf12_canSend(), which isn't being checked for its return value? It seems unnecessary and as far as I know, the examples from JCW doesn't do this: http://jeelabs.org/2011/05/07/rf12-skeleton-sketch/

No clue if it would be possible for it to cause hangs, but you never know (although Martins suggestion is probably more likely).

PaulOckenden's picture

Re: emonGLCD - Not responding.

Why is there a rf12_recvDone() before rf12_canSend()

It's what JeeLabs recommend. There's a couple of blog posts on their website about this.

Incidentally, does anyone here know JCW? It would be good to get him involved now we're suspecting bugs in his code.

P.

o_cee's picture

Re: emonGLCD - Not responding.

It's what JeeLabs recommend. There's a couple of blog posts on their website about this.

Care to point out one for me? Can't find anything about doing things like that, when there already is one loop polling rf12_recvDone() like here.

PaulOckenden's picture

Re: emonGLCD - Not responding.

Here: http://oldred.jeelabs.net/projects/cafe/wiki/Rf12_canSend()

I found a blog post which went into this in more detail earlier, but I can't find it now.

P.

 

o_cee's picture

Re: emonGLCD - Not responding.

Note that even if you only want to send out packets, you still have to call rf12_recvDone() periodically, because it keeps the RF12 logic going.

My point exactly, we are not only sending packets, we are sending and receiving. In Brians code, there is already a call to rf12_recvDone() at the beginning of loop() which keeps the logic going. The one before rf12_canSend() could return true without the code acting appropriate, but I'm not sure it would have any ill effects other than missing a packet(?).

Petrik's picture

Re: emonGLCD - Not responding.

Just out of curiosity: does everyone who has a hang up problem have an 1-wire sensor connected ?... or is there someone with this problem without an 1-wire sensor ?

fluppie007's picture

Re: emonGLCD - Not responding.

Cool that we're heading in a certain direction. But what I still don't understand is, why does my emonTx never lock up and my emonGLCD within 24h? EmonTx uses the same function calls, right?

o_cee's picture

Re: emonGLCD - Not responding.

Because of RX and TX, not just TX?

o_cee's picture

Re: emonGLCD - Not responding.

Another interesting thing to try: http://jeelabs.net/boards/6/topics/167 Look at the link back to the old forum also. Seems like a partly rewritten RFM12b driver, maybe it addresses the potential issues Martin saw? Haven't had time to compare the code myself yet.

JBecker's picture

Re: emonGLCD - Not responding.

I just had a quick look at rf12_byte(), which is called from rf12_canSend() and it does indeed look vulnerable.

Yes, and it does not only look vulnerable, it is also completely useless!. I think I repeat this, but the rf12_byte() call does not assert the chip select of the RFM12B. So it will never read valid data from the chip! So the function call is at least useless. And I think it might be dangerous, too (see Martins explanation above. Could it be that asserting CS of the RFM12B will keep it from generating interrupts?).

First thing should be to either remove this call or replace by the (correct) call to rf12_xfer(). Somebody out there who wants to try that (I do not own an emonGLCD)?

MartinR's picture

Re: emonGLCD - Not responding.

First thing should be to either remove this call or replace by the (correct) call to rf12_xfer(). Somebody out there who wants to try that (I do not own an emonGLCD)?

I think we should try and prove this really is the problem area first.

How about this....

Make a copy of rf12_byte() and call it something like rf12_byte_copy(). Then change rf12_canSend() to call this new version.

In the new version add a bit of delay before the while statement. This will create a much longer window where SPIF may get cleared by an interrupt. You could also toggle an unused pin or LED in the while loop so you could prove that is where the CPU is stuck. Something like this should do it...

static uint8_t rf12_byte_copy (uint8_t out) {
     SPDR = out;
     // this loop spins 4 usec with a 2 MHz SPI clock
    delay(1);

     while (!(SPSR & _BV(SPIF))) digitalWrite(TESTPIN,digitalRead(TESPIN)?LOW:HIGH) ;
     return SPDR; }

JBecker's picture

Re: emonGLCD - Not responding.

This is a good idea to find if this is the critical point.

You can also set a flag (volatile variable) in the ISR which is cleared entering your modified rf12_byte_copy() routine and checked just after the delay(1).

 

Brian D's picture

Re: emonGLCD - Not responding.

Martin

Thanks for that. I shall have a look at this tonight. My current test platform is ideal for this because the lock-up happens so rarely and if your idea seriously aggravates the  problem then we are getting much closer.

 

Brian D's picture

Re: emonGLCD - Not responding.

I believe in small increments so the only change I have made initially is this:

static uint8_t rf12_byte (uint8_t out) {
#ifdef SPDR
    SPDR = out;
    // this loop spins 4 usec with a 2 MHz SPI clock
    delay(1); // BD test code
    while (!(SPSR & _BV(SPIF)))
        ;
    return SPDR;

The result is the LCD display is initialised correctly and the simple test of covering the light sensor continues to control the backlight but no data appears to be recieved from the emonTx. This continues for about 1 minute and then the watchdog trap activates with function code 7 displayed.

This is the code 7 part of the sketch:

     function=7;
     rf12_sendStart(0, &emonglcd, sizeof emonglcd);
     function=8;
     rf12_sendWait(0);

Clearly 1mS is too long so I tried this:

static uint8_t rf12_byte (uint8_t out) {
#ifdef SPDR
    SPDR = out;
    // this loop spins 4 usec with a 2 MHz SPI clock

for(int test_delay = 0; test_delay < 100; test_delay++)
{
__asm__ __volatile__ ("nop\n\t");
}

    while (!(SPSR & _BV(SPIF)))
        ;
    return SPDR;

That was also too long but the loop reduced to 50 is now running on both GLCD:

for(int test_delay = 0; test_delay < 50; test_delay++)
{
__asm__ __volatile__ ("nop\n\t");
}

 

o_cee's picture

Re: emonGLCD - Not responding.

Brian: remember that rf12_byte is called from other functions as well, so do as Martin wrote and create a copy of it which is only called from rf12_canSend.

MartinR's picture

Re: emonGLCD - Not responding.

Did you change the rf12_byte function in the library Brian, rather than make a copy?

The reason I suggested a copy is because the function is also called from the interrupt routine and we don't want to disturb the timing there.

If the code is running with the delay you've added though it should still fail slightly more often than before if this is indeed the problem.

MartinR's picture

Re: emonGLCD - Not responding.

I'm too slow :)

Brian D's picture

Re: emonGLCD - Not responding.

Ahh! more to this than I thought!

I have now added a copy --- static uint8_t rf12_byte_copy (uint8_t out) {

Which is only called from rf12_canSend()

The delay is now delay(1); as suggested.

Both running at the moment - thanks guys.

Brian D's picture

Re: emonGLCD - Not responding.

Running with the delay at 1mS on both units there were no problems after 1 Hour so I put the delay up to 100mS on one unit only. The longer delay trapped after 20 minutes with function code 6 displayed but prior to the trap everything was working normally.

I could add the testpin stuff etc but it's not entirely convenient and it probably will not change things much. I was hoping that when we are reasonably convinced this is the problem area we can cook up a tiny bit of code that jeelabs can run to see the problem for themselves. Alternatively is this enough?

Anyhow this is just an update and the test is running again. I shall set the delay to 10mS for the overnight run.

 

 

Brian D's picture

Re: emonGLCD - Not responding.

OK, both units - one with delay(1); and the other delay(100); are locking up reliably now with function code 6.

The thing that really puzzles me is that the temperature display is always shown as 0.2. This occurs in this bit of code:

 // Display the current and maximum temperature of the Triac
 
  dtostrf((controllertemp/100),0,1,str);

  glcd.drawString_P(68,58,PSTR("Tp "));
  glcd.drawString(78,58,str);
 
 // Now display maximum temperature
 
  if(controllertemp>maxcontrollertemp)
  {
    maxcontrollertemp=controllertemp;
  }
  dtostrf((maxcontrollertemp/100),0,1,str);

  glcd.drawString_P(100,58,PSTR("Mx "));
  glcd.drawString(110,58,str);

I don't understand how that can happen.

o_cee's picture

Re: emonGLCD - Not responding.

I'm not even sure we need test code for this, it's quite obvious that this can go wrong when the interrupt comes at the wrong time. The rf12mods branch seems to handle things a bit differently, it might already be addressed there.

It's also interesting to see the comments in the code on rf12_canSend(): "no need to test with interrupts disabled" Seems like there has been a check earlier, but then removed. Can't see any history of this on GitHub though. Disabling interrupts there should fix this problem as far as I understand, not sure what it could mess up though.

Another interesting part in the code: rf12_control() disables interrupts "to avoid clashes on the SPI bus". The function calls rf12_xfer(), which calls rf12_byte().

Brian D's picture

Re: emonGLCD - Not responding.

Brian said:

The thing that really puzzles me is that the temperature display is always shown as 0.2. 

OK you can all stop worrying now :)

It's a trivial error on my part which I have fixed and  is not worth explaining.

Brian

o_cee's picture

Re: emonGLCD - Not responding.

Brian: Could you try to add cli() in the beginning of rf12_byte copy, and sei() at the end? Even with the delay, that should work if our hypothesis is correct.

jcw's picture

Re: emonGLCD - Not responding.

Whoa, great detective work - thx Oscar for the email.

I'm considering the following change to rf12.cpp - could someone here verify that it does indeed solve the problem?

/// rf12_recvDone() periodically, because it keeps the RFM12B logic going. If
/// you don't, rf12_canSend() will never return true.
uint8_t rf12_canSend () {
-    // no need to test with interrupts disabled: state TXRECV is only reached
-    // outside of ISR and we don't care if rxfill jumps from 0 to 1 here
+    // need interrupts off to avoid a race (and enable the RFM12B, thx Jorg!)
+    // see http://openenergymonitor.org/emon/node/1051?page=3
     if (rxstate == TXRECV && rxfill == 0 &&
-            (rf12_byte(0x00) & (RF_RSSI_BIT >> 8)) == 0) {
+            (rf12_control(0x0000) & RF_RSSI_BIT) == 0) {
         rf12_xfer(RF_IDLE_MODE); // stop receiver
-        //XXX just in case, don't know whether these RF12 reads are needed!
-        // rf12_xfer(0x0000); // status register
-        // rf12_xfer(RF_RX_FIFO_READ); // fifo read
         rxstate = TXIDLE;
         return 1;
     }

-jcw

PS. Now added as issues on the JeeLib issue tracker - https://github.com/jcw/jeelib/issues

MartinR's picture

Re: emonGLCD - Not responding.

It's a trivial error on my part which I have fixed and  is not worth explaining.

That's a relief, didn't fancy another treasure hunt - interesting journey though it was!

I agree with o_cee's comment above, disabling interrupts in rf12_byte_copy would be the final proof if any were needed. After proving this you could then remove the delay and you should have a temporary solution with no more lockups if this is the only problem.

Well done for sticking with the problem Brian and not getting distracted by temporary fixes.

Brian D's picture

Re: emonGLCD - Not responding.

Jean-Claude Wippler

Your suggested mod is now running here on two units but the test does contain the delay(1); described above otherwise it will take weeks to see a result. This way if no lock-up is seen within say 24 Hours then things will look promising.

I shall update the group tomorrow or before if it fails.

JBecker's picture

Re: emonGLCD - Not responding.

JC,

would still be interesting to know if rf12_control() (with interrupts disabled) is really needed or if rf12_xfer() would also work due to asserting the CS (which might keep the RFM12B from generating an interrupt while being 'selected') !?!

Do you (or anyone else) know this from your experience with the RFM12B?

BR, Jörg.

jcw's picture

Re: emonGLCD - Not responding.

I know of no docs saying that the RFM12B won't generate an interrupt while selected. Disabling just the RFM12B interrupt, as rf12_control does seems like a safe bet. If it's easy to test, we could try without and use rf12_xfer to see what happens.

PS. Thanks for testing, Brian!

JBecker's picture

Re: emonGLCD - Not responding.

The reason why I asked is the following:

If rf12_xfer() is generally not safe outside interrupts, then the

         rf12_xfer(RF_IDLE_MODE); // stop receiver

would also not be safe (and should be replaced by rf12_control())!

And then generally rf12_control() should be used everywhere outside interrupts.

Or am I wrong there?

 

jcw's picture

Re: emonGLCD - Not responding.

Yes, in general that's indeed a good idea (rf12_control was added later on). In some cases it can be avoided by looking at what states the driver can be in at that point. Here, I think you're right - I've added the change to commit as soon as the previous change has been "somewhat" confirmed. Thanks.

Brian D's picture

Re: emonGLCD - Not responding.

Brian said:
I shall update the group tomorrow or before if it fails.

Here is the update but first a reminder. The two emonGLCD test platforms have been running code with Martin's additional delay(1); and the result was a fairly consist one or two lock-ups every 90 minutes.

The test was to add JC's fix but leave in Martin's delay.

The result after 24 Hours is no lock-up on either unit. Not conclusive but extremely encouraging.

The best thing to do now is a beta release and then all the other users who have this problem can simply re-compile and we will have several tests running in parallel.

On a different tack - this exercise has been my first experience of such a collaborative effort and I am very impressed with the quality of the contributions and have enjoyed the ride so well done everyone!

Brian

jcw's picture

Re: emonGLCD - Not responding.

Yep - the power of open source: everybody wins!

I've checked in new changes for the RF12 driver, see https://github.com/jcw/jeelib/issues/33

Also includes a new rf12_sendNow() wrapper around rf12_canSend() and rf12_sendStart(), with updated docs over here: http://jeelabs.net/pub/docs/jeelib/RF12_8cpp.html#func-members

If you comment on GitHub, I'll get an email ping right away.

Allow me to chime in and thank everyone as well - this sort of deep testing helps a lot because then the issue becomes clear enough to be able to fix it.

mad_dad's picture

Re: emonGLCD - Not responding.

So is there a final solution to this for the emonglcd code

A Summary of the last 4 pages perhaps

Brian D's picture

Re: emonGLCD - Not responding.

Yes there is a solution. A vulnerability was found in the jeelib code.

The jeelib problem has been fixed and a release has been made - https://github.com/jcw/jeelib/issues/33

That's it.

 

Avontech's picture

Re: emonGLCD - Not responding.

Awesome work guys, many thanks, I've been following avidly, though unfortunately do to working away for the last 4 weeks, I've not been able to contribute much.

So is there a simple summary of what I should do to get my system back up and working - what code changes to do or libraries to download - any links would be most useful.

An idiots step by step to implementing the fixes?

Many thx

Brian D's picture

Re: emonGLCD - Not responding.

Click here.

Then download the zip file and replace the jeelib library in your library area with the new one. I had to change the name on my system.

 

robw's picture

Re: emonGLCD - Not responding.

Hi Brian

Is that all thats needed... There is no need to change any of the code for the GLCD sketch at all? Just as i thought there was a problem as well with the sending. (kudge?)

int i = 0; while (!rf12_canSend() && i<10) {rf12_recvDone(); i++;}  // if ready to send + exit loop if it gets stuck as it seems too
    rf12_sendStart(0, &emonglcd, sizeof emonglcd);                      // send emonglcd data
    rf12_sendWait(0);

Just i have my folks that use the history page a lot (saves them going into the roof to look at the inverter and under the stairs every day, Yes they are still new to the solar thing and still check everyday)

Thanks.

Brian D's picture

Re: emonGLCD - Not responding.

I got rid of the kludge a long while ago so have not tried putting it back in.

As for the history stuff - I have never tried it.

 

Barry Broom's picture

Re: emonGLCD - Not responding.

Hi all. I have applied the JeeLib fix to the standard SolarPV emonGLCD build. It sometimes freezes up, but less frequently now. Is anyone still having issues with emonGLCD freezing up still?

gb095666's picture

Re: emonGLCD - Not responding.

Yes, I am still getting lockup's, but I still have the history screen active so was wondering if that had anything to do with it.

Barry Broom's picture

Re: emonGLCD - Not responding.

Same here, I am running the history screen. I am dealing with my third R-Pi emonCMS rebuild after two SD card failures so don't have time to look into this issue at the moment. On the positive side, my emonTx has run fault free for months!

robw's picture

Re: emonGLCD - Not responding.

Just for info.  I've got a v1.4 that still seems to lock up but now at weekly intervals not hourly or daily. 

But my parents have a v1.3 that locks up daily some times still hourly. Running from the same code as the v1.4. (My parents would have no idea how to program one so I go round with the same laptop to re program theirs). The v1.3 has been modded to run the screen at 3.3v not the usual 5v as shipped. 

Don't know how this could effect anything though. Just throwing it into the mix. 

 

 

matt-carbon-coop's picture

Re: emonGLCD - Not responding.

Yeah, I'm getting freeze up's about every 2-4 days on all 15 emonGLCD v1.4 units I'm running. But they're loaded with a standard HomeEnergyMonitor example with a DHT22 humidity/temp sensor in place of the DS18B20, rather than the Solar example, so it's probably a different problem. I've copied the ino below in case anyone has any ideas that might help. I'm using the latest Jeelib library.

I'm putting in the wdt code for now where I can, until I've time to troubleshoot.

 

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi Guys, 

Great work tacking down and getting this fix pushed to the JeeLib library. This really is an amazing example of open source at it's best!

I have updated the emonGLCD SolarPV_lowmem_dev example to use the new rf12_sendnow wrapper and have updated my JeeLib to the latest version from JCW. As of today I've started a test running here in the lab and in my house using the updated code. It would be great if anyone else wants to run the same code and report back: https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV_lowMem_Dev

Fingers crossed!

Thanks again.  

Barry Broom's picture

Re: emonGLCD - Not responding.

Hi Glyn,

Thanks for the update on this. I have loaded your new low-mem code to my v1.3 emonGLCD using the updated JeeLib library. Mine last crashed 6 days ago, so it may be a while before I feedback. Fingers crossed!

Cheers,
Barry

Barry Broom's picture

Re: emonGLCD - Not responding.

Results from my tests:

The low-mem sketch was disastrous for me. It always locks up after 9 minutes of running. I did this four times. I had problems trying the low-mem sketch before the JeeLabs code change.

I did apply the JeeLabs code changes to the current SolarPV sketch (without history page) and it has been running fine for the last four days.

fluppie007's picture

Re: emonGLCD - Not responding.

Barry, I experience the same problems with the firmware. Locks up in less than 20 minutes.

EDIT: Loading the regular SolarPV sketch like Barry seems to work just fine :-).

Barry Broom's picture

Re: emonGLCD - Not responding.

I have been running the SolarPV sketch (without history page) with Glyn's code changes to bring in the JeeLib fix. It has run perfectly for over a week now, so I consider this issue as resolved. Thank you very much! I'll give the code with the history screen a go, as I am sure this will work now.

fluppie007's picture

Re: emonGLCD - Not responding.

Barry, did you use this code without changes or did you modify something in the code: https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV

Because mine locked up 5 hours ago. I'm running Arduino IDE 1.5.2 with the latest onewire, dallastemp, emonlib en jeelib libraries.

EDIT:
I changed the the following lines at the end of the SolarPV sketch:

rf12_sendNow(0, &emonglcd, sizeof emonglcd);
rf12_sendWait(2);

casestudies's picture

Re: emonGLCD - Not responding.

Last December, there was some discussion about stability issues with the EmonGLCD. The one I have setup continues to lock up after about 12-24 hours of running. Was this issue resolved in the latest updates on Github? The unit I built was for my parents who live a few hundred miles away, so I can't continually  update the firmware for testing.

I'm not using it for monitoring a PV setup--just monitoring home power usage with CT clamps. I have two temperatures that I monitor as well; one on the EmonGLCD and one on the EmonTX. If there is a fix, what file should I upload for monitoring home power usage? Link would be great as I'm rather lost. Also, I'm using a Raspberry Pi for the base station so I'd like to send the GLCD the current time from the Pi.

The EmonTX I have setup has been running for about 3.5  months, and hasn't caused any problems (i.e. the Raspberry Pi is still able to read data from it). It'd be great to get the display working so my parents can look at current usage (easier than talking them through how to use the emoncms website).

Thanks in advance!

Brian D's picture

Re: emonGLCD - Not responding.

Glyn recently published this link to the fix. If you only download the *.ino file that will not fix the problem. You also need to update with the jeelib fix which can be found via a link in Glyn's post.

Barry Broom's picture

Re: emonGLCD - Not responding.

Hi fluppie007. Sorry for the delay, I've been busy!

Like you, I have used the main EmonGLCD SolarPV branch at https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV

I have removed the last three lines:

    int i = 0; while (!rf12_canSend() && i<10) {rf12_recvDone(); i++;} // if ready to send + exit loop if it gets stuck as it seems too
    rf12_sendStart(0, &emonglcd, sizeof emonglcd); // send emonglcd data
    rf12_sendWait(0);  

and replaced them with the changes Glyn made to the 'low_mem' branch;

    rf12_sendNow(0, &emonglcd, sizeof emonglcd);                     //send temperature data via RFM12B using rf12_sendNow wrapper - https://github.com/jcw/jeelib/issues/33
    rf12_sendWait(2);

With the correct version of the JeeLib libraries, EmonGLCD reliability is massively improved. I would like to see these changes included in the main SolarPV branch, ideally with the history feature reinstated. Glyn, if you reading this, I can thoroughly recommend the change to help new installations work correctly!

glyn.hudson's picture

Re: emonGLCD - Not responding.

Hi guys, 

As of the 14th March I have updated the Solar PV example on github to use the new sendNow() wrapper. Sorry for the delay in posting I wanted to prove to myself that it has actually fixed the issue. I have been running the SolarPV example in my own home for over two weeks now and it has not crashed. This is a big improvement, before was crashing several times a day.

The current Solar PV example on github consists of the Solar PV overview page and a press of the top button reveals a week view history page. I have found this to be very useful as a quick check of how the week is looking solar PV generation wise. 

I have also mirrored the change in the Home Energy Monitor Example.

I will post if I find any issues but so far so good!

Solar PV example: https://github.com/openenergymonitor/EmonGLCD/tree/master/SolarPV

Home energy monitor example: https://github.com/openenergymonitor/EmonGLCD/tree/master/HomeEnergyMonitor

Thanks a lot to everyone who has helped solve this issue. It's been a real example of the power of open source software and development and testing. Onwards and upwards. 

glyn.hudson's picture

Re: emonGLCD - Not responding.

I have moved the LowMemDev example to the /old folder. Please don't use it, it actually seemed to crash the worst even with the fix. I'm not sure why this is but it doesn't really matter as the main example seems to have been proved stable. 

johny5_uk's picture

Re: emonGLCD - Not responding.

I have just updated my GLCD as above but also enabled all of the 4 pages so will see how it goes.

 

My display used to lockup once or twice a day before, I was also pleasantly surprised after updating emoncms the other day, the time started to sync up with the display (I must have missed a thread on this one).

I am just running a Rpi with emontx.

glyn.hudson's picture

Re: emonGLCD - Not responding.

I'm happy to report that my emonGLCD has been totally stable for months now running the current emonGLCD Solar PV code ''as it is' form GitHub. Happy days!

It's great to finally be able to finally (I hope!) draw a line under this thread 7 months after this issue was first raised!

I would be interested to hear from anyone else who has had success who were previously plagued by lockups. I'm guessing by how this thread has gone quite that everyone is now enjoying their rock solid stable emonGLCDs!

Thanks again to everyone who helped. 

robw's picture

Re: emonGLCD - Not responding.

Hi Glyn

Quick answer is yes and no..

My V1.4 is fine with the new code and has not crashed once..

My parents V1.3 still crashes once ever 1-2 days. Its had the 3.3v mod done for the LCD.
Both running the same code loaded from the same laptop. Bar the change for the Switches.

Only difference is mine is run from a lipo and theirs is run from the shop USB.

 

Avontech's picture

Re: emonGLCD - Not responding.

Hi Rob, did you resolve the V1.3 crashing?

I've got a V1.4 that still hangs every couple of days, it's custom code to display Solar PV generation

1 x emonTx with 3 x CT's for Import, Generation and Consumption plus temp
1 x emonTx with 3 x CT's one of each of the 3 individual inverters plus temp (Yes 3 :) )

on top of that there are:

1 x low power emonTx monitoring temperature of the hot water cylinder
1 x low power emonTx monitoring room conditions
1 x emonTx monitoring room conditions

then
4 x Jeenodes monitoring various environmental paramaeters

the OKG emonBase runs just fine and collates all the data to emoncms.org, the emonGLCD hangs every couple of days.
 

robw's picture

Re: emonGLCD - Not responding.

Hi, No i did not..

Iv tried alot and removing some code does seem to help eg its now 3 days but long off mine..  Its so weird.

Well glad to see someone else using jeenodes also.. The wife thinks im mad as iv got them all round the house.

Will let you know if i do find it.

Robert Wall's picture

Re: emonGLCD - Not responding.

Avontech,

You say the OKG base runs faultlessly, so I'm wondering whether the GLCD is seeing data collisions due to its location that maybe the base isn't. Is that a possibility? Also, do you have enough spare ram in the GLCD?

Robw,

Your parents' V1.3: I'm just wondering whether the occasional burst of rubbish coming down the supply might be the problem. Have you tried extra smoothing on the 5 V or 3.3 V rails?

Avontech's picture

Re: emonGLCD - Not responding.

Hi Robert Wall

The GLCD and OKG base are about 600 mm from each other, so both should be 'seeing' the same data and getting the same collisions (if they occur).There may be collisions, maybe the OKG code (NanodeRF_multinode_bulksend) is more resilient?

The GLCD code is pretty small as I'm not using multiple pages, just instantaneous display from the the two emonTx's with CT's based on the SolarPV code and screenshot, attached for info:
 

 

robw's picture

Re: emonGLCD - Not responding.

Robert

Both Units are now on lipos (have been for last 3 weeks in my parents case mine since build) Still getting lockups.. Also im getting lockups at my house with their kit so cant be the location.. EG someting in the air (radio)

Iv tried a cap on the 5v in (usb power when we were using that) and also now on the 3.3V in.

Iv used a jeelink to log all data on the same band and it looks clean to me..

Im just cleen out of ideas..

 

Robert Wall's picture

Re: emonGLCD - Not responding.

Robw: I'm just clean out of ideas..  I was clutching at straws too...  Does the 1.3 have the same bootloader?

Avontech:

Isn't the OKG code (as far as RF receiving is concerned) almost, if not exactly, the same? I'd need to check but from memory it looks very close.

Is it worth giving this a try: Can you modify your OKG code to resend a composite package of the data (that it has correctly received) and modify the GLCD to accept the data only from the OKG and ignore the original data from the emonTx's? I can't think of a good reason why this should work when how you have it now doesn't, because the GLCD must receive the packet to extract the node ID to be able to know to discard it...  Unless it's tied up with signal strength and an incomplete package not being discarded properly - which was one line of enquiry earlier.

Avontech's picture

Re: emonGLCD - Not responding.

@Robert Wall, I see what you mean :) I'll give that a go - transmit the data (just what the emonGLCD wants ) at the same time as sending the data to emoncms .... It'll take me a couple of days to get my head into that space (it's fried in this weather! )

Robert Wall's picture

Re: emonGLCD - Not responding.

" ... my head ... (it's fried in this weather! )"  Me too. Why do you think I'm posting at 00:30 local time? ;-)

What I think you need to do: Make a new struct (like the emonTx code) containing the data destined for the GLCD, then inside the (rf12_recvDone()) branch, you extract that data like you do now in the GLCD, copy it into the new structure and send that in exactly the same way as the emonTx does. (You'll need to send the data twice - each time half old from one emonTx and half new from the other emonTx, and your new struct must be static or the display will flicker.) In the GLCD, you change the node number it tests for to the OKG's, then carry on much as you have it.

Don't blame me if it doesn't work - as I said, I can't see why it should. But if it does, it surely raises some interesting questions.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.