Data loss due to RF packets getting corrupted

Hi there,

I have the emonTX v3.4, 2 CT's for PV and Grid, Raspberry Pi with RFM69Pi uploading to emoncms.org, and I thought it was all working pretty well, until one day, the graphs in emoncms showed some flat lines indicating identical values for far too long, and after a bit of digging, I found that I was losing data packets (which appear as flat lines due to the "skip missing" setting)

So, I have narrowed down the problem, and it appears to be the RF packets from emonTX to RPi where the problem occurs, but I haven't been able to solve it, and so I am hoping for some help.

Using the USB/UART cable I connected to the emonTX, and could see that it was generating data every 10 seconds as expected, but the LED on the RFM69 on the Pi was not flashing every time, indicating some packets are not being received by the RPi. It looks like I am losing about 40% of packet transmissions.

I set logging level to Debug in emonhub.conf, and then also set quiet = false, and then I could see lots of packets arriving at the Pi from different node ID's, but I only have one node - my emonTX, so this can't be my data. I also saw that some packets that look like my data (node ID =10) were being discarded due to "unreliable content", and so I presume this is my real data that is being dropped.

Initially the RSSI was quite poor - around -85, so I tried moving the RPi closer to the emonTX and the RSSI improved, but didn't fix the data loss. Yesterday I had the RPi 60cm away from the emonTX, RSSI was -34, but still the data loss occurs.

Last night I removed and reseated the RFM69CW module, but it has made no difference. I have tried setting thre group to 0 as mentioned in this post to see if another group ID would be better, but I couldn't tell which byte was showing the group, so not sure how to interpret the results.

Any thoughts?

I am wondering whether the other packets are legitimate, and are swamping my packets from my emonTX, or could either the emonTX be generating additional rogue packets, or maybe the RFM69Pi is creating the spurious packets due to a hardware problem?  Could I use the emonTX to 'listen' for RF packets, to narrow down whether I really am surrounded by other RF packets, or whether either my emonTX or RFM69Pi is generating the noise?

I am using the stock discrete sampling firmware on the emoTX only modified to change the VCAL as part of calibration. The Pi is running the emonSD-13-03-15. And I don't know if this is related, but I am getting a constant value of 297 on Key 12 in emoncms, which I believe is supposed to be T6, but I have no temperature sensors at all, so this is more rogue data, but I don't know when this started, as I don't log it to a feed, so I don't think I can tell when it started. I have tried deleting the key from the Inputs tab, but it reappears.

sample from  emonhub.log attached (hopefully)

pb66's picture

Re: Data loss due to RF packets getting corrupted

I do start to wonder if the super sensitivity of the rfm69 maybe a bit of a hindrance in high traffic areas, I wonder if it can be attenuated??? 

As you can see in emonhub all the good packets start "OK" followed by a node id, bad packets start with a question mark. But I believe all of these are seen to belong to the group you are operating with. If you set the group id to 0 you should see a change in format that begins with the group id identified by the letter G prefix (I believe) the packet will not pass through emonHub as it isn't geared up for that format, but you may be able to highlight a small range of numbers between g1 and g212 (or g255 depending on model and firmware etc) that is getting less traffic that you can set up camp in the middle of.

It's not an exact science (well it probably is, just not to mere mortals like myself) but it's the best I can think of, bar trial and error.

I'm surprised to see incomplete "node 10 " frames with a rssi of -57, I would of thought once the data was in transit it would complete unless very weak or knocked out by a very strong signal, which could be happening just not visable if that strong signal isn't passing crc or group checks (I'm speculating a bit here as I'm not sure of the inner workings) but I do believe totally missing packets maybe caused if the transmitter doesn't see an opportunity to send, As i understand it the sender will pause sending if it detects a "busy network" how long it will wait before giving up I don't know nor what happens when the next packet is ready to send.

You can load the RFM2Pi firmware to an emonTx (check your pin numbers etc) whether that will help I don't know, the hi-gain ant may pull in even more, maybe that would shed some light?? you may find it easier to work with the serial output directly rather than emonhub as the frames print clearer rather than get lost in the logfile, that is a little easier to do with a usb ftdi programmer and emonTx I guess.

Paul

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Paul,

Thanks for your response. I did think about changing the group ID, but I thought that if the "problem" is able to corrupt my valid node 10 packets, I suspected that problem would persist even if I changed the Group ID.

I've just thought of one simple test I can do to see if my emonTX is generating the bad packets - turn off the emon TX and see what the RPi receives ! So I did this, and with the emonTX powered off, the RPi is still receiving lots of packets from sources, so I can rule out the emonTX as being the source of the noise.

I was then looking at options for doing the reverse, to put the emonTX into receive mode, and see if the same amount of noise existed (and if not, it indicates the RPi as the source of the noise). This lead me to read some documents from the Jeelib site, and this seems to confirm what you and others have written, that setting the group ID to 0 should enable promiscuous mode, and allow me to see which groups the other devices are using, and then choose a group that isn't in use.

So I have been playing with this, and hitting more problems: According to http://jeelabs.net/projects/jeelib/wiki/RF12demo it says "Set the net group used by the radio - only nodes in the same group can see each other." If this is true, then all the noise must be in the same group as my RPi (210) ? What's the chance that someone nearby is using the sane group ?

So, I set the group to zero on the RPi, to see what other groups are in use, and the log would suggest there are loads of groups in use, as the first byte in the packet is almost always different - although I am not convinced that the log is actually showing the group ID as the first packet, due to the sheer number of variations, and perhaps more worrying, my emonTX which is using group 210 never appears in the log when the RPi is in promiscuous mode (see file emonhub_group0.txt)

I also tried this using minicom in case the emonhub code was modifying the log output, but this also showed group was always 0, and my group 210 packets never appeared. (see file emonhub_group0_minicom.txt)

So now I don't know whether to trust the output from promiscuous mode?

Anyway, I tried changing the emonTX and RPi to use group 200, but it made no difference - still high packet loss where the log shows the packet arrives at the RPi but is marked as invalid usually being a couple of bytes short.

I've tried turning off electrical items in case they are flooding the ether with noise (Fridge Freezer, Low energy lights, Powerline network adaptor) but still no good.

It's late and time to sleep - any ideas from anyone would be appreciated.

Ian

pb66's picture

Re: Data loss due to RF packets getting corrupted

Ok that's bizarre! The received packets should not all start with zero, If you take a look at the code

if (config.group == 0) {
    showString(PSTR(" G"));
    showByte(rf12_grp);
}

When group 0 is set the if (config.group == 0) must be true as you see the " G" which means rf12_grp is 0, which cannot be the case as rf12_grp is a pointer directly to the "group id" byte of the received packet, which if it was the case that no packets had a group id of 210 because they are all 0 you would see any data when the group is set to 210.

That doesn't make sense to me, maybe someone can jump in and confirm my logic as I'm not overly confident and maybe missing something. All the current rfm2pi, rfm69pi and jeelab rf12demo sketches (except emonPi) seem to share that same code.

You could try updating the rfm69pi firmware as the hex file was recently recomplied using the updated jeelib library.

Paul

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Paul,

I also looked at that code, and it saw that rf12_grp should be the group ID, but I couldn't see that defined anywhere in the program. I assumed it's probably defined in a library somewhere, but I haven't proved that yet.

I'll try the new firmware for the rfm69pi next

 

And thanks for your support with this ;-)

Ian

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

I've updated the rfm69cw firmware in the RPi, but it doesn't seem to have changed anything. Although I'm not sure how I can confirm that the new firmware was installed? I followed these instructions which worked perfectly, I'm just not sure how I can be certain it pulled updated code, as the version is still RF12demo.12.

I have noticed that when in promiscuous mode, the first byte (which is supposed to be group ID) is always less than 32, so it really looks like it's the node-ID, and not the group ID. So maybe somehow the groupID is just not being displayed? Although even if true, it still doesn't explain why I never see my real packets from nodeID 10 / groupID 210 when I am in promiscuous mode. (unless groupID of zero isn't working as promiscuous mode as expected, and it's only showing packets that have groupID set as 0 ?)

I've tried changing the group to other values in Minicom, e.g. 1, 2, 209, 208, and there is always data arriving, so it almost looks like every group is in use - is this realistic? I live on the edge of a small town, with a handful of house in the cul-de-sac and farmers fields behind the house. I would be surprised if I was surrounded by real devices al using 433Mhz.

I still wonder if I have a hardware problem with my RPi or RFM69pi. I bought the RFM69pi from the shop, along with the emonTX and CT's, but I already had the RPi, so I just added the RFM69pi. I have removed and reseated it, but this made no difference. I have noticed that where the aerial is connected to the RFM69, the solder on the underside of the board protrudes and is very close to the GPIO pin on the RPi. I can't tell if it's touching, but it's very close. Could this be generating 'noise'?

I think I need to get the emonTX running as a receiver, and see if it shows the same amount of noise - if it doesn't it points to a problem with my RPi

stuart's picture

Re: Data loss due to RF packets getting corrupted

It may well be noise from either a bad electrical connection or problems with the aerial.

Is the aerial the correct length (exactly?) I've had problems with this in the past.

I also had issues when the sender and receiver were very close to each other.

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

If it is still reporting version 12 I suspect it hasn't worked as although it is still version, 12 Glyn bumped the version number to 13 so it could was distinguishable as being the recompiled v12 see RFM12PI receiver goes hard down sometimes for details.

"I have noticed that when in promiscuous mode, the first byte (which is supposed to be group ID) is always less than 32, "

I'm not seeing that, your logs show lines starting

? G0 135 69 .....
? G0 130 224 160 .......
? G0 10 50 21 .....

The G0 should be the group number and the next value the node id so a genuine packet from group 210 node 10 should look like 

OK G210 10 9 0 9 255 0 0 0 0 147 92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (-53)

IMO it's very unlikely (but not impossible) to be the Pi and if the rfm board is that close to cause concern you should do something about that to be sure, either trim the protruding solder joint back slightly or as a temp measure to just rule it out put a thin piece of plastic in between the board and the gpio pins in question (cut a small piece of something like plastic packaging or bottle) or even just slide the board up the pins a mil or two that end.

Since the code that is directly related to the issue you are experiencing isn't working correctly it maybe a little too soon to assume it's root is elsewhere.

Looking a little closer at the 58 frames in your minicom log some observations (and a loose conclusion or 2)

54 of them have the same length, (which is 4 bytes short of a correct packet).

of those 54, only 3 have a rssi and node id that matches the good packet.

There are also 54 packets in total in the rssi range -74 to -83 and only 4 in the range -52 to -55.

So I suspect most of the rogue data is coming from a single origin.

only 1 "OK" but there are 6 node 10's, 4 of which also match the rogue length,

and of those 4, 3 have a similar value signature as the good packet just cut short by 4 bytes

But 2 node 10's have the rogue rssi and very different values if they made it to emoncms they would contaminate your data with incorrect values.

Can you see any difference in the speed or quantity of packets when selecting group 0? Is it possible exactly the same data is being shown but with "G0" added? if so it is possible promiscuous mode isn't even being initiated and the other groups are not being shown

Paul

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

I've updated the RFM69 firmware to version 13 (thanks for the link Paul) but it hasn't changed the output in emonhub.log or via minicom, and some packets from my emonTX are still flagged as corrupt.
> 0v
[RF12demo.13] O i15 g0 @ 433 MHz
 ? G0 26 169 226 12 133 196 130 131 223 50 161 100 162 131 70 179 115 48 40 110 27 (-78)
 ? G0 0 97 154 160 46 6 134 100 77 219 63 194 39 23 10 12 25 77 98 45 240 (-78)
 ? G0 8 36 214 100 20 138 108 50 23 140 139 2 169 227 13 36 197 22 227 137 141 (-82)
 ? G0 30 88 133 206 221 43 29 203 225 139 102 92 158 169 3 141 108 220 140 (-81)
 ? G0 28 198 0 187 183 113 194 88 32 122 33 8 220 248 165 248 77 47 (-80)
 ? G0 2 216 11 247 68 244 235 135 77 19 39 83 197 56 200 2 120 188 105 75 10 (-78)
 ? G0 26 96 1 69 6 222 107 76 213 4 72 192 160 252 139 118 102 193 204 161 85 (-80)
 ? G0 17 11 103 140 97 61 150 102 47 148 145 221 202 139 129 180 132 241 72 129 24 (-78)
 ? G0 13 135 200 187 93 142 0 150 (-79)
 ? G0 29 4 40 42 36 132 17 47 93 35 206 182 16 31 154 70 229 8 76 185 212 (-81)
 ? G0 20 210 47 239 233 14 173 169 148 229 25 214 175 154 204 4 4 216 66 12 55 (-77)
 ? G0 29 70 71 102 51 74 190 120 235 72 37 62 5 201 243 61 225 158 209 32 108 (-74)
 ? G0 8 106 75 41 66 76 60 231 162 62 91 210 66 97 68 149 53 64 56 35 159 (-78)
 ? G0 11 124 47 142 101 184 213 253 16 94 186 41 145 159 95 72 158 193 195 116 58 (-78)
 ? G0 9 208 185 244 223 180 236 68 11 221 143 5 216 118 155 158 72 5 63 227 198 (-77)
 ? G0 10 102 246 62 46 200 83 131 13 192 10 184 224 11 178 150 154 15 130 156 88 (-80)
 ? G0 14 174 160 215 208 20 225 176 7 126 20 131 242 127 35 86 215 246 228 252 214 (-78)
 ? G0 16 174 87 56 126 210 72 157 153 143 26 166 213 128 21 2 222 53 249 212 41 (-79)
 ? G0 12 112 231 55 162 241 128 151 241 128 158 112 96 200 248 205 77 169 250 124 71 (-79)
 ? G0 24 61 173 197 152 84 138 214 152 214 30 27 48 71 56 214 143 240 102 128 80 (-81)
 ? G0 4 209 131 136 40 53 192 90 104 35 244 83 236 92 169 15 32 55 236 22 184 (-79)
 ? G0 4 25 197 24 75 169 144 34 194 74 120 228 58 245 66 248 52 194 68 248 55 (-80)
 ? G0 0 167 2 185 145 143 208 196 64 104 126 132 49 214 153 99 167 79 184 199 10 (-76)
 ? G0 7 80 (-79)
 ? G0 3 192 193 24 129 235 140 251 174 164 52 147 188 72 26 16 242 74 108 16 213 (-78)
 ? G0 10 24 47 137 221 18 133 19 144 38 115 35 192 240 176 80 123 237 61 122 32 (-76)
 ? G0 20 183 16 47 36 171 24 66 105 71 63 25 149 119 109 91 87 196 81 168 225 (-79)
 ? G0 27 123 53 176 134 165 194 169 239 136 194 217 69 87 7 22 202 94 31 218 71 (-79)
 ? G0 31 115 233 1 180 205 98 55 68 201 29 81 168 97 138 205 89 193 169 47 65 (-76)
 ? G0 9 25 11 51 249 58 177 53 237 (-81)
 ? G0 3 55 206 144 0 43 238 3 84 78 212 174 70 8 131 0 39 206 166 69 129 (-78)

@Paul, you were right that some of the entries my previous attached files showed a first byte with a value greater than 31, but today, all of the output I have looked at shows a value of 31 or less. So maybe my first attempt at updating the firmware did "something". However, even with V13, promiscuous mode still doesn't show me my packets from my emonTX (node 10), or anything with an RSSI close to my emonTX (now -51)

Can you see any difference in the speed or quantity of packets when selecting group 0? Is it possible exactly the same data is being shown but with "G0" added? if so it is possible promiscuous mode isn't even being initiated and the other groups are not being shown

The quantity and speed of packets is about the same when using group 210, or using group 0, or using any group number at all for that matter, and the RSSI values are the same, usually between -74 and  -82.So I agree with you - it looks like group 0 is not having the effect we are expecting. About the only difference I can notice is that when I select group 210, I can then see the data from the emonTX as well as lots of other data. If I change the group, I see lots of other data except group 210. It almost looks to me as though group 0 is only showing me data that does not have a groupID, (hence it excludes my group 210 traffic), but when I set group 210, I get my group 210 as well as everything else.

 

I have tried lifting the RFM69 slightly to leave more gap between the RFM69 and the GPIO pins, and also inserted a piece of plastic as suggested, but the MiniCom output looks about the same. I restarted the RPi but this made no difference.

Is the aerial the correct length (exactly?) I've had problems with this in the past.

It looks like it's about 15.5 - 16cm - slightly hard to be accurate as it's slightly curved where it is soldered to the RFM69, and I don't want to risk damaging the connection. Is this close enough to the 164.7mm  or do I need to measure more accurately?

pb66's picture

Re: Data loss due to RF packets getting corrupted

At some point over the weekend I will try and set something up to run some proper tests, but I just tried a jeelink that has the rf69pi sketch on it and found the same problem with the group 0 setting.

I do not really have an rf issue, but I do lose more emonTH packets than I would like but most do get though.

via a serial monitor with standard settings I could see my usual packets and nothing unusual.

selected 0g and it was like the aerial had been cut off, nothing received at all (so my known good packets do not get through in promiscuous mode)

selected 210g and it sprang back into life!! tried it a few times in case it was a fluke, it wasn't!!

back to standard settings I tried 0q and found as expected a few rogue entries in addition to the valid frames.

Selected 0g while 0q was still in place and the flood gates opened to loads of rogue entries all of which showed "G0" and not a single valid packet in sight!

So it seems like promiscuous mode blocks valid packets or packets with any group defined (it can't block group 210 as when set to 0g there is no reference to 210)

Promiscuous mode does however, remove the "filter by group" as the flow of invalid packets increases significantly when switching to 0g.

I will need to digest that some more and run some tests but I think we can say with some certainty "it ain't right!!!"

Paul

 

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

Attached is about 45-60 mins of log, still no good packets getting through though. My "live" emonHub base and my emonPi are still receiving the same packets so the good packets are getting lost in the jeelink firmware not in the air, apparently due to setting 0g.

Paul

emjay's picture

Re: Data loss due to RF packets getting corrupted

My interpretation of the logs is that there is a relatively weak RF noise source that is causing most of the problem.

Here is the logic:

The RFM69 series has a more sensitive receiver than the  RFM12B.  Its state machine is designed to work on complete packets, rather than the previous byte-level host interaction.  The packet engine is triggered by some criteria that happen very early in the packet reception, namely some preamble arriving, the "snapshot" RSSI > the settable threshold and a valid SYNC pattern matched.  This pattern is a magic byte (0x2D) concatenated with the group byte.

A noise source sitting somewhere in the receive channel is likely to decode as a random bit stream continuously once the RSSI threshold is exceeded.  At some small probability (~ 16th power of 0.5), that stream will look like the beginning of a valid packet by matching the magic byte and group number. The packet engine will start stuffing the buffer with junk...  The CRC naturally fails and the "packet" is trashed.  However, the Rx section is essentially "blinded" during this time to the arrival of a valid packet start - the state engine is already past the packet preamble recognition stage and will not look again until the junk packet reception is terminated and the Rx section put back into Rx ready.

This can be verified by tweaking the RSSI threshold to a value higher than reported by that stream of junk packets. The good packets are way stronger, so there is considerable wiggle room to try. 

In RF69.cpp there is an initialisation section early on that looks something like this:

0x25, 0x80, // DioMapping1 = SyncAddress (Rx, Packet Mode)
0x26, 0x07, // DioMapping2 = Disable ClkOut, POR = Fxtal/32
0x29, 0xB4, // RSSI threshold    
0x2E, 0x88, // SyncConfig = sync on, sync size = 2    
0x2F, 0x2D, // SyncValue1 = 0x2D    
  // 0x30, 0x05, // SyncValue2 set at runtime

 

Just for the diagnosis, change 0x29 to a less sensitive level, (it is - RssiThreshold / 2 dBm) E.g. 0x80 would allow the good packets through, but not trigger on the noise.

 

 

 

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi Emjay

I follow what you are saying, and it could explain the large amount of "noise", but why would it also affect real packets from a valid node? My logs show that some of the real data packets are received, but discarded, and these are usually 4 bytes short, so presumably fail the checksum test. If the RX process is filling the buffer for a valid packet, why wouldn't all the bytes get loaded ?

I'll try to work out how to make the change to the RF69.cpp file you suggested

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

I've just been looking at RF69.cpp, and notice the following at line 153:

void RF69::configure_compat () {

initRadio(configRegs_compat);

// FIXME doesn't seem to work, nothing comes in but noise for group 0

// writeReg(REG_SYNCCONFIG, group ? 0x88 : 0x80);

writeReg(REG_SYNCVALUE2, group);

So maybe this is a known problem?

pb66's picture

Re: Data loss due to RF packets getting corrupted

Thanks for jumping in emjay.

So basically if I understand correctly the receiver is committing itself to processing what it thought was a good packet, before realizing it's only fluked the initial checks and it cannot move on to another packet until it's fully processed and disposed of the bad one.

When it's ready for another packet any good packet already in progress will be seen as a partial packet with the all important first part including group missing so it also fails, only the good packets that happen to arrive after a packet is completely processed and the receiver is ready for it but before another (bad) packet turns up, will make it through.

And by setting promiscuous mode the flood of bad packet pretty much eliminates any chance of that happening.

"Just for the diagnosis, change 0x29 to a less sensitive level, (it is - RssiThreshold / 2 dBm) E.g. 0x80 would allow the good packets through, but not trigger on the noise." by this are you suggesting reducing the noise so that the stronger packets are able to be seen to identify a quieter group?

presumably with less "noise" or junk the good packets will show the correct group number? that's the part that really threw me and made the code "appear at fault" 

In my first post on this thread I said "I do start to wonder if the super sensitivity of the rfm69 maybe a bit of a hindrance in high traffic areas, I wonder if it can be attenuated??? " is it feasible to tone down the reception using the rssi threshold in high traffic on a more permanent basis by making it a user setting?

Paul

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

Interesting find Ian,

emjay's picture

Re: Data loss due to RF packets getting corrupted

Ah yes, the dangers of a quick FIXME LATER patch.  The required fix is actually an adaptive RSSI threshold, then to be really confusing, that two line patch introduces a new bug!  The code is fine for a non-zero  group . The state machine uses a two byte match to the pattern of magic number + group as expected.

Setting group to zero unfortunately leaves the byte match count at two, with pattern of magic number + 0.   Not good, the only "packets" that can match are noise packets, hence never seeing a "good" packet in the last promiscuous log.

This secondary bug is easily fixed by uncommenting the writeReg(REG_SYNCCONFIG, group ? 0x88 : 0x80) but the background EMI versus the RSSI threshold remains as the likely root cause here.

  > is it feasible to tone down the reception using the rssi threshold ?

Yes indeed - IMHO this should be tied to Tx level back off.  If your received signals are 20 - 30 dB above the noise floor, then that is more than enough for error-free bit decoding with the standard packet size; higher suggests turning down the Tx level, saving power and being neighbourly in the ISM band.   This needs a dynamic evaluation of the noise floor (which may well vary by time of day etc), setting the RSSI threshold sensibly above that, then adding a protocol to say "your last n sent packets were well above the level I need to decode"   Feasible, but perhaps a bit onerous for what is meant to be a lightweight driver, perhaps best left for a userspace implementation with some assist from the driver.

pb66's picture

Re: Data loss due to RF packets getting corrupted

How did you get on with this Ian, any joy?

Paul

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi Paul,

Not yet - last week I was trying to work out how to change the code (I am a newbie with Arduino and the IDE and C++), and eventually decided to get a Jeelink to give me another option for testing the changes. The Jeelink arrived yesterday, and tonight I will start playing with it. I'll update if I make any progress.

Ian

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

An update.

I have been testing changes to the RSSI Threshold as suggested by emjay, but it's not having the effect I was expecting. To avoid the testing affecting my data that is processed by the RPi, I bought a Jeelink, which is connected to my Windows 7 laptop, and is now working (after the fun and games described here).

So, I modified RF69.cpp and changed the line 

// 0x29, 0xDC, // RssiThresh ...

to

0x29, 0x40, // RssiThresh - changed to exclude noise

and then compiled and uploaded just fine, but no data appeared in Serial Monitor to the Jeelink. I know there is data being transmitted as my RPi is receiving the data from the emonTX. I re-read and spotted emjay said the value is divided by 2, so ''x40 would be 64 / 2 = 32, too low for my RSSI (~50). So I played around with the Threshold value, increasing it from 0x40 to 0xC0 at various intervals. Once I reached 0x70, data started to appear in the monitor, but none was from my emonTX, and the RSSI was around -79 to -84. 

When I changed the value to 0xA0, finally I get to see the data from my emomTX, with RSSI around -59, but I still see the other traffic with lower RSSI (i.e. -92 to -98). I thought the purpose of the threshold was to exclude the weaker signals, but in fact it seems to be excluding the stronger signals - the opposite of what I thought we were trying to achieve? 

I couldn't actually see anything in RF69.cpp that compares the RSSI of a packet to the constant, and rejects or allows the packet, so I guess this check is done in some other code somewhere?

I attach the sample results, in case I am interpreting it incorrectly. 

emjay's picture

Re: Data loss due to RF packets getting corrupted

The RSSI threshold value is used internally by the RFM69 state engine as the very first criteria for deciding if incoming reception is worth processing further.  The reported RSSI value is taken later in the process after the AGC and AFC adjustments are made.   For a real packet, on frequency, the initial and final value will agree within a few db.

The results suggest there is a noise source in the channel that  has a short duration - it is strong enough to trigger the decode process, but has faded away by the time the reported RSSI value is evaluated. The packet contents are as you might expect, just junk.

Another suggestion from the results is that the emonTx and the monitoring JeeLink do not agree on what is 433 MHz i.e. the Rx uses the AFC to correct a static error between the two crystals involved.  If you have a recent rfm12 demo loaded, there is an O command for introducing an offset to the carrier frequency.  You can use this to sweep across a range (say +/- 200 KHz) around the default carrier frequency and see the transition from almost no emonTx packets - good reception - almost no packets.  The midpoint of that is a good approximation to the carrier offset.

 

 

 

 

AK_emon's picture

Re: Data loss due to RF packets getting corrupted

I am a new emon user. Just hooked everything up this evening. I have the same setup as Ian was describing and I am having the same problem. I was wondering if Ian (or anyone else) was able to figure out the issue?

Some good RX frames, but mostly bad RX frames.
Screen shot of emonhub.log attached.

emjay's picture

Re: Data loss due to RF packets getting corrupted

@AK,

Could you capture a few more of the failing packets that have the RSSI value ~54 ?  The contents appear reasonable, but the CRC has clearly failed.  Are the 'missing' four bytes of payload just an artifact of the debug output?

The other failing packets appear radically different - junk contents and much lower received signal strength ~90. This suggests an interferer on or close to the selected channel.

 

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

Hi emjay,

​I think the "missing" 4bytes may be quite significant as I have seen the same characteristics.

IMO it is unlikely to be a debug characteristic as most of the rf "base station" sketches are based on the RFdemo.12 sketch and the "raw" packet is logged unprocessed in emonhub. 

If you noticed in my observations of Ian's logs above he has many weaker frames that are 4bytes smaller but more importantly he has some apparently good strong packet that even look to be carrying similar values but they are both 4bytes short and rejected.

Ian is using a RFM69Pi add-on board which is a rfm69 and 328 on a GPIO connected board running "RFM69CW_RF12_Demo_ATmega328.ino" and "stock" emonhub v1.2.on the Pi where as AK is running a significantly different "emon-pi" variant of emonhub which means he may also using the emonPi RFdemo.12 based sketches (or possibly a RFM2Pi or RFM69Pi).

Confirmation of the HW plus some longer logs from AK would be helpful.

I wish I had more time (and the knowledge) to look into this myself but have not been able to, I am now losing a significant number of packets too , I did try "Adding a trace_mode to rfm2pi firmware" aside from the same "lack of full promiscuous mode " limits it seems to work but involves continuously switching groups to get a better picture, but it's not really sufficient to monitor individual channels (groups).

Paul

 

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Paul,

Thanks for the definitive links.  That nails the truncated CRC error case I think:

  (Line 652): showString(PSTR(" ?"));

  (Line 653): if (n > 20) n = 20; // print at most 20 bytes if crc is wrong

Some longer logging with this 20 limit commented out would be useful.

Also, for testing, is it relatively simple to stuff a couple of 0xAA's spaced out towards the end of the packet overwriting a couple of those 0x00 bytes?  The long string of constant zeros is putting a strain on the Rx clock recovery.

 

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

Ahh ok! that's makes sense, the packets that fail crc are truncated TO 20bytes (data only) not BY 4 bytes, that is just coincidence due to the same (or similar) remote sketch being used which happens to be a 24byte payload, 

That explains the output but not why they fail the crc checks, I had thought the checks failed due to being incomplete packets.

Perhaps Ian can also try commenting out line 653 and 654 of the RFM69Pi sketch and provide a log too as his errors seem to be quite consistent.

Paul

emjay's picture

Re: Data loss due to RF packets getting corrupted

Hmm - overriding n to 25 or 60 something for the failing CRC case might be safer...

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

I had similar thoughts but assumed there would be a 66byte cut-off somewhere, would it not be better to be sure we are seeing all of the payload rather than a truncated one, to hopefully spot why it fails the crc?

Perhaps replace those 2 lines with something like

if (n > 70) {
    showString(PSTR(" Only 20 of "));
    showByte(n);
    showString(PSTR(" bytes shown"));
    n = 20;
}

to prevent run away or grossly over size packets while retaining scope for up to a full 66byte payload plus a few bytes.

Paul

EDIT - I'm thinking more of a permanent change to the firmware for the benifit of future users rather than just a temp change for this 24byte case, not all users have the set-up for playing with sketches on the rfm2pi's so G&T provide compiled hex files (plus it's a pain in the backside keep changing sketches)

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi guys, sorry I have been quiet on this lately, just been too busy. However, I will try and find time to do the testing suggested and will post back. 

Oh, I forgot to say that I did try emjay's suggestion of changing the frequency offset, but it didn't really help. I can post the log from the test if required, otherwise i'll try and focus on the code change suggested today

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

OK guys, I managed to do some testing, but I'm not sure it helps much. However, it's quite late, and I might have made some simple mistake, so feel free to groan if you spat anything simple...

As a reminder, I am doing my testing on my Jeelink connected into my windows 7 laptop, to avoid affecting my RPi. My laptop / jeelink is in the same room as the RPi so distance to emonTX is similar.

I started by using the existing RF12demo.12 that I last used in the Jeelink, and commenting out the two lines that would truncate bad packets to 20 bytes. The results are in filename = rf12demo with crc truncation bypassed.txt

There are a few things interesting in this test - 1) the size of some of the bad packets is huge. 2) There are no bad packets from my emonTX (which is node 10), and 3) in most of the bad packets, there is a common string towards the end of the packet "0 45 1 0 0 232 3 0 0 0 0 0 0 197 0 196 0 192 0 193 0 194 0 198 0 1 0 0 58 62". It can't be coincidence, but equally it might not be significant. Also, the Jeelink is picking up fewer packets (good or bad) than the RPi, but I guess this could be due to differences between the two devices as emjay mentioned previously.

However, Paul asked me to use the new RF69 sketch, so I downloaded, compiled, and loaded into the Jeelink. The results are in filename = RFM69CW_RF12_Demo_ATmega328_1.txt This also showed no bad packets for my node 10, and again not many packets at all compared to what the RPi was seeing during the same time period.

I then commented out the two lines, to mirror the test I had done with my original rf12demo sketch, and the results look very similar. However, there are two packets that look like valid node 10 packets except the last but one byte has changed from 115 to 230.

? 10 9 0 43 255 0 0 0 0 27 94 0 0 0 0 0 0 0 0 0 0 0 0 230 0 (-61)

OK 10 10 0 45 255 0 0 0 0 37 94 0 0 0 0 0 0 0 0 0 0 0 0 115 0 (-60) 

? 10 10 0 110 255 0 0 0 0 4 94 0 0 0 0 0 0 0 0 0 0 0 0 230 0 (-56) 

OK 10 9 0 57 255 0 0 0 0 233 93 0 0 0 0 0 0 0 0 0 0 0 0 115 0 (-57) 

(RSSI is different as I had to move the laptop to plug in to power supply!)

Now, 230 is exactly double 115, and I don't believe in coincidences !

This also got me wondering - what is this 115 value? I logged in to emoncms, and this is value is showing up on the Input screen ay key 12 - which I believe is supposed to be a temperature sensor, but I don't have any such sensors !

I've run out of time to try Pauls mod to print only 70 bytes of bad packets, but let me know if you think it will help, and i'll get to it.

Ian

P.S. I've just looked bak at the console, and am seeing a few more of the rejected packets where the "115" has become "230"

OK 10 11 0 73 255 0 0 0 0 106 94 0 0 0 0 0 0 0 0 0 0 0 0 115 0 (-51) 
 ? 10 10 0 53 255 0 0 0 0 104 94 0 0 0 0 0 0 0 0 0 0 0 0 230 1 (-51) 

? 10 12 0 70 255 0 0 0 0 77 94 0 0 0 0 0 0 0 0 0 0 0 0 230 1 (-51) 

 

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Ian,

Well spotted ref the 115/230, that is the key to resolving one issue. A quick diversion into how FSK receivers work.

On the transmit side, you define how fast the stream of bits is stamped on the carrier - in this case, the Tx baud rate. How precise is that? The clock used is based on some divider ratio off the main on-module crystal, itself accurate to ~20 ppm.  The corresponding RF is launched through the air and processed by a Rx section.  Extracting the bit stream requires a matching data clock, derived from the different on-module crystal - so a potential mismatch of ~2 x 20 ppm if the tolerances are in opposite senses.  Not too bad, but even with exactly matching clocks, we are forgetting the phase - is the Rx clock leading or lagging the Tx clock? 

This classic problem is solved in the Rx bit slicer section by using the actual data bit transitions to drag the two clocks into pseudo synch. For various reasons, there is considerable jitter on the bit transitions, so this is done by some phase lock loop equivalent that starts with the data clock estimate, then tweaks it slightly faster or slower.  Some timing slop is acceptable as long as it is less that 0.5 x bit time anywhere during the packet reception.

Note that this relies on the packet contents having frequent bit transitions for the bit slicer to lock on to - guaranteed at a packet start due to the preamble (alternating bits) and then various flags/headers.  But not guaranteed in the packet payload as you see from that string of null bytes after 37, 94.

From the trace, we see that the sampling clock is running a bit slow - since MSB is sent first, when the error exceeds 0.5 x bit time, this has the effect of shifting the bit pattern one position left - effectively promoting the perceived value in that byte by x 2.  The 115 becomes 230.

So how about the next byte? How does 0 get promoted to 1?  Remember that there are two further bytes in the packet that are not put into the Rx buffer - the CRC 16bit check value.  Well, by chance the MSB of the first CRC byte is a one - this gets promoted to be the LSB in the previous byte, the last byte of the payload.

Now the picture should be clearer - the mismatched clock is messing up the bytes right at the end of the packet by a single bit shift left, including the hidden CRC value.  The packet has to fail the CRC check.

Can this be verified?  Sure - give the bit slicer a far chance to do its job by inserting some bit transitions in that 'empty' part of the payload.  Ideally a couple of 0xAA spaced out replacing a couple of nulls - but if this is inconvenient on the application side, then at least replace several 0x00 by 0xFF and filter them out as flags later.

If we get this issue wrapped up, then the logs will clean up and the symptoms should be clearer - there is at least one more issue to resolved.

 

 

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Ian,

  • There are a few things interesting in this test - 1) the size of some of the bad packets is huge

This is just an artifact - remember there are two classes of bad packets.  The junk packets and the almost correct node 10 packets.  A bit circular, but my reasoning is that if we can trust the buffer contents at all when CRC is failing, then we can trust the received length. NodeID and early bytes are matching a known good packet, so it is reasonable to print out based on the received n.

The converse is true for the junk packets - the contents are all noise, including the received n value.  Printing out beyond the buffer length is not useful and indeed may show some consistent pattern since this is dragging data from some undefined memory area. Luckily it is a read operation, so no harm done.

 

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

@Ian

I have altered a copy of the latest emonTxV3_4_DiscreteSampling.ino v1.8 sketch to use "out of scope - fall back values" so the temperture values should only ever read zero when the temperture is 0°C. (attached)

MartinR's sketches use this concept so it is obvious when a sensor develops a fault or isn't connected.

The current emonTx sketches tend to set the unused ct readings to zero and some users even use a "if less than" algorithm to zero the solar inverter ct overnight. I think this conditioning should be done at emoncms as sometimes the info is useful (eg if the inverter isn't showing 10w overnight maybe there's a problem). If this resolves this RF issue (which I expect it will) maybe we should try and avoid forcing zero values.

This probably won't effect you if you are not using the temp sensors but the 300/301 'statuses' can be ignored by using "-300", "allow neg", "+300" processes in the input processlist this could be than the current method in that it will not record a 0 to a feed. So if we were to unplug a sensor for an hour and plug it back in, the data recorded would reflect the temp then and now rather than an incorrect 0°c throughout the hour.

 

alexr's picture

Re: Data loss due to RF packets getting corrupted

Hi all,

I too had noticed this behaviour on my recently purchased emonPi + emonTx setup, and had had a similar hunch to what @emjay explained above about FSK and clock synchronisation. When my programmer arrived, I changed the emonTx sketch to hardcode some temparature values and can confirm that this greatly increased the success rate of packets being received by the emonPi.

Setup:
emonTx 3.4 + emonPi about 30cm from each other, emonTx powered from 9V AC adapter and other sensors connected. Recently ordered so I'm presuming running latest versions.

Before / After:
With stock firmware the success rate of packets getting through was about 1 in 3. Hardcoding the temperatures transmitted by emonTx to (randomly picked) 0, 100, 200, 300, 400, and 500 (/10 degrees) changed the success rate of packets getting through to about 1240 in 1250 (logged over 20 mins).

What remains:
I managed to snapshot a log of one of the remaining failed packets (attached). Looking at the binary equivalents it seems that there was an off-by-one-bit type error before the voltage got transmitted (I had no current sensors connected, hence a long series of zeros for the current readings).

Thoughts:
I've heard of techniques (I think they are called line coding) which exist to get round this sort of problem. Afaik they encode data in a particular way such that there no long streams of 0's or 1's in the transmitted signal and then decode on the other end to recover the original data. A software implementation of one of these may be useful in this case. Alternatively, stuffing 0xAA bytes in suitably frequently etc would also work albeit arguably less elegant.

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Paul,  thanks - just inserting the non-zero temperatures should might be enough bit transitions to knock the Rx bit clock drift on the head.   The next logs should be illuminating.

A reasonable question is why was this phenomenon not seen with the RFM12B Rx section?  Turns out the RFM69 bit slicer claims a much shorter lock in time - the downside is that increased sensitivity allows it to drift off faster.

The better solution is to turn on data whitening, then there is no payload content dependency.  Unfortunately, a change of RF packet format would be disruptive to the installed base (and no, I don't think it is possible to detect dynamically in the general case).  

pb66's picture

Re: Data loss due to RF packets getting corrupted

Thanks Alex, it's always reassuring when independent debugging arrives at the same place :-)

I don't think we can implement a change in encoding (easily) but I see we have 2 easy options 1 is to just alternate the "rarely zero" and "may be zeroed" values to avoid the long strings, (4ct's = 8bytes = 64 zeros) which isn't "organised" (or pretty) or just not zero the results, The main reason they get zeroed is so that users that do not have ct's connected can see only zero's rather than noise, if there is a ct connected but not "active" there will be noise and that noise could prevent this RF issue, without effecting use of said ct channels and those not being used can just be ignored.

The same "out of scope" method won't work for ct's as they can use the full range.

@emjay - yes, I'm quite keen to see what we have left too

"A reasonable question is why was this phenomenon not seen with the RFM12B Rx section?" another contributing factor maybe that the 6x unused temperature values (96 consecutive bits) were not previously included by default and far more temp channels go unused than ct channels so that alters the odds too.

Paul

PS Do you think there likely to be any fix to the 0g issue in JeeLib? or can you think of a workaround ?

update - I've just tried lifting some of the "if ct connected" logic in that sketch and even with all the channels reading but not connected I cannot get any noise (sods law), so that knocks that idea on the head!!

to be fair it's quite uncommon to have a emontx running without ant ct's at all.

Robert Wall's picture

Re: Data loss due to RF packets getting corrupted

CTs:

The present emonTx design grounds the input when no plug is present, this 0 count input is detected and disables the processing on that input. However, the data structure is always initialised to zeros, and that's where the zero values come from.

If you disable the disabling 'if', and process the input just the same, then the filters adjust down from their initial state of 512 counts and output zero after about 5 blocks of 200 ms, so both ways you get zero output.

A quick and easy work-round would be to define dummy bytes (or words) in known positions in the data structure, stuff them with 0x0A or 0xAA, and that would provide something to synchronise to, and could easily be ignored in emonCMS.

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

@Paul, I have uploaded your modified sketch, and will run overnight, and check in the morning. But very early indications look good. Thanks to emjay Alex, and Robert for your support so far. 

 

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

I've just been having a look at the data from overnight, and there are still quite a lot of null entries in emoncms. I'll need more to determine how much improvement iwe have made, and will update later - might be tomorrow as I have a busy day today

Ian

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Initial analysis of the performance shows a big improvement - in the 12 hours since using the updated program (which fills the temperature values with "3010"), emoncms shows null 

The updated code has now been running in the emonTX for half a day, and so I compared the 12 hours from before the update with the 12 hours after the update. The number of null packets reported by emoncms has reduced from 29% to 17%, so a good reduction, but I hope we can still improve further. I'll pull some logs from the RPi next, and upload soon.

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

Glad your making progress Ian. How are you counting "null packets"? I'm not sure if counting the null datapoints on a fixed interval feed is accurate unless the two independent intervals are synced somehow,

Try adding an accumulating phptimeseries feed to one of the temp sensor input processing chains (before the filter) when you look at the graph you should see a regular stepped increment and any missed packets will stand out as a longer horizontal between verticals.

Paul

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi Paul,

I am counting the packets within emoncms. I am using the Input panel, selecting my CT1, and the Data Viewer then shows a graph. I then set the start and end time to show a 12 hour period, untick "skip missing" and then hit the "Show CSV Output" button to display the values. I then copy & paste into excel and use a Filter to count the entries with null. I have now attached the file. I expect this will provide the same results as your suggestion, but I will try that as well, in case it shows any differences.

Ian

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Paul Ian, no packet dump?  I'm a low level guy, need feeding a bit stream to function ;-)

BTW, what is the power arrangement on the Pi ?  There is a solid RF ground reference for the negative rail? Assuming that the power brick will do this for you can be fallacious.

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

@Ian, I would assume that method is pretty accurate and certainly gives you a good overall 'end to end' success rate, which is what you are trying to improve overall, The results indicate that 17% of the datapoints recorded by emoncms were null for one reason or another and that 83% will have recorded a value due to being updated at least once during the 10s interval.

The 17% could indeed reflect only dropped packets, but could also include packets lost in transit or filtered for some reason or just delayed before being issued a timestamp, possibly leaving a +10s interval.

Possibly, none of these will have an impact, I just suggested using the phptimeseries to be sure to show every packet recorded by emoncms with a timestamp that may (or may not) reveal any time patterns.

@emjay, A serial console is currently the easiest place to see a stream of data, I have previously had emonHub outputting all packets to a csv file which could be opened directly by excel, I found excel extremely useful for byte level debugging and it is so much easier on the eye too when data is displayed in columns.

I have never looked that closely at a Pi in way, and have never thought to question the power/ground plane relationship, the Pi's get powered by a wide variety of power sources (as do emonTx's), I recall some users have added ground plane "tentacles" but I wouldn't comment on the science behind it beyond accepting that it is normal practice for "CB" and "Ham radio" antennae to have ground plain radials.

Ian's comment about the signal strength of the Pi + RFM2Pi compared to the PC + JeeLink might suggest his particular Pi set up is reasonably well matched?

(Although it's along way from tuning "SWR" and the likes,the visual "trace_mode" I added to the RFM2Pi firmware could help "tune" reception.)

Paul

 

 

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Paul,

I'm assuming we can move on to the next stage of debugging soon viz. what is that disruptive noise source popping up in the channel?   The RSSI values for the wanted signal are good - that's coupling in to the Rx section just fine.  But the noise floor is not so good - could be just location or (the reason for the GND question) unexpected very local hash.   Once you have switching power bricks involved, some attention is needed to what is defining the RF ground.
 

pb66's picture

Re: Data loss due to RF packets getting corrupted

"what is that disruptive noise source popping up in the channel?" Which data are you looking at?

and what do you refer to as the "noise floor"? the consistent background noise (minus any spurious peaks) or the max level of any unwanted signal or the average ?

Paul

edit - It would be nice to see a fresh "minicom output" from Ian to see what's left.

emjay's picture

Re: Data loss due to RF packets getting corrupted

@Paul,

Sorry, I'm being less than clear.   Assuming the mis-decoded real packets are now better behaved, that leaves the frequent trash packets that are interleaving (and sometimes keeping the Rx section busy when a correct packet is due in).

Their reported behaviour when the RSSI threshold was swept across a reasonable range was odd, appearing when the reported RSSI value was well below the threshold.   Definitely needs some more packet level trace to get to the bottom of this.

RSSI threshold is as per the spec sheet - basically the trigger level for the packet engine to do any processing on what may turn out to be a packet.

I'm using 'Noise floor' as the general background hash seen by the Rx section - it is a combination of energy coming in between the ANT pin and RF GND + internally generated noise (mostly LNA).   Better than -90 dBm is typical for a reasonably "quiet" environment.

This noise is typically following a Gaussian distribution (so there is a significant probability of short lived peaks) - best seen on a spectrum analyzer as a distinct lower band of constant activity.  Once an FSK signal is about 20db above this fuzz, it will decode correctly for the packet lengths we are looking at, when the important knobs are set reasonably (RxBw to match Tx deviation etc).

 

 

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi guys, I've been looking at the results tonight, and trying not to get too excited... but I think you've fixed the RF problem !

I know emjay wanted some proper debug output, and as Paul noted, the results I provided from emonCMS was the end-to-end view, and data could have been getting lost after it arrived at the RPi, so I thought I would start by looking at the emonhub log from the RPi to get a feel for the data, and the numbers of error packets, and frequency of errors. As I was reviewing the log file from today, I just wasn't seeing the gaps in valid data that I could normally spot by reviewing the logs.

So I went to emoncms and took a look at the phptimeseries that Paul had suggested I setup to track the dummy data provided from the unused temperature sensor, and the graph was a perfectly smooth line. But as I zoomed in and with "skip missing" unticked, I eventually saw gaps in the graph, as confirmed by the CSV data, which showed null entries.

But these null entries didn't quite tie up with the emonhub.log. So I took a different approach. As the emonhub.log file gets rolled automatically (looks like when it gets to 5MB?) I copied the backup log (emonhub.log.1) off the RPi to analyse in Excel. The log covered 6 hours 18 minutes and 47 seconds from 15:08 to 21:27 today.

Reviewing the log showed the following: all of the entries from my Node 10 were complete - no errors! Although there were entries that appeared to be from node 10, when I checked them, they were clearly random noise, as the other bytes did not match my node 10 packets (e.g. CT3 & CT4 were not zero, and the Temperature sensor bytes were not 194 and 11). So I have 2107 valid looking packets from 6Hr 18 min 47secs of log. At one packet every 10 seconds I would expect 2272 packets, but the emonhub log shows that the interval between packets is between 10 and 11 seconds - in fact it's more often 11 seconds between each packet. At an 11 second interval, I would expect 2066 packets, so the 2107 is nicely between the 10 and 11 second interval.

I have *just* worked out how I can process this file further, and look for any gaps greater than 11 seconds, but that will have to wait until tomorrow. 

So, this seems to be great news on the RF problem. Oh, and to to test this, I reloaded the normal sketch into the emonTX, and watched the Terminal output, and saw packets being received by the RPi, and some were flagged as corrupt. I reloaded Paul's sketch with dummy temperature values, and those corruptions stop. Result !!!!

The only snag, is it doesn't yet explain why emonCMS is still showing null packets. I was thinking it could be due to the interval not matching the 10 or 11 seconds seen at emonhub, and I was trying to understand that, but decided to document and share the great news first, and come back to emonCMS tomorrow with fresh eyes.

I think you guys have solved the RF problem, and it was fascinating as you explained how the Rx timing works, and how clever and subtle the processing is! It's been a wonderful education learning from you! 

Right, I'm off to bed. Night

Ian

 

pb66's picture

Re: Data loss due to RF packets getting corrupted

It would be useful to see the latest minicom output Ian, the "gaps" in the timeseries feed may be due to the node 10's without a zero ct3&4 value, perhaps the "255 0 0 0 0" part of a valid frame is also a problem for the syncing on occasions, 8x1's will have the same effect as 8 x 0's, I would hazard a guess that hardcoding a choice value (21845 would be good) into the sketch for ct3&4 may make a difference as if there is a full '0 1' cycle after the 8 x1's it may pull back into sync if it's less than 1/2 a bit out at that point.

Paul

 

Ian Davies's picture

Re: Data loss due to RF packets getting corrupted

Hi Paul,

As requested, here is some Minicom output for about 4 minutes, but I don't think it shows any problems. From another analysis of the emonhub.log.1 file, which covers just over 6 hours, and by comparing the timestamp between each successive packet, I see 22 missing packets out of 2070, just under 1% error. That is a massive improvement, hence I think you have fixed the RF problem, with your updated program that fills the empty temp byes. I have now turned quiet mode back on, so the emonhub.log should hopefully capture a full 24 hours of traffic, and I'll check the data loss again. 

The remaining problem that I was referring to in my previous post, is that although the RPi is now successfully receiving 99% of the packets, the number of null entries in the emoncms DB seem much higher, even though the emonhub log shows that all of the uploads to emoncms.org are successful. I'm not sure if I was clear last time that I am using emoncms.org and not a local emoncms.

So to try and understand where this problem might be appearing, I tried to compare the data from emonhub.log with the data in emoncms.org, using the timestamp to match and compare. One point that was immediately visible is that the interval between packets arriving at the RPi is often 11 seconds, but as emoncms.org is expecting 10 seconds, I wondered if the nulls displayed in emoncms is just a mismatch caused by emoncms expecting a 10 second interval, but the data sometimes being delivered every 11 seconds, leading to "gaps" appearing in the graphs?

I am also aware that I started this topic to discuss the RF packet loss, and this thread is already pretty big, so I suggest I should start a separate thread to discuss this part of the problem. After all, assuming you are happy that the RF problem is fixed, maybe there should be discussion about adopting or developing this fix to help others?

Ian

pb66's picture

Re: Data loss due to RF packets getting corrupted

That's good to see, the log entries show a strong difference between the background noise/interference and the valid signals +20dB and all the strong pass and all the weaker fail, I would say that looks pretty successful and the credit for that should go to emjay, his help has been invaluable.

The ~11secs interval is not an uncommon thing, most looping sketches either "sleep nSecs" or use a "if x > nSecs" so the loop duration plays a part, but 11secs vs 10secs should not cause that many nulls in emoncms (in theory) as the 2 cycles rotate it would only rarely cause a whole 10sec interval to pass without a update,

That is as you say a different discussion, but at least you can now do some real tests with a fairly consistent post rate. Try putting a phptimeseries and phpfina feeds one after the other on a spare input (temp?) and compare after a few days perhaps.

I will submit a pull request and see how G&T feel about the modded sketch

Paul

321liftoff's picture

Re: Data loss due to RF packets getting corrupted

Just read this thread (Paul, thanks for the reference) and I'm having the same issue.  It appears that there is a solution, but it has not been implemented into github yet.  Can you confirm that using the modified emonTxV3_4_DiscreteSampling.ino in the post above on July 24 solves this?

I don't have a USB to UART cable, but it looks like that needs to go on the wishlist!

Any reason why this has not been posted to GitHub?

pb66's picture

Re: Data loss due to RF packets getting corrupted

To my knowledge, this was the only "known" case (since diagnosed) so it may get implemented if there is a demand. In this instance, Ian wasn't using the temp sensors, so there was no feedback on using the "positive error reporting." He just needed them to not.be zero.

I haven't looked at the sketch recently. So off-hand, I do not know if we need to re-introduce the mods to an up to date sketch, or if the attached is still valid. It will certainly prove the point, and/or the mods are no big deal to add into a sketch either way.

The mod to the rfm2pi firmware is not so straightforward unless you have the environment set up for it. The rfm2pi can be programmed direct from the Pi if you can access the arduino IDE on a GUI (screen and keyboard or remote desktop) or you will need to compile on another machine and transfer the hex to the Pi to upload to the rfm2pi. Either way, you will need the modified avrdude to upload to the rfm2pi, see the rfm2pi wiki for details.

It sounds a lot worse than it actually is. If you want to try updating the rfm2pi firmware while you wait for the usb programmer, in fact once you have the environment setup for the rfm2pi you could also use it for the emontx if you have a couple of link wires to go from the gpio to the 6pin ftdi header on the emonTx.

Paul

EDIT - just added a link to the previous posts for completeness (emonTH Unreliable reading timing?)

emjay's picture

Re: Data loss due to RF packets getting corrupted

Yes, looks like a mild case of the same issue - the bit-slicer is drifting out of synch due the that long string of null bytes in the packet.  The bit failure(s) are mostly hidden since they are to the right of where a bad-CRC packet gets truncated.

The fairly frequent but much weaker bad packets are essentially noise.   I'd like to work on this area too (probably affects more users that the bit-synch slip), but it requires a driver change on the Rx side. I appreciate not all are set up to rebuild that image.  Anyone game to try?

 

321liftoff's picture

Re: Data loss due to RF packets getting corrupted

Thanks for the responses.  It will be a few weeks til I get a programmer, but at that time, I'd be willing to help improve the driver of the rfm2pi.

Paul, to what changes to rfm2pi are you referring?  I don't recall reading in the thread suggested changes to rfm2pi, other than modifying the encoding scheme to ensure that a string of zeros are never possible.

The latest emonTx sketches just have some updates to the pulse counter code, so I'll give your sketch a go and see how it improves thigs.  That will be another datapoint to help determine if this should be implemented in github.  I, too, am not using the temperature sensors, so I expect to see similar positive results.

djh95's picture

Re: Data loss due to RF packets getting corrupted

Hi, this is my first post.  I've just received my first emonTX and RFM69Pi and have spent a good part of the weekend trying to get them working together with an old Raspberry Pi, and with limited success!

The data loss, as described here is plaguing me too.   I'm a newbie so this thread has been fascinating reading - I see I have lots to learn! 

Out-of-interest - am I correct in thinking that a quick fix might be to purchase/install a temperature sensor on the emonTX?  

Anyway, big thanks to everyone who is working on this issue!  This is a great open source project.

 

Robert Wall's picture

Re: Data loss due to RF packets getting corrupted

That might provide a solution. To my way of thinking, if you have a programmer, the far better solution would be to edit your emonTx sketch (and the RPi to suit if necessary) so that the rf payload consists of only the quantities that are actually being used - so the bit stream transmitted is not filled with a (useless to you) string of zero bits.

pb66's picture

Re: Data loss due to RF packets getting corrupted

Yes, adding a single sensor only might reduce the issue as adding one sensor only reduces the 6positions * 2byte * 8bits (96 all zeros bits) to 5 unused positions (80 bits) since the positions get populated in order, you could be fairly confident about 2 or 3 sensors reducing the run to a manageable level as when the 4 cts are unused (64 bits) it is not known to cause an issue.

...and should the sensor(s) happen to be a location that is ever around 0°C the issue could reoccur.

Paul

 

321liftoff's picture

Re: Data loss due to RF packets getting corrupted

I finally bought a USB-UART cable, got my environment setup, and tested uploading code to the emonTx. To my surprise, the github code was updated on Oct 24 to include changing the default temp values from 0 to 3000, so my test upload also fixed the issues I was having!

Thanks for your help, Paul.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.