Title was too short to describe this correctly. I managed to lose a single message aft

Try <div class="highlight highlight-source-python notranslate position-relative ov

Message loss after reconnect about micropython-iot HOT 11 CLOSED

peterhinch commented on July 19, 2024

Message loss after reconnect

from micropython-iot.

Comments (11)

peterhinch commented on July 19, 2024

That is a very interesting observation. In ~48 hours of running I too experienced one lost message. I didn't spot this reason.

However I did spot an additional possible cause: an aaagh! moment. I figured that the cause might be an outage which was long enough to swallow a message but too short to be detected. I still think this is a possibility, to which the only solution I can see is to use your suggestion of ACK responses. I'm currently working on a branch which does this.

My initial reluctance was because I had seen some quite complex code in communications drivers and thought I had a simpler way. But given that we have an efficient dedupe I don't think it's too bad.

Getting back to the problem you identified. It strikes me that there are other scenarios in which messages may be received in quick succession, and we're missing an interlock in ._reader:

                if not mid or isnew(mid):
                    # Read succeeded: flag .readline
                    while self._evread.is_set():
                        await asyncio.sleep_ms(self._tim_short)
                    self._evread.set(line[2:].decode())

(this includes the fix for the message 0 problem you identified). This has survived a quick test with the qos demo.

I've pushed a (hopeful) fix for the two problems you identified, but no ACK's yet.

Sorry for my slow response: my laptop croaked. New battery ordered and fingers crossed...

from micropython-iot.

kevinkk525 commented on July 19, 2024

Oh if you think it's possible to miss a message although the connection seem stable, then ACKs are the only possibility. Although I would be quite happy to see ACKs in this library, I'm not quite sure losing messages despite a stable connection without noticing is possible.
I did implement ACKs on a different approach but implementing it into my header support branch would be quite easy. Have to pull your latest changes first.

Your quick succession fix will make it better but only buffers a few messages at most before messages are being dropped by the socket. Lowering the evread yield time could help too.
But this is always the downside of a non-callback approach without (larger) buffering.

from micropython-iot.

peterhinch commented on July 19, 2024

Try

        self._evread = Event()

A well-written application should have a single coro which is (as near as possible) permanently reading.

Issuing messages in quick succession has the potential to crash the ESP8266 so apps should avoid it by limiting the number of coros doing independent writes. With the above fix, do we have a real problem for apps conforming to the above guidelines?

I'm in a similar dilemma over acks. The code is written untested. It just sends 'xx\n' as an ACK where xx is the ID of the message being acknowledged. Alas I don't know if such brief outages are possible, or if the socket can cope with them if they occur. I can't think of any way to test this and I don't want to add code to fix a nonexistent problem. Maybe the one missed message in 48 hours was down to the problem you identified and my (possibly imaginary) issue.

This is the difficulty with this (and MQTT). So much empirical testing and so little deep theoretical understanding.

In testing this (and MQTT) I use four approaches to creating outages: disable the AP, put the unit under test with its power source in the microwave, walk it beyond WiFi range and back and leave it running for very long periods. These have sometimes produced different outcomes with the latter being the most critical. But I can't think of ways repeatedly to generate outages << 1 sec.

from micropython-iot.

kevinkk525 commented on July 19, 2024

That would definitely help with the problem.
However I "fixed" the problem server-side by making sure that there is a delay of at least 20ms between each write. So if the client uses an Event with a 20ms yield time (qos included, keepalives not) and processes the messages quickly, it won't lose any messages. I'm aware that this limits the througput but the main goal here is to get a relatively fast and reliable solution. Could still later implement sending a configuration for throughput throtteling on connect to support different scenarios/devices.

Your fix by waiting for the evread event to get cleared is basically just trading the dumping of new messages in favor of old messages, whereas before the fix, the library was dumping old messages in favor of new messages. So it comes down to what behaviour you prefer and personally I think dumping old messages is better than new messages although dumping is of course never a good thing in normal operation.

The way the mqtt client handles received messages by executing a callback ensures that no messages get lost and the user has to take care of keeping the callbacks short or his device crashes because of OOM. This is more transparent for the user compared to messages silently being dropped as the user might be wondering where those messages are and might blame the library, hardware or similar. I'm not saying it's better, it's just different.
A typical developer however should be aware of the implications of each approach and should have no problem writing an app conforming to the guidelines. A typical enduser will just use that library then (like the mqtt client) and does not have to worry about losing messages and the inner workings of the library.

Brief outages may occur if the WIFI is used excessively and the transmissions clash but this is typically handled by the wifi implementations. There should not be a case that the library should need to handle.

So basically if we have reasonable doubt that the current approach might not cover every outage, wouldn't it be better to just stick to the ACK approach that has been tested very well in mqtt instead of testing a different approach?

from micropython-iot.

peterhinch commented on July 19, 2024

I take your point, but I have tested our current approach fairly extensively. My optimistic view is that the one failure I witnessed was down to the bug you identified.

I'm loath to add complexity unless there is a real problem to be fixed. I don't know if outages << 1s are possible, and, if they are, whether the socket would notice them.

The orthodox solution is indeed to use ACKs. The problem which MQTT fixes is different from ours. The internet can suffer huge latencies. A sequence of messages can be sent before even the first ACK is received. There is no option other than reducing throughput to a hopeless degree.

With a wired server we can detect outages quickly. Further the ESP8266 will crash if it pumps messages into a dead socket; we have hardware limitations which preclude high throughput. The best we can do is detect outages fast and design apps to limit the number of messages sent in that period. So I (for the moment) retain my pig-headed liking for this simpler approach which I think matches the advantages and limitations of our hardware environment. As always I'm willing to be convinced by empirical evidence or by theoretical insight.

from micropython-iot.

kevinkk525 commented on July 19, 2024

We can work with an optimistic view but if we want to be sure about the outage detection we should either try to ask someone who knows this for sure or use the orthodox solution.

I understand you want to keep the library as small as possible. It's already very close to the limit of compilation and only a few changes I made, made it neccessary for me to use a precompiled file.
I'm not sure using ACKs would be much more complex. I might try and compare the size.

from micropython-iot.

kevinkk525 commented on July 19, 2024

Experience using library in a real application:
Another reason why I prefer ACKs is that the current approach spawns a lot of additional coroutines for _do_qos. I now have to increase my uasyncio waiting queue for the first time, just to prevent my program (pysmartnode) from crashing at startup publishing multiple messages because awaiting a write now does not mean that the resources are actually freed.
If i wanted to make sure all resources are freed before sending another message, I'd have to do the same as _do_qos and wait for timeout and check the connection.
For me that's not actually an improvement. Did not really think of it as a big deal when I saw your code and tested it but now after implementing it into my pysmartnode project for a comparison I'm actually not that happy about the solution.
So I would either have to increase complexity in my application to stay within resource limits, also resulting in lower throughput, or I have to increase the asyncio waiting queue. This is a doable workaround, no doubt, but the library makes it a lot easier to get a queue overflow with this _do_qos implementation.

Comparison between library qos and ACKs:
Current implementation
If you think about throughput, you can send at most 15 messages during the timeout timeframe with 1 coroutine sending those messages. Then the 16 asyncio waitq slots are full. Of course almost no program is designed to only have 1 main coroutine but sending 15 message during 1.5 seconds seems possible.
That would be 10 messages per second as throughput. Increasing the uasyncio waitq however easily increases the throuhput but also the RAM usage. In a common scenario however you would not increase the waitq size. As sending is cheap (time wise), there is no real limit for sending messages if you don't intend to get any answer.

Using ACKs
Using ACKs has no theoretical limit for throughput in this scenario regarding uasyncio waitq slots.
The real throtteling of throughput comes from the esp itself, I get a latency of about 120-200ms for sending a message (including headers) and receiving an actual answer (~50 Bytes) including converting it (after lowering the events yield times). So it seems that for ACKs the throughput would be about 5-8 messages per second. Won't win an award either but is not so bad. Using a simple ACK instead of a full answer and having tested it without header support might lower that latency a bit further but I do not have ACKs implemented yet, only a system that answers requests (actually a mqtt proxy running on the server).

Conclusion
The direct comparison of throughput between both approaches however is not reasonable as having 1 main coroutine sending all the messages is not common and it does not account for RAM usage where the ACK approach would do better, probably despite the library being a little bit more complex. Depends on how big the data is you are sending.
Therefore in a real application with about 6 coroutines used by the application, only 10 messages can be send during the timeout interval, resulting in only 6-7 messages per second, just like using ACKs.

What to do?
In my opinion changing the qos approach could be a good move as the current implementation is actually not the more resource saving approach although it might look like it when only looking at the code or sending messages very rarely.
Also the current approach is not suitable for high throughput and puts the obligation of keeping resources in check on the user. He can't just send messages as fast as possible as he'd get a queue overflow.
Using ACKs however has a constant resource usage, making it easier to use. The user only has to think about the resources his program adds and can send messages as fast as he wants.

I might add a comparison of complexity/filesizes tomorrow when I replace the current qos implementation with ACKs to see the difference.

from micropython-iot.

peterhinch commented on July 19, 2024

I'm convinced, not least because I received a missing message in each direction overnight. My approach doesn't work and I'm at a loss to explain why.

I will implement my minimal ACK scheme. It also has the potential advantage of enabling flow control. This would prevent the receiver overrun issue you identified. Transmission could be paused until outstanding ACK's were received. This would trade throughput for robustness.

I'm impressed and surprised that you're getting that throughput without crashing the ESP8266. I never anticipated more than a very few concurrent write coros, with only a handful of messages being sent in the timeout period.

from micropython-iot.

kevinkk525 commented on July 19, 2024

Sounds good to me. Then we have a direction. I'm a bit sad though that your approach for qos does not work as expected.

Oh I did not actually test if the esp was happy with 10 messages per seconds with the current implementation. That was just a theoretical thought only based on uasyncio in an optimistic scenario.

from micropython-iot.

peterhinch commented on July 19, 2024

There are some interesting design decisions in implementing ACKs.

One aim is to eliminate out of order (OO) messages. Initially I plan to do this with strict flow control: an attempt to send a message will block until the previous message has received its ACK (possibly after retransmissions). But this may unduly limit throughput (in the absence of an outage).

I have in mind relaxing this so messages can overlap. My proposed solution will avoid spawning multiple do_qos tasks with a continuously running do_qos coro handling a list of messages awaiting an ACK. But I don't think this is possible while retaining the no-OO restriction:

You send messages x and x + 1. x doesn't get its ACK but x + 1 does. So x is re-sent and is received OO. I don't see how to avoid this without rigorous flow control. Re-ordering on the receiver implies latency for message x (in the event that its ACK was lost). It's also resource-heavy on an ESP8266. A can of worms.

Implementing a qos==0 option has me foxed at the moment as you need to suppress the receiver from processing the ACK, an ACK which may never arrive. So, in the presence of lost ACKs, you end up with a lengthening ignore list. I guess a timeout will be needed.

At the moment I have copious comments in the code for you to study. I'll post when I have something that works at this basic level.

I did think ACKs would be hard to do...

from micropython-iot.

kevinkk525 commented on July 19, 2024

I'm implementing it too currently so we can compare which approaches are best. Done for today though, head hurts.

Eliminating OOO messages by waiting for the ACK is the best way to do it. Throughput is not more limited than with the old approach. Having messages overlap would make the code significantly more complex and is in my opinion not neccessary. If you want to have high throughput, just use qos 0 and write as fast as you want.

Supressing the receiver from sending ACKs is no problem in my case, because I actually implement it on top of my header support so I already have the structure of sending the information mid, qos and message_type.

from micropython-iot.

Message loss after reconnect about micropython-iot HOT 11 CLOSED

Comments (11)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent