Describe the bug ijson.items(file, pref

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Aha! In this case, we just need to add an option to use <a href="htt

ijson.items(file, prefix) waits for EOF about ijson HOT 8 CLOSED

green-green-avk commented on August 23, 2024

ijson.items(file, prefix) waits for EOF

from ijson.

Comments (8)

green-green-avk commented on August 23, 2024 1

Uh. Huh. It's about streams, not files at all.

Please, consider this one-liner repro case:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True): print(v)')"

from ijson.

rtobar commented on August 23, 2024

Thanks @green-green-avk for the interesting report. Do you have an example file that can be used for testing this? In particular, I'd like to understand/confirm that you really don't get any values yielded until you hit the EOF. How are 100% certain of that? The example code looks correct of course, so I suppose your report is based on observations of some kind.

If f is an actual file on disk, and you opened it via open, you could try adding a print(f.tell()) in your for loop to see how much the file has been advanced every time you get a value.

from ijson.

rtobar commented on August 23, 2024

Also: does this happen with other backends? Just noticed this was reported for the python backend, so maybe (hopefully!) ithe issue, if there's one, is specific to this backend.

from ijson.

green-green-avk commented on August 23, 2024

Ouch, it seems, I missed buf_size=1 stanza so bad here...

It works:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True, buf_size=1): print(v)')"

However, I think, the exact semantics of the buf_size option deserves to be documented.

from ijson.

rtobar commented on August 23, 2024

@green-green-avk thanks for that reproducer and the clarification on it being a continuous, potentially infinite stream of data (you did mention it in the original description, and I was a bit puzzled about what you exactly meant, but didn't ask further). In any case, it being a "stream" or a file on disk is irrelevant: ijson is presented with a file object regardless, and the f.tell() test should still be possible (haven't tested myself though, don't have a computer at hand ATM).

Am I understanding correctly that there is actually no issue, and that the actual problem was that your stream was generating data in chunks smaller than the default buf_size of 64kB? And not only that, but also that your streams in total were smaller than 64kB, since you were getting values only after writing your whole stream? If that's the case, could you please go ahead and close this bug report to keep things tidy? If I'm misinterpreting your last comment and there is a bug, then you'll need to wait until I get my hands on a computer with some time to check this.

To clarify: my understanding is that ijson was blocking when reading data from stdin via sys.stdin.buffer.read(64k). The buffer object is a io.BufferedIOBase whose read method tries to read as much data as requested (or until EOF). That's not ijson's fault, just the behaviour of that particular file object, although I understand it could be misleading and confusing.

On the matter of the documentation of buf_size, I'd be interested in hearing some thoughts on how to improve it. I would have said it was good enough, but it clearly isn't since it's leading to confusions. Maybe the confusion arose mostly around the particulars of reading from sys.stdin.buffer as explained above, so maybe some words of caution could be thrown in the documentation.

from ijson.

green-green-avk commented on August 23, 2024

Aha!

In this case, we just need to add an option to use read1() instead. This way we could avoid the performance impact of buf_size=1 and preserve an ability to work with data streaming.

By data streams, I mean cases such as a network socket with small portions of data passing each several seconds. ijson must be able yield parsed objects as soon as they are available, without waiting for the buffer to be full.

from ijson.

rtobar commented on August 23, 2024

Or try with sys.stdin.buffer.raw (if that exists). Or choose a buf_size that better matches your situation.

I'd try to exhaust your options before jumping in and trying to implement a new feature that seems to cover only a very minor corner-case, but would take some effort to get right (and thus I'd be a bit unwilling to implement myself, although I'd be happy to review PRs). If you really want to go down that route, please let's track that on a separate issue to keep things separate.

Also, could I ask again if you could address the questions I posed in my previous comment? I want to confirm there's no actual issue in ijson ATM, and that this is mostly about managing expectations when using stdin.

from ijson.

rtobar commented on August 23, 2024

Haven't heard back in a week for further feedback, and the issue wasn't really a problem with ijson itself, so I'm closing this one.

from ijson.

ijson.items(file, prefix) waits for EOF about ijson HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent