Git Product home page Git Product logo

Comments (8)

green-green-avk avatar green-green-avk commented on August 23, 2024 1

Uh. Huh. It's about streams, not files at all.

Please, consider this one-liner repro case:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True): print(v)')"

from ijson.

rtobar avatar rtobar commented on August 23, 2024

Thanks @green-green-avk for the interesting report. Do you have an example file that can be used for testing this? In particular, I'd like to understand/confirm that you really don't get any values yielded until you hit the EOF. How are 100% certain of that? The example code looks correct of course, so I suppose your report is based on observations of some kind.

If f is an actual file on disk, and you opened it via open, you could try adding a print(f.tell()) in your for loop to see how much the file has been advanced every time you get a value.

from ijson.

rtobar avatar rtobar commented on August 23, 2024

Also: does this happen with other backends? Just noticed this was reported for the python backend, so maybe (hopefully!) ithe issue, if there's one, is specific to this backend.

from ijson.

green-green-avk avatar green-green-avk commented on August 23, 2024

Ouch, it seems, I missed buf_size=1 stanza so bad here...

It works:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True, buf_size=1): print(v)')"

However, I think, the exact semantics of the buf_size option deserves to be documented.

from ijson.

rtobar avatar rtobar commented on August 23, 2024

@green-green-avk thanks for that reproducer and the clarification on it being a continuous, potentially infinite stream of data (you did mention it in the original description, and I was a bit puzzled about what you exactly meant, but didn't ask further). In any case, it being a "stream" or a file on disk is irrelevant: ijson is presented with a file object regardless, and the f.tell() test should still be possible (haven't tested myself though, don't have a computer at hand ATM).

Am I understanding correctly that there is actually no issue, and that the actual problem was that your stream was generating data in chunks smaller than the default buf_size of 64kB? And not only that, but also that your streams in total were smaller than 64kB, since you were getting values only after writing your whole stream? If that's the case, could you please go ahead and close this bug report to keep things tidy? If I'm misinterpreting your last comment and there is a bug, then you'll need to wait until I get my hands on a computer with some time to check this.

To clarify: my understanding is that ijson was blocking when reading data from stdin via sys.stdin.buffer.read(64k). The buffer object is a io.BufferedIOBase whose read method tries to read as much data as requested (or until EOF). That's not ijson's fault, just the behaviour of that particular file object, although I understand it could be misleading and confusing.

On the matter of the documentation of buf_size, I'd be interested in hearing some thoughts on how to improve it. I would have said it was good enough, but it clearly isn't since it's leading to confusions. Maybe the confusion arose mostly around the particulars of reading from sys.stdin.buffer as explained above, so maybe some words of caution could be thrown in the documentation.

from ijson.

green-green-avk avatar green-green-avk commented on August 23, 2024

Aha!

In this case, we just need to add an option to use read1() instead. This way we could avoid the performance impact of buf_size=1 and preserve an ability to work with data streaming.

By data streams, I mean cases such as a network socket with small portions of data passing each several seconds. ijson must be able yield parsed objects as soon as they are available, without waiting for the buffer to be full.

from ijson.

rtobar avatar rtobar commented on August 23, 2024

Or try with sys.stdin.buffer.raw (if that exists). Or choose a buf_size that better matches your situation.

I'd try to exhaust your options before jumping in and trying to implement a new feature that seems to cover only a very minor corner-case, but would take some effort to get right (and thus I'd be a bit unwilling to implement myself, although I'd be happy to review PRs). If you really want to go down that route, please let's track that on a separate issue to keep things separate.

Also, could I ask again if you could address the questions I posed in my previous comment? I want to confirm there's no actual issue in ijson ATM, and that this is mostly about managing expectations when using stdin.

from ijson.

rtobar avatar rtobar commented on August 23, 2024

Haven't heard back in a week for further feedback, and the issue wasn't really a problem with ijson itself, so I'm closing this one.

from ijson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.