Is your feature request related to a problem? Please describe. <p

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Output items sequence as original bytes about ijson HOT 2 CLOSED

icrar commented on June 23, 2024

Output items sequence as original bytes

from ijson.

Comments (2)

rtobar commented on June 23, 2024 1

Thanks @skeggse for the interesting proposal, and sorry for the delay on this initial response, busy days.

First things first, let me rephrase your idea to make sure I'm understanding correclty. You basically want something like this:

data = b'{}{}'
for raw_json in ijson.new_method_you_want(data):
    # raw_json is b'{}' each time

Is that a fair depiction of what you're looking for?

As you mentioned, a way to currently achieve this is doing something like:

data = b'{}{}'
for raw_json in map(json.dumps, ijson.items(data, '', multiple_values=True)):
     # raw_json is '{}' each time

The drawback is that this indeed builds each document fully as a Python object just to dump it back into its string form. In the process you might also lose some information (but not necessarily).

From the point of view of ijson and its inner workings here are some thoughts:

In the example above, and the one you mention in your original comment, the top-level documents in the single stream consist on JSON objects. Note however that in general they could be any JSON value; e.g. {} [], [] [], 1 2, true {2}, etc.
The above means that ijson cannot simply look for a starting/ending bracket, parenthesis or the like. Instead, and to ensure correct behavior, parsing of the original document must be done; in other words, there are no shortcuts. In particular, also note that although it might work most of the times, using newlines to determine the end of JSON value is obviously not fully reliable (e.g., {\n} is a valid, single JSON value).
To produce individual documents consisting of verbatim copies of the original bytes we then need to fully parse the document, while keeping track of the bytes the parser considered in the process (this is they key). To begin with, none of the ijson routines is "low-level" enough to offer this information -- we need to go to the parser technologies that power our backends.
Of those we currently have a few: our own pure python parser, the yajl library (versions 1 and 2), and a not-yet-on-the-master-branch boost json parser.
- We could change our own python parser to keep track of input bytes
- From memory the boost json library might keep track of this information already
- But the yajl parser (neither version) doesn't, so for most of our backends it would be simply impossible to provide this information.
Moreover, even if all underlying parsers exposed this information, it would still require some non-trivial amount of work to add your desired functionality on top of that.

In summary, I think this is simply not exactly possible because of the restrictions imposed by the underlying parsing technologies we use, and even if it was, at least for some of the backends, it would be too much effort for little gain.

from ijson.

skeggse commented on June 23, 2024

Okay, I might just pull in a parser backend. It seems like a relatively simple parser atop a tokenizer would be sufficient. Thanks for your consideration!

from ijson.

Output items sequence as original bytes about ijson HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent