Git Product home page Git Product logo

Comments (2)

rtobar avatar rtobar commented on June 23, 2024 1

Thanks @skeggse for the interesting proposal, and sorry for the delay on this initial response, busy days.

First things first, let me rephrase your idea to make sure I'm understanding correclty. You basically want something like this:

data = b'{}{}'
for raw_json in ijson.new_method_you_want(data):
    # raw_json is b'{}' each time

Is that a fair depiction of what you're looking for?

As you mentioned, a way to currently achieve this is doing something like:

data = b'{}{}'
for raw_json in map(json.dumps, ijson.items(data, '', multiple_values=True)):
     # raw_json is '{}' each time

The drawback is that this indeed builds each document fully as a Python object just to dump it back into its string form. In the process you might also lose some information (but not necessarily).

From the point of view of ijson and its inner workings here are some thoughts:

  • In the example above, and the one you mention in your original comment, the top-level documents in the single stream consist on JSON objects. Note however that in general they could be any JSON value; e.g. {} [], [] [], 1 2, true {2}, etc.
  • The above means that ijson cannot simply look for a starting/ending bracket, parenthesis or the like. Instead, and to ensure correct behavior, parsing of the original document must be done; in other words, there are no shortcuts. In particular, also note that although it might work most of the times, using newlines to determine the end of JSON value is obviously not fully reliable (e.g., {\n} is a valid, single JSON value).
  • To produce individual documents consisting of verbatim copies of the original bytes we then need to fully parse the document, while keeping track of the bytes the parser considered in the process (this is they key). To begin with, none of the ijson routines is "low-level" enough to offer this information -- we need to go to the parser technologies that power our backends.
  • Of those we currently have a few: our own pure python parser, the yajl library (versions 1 and 2), and a not-yet-on-the-master-branch boost json parser.
    • We could change our own python parser to keep track of input bytes
    • From memory the boost json library might keep track of this information already
    • But the yajl parser (neither version) doesn't, so for most of our backends it would be simply impossible to provide this information.
  • Moreover, even if all underlying parsers exposed this information, it would still require some non-trivial amount of work to add your desired functionality on top of that.

In summary, I think this is simply not exactly possible because of the restrictions imposed by the underlying parsing technologies we use, and even if it was, at least for some of the backends, it would be too much effort for little gain.

from ijson.

skeggse avatar skeggse commented on June 23, 2024

Okay, I might just pull in a parser backend. It seems like a relatively simple parser atop a tokenizer would be sufficient. Thanks for your consideration!

from ijson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.