Comments (2)
Thanks @skeggse for the interesting proposal, and sorry for the delay on this initial response, busy days.
First things first, let me rephrase your idea to make sure I'm understanding correclty. You basically want something like this:
data = b'{}{}'
for raw_json in ijson.new_method_you_want(data):
# raw_json is b'{}' each time
Is that a fair depiction of what you're looking for?
As you mentioned, a way to currently achieve this is doing something like:
data = b'{}{}'
for raw_json in map(json.dumps, ijson.items(data, '', multiple_values=True)):
# raw_json is '{}' each time
The drawback is that this indeed builds each document fully as a Python object just to dump it back into its string form. In the process you might also lose some information (but not necessarily).
From the point of view of ijson and its inner workings here are some thoughts:
- In the example above, and the one you mention in your original comment, the top-level documents in the single stream consist on JSON objects. Note however that in general they could be any JSON value; e.g.
{} []
,[] []
,1 2
,true {2}
, etc. - The above means that ijson cannot simply look for a starting/ending bracket, parenthesis or the like. Instead, and to ensure correct behavior, parsing of the original document must be done; in other words, there are no shortcuts. In particular, also note that although it might work most of the times, using newlines to determine the end of JSON value is obviously not fully reliable (e.g.,
{\n}
is a valid, single JSON value). - To produce individual documents consisting of verbatim copies of the original bytes we then need to fully parse the document, while keeping track of the bytes the parser considered in the process (this is they key). To begin with, none of the ijson routines is "low-level" enough to offer this information -- we need to go to the parser technologies that power our backends.
- Of those we currently have a few: our own pure python parser, the yajl library (versions 1 and 2), and a not-yet-on-the-master-branch boost json parser.
- We could change our own python parser to keep track of input bytes
- From memory the boost json library might keep track of this information already
- But the yajl parser (neither version) doesn't, so for most of our backends it would be simply impossible to provide this information.
- Moreover, even if all underlying parsers exposed this information, it would still require some non-trivial amount of work to add your desired functionality on top of that.
In summary, I think this is simply not exactly possible because of the restrictions imposed by the underlying parsing technologies we use, and even if it was, at least for some of the backends, it would be too much effort for little gain.
from ijson.
Okay, I might just pull in a parser backend. It seems like a relatively simple parser atop a tokenizer would be sufficient. Thanks for your consideration!
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- High level interface to iterate over lists HOT 3
- HighLevelAPI: Raise an error if the prefix does not exist HOT 2
- Is it possible to use multiple prefix HOT 8
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.