Comments (7)
@nolaexe good to know you have found this library to be helpful.
I'm afraid there's not much we can do at the moment at ijson
level to resume streams. Allowing such a feature would require us to implement a number of things that would make the code much more difficult to maintain: offsets from the stream start, keeping the generators' state between invocations, etc. If you really need to be able to "resume" from a faulty stream my suggestion would be to implement this at your level: wrap your original stream/file-object into a file-object class that performs any corrections when its read
method is invoked. That way you are in full control of how many bytes were read, and how to restart the stream from the place it failed at.
Regarding efficiency (more details about most of these in the project README
file):
- There's no multithreading/multiprocessing support at the moment. I'd have to think harder about this, but at the surface it also seems like this wouldn't be possible to implement anyway, as the underlying parsers all carry some state that depends on the input, which is serial.
- Using the
yajl2_c
backend is indeed the fastest way to go. - If feasible set the
use_float
option toTrue
in your invocation. - Make sure you open your stream/file in binary mode too to avoid an unnecessary encoding/decoding roundtrip of your data
- I have a working prototype of
ijson
using a newboost
backend (using the Boost.JSON) library that can be as twice as fast depending on the use case. If you are honestly interested on trying this out we could coordinate something. I'm fairly confident the code works correctly, but I want to have the code a bit cleaner before publishing it into GitHub; I also want to be able to build binary wheels with the new backend once I release a new version. - Finally: is
ijson
the bottleneck of your processing? Unless it is, you might be trying to optimise where no optimisation is needed (yet).
from ijson.
Interesting to hear about the new Boost backend. Have you also looked at https://simdjson.org ?
from ijson.
@jpmckinney I know of simdjson, but haven't given it a try (but does it support stream parsing? Not clear from having a quick glance at the documentation).
I added the boost
backend a while back when Boost.JSON wasn't even included in boost (and helped find a couple of issues they had with their utf8 sequence handling, plus requested a by-then missing feature required by ijson
. As I said the code is working, but a couple of issues are holding me back from releasing an ijson
version with it.
from ijson.
Oh, right, they added an "on demand API", but SAX-style parsing simdjson/simdjson#670 is not yet supported, nor are files larger than 4GB simdjson/simdjson#128 (the two are likely related). These issues are now scheduled for 2.0 (simdjson is not yet 1.0).
from ijson.
Thank you for the response!
Currently I don't have the time to test boost.JSON
, but I'll try your other tips.
And yes, currently parsing the json is the bottleneck - The process is to stream the huge JSON file from a cloud provider, split the items
array into objects and write them to my database.
All the objects have a shared part, so I'll try to figure out a way to speed the processing with this knowledge.
from ijson.
@nolaexe I'll be closing this issue as I think all the original questions have been answered.
from ijson.
I came across a use case for this today, where the input JSON has predictable errors (quotation characters and control characters are not escaped inside strings). Right now I'm parsing the JSON, getting the error position from the exception, replacing the unescaped characters with their escaped versions, and then re-parsing the JSON from the beginning. Obviously, this changes the running time from (size of input)
to (size of input) x (number of errors)
. I don't expect ijson or any other JSON parser to support this use case, but I figured I'd note it here.
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- High level interface to iterate over lists HOT 3
- HighLevelAPI: Raise an error if the prefix does not exist HOT 2
- Is it possible to use multiple prefix HOT 8
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.