Comments (16)
Thanks a lot @rtobar!
I checked out the branch and it seems to work fine for my use case as well.
I'd vote for including this already, and leaving an issue open to improve speed for other backends if needed.
from ijson.
New 2.6.0 released today, available in PyPI, includes this now.
from ijson.
@ltalirz that's a situation that indeed has been experienced by a number of users now, so yes, it makes sense to add support for iterating over object members in addition to being able to iterate over full objects.
Instead of adding a new method I would go for a different approach: the prefix
could contains a special final character (e.g., :
, |
or even .
) to signal that iteration should be over the object's members, not objects themselves. So in your case you could use ijson.items(f, 'my_big_data|')
(or one of the other final characters), and the result of the iteration would give you two-tuples with the individual key and values.
I'm not sure yet how much effort would be required to implement this, or even if it's possible (there might be some aspect of the problem I'm not seeing yet?), but I'll keep it mind and try to experiment with it.
from ijson.
Hm... a function that changes return type depending on the prefix string?
Are you sure this is more intuitive than just using a different wrapper?
Anyhow, I had a quick go and came up with this minor generalization of the example isagalaev#62 (comment), put into the form of the objects
function in the codebase.
Only tested for my use case, will test more later
import ijson
from ijson.common import ObjectBuilder
def objects(prefixed_events, prefix):
'''
An iterator returning native Python objects constructed from the events
under a given prefix.
'''
prefixed_events = iter(prefixed_events)
try:
key='-'
while True:
current, event, value = next(prefixed_events)
if current == prefix and event == 'map_key': # found new object at prefix
key=value
builder = ObjectBuilder()
elif current.startswith(prefix + '.' + key): # while at this key, build the object
builder.event(event, value)
if event == 'end_map': # found end of object at current key, yield
yield key, builder.value
except StopIteration:
pass
def kviter(file, prefix):
return objects(ijson.parse(file), prefix)
f = open('data.json', 'rb')
for k,v in kviter(f, 'my_big_data'):
print(k, v)
break
from ijson.
Good point regarding different return types, I hadn't thought of that actually, and I think it's a good reason to require a new function, so let's go for that. Regarding its name, as you point out the more natural items
name is sadly already taken (and items
could probably have been called objects
in hindsight), but inspired in your code snipped it could be kvitems
to keep names aligned.
Would you be willing to submit a PR to include this new functionality added to all backends? Unit tests would be required together with the code though. Mind you that the C backend doesn't use the code under ijson.common
, so it needs to be implemented separately (and excluded from the tests until it's implemented), but I can take care of that one if that's an issue.
If that's not possible then I can take your code and add the missing bits, but might take a bit more depending on other things I have on my plate.
from ijson.
Hi @rtobar , I'm very busy this week but I could perhaps make a PR for the python implementation if you let me know where/how to add tests.
Somewhere here, in this style as well?
Lines 258 to 262 in 87c4a0e
from ijson.
I would try to add tests that demonstrate this working over a simple prefix (like in your example), a prefix including array elements (e.g., docs.item
when used against the JSON
global value in tests.py
), and a prefix producing no results.
Unit tests, as you saw, should go into the Parse
class in tests.py
, and then they are run for each supported backend. To exclude a backend you'll need to do something like this:
Lines 299 to 307 in 87c4a0e
or
Lines 218 to 225 in 87c4a0e
I also realized that one could actually (I think) offer an implementation of kvitems
for the C backend that uses the code under ijson.common
, it's just that it might not be as fast as possible. So maybe give it a go, and if it works then it'd be better to include it than not.
from ijson.
Hi @ltalirz I actually went ahead and gave this a try myself -- adding tests and all. I started with your code, but had to change it a bit to work properly in a few cases. Could you give this a try and see if it works for your example as well? Changes are in the kvitems
branch.
from ijson.
Just as one performance data point:
On my SSD, it takes me 20s to iterate over 100k key-value pairs in a file containing ~1.4M key-value pairs (2GB in size), i.e. 0.2ms per pair.
This compares to 60s for parsing the entire file using json.load
(0.04ms/pair) or 30s using ujson.load
(0.02ms/pair).
While there's probably still room for improvement, I think that's already not too bad.
Edit: I was a bit surprised to have only a factor of 10x wrt ujson, and indeed I overlooked that there were other top-level keys in the file.
After removing those, the appropriate comparison is against 13s for json.load
(0.01ms/pair) and 10s for ujson.load
(0.007ms/pair).
from ijson.
Good to see nice performance going on.
Shouldn't the two last measurements be 0.6ms/pair and 0.3ms/pair though, given the times you report? Or are you otherwise scaling the overall times for json
and ujson
to account only for the pair construction somehow?
(I had missed the fact that only 100k pairs is what takes 20s, all clear now)
It would also be good to double-check maximum memory usage on each case (via /usr/bin/time -v
in Linux, or similar).
In any case, performance could indeed go up once the kvitems
logic is implemented in C for the C backend. Right now it's implemented in python, and hence it could see a boost. I'll create a new issue for that bit of work though, which can be handled separately.
from ijson.
@ltalirz I just ran the benchmark.py
tool (after enhancing it a bit to allow for this), and could see a difference of ~3x, 4x in the speed of items
compared to that of kvitems
for the C backend, while for the other backends they yield similar times. This would indicate that this sort of speedup factor should be possible once kvitems
is implemented in C.
from ijson.
Sounds good!
Let's put that benchmark info in the new issue as well.
from ijson.
Changes merged to master
branch, closing this issue now.
from ijson.
Great! I would say this warrants a new release :-)
from ijson.
Yes, I'll try to push 2.6.0 out as time allows, hopefully not after the end of the week.
from ijson.
Thanks for this!
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- High level interface to iterate over lists HOT 3
- HighLevelAPI: Raise an error if the prefix does not exist HOT 2
- Is it possible to use multiple prefix HOT 8
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.