Comments (11)
@raeldor unfortunately without a code example and the document (or part of) you are trying to load there's little help that can be provided. Please provide more details, it could be that the memory explosion is happening somewhere else.
from ijson.
@raeldor to answer your original question: no, there's no dictionary construction or anything like that going on at the level of ijson.parse
, and in principle it shouldn't do any accumulation of data in memory (if it did it would be a bug). But again, only seeing some code will help answer your question more precisely.
from ijson.
Thank for the reply. Not sure code will help, since the line is literally just...
parser=ijason.parse(cellset_string)
I suspect you would need to have the same data to replicate.
from ijson.
After reading the FAQ, I suspect this is happening...
However if a text-mode file object is given then the library will automatically encode the strings into UTF-8 bytes
from ijson.
Regardless of the memory taken, even just running through the parser like...
parser = ijson.parse(cellset_string)
for prefix, event, value in parser:
pass
Takes about 10 minutes for my roughly 320MB string of JSON. I feel like I'm doing something wrong here.
from ijson.
@raeldor thanks for giving more details. Even though the code was simple, it actually helped me figure out what's going on.
When a str
instance is given as input, ijson uses an io.StringIO
object to wrap it to make it look like a file. I always assumed io.StringIO
wouldn't internally copy the input data, but it actually does:
$> python
Python 3.9.5 (default, May 11 2021, 08:20:37)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tracemalloc
>>> import io
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory()
(14106, 36242)
>>> x = ' ' * 10**6
>>> len(x)
1000000
>>> tracemalloc.get_traced_memory()
(1014943, 1024853)
>>> i = io.StringIO(x)
>>> tracemalloc.get_traced_memory()
(5015725, 5025635)
We could certainly simplify this on ijson to use a simpler file-like object wrapper that doesn't require copying the input string. I'll create an issue to remember doing that.
As you found out in the FAQ, the best input you can give ijson is binary data, not textual data. Also, where is your in-memory string coming from? You must be loading it from a file, the network, or some other external source. In that case it's always better to just give a file object to ijson so it reads the data for you, instead of you reading the whole data and giving it to ijson.
from ijson.
Thanks for the quick response. It's the response.text from an HTML call. Is there a way to wrap the string to prevent the conversion and improve the performance. For some reason the performance is also very slow. Or maybe it's because I'm in debug mode?
from ijson.
@raeldor It seems you're using the requests library? I'm no expert in it, so take this with a grain of salt.
If you are using the requests library you can access the response body as bytes via response.content
. That should already give slightly better performance because you'll save a roundtrip of encoding/decoding. Internally ijson will then use a io.BytesIO
object, which apparently doesn't suffers from the extra memory usage problem that io.StringIO
does.
However, the best would be to find out how to use the requests
library to get a file-like object that you can pass to ijson directly. In that case your memory usage should stay really low, as you would never need to load the whole response in memory. Like I said, I'm no expert on requests
, but it would seem like creating something around Response.iter_content would work.
Takes about 10 minutes for my roughly 320MB string of JSON. I feel like I'm doing something wrong here.
Are those 10 minutes spent only in ijson? Please check the performance section of the documentation. In particular make sure you have a fast backend available. 320 MB of JSON shouldn't take that long to parse (but who knows, maybe you have a particularly difficult JSON document to parse...)
from ijson.
Thank you. I'll investigate the response options further. Writing the string out to a physical file as binary and opening removed the memory issue as suspected.
Turning debugging off went from 10 minutes down to 1 minute for a read through the parser with no action. It's using the yajl2_c backend. Not sure how that stacks up performance-wise with what you'd expect.
from ijson.
Yes, 1 minute for reading and parsing sounds much better (still a bit high, but probably because of the extra I/O to disk). Note that you should be able to skip the writing to file though; just pass reponse.content
to ijson, and it will internally use io.BytesIO
, which unlike its sibling io.StringIO
doesn't require more memory.
In any case, please close this issue if you're happy with the responses. I'll additionally deal with replacing our internal usage of io.StringIO
to avoid unexpected memory increases.
from ijson.
Using response.content worked great without impacting memory. Really appreciate your fast assistance, thank you.
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- Iterate over more than one prefix? HOT 2
- How to parse a large gzip json file. HOT 2
- Make new release HOT 2
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.