Comments (5)
@TomGeoDK I think the key in your description is that ijson.items
seems to not have an end, but you haven't been able to prove this. OTOH your file is quite large. How long have you waited for until giving up? Also, which backend are you using? Depending on this, your bottleneck might be disk I/O or CPU.
For performance tips see the ijson FAQ. In particular make sure you are using a version of ijson that has the yajl2_c
backend; also open your file in rb
mode instead of text, otherwise if your input file is already utf-8 (should be, as JSON content should be utf-8 encoded) you are doing an unnecessary decode/encode roundtrip.
I put your JSON example content and put it in a 77.json
file. I also wrote a 77.py
file with your create_content
function (without the SQL bits, just some prints), plus the code that feed the JSON file into it:
import ijson
def create_content(objects):
''' simple helper function to get content
for every object list from the JSON file'''
item_lst = []
item_count = 0
while True:
try:
item = next(objects)
item_lst.append(item)
if len(item_lst) == 10000:
item_count += len(item_lst)
item_lst.clear()
except StopIteration:
break
except Exception as e:
print(e)
break
if len(item_lst) > 0:
item_count += len(item_lst)
item_lst.clear()
return item_count
with open("77.json", "rb") as f:
print(create_content(ijson.items(f, "GrundList.item")))
with open("77.json", "rb") as f:
grund_objects = ijson.items(f, 'GrundList.item')
print(len(list(grund_objects)))
This works and prints 2
and then 2
again.
from ijson.
@rtobar , since data is written into the database in steps of 10.000 I do not understand why the final 4471 items in the generator would create a problem to leave the try - except block. At one point I simply gave up and went to bed, while the program was still running. So I would say I waited fairly long. :-)
With respect to the print(len(list(grund_objects)))
... I believe the initialization of the grund_objects generator, does not need much time, still the print does not deliver anything. There should not be as many items in the grund_objects than there are in the other generator of 1.384.471. How can I find out how many items the generator contains?
from ijson.
OK, there are a few various things thrown in there...
@rtobar , since data is written into the database in steps of 10.000 I do not understand why the final 4471 items in the generator would create a problem to leave the try - except block.
I don't know what these "final 4471" items you are referring to are, but....
It seems like you have an idea of how many items are in the "GrundList" list, is that correct? And maybe you are expecting ijson
to stop after that list finishes in the JSON stream and give up reading the rest of the file and exit? If so then you are having wrong expectations: ijson.items
always parses through the whole stream until it ends (i.e., it reads the entire file). So while you'd expect ijson
to finish rather quickly, it will still take it a long time to go through the entire file. This is also why I asked about the backend you are using: the fastest is orders of magnitude faster than the slowest, so you want to make sure you're using the best possible if you are dealing with big files.
There are reasons for having to read streams in their entirety, but it's mostly because JSON doesn't enforce keys to be unique -- so there could be a second/third/fourth/.... "GrundList" list in your JSON document which would match your prefix.
If you want to iterate over the "GrundList" and quickly exit after that then you'll have to do that yourself, resorting to the ijson.parse
routine instead of ijson.items
. Here you'll be able to detect when the "GrundList" starts, but also when it finishes, in which case you can break from your loop.
With respect to the print(len(list(grund_objects)))... I believe the initialization of the grund_objects generator, does not need much time, still the print does not deliver anything
Instantiating a generator in python takes basically no effort, it's its continuous iteration that consumes time. In this case the list()
construction is iterating over it. The fact that print
doesn't print anything is because list()
hasn't finished iterating over the generator.
There should not be as many items in the grund_objects than there are in the other generator of 1.384.471
Again you are dropping this mysterious 1.384.471 number that I don't know where it's coming from...
How can I find out how many items the generator contains?
In general you can't tell in advance, the only way is to consume the generator and count how many items it yields (basically what you're doing already).
from ijson.
May not always be appropriate but I have used signals in the past to stop a ijson loop if it doesn't find an item after a certain period of time.
import signal
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException
signal.signal(signal.SIGALRM, timeout_handler)
json_data = []
stop_after = 10
with open("C:/Path/To/JSON/Test.json", "r", encoding="utf-8") as json_file:
grund_objects = ijson.items(json_file, 'GrundList.item')
try:
for o in grund_objects:
json_data.append(o)
signal.alarm(stop_after)
except TimeoutException:
print(f'No more objects found after {stop_after} seconds')
from ijson.
Closing as this issue is stale and no further info was provided. It seems it's most likely a problem of having wrong expectations rather than a bug. If still a problem please reopen with more details.
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- Iterate over more than one prefix? HOT 2
- How to parse a large gzip json file. HOT 2
- Make new release HOT 2
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.