icrar / ijson Goto Github PK
View Code? Open in Web Editor NEWThis project forked from isagalaev/ijson
Iterative JSON parser with Pythonic interfaces
Home Page: http://pypi.python.org/pypi/ijson/
License: Other
This project forked from isagalaev/ijson
Iterative JSON parser with Pythonic interfaces
Home Page: http://pypi.python.org/pypi/ijson/
License: Other
Hi,
We've run into https://travis-ci.org/github/frictionlessdata/tabulator-py/jobs/678168760 on
items = ijson.items(self.__bytes, path)
for row_number, item in enumerate(items, start=1):
which happened on @3.0.2 but works with @3.0.1
Input json file:
{
"0.1": {
"123":"ok"
}
}
test file:
import ijson
filename = "input.json"
with open(filename, 'r') as f:
objects = ijson.items(f, '0.1')
cities = (o for o in objects)
for city in cities:
print(city)
the test file prints nothing.
If I change the key to be "01", and prefix to be "01" then {"123":"ok"}
prints as expected.
Any help would be appreciated!
If a user doesn't have YAJL installed, I think YAJLImportError is raised instead of ImportError.
So I have a json file, 330 ish MB.
The content is like this
{
"locations" : [ {
"timestampMs" : "1231313131313",
"latitudeE7" : 111111111,
"longitudeE7" : 123123131,
"accuracy" : 36,
"activity" : [ {
"timestampMs" : "1211211121121",
"activity" : [ {
"type" : "STILL",
"confidence" : 75
}, {
"type" : "ON_FOOT",
"confidence" : 10
}, {
"type" : "IN_VEHICLE",
"confidence" : 5
}, {
"type" : "ON_BICYCLE",
"confidence" : 5
}, {
"type" : "UNKNOWN",
"confidence" : 5
}, {
"type" : "WALKING",
"confidence" : 5
}, {
"type" : "RUNNING",
"confidence" : 5
} ]
} ]
}, {........
Meaning an array of locations.
If I run this through json_load, then iterate over the file, pull out the two map_keys I want. It takes about 20 seconds. It is doable.
But I cannot load the whole thing in memory anymore, it is to big for mye infrastructure, so I found this lib. But when I run fex
locations = ijson.kvitems(json_file, 'locations.item')
timestampMsObjects = (v for k, v in locations if k == 'timestampMs')
timestampMs = list(timestampMsObjects)
It takes many many minutes. I dont know how long acctually becaus i quit it everytime it goes to far.
Why is this? Im just trying to get the length of that list.
See how many points im working with.
Afterwards I want to pull out 3 map_keys, and combine them into a smaller object. But just need ti naje sure this software is fast enough.
Anyone with some insight on this?
Hi,
Thank you very much for the ijson library.
I think I may have found a memory leak when using the yajl2_c
backend. I've reused a code similar to the one found in a previous issue:
https://gist.github.com/mhugo/dec469223e578ea7ec94946edcd43e6f
With yajl2_c:
using backend yajl2_c
using ijson version 3.1.1
starting memory usage: 11.736 MB
spent time: 5.27
memory usage after ijson calls: 204.432 MB
memory usage after garbage collection: 204.432 MB
With yajl2_cffi:
using backend yajl2_cffi
using ijson version 3.1.1
starting memory usage: 18.556 MB
spent time: 16.25
memory usage after ijson calls: 18.556 MB
memory usage after garbage collection: 18.568 MB
Reading data that is formatted as sort of preamble followed by a stream similar items in an array, I am currently using a combination of ijson.parse
and ijson.items
, reading the preamble as prefixed events, followed by processing the array items. Something like
{"interesting": "preamble",
"another": "key",
"actual_results": [{"object": 1}, {"object": 2}]}
def read_preamble(events):
preamble = None
for prefix, etype, value in events:
if prefix == 'preamble':
preamble = value
if prefix == 'actual_results':
return preamble
events = ijson.parse(data)
preamble = read_preamble(events)
for result in ijson.items(events, 'actual_results.item'):
process_result(result)
As of ijson 3.0, items
no longer accepts an event generator and requires a raw input, which I can't recreate (actual use case is a sizable HTTP response that I wouldn't want to request twice).
I guess it's possible to use just ijson.parse
and mimic the old ijson.items
with an ObjectBuilder
, but that seems a step back in usability to me. Is there a better way to accomplish this / can we get an items
-like API that enables this?
I'm using asyncio.
I do asynchronous requests to a child process that may be connected through stdin or a socket (if in a remote machine), then I wait for JSON responses through the stdin pipe or the socket.
Since the requests are asynchronous, I could get multiple JSON responses one after the other, consecutively.
What I want to achieve is an asynchronous streamer of multiple JSON chunks, for instance i could get:
{"x":
... 3}
... {"y
... ":4}
or
{"x":3}{"y
... ":4}
And for these 2 cases my code do work perfectly, because once I receive the first correct JSON chunk (that is {"x":3}
) I reset the items_coro coroutine, but not the events list.
Problem arises when I get {"x":3}{"y":4}
all in once. The library can't parse that because it is not valid JSON, of course.
But then, how to get the streaming of multiple JSON chunks correctly?! Is that possible?
I cannot even use a list of JSONs for the chunks because if I try to put all those JSON chunks inside a list, it will never get parsed until I receive the last ]
, which I won't receive until the child process closes, for instance.
There is a way to parse JSON chunks having them inside a list? Is that the solution? Thanks.
This is my code:
import typing as t
import asyncio
import aioconsole
import ijson
class JSONStreamer:
BUFSIZE = 4
def __init__(self, asyncio_file, on_json=None, on_error=None):
self.input = asyncio_file
self.jsonEvents = None
self.jsonCoro = None
self.active = True
self.onJSONCallback = on_json
self.onErrorCallback = on_error
def __repr__(self):
return "<JSONStreamer object>"
def onJSON(self, py_dict:dict[str, t.Any]) -> None:
if self.onJSONCallback is not None:
self.onJSONCallback(py_dict)
def onError(self, exc:Exception) -> None:
if self.onErrorCallback is not None:
self.onErrorCallback(exc)
def close(self) -> None:
if self.jsonEvents is not None:
self.jsonEvents = None
if self.jsonCoro is not None:
try:
self.jsonCoro.close()
except Exception:
pass
finally:
self.jsonCoro = None
def reset(self) -> None:
self.close()
self.jsonEvents = ijson.sendable_list()
self.jsonCoro = ijson.items_coro(self.jsonEvents, '')
async def run(self) -> None:
# RESET the coroutine state.
if self.jsonEvents is None:
self.reset()
# FOREVER.
while self.active:
try:
# READ binary json data. (E.g. from socket, or stdin)
data = await self.input.read(self.BUFSIZE)
except asyncio.CancelledError:
# CLOSE on cancel.
self.close()
self.active = False
return
# EOF! Close and return.
if not data:
self.close()
self.active = False
return
elif len(data) > 0:
# STRIP unwanted bytes terminators.
data = data.strip(b"\n\r\t")
# CHECK if we have some data after stripping.
if len(data) > 0:
try:
# SEND this (partial?) JSON data to the coroutine.
self.jsonCoro.send(data)
except Exception as exc:
# ERROR: Unable to parse JSON correctly, maybe malformed.
# CALL onError callback.
self.onError(exc)
# RESET the coroutine state.
self.reset()
# CONTINUE to grasp JSON.
continue
must_reset = False
# FOREACH correctly parsed JSONs.
for py_obj in self.jsonEvents:
print("O IS:", py_obj, type(py_obj))
# CALL the callback to signal we got a correctly parsed JSON chunk,
# that has been already converted to a python object (list or dict).
self.onJSON(py_obj)
must_reset = True
if must_reset:
self.reset()
self.close()
def _clean_data(self, data):
return data.strip(b"\n\r\t")
def on_json(py_dict):
print("ON JSON!!", py_dict)
def on_error(exc):
print("ON ERROR!!", exc, exc.__class__.__name__)
async def main2():
stdin, _ = await aioconsole.get_standard_streams()
print("STDIN TYPE:", type(stdin))
json_streamer = JSONStreamer(stdin, on_json=on_json, on_error=on_error)
await json_streamer.run()
if __name__=="__main__":
try:
asyncio.run(main2())
except BaseException as e:
print(e.__class__.__name__, e)
RFC 8259 allows non-UTF-8 data in "a closed ecosystem".
I am using ijson to iteratively read JSON from stdin, and I don't presently know a way to change its encoding, without either causing an error in ijson or having to buffer the entire input. Among other attempts, I tried monkey-patching b2s
in ijson.compat
to use a different encoding, but it led to a different error than UnicodeDecodeError.
Is there a way (or a desire) to parse non-UTF-8 data?
Opening a 1GB JSON file and iterating over it using ijson.items
takes up a large amount of memory which can not be reclaimed by running the garbage collector.
Software & Environment version:
ijson==3.0.1, Python3.8, Debian buster.
Included are a Dockerfile and Python file which fully demonstrate the issue.
# Dockerfile
FROM python:3.8-buster
RUN wget https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/NewYork.zip
RUN unzip NewYork.zip -d /data
RUN rm NewYork.zip
RUN apt update && apt install -y \
libyajl2
RUN pip3 install ijson==3.0.1 cffi
COPY main.py /
ENTRYPOINT ["python", "/main.py"]
# main.py
import resource
import gc
import ijson.backends.yajl2_c as ijson
# import ijson.backends.python as ijson
# import ijson.backends.yajl2_cffi as ijson
def memusage():
"""memory usage in GB. getrusage defaults to KB on Linux."""
return str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)
def iter_features(filename):
with open(filename, 'rb') as f:
yield from ijson.items(f, "features.item")
def main():
print("using backend", ijson.backend)
print("starting memory usage:", memusage(), 'GB')
for feature in iter_features('/data/NewYork.geojson'):
pass
print("memory usage after reading file:", memusage(), 'GB')
gc.collect()
print("memory usage after garbage collection:", memusage(), 'GB')
if __name__ == '__main__':
main()
Create a new directory with the Dockerfile and main.py and then run:
$ docker build -t test-ijson .
$ docker run -it test-ijson
using backend yajl2_c
starting memory usage: 0.010604 GB
memory after reading file: 5.824332 GB
memory usage after garbage collection: 5.824332 GB
if you change main.py
to use the python
backend, there is no issue:
using backend python
starting memory usage: 0.01056 GB
memory after reading file: 0.01582 GB
memory usage after garbage collection: 0.01582 GB
Same with yajl2_cffi
:
using backend yajl2_cffi
starting memory usage: 0.015032 GB
memory after reading file: 0.016292 GB
memory usage after garbage collection: 0.016292 GB
Though they are, of course, much slower to test than yajl_c
.
Please help. What am I doing wrong?
import asyncio
import ijson
# ijson==3.1.2.post0
from aiofile import AIOFile
data = """
[
"a"
]
"""
async def main():
with open("test.json", "w") as f1:
f1.write(data)
with open("test.json", "r") as f2:
for obj in ijson.items(f2, prefix="item", use_float=True):
print(obj)
# ijson.common.IncompleteJSONError: parse error: trailing garbage
# [ "a" ]
# (right here) ------^
async with AIOFile("test.json", "r") as fp:
async for obj in ijson.items_async(fp, prefix="item", use_float=True):
print(obj)
asyncio.run(main())
The C backend doesn't respect the multiple_value
flag; it always considers it as switched on.
If I have a json array as file root using items(f, '')
returns a generator with 1 item (the array itself), instead of the generator with its items.
Example:
import io, ijson
it = ijson.items(io.BytesIO(b'[1, 2, 3]'), '')
for el in it:
print(it)
# Prints [1, 2, 3] instead of the items 1, 2, and 3
Hey,
I've been playing around with ijson for the last couple days and it's quite helpful.
I'm having a hard time figuring out what's the best way to resume a stream process.
The file I'm processing is a 20GB json. I'm processing it from a cloud based storage, and a stream-error of some kind might occur during the processing.
The structure of my json is:
{
"fileVersion": number,
"id": uuid,
"items": a very big array(10000s of items) of small-medium object(up to 100 lines)
}
kvitems
? I'm using the yajl2_c
backend.Thanks for the support!
I am trying to implement the Intercepting Events pattern from https://github.com/ICRAR/ijson#id13 to consume an aiohttp response. When using non-async sources everything works as expected.
Running Python 3.9.4 (on Kubuntu 20.04), ijson 3.1.4, aiohttp 3.7.4.post0. For the sake of testing all backends i also installed cffi 1.14.5, and the OS package libyajl2:amd64 2.1.0-3. The precise versions do not seem crucial.
The code below uses a json file from the web, the specific json data is not important. The path specified in the code is not important either.
import asyncio
import traceback
import aiohttp
import ijson
url ='https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json'
async def run():
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
parse_events = ijson.parse_async(response.content) # 1 <----
async for prefix, event, value in parse_events:
print(prefix, event, value)
async with session.get(url) as response:
async for i in ijson.items_async(response.content, "quiz.maths.q2.options.item"): # 2 <----
print(i)
async with session.get(url) as response:
body = await response.read()
parse_events = ijson.parse(body)
for i in ijson.items(parse_events, "quiz.maths.q2.options.item"): # 3 <----
print(i)
for backend in ['yajl2_c', 'yajl2_cffi', 'yajl2', 'python']:
try:
ijson_backend = ijson.get_backend(backend)
async with session.get(url) as response:
parse_events = ijson_backend.parse_async(response.content)
async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"): # 4 <----
print(i)
except Exception as e:
print(f"{backend}\n\n\n{traceback.format_exc()}\n\n\n")
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(run())
1, 2, and 3 work fine.
4 raises various exceptions depending on the backend:
yajl2_c
Traceback (most recent call last):
File "/home/federico/python-tests/test.py", line 23, in run
async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
if type(await f.read(0)) == compat.bytetype:
AttributeError: '_yajl2._parse_async' object has no attribute 'read'
yajl2_cffi
Traceback (most recent call last):
File "/home/federico/python-tests/test.py", line 23, in run
async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
self.read = await _get_read(self.f)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable
Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfdd0>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
yajl_parse(handle, buffer)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
(right here) ------^
yajl2
Traceback (most recent call last):
File "/home/federico/python-tests/test.py", line 23, in run
async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
self.read = await _get_read(self.f)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable
Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfcf0>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
yajl_parse(handle, buffer)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
(right here) ------^
Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bf970>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
(right here) ------^
python
Traceback (most recent call last):
File "/home/federico/python-tests/test.py", line 23, in run
async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
self.read = await _get_read(self.f)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable
Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfe40>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
(right here) ------^
Exception ignored in: <generator object utf8_encoder at 0x7fb0f3587200>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
target.close()
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
target.send(EOF)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content
Exception ignored in: <generator object utf8_encoder at 0x7fb0f35870b0>
Traceback (most recent call last):
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
target.close()
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
target.send(EOF)
File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content
I tried a few combinations of parse_async/parse/items_async/items/async for/for, but without luck.
Am i doing something wrong or is there an issue?
There are several syntax error in some of the example code. Here's a version that works:
import io
import ijson
fo = open("foo.txt", "rb")
print ("Name of the file: ", fo.name)
line = fo.read(10)
fo.close()
parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
while True:
prefix, event, value = next(parse_events)
if value == "skip":
break
for obj in ijson.items(parse_events, 'item'):
print(obj)
Hi.
I think I've found a memory leak in asyncio interface.
I'm using latest ijson==3.0.4
with python3.7 on macos mojave.
Here's an example of the leak:
import asyncio
import ijson.backends.yajl2_c as ijson
class AsyncReaderWrapper:
def __init__(self, stream):
self._stream = stream
async def read(self, value: int):
if value == 0:
return b""
return self._stream.read()
async def parse_json_async(json_fp):
async for objects in ijson.parse_async(AsyncReaderWrapper(json_fp)):
yield objects
async def amain():
events = 0
with open("100mb.json", "rb") as json_fp:
async for prefix, event, value in parse_json_async(json_fp):
events += 1
print(f"Got {events}")
print("Press any key...")
input()
if __name__ == "__main__":
asyncio.run(amain())
And a sync version for comparison:
import ijson.backends.yajl2_c as ijson
def main():
events = 0
with open("100mb.json", "rb") as json_fp:
for prefix, event, value in ijson.parse(json_fp):
events += 1
print(f"Got {events}")
print("Press any key...")
input()
if __name__ == "__main__":
main()
I've used memory_profiler on both and got following results:
Async version
Looks like when I'm using async interface, memory released only after the whole file was processed.
I tried all backends and all have comparable results.
There is no ijson 3.1.2.post0 sdist on PyPI. Could you please publish it?
Thanks.
I may have misunderstood the library, but is there any way to loop over structures without any prefixes? I've been working in a memory-constrained environment where I wish to replace something like the following
data = json.load(srcfile)
for value in data.values():
do_something(value)
ie. converting some JSON into CSV.
To do this with IJSON I've had to drop down to the basic_parser()
where I manually handle start_array
and end_array
, concatenate values and ignore other events. But it feels like there should be another way.
Running Python 3.8.2 (on Linux), ijson 3.1.post0, backend yajl2_c
, aiofiles 0.5.0.
When trying to asynchronously get items from a simple json file, a TypeError: _get_read() missing 1 required positional argument: 'f'
is raised.
Code:
import aiofiles
import asyncio
import ijson
ijson_backend = ijson.get_backend("yajl2_c")
async def main():
async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
async for i in ijson_backend.items_async(buff, "item"):
print(i)
asyncio.run(main())
test.json:
["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
Output:
Traceback (most recent call last):
File "test_ijson.py", line 12, in <module>
asyncio.run(main())
File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "ignore_test_ijson.py", line 9, in main
async for t in ijson_backend.items_async(buff, "item"):
TypeError: _get_read() missing 1 required positional argument: 'f'
I would expect normal operation that would print all items from the test.json file.
Synchronous code works correctly:
import ijson
ijson_backend = ijson.get_backend("yajl2_c")
def noasync_main():
with open("test.json", "r", encoding="utf-8") as buff:
for i in ijson_backend.items(buff, "item"):
print(i)
noasync_main()
Output:
1
2
3
4
5
6
7
8
9
10
Update: the yajl2_cffi
backend works correctly:
Code:
import aiofiles
import asyncio
import ijson
ijson_backend = ijson.get_backend("yajl2_cffi")
async def main():
async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
async for i in ijson_backend.items_async(buff, "item"):
print(i)
asyncio.run(main())
Output:
1
2
3
4
5
6
7
8
9
10
If I understand correctly, there is the built-in items
wrapper for iterating over items in a list, but there isn't one for iterating over keys in a dictionary.
I've seen the solution for the special case when the keys are at the top level of the JSON isagalaev#62 (comment) but what if the large list of keys is not at the top level? E.g.
{
"my_big_data":
{
"1": 1,
"2": 2
}
}
Would it be difficult to do an analogous function to items
where one can specify the prefix of the dictionary to iterate over and returns the keys and values?
I guess besides the implementation, there is also the question what to call it.
It is perhaps a bit unfortunate that in python 3 the natural name for the dictionary iterator returning keys and values would actually be items
but I guess that is already taken ;-)
Hi,
I was trying to use ijson with a json stream coming from a zip archive through through a libarchive binding. Unfortunately the package I tried first exposed only a generator for getting the file bytes out of the zip:
https://github.com/Changaco/python-libarchive-c/blob/master/libarchive/entry.py#L48-L56
This is apparently not currently supported by ijson? At least I was getting very strange errors (internal C errors with the default C backend, "too many values to unpack" with the python backend using .items()
) which I eventually could narrow down to the generator when using .basic_parse()
. Would it make sense to support generators as a source as well or is that somehow fundamentally incompatible?
(Meanwhile I've switched to using the other python libarchive binding which does offer a file-like interface for reading from the archive.)
Is your feature request related to a problem? Please describe.
It'd be great if there were a way to rapidly break a large stream of multiple JSON values (i.e. the multiple_values
option) into its constituent values. For use-cases where you just need to know e.g. the number of JSON values in a stream, or need to multiplex an incoming stream across threads, or simply substring match the entire raw JSON value without first interpreting it, this is a pretty useful feature. As a point of reference, some JSON libraries like Golang's support this out of the box: in that case, you can decode a JSON-containing byte array into a json.RawMessage
, which just copies the byte array.
Describe the solution you'd like
I'd like some equivalent to ijson.items
that simply produces the original bytes (possibly copied) instead of parsing the items themselves.
Describe alternatives you've considered
If I had full control over the production of these JSON streams, I could require that the output were newline-delimited. At present, this is not the case.
I think the current workaround is to run jq -cM
in a subprocess and pipe the stream into jq
, which will force sequences like {}{}
to get produced as {}\n{}\n
. I could try to reserialize the original items, but that doesn't always result in the desired behavior (and would probably be slower than the jq
equivalent). This is an imperfect solution because it'll mangle the original bytes, which may not be the desired behavior when searching for item-level substring matches.
Description
I want to parse an array using ijson.items
, stop on StopIteration
, and then proceed to parse trailing content using the underlying ijson.parse
object. In the code below, the final line fails. If I replace while True
with for i in range(3)
then I parse all of arr
without a StopIteration
and I can go on to parse the trailing content. This doesn't help me though, because I don't know how long arr
is...
Thank you for you help.
import io
import ijson
parse_events = ijson.parse(io.BytesIO(b'''
{
"leading": "hi",
"arr": [ 1, 2, 3 ],
"trailing": "bye"
}
'''))
# not shown: iterate over parse_events to process "leading"
arr_iter = ijson.items(parse_events, 'arr.item')
try:
while True:
print(next(arr_iter))
except StopIteration:
print('Caught StopIteration')
next(parse_events) # I want to read "trailing" now but get StopIteration
With the increase of Arm CPUs in datacenters and the upcoming Apple migration to Arm, the use of Python on these platforms is growing. However, installing Python modules without Wheels often fails or it is very slow. The error messages users see are not clearly identifying the problem as missing build dependency. Publishing a Wheel is typically low effort, just a few lines in the build script. For users this saves significant time by avoiding troubleshooting and not having to wait for build processes to finish.
Ijson uses cibuildwheel via GitHub actions. If you're open to it - the easiest way to create arm64 wheels would be to move to cibuildwheel on travis.com. I would be happy to try and create a pull request in that direction.
Me again, sorry. I'm trying to stream data from a web call directly into ijson. However, if I use...
parser=ijson.parse(cellset_response.iter_lines())
for prefix, event, value in parser:
pass
I get 'too many values to unpack (expected 2)' error. If I use iter_content() instead of iter_lines() I get 'not enough values to unpack (expected 2, got 1).
It seems I can't win. :D
I have dots, spaces and parentheses in the keys of my JSON : how do I reference them in the prefix? Escaping them with a backslash doesn't seem to work.
Hi guys,
we are using this library to be able to parse incomplete JSON, specifically its Python backend.
We noticed that sometimes it raises UnexpectedSymbol
instead of IncompleteJSONError
and after some investigation, we found out that the problem is that given JSON ended with incomplete null
or true
or false
value. E.g. {"a": n
. It seems that the lexer's state machine does not recognize it.
Do you think this is a valid case and if so, can it be fixed?
Thank you for your response.
The python backend will not throw an IncompleteJSONError when parsing a JSON file that has leading 0s in the numeric value. When using the yajl2_c parser, the error is thrown as expected.
For example, when given this invalid JSON, both backends should throw the error.
{"should_fail": 001}
However, as the output shows, only the yajl2 backend fails, while the python parser parses 001
as 1
>>> import ijson.backends.python as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
[{'should_fail': 1}]
>>> import ijson.backends.yajl2_c as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
ijson.common.IncompleteJSONError: parse error: after key and value, inside map, I expect ',' or '}'
{"should_fail": 001}
(right here) ------^
Python Version: 3.6.12
ijson Version: 3.1.1
I've hit another snag with my libarchive experiment:
(Currently using https://github.com/smartfile/python-libarchive, working with this file-like object: https://github.com/smartfile/python-libarchive/blob/master/libarchive/__init__.py#L183)
When using either the python
or the yajl2_c
backends I hit some error condition on the end of the file. It works correctly with the yajl2_cffi
or yajl2
backends.
Here's what happens when using yajl2_c
:
count = 0
with libarchive.Archive('myarchive.zip') as archive:
for entry in archive:
if entry.pathname == 'myjson.json':
jsonstream = archive.readstream(entry.size)
objects = ijson.items(jsonstream, 'somekey.item')
for o in objects:
count +=1
print(count)
[prints the correct number of things, so everything is parsed right to the end it seems]
Traceback (most recent call last):
File "myfile.py", line 60, in <module>
for o in objects:
TypeError: a bytes-like object is required, not 'NoneType'
With the python backend (same code):
[same correct number again]
Traceback (most recent call last):
File myfile.py", line 60, in <module>
for o in objects:
File "/usr/lib/python3.9/site-packages/ijson/utils.py", line 55, in coros2gen
f.send(value)
File "/usr/lib/python3.9/site-packages/ijson/backends/python.py", line 36, in utf8_encoder
sdata = decode(bdata, final)
File "/usr/lib/python3.9/codecs.py", line 321, in decode
data = self.buffer + input
TypeError: can't concat NoneType to bytes
I'm not sure if I'm doing something wrong with calling ijson, or that my file-stream from libarchive is somehow "weird" (but I wouldn't know how really). The weird thing is that it works with the cffi backend. Oh and it also works with all backends if I .read()
the full jsonstream into memory (but this defeats the point of using ijson of course).
It would be nice if ijson had PyInstaller hooks so we can bundle it as part of a standalone application.
This code from init.py causes problems when building with PyInstaller because dynamically loaded modules are not included
return importlib.import_module('ijson.backends.' + backend)
For now I'm getting around this by specifying the below in the .spec file
hiddenimports=['ijson.backends.yajl2_c']
I have code that re-orders JSON keys into a standardized order using OrderedDict.move_to_end
. I want to use ijson to read the input iteratively. Presently, I think I would need to convert the dict
that ijson returns into an OrderedDict
, but my data has deep JSON objects, so this would be a fairly expensive operation. It would be faster to parse the data into an OrderedDict
directly.
Is there an interest in adding this feature?
Currently the new kvitems
method is implemented in python, which is what all backends use. The C backend however should see performance benefits by seeing this method implemented in C, with an expected increase of something around ~3x, 4x, depending on the use case.
See #18 (comment) and #18 (comment) for reference.
Hi,
I am using this module because parsing using json.loads() results in my already large json string (about 900MB) memory usage going up by about 10x (over 9GB). I was expecting to be able to parse the JSON line by line. It works, but I was a little surprised that when I call ijson.parse() it grabs about 3GB of memory. May I ask why the memory usage is so large? More conversion to dictionaries behind the scenes?
Thanks
Ray
I don't know if this is too narrow a use case for this library, or if there is another way to do this.
I work with large remote JSON objects. If I know a JSON key occurs near the start of the file, I might download only a few kilobytes, and then use ijson.items
to read that key. ijson
will not raise an IncompleteJSONError
, because it never had to read to the end of the file. So far, so good.
While most of these JSON objects are standardized, some are not. For example, a publisher might wrap a standard object inside another object, like {"result": { standardized content } }
. My data pipeline currently handles this, by being able to set the prefix for a given remote file (in this case result
). A pipeline step returns the data at that prefix, using ijson.items
.
When I combine these two tactics, I run into trouble. ijson.items(data, 'result')
tries to read entire result
key, but the JSON is incomplete, so it raises an error.
One solution is to collapse the two steps into one, i.e. index to result.some_key
in one step. In my case, this is undesirable, because then I'll have multiple pipelines using similar code, which is more work to maintain.
Another solution might be to have a method that just returns the bytes from the given prefix, without parsing them. (Since the prefix might include one or more item
, I guess the return value would be a string of bytes.) In that case, I could then use ijson.items
as usual on one of the return values.
The recent commit to the Python backend does this; it'd be useful to do the same in the other backends.
Code Example:
test_ijson.py
import json
import ijson
import codecs
with codecs.open('test.json', encoding="utf-8") as json_file:
form = ijson.items(json_file, 'menu.test_items.item')
print(form)
forms = (o for o in form)
print(forms)
for objects in forms:
pass
test.json
{"menu": {"header": "SVG Viewer","test_items": [{"id": "Open"},{"id": "OpenNew", "label": "Open New"},{"id": "ZoomIn", "label": "Zoom In"},{"id": "ZoomOut", "label": "Zoom Out"},{"id": "OriginalView", "label": "Original View"},{"id": "Quality"},{"id": "Pause"},{"id": "Mute"},{"id": "Find", "label": "Find..."}]}}
Result:
<_yajl2.items object at 0x7fb21010f650>
<generator object at 0x7fb2100e0c00>
Traceback (most recent call last):
File "test_ijson.py", line 11, in
for objects in forms:
File "test_ijson.py", line 9, in
forms = (o for o in form)
TypeError: expected bytes, str found
If I import the python backend as ijson everthing went fine.
Is there a solution for parsing json files with syntax problems?
Can this type of error be skipped?
IncompleteJSONError: lexical error: invalid char in json text.
type":"Point","coordinates":[NaN,NaN],"crs":{"type":"link","
(right here) ------^
This is the actual pipeline being applied:
def get_generator(gen, export_path, batch_size):
records = []
for i, ob in enumerate(gen):
records.append(get_record(ob))
if len(records)==batch_size:
records = save_parquet(records, i, export_path)
print('saved:', str(i+1))
if len(records) > 0:
records = save_parquet(records, i, export_path)
print('saved:', str(i+1))
def read_json_parquet(file_name, export_path, batch_size):
Path(export_path).mkdir(parents=False, exist_ok=False)
with open(file_name, 'rb') as file:
obj = ijson.items(file, 'features.item')
obj = get_generator(obj, export_path=export_path, batch_size=batch_size)
Hi, thanks for maintaining this package. I'm just curious since the pypi package links to this repo but links from the original repo points to the pypi package. Have you started publishing the package now?
Describe the bug
ijson.kvitems(prefix)
iterates through items even after the prefix was found and closed.
How to reproduce
If first generated an example.json
file with the following code.
import json
def generate_example(n):
with open("example.json", "w") as f:
json.dump({"a": {"foo": 13}, "b": {"bar": list(range(n))}}, f)
This gives the following JSON for n=5
{"a": {"foo": 13}, "b": {"bar": [0, 1, 2, 3, 4]}}
Then I tried to load only the items under the prefix a
with the two following functions:
import ijson
def g():
with open("example.json", "r") as f:
result = list(ijson.kvitems(f, prefix="a", use_float=True))
print(result)
def h():
with open("example.json", "r") as f:
under_prefix = False
result = []
for prefix, event, value in ijson.parse(f):
if prefix == "a" and event == "start_map":
under_prefix = True
if under_prefix:
result.append((prefix, event, value))
if prefix == "a" and event == "end_map":
break
print(result)
To compare the two functions, I created a big example.json
file with n=10000000
.
The function h
took 0.43s to execute while the function g
took 2.60s. Thus it seems that ijson.kvitems(prefix)
iterates through items even after the prefix was found and closed (i.e event end_map
is found).
Expected behavior
ijson.kvitems(prefix)
should stop iterating through items once the prefix was found and closed.
Execution information:
Thanks !
Dealing with asyncio streams I was wondering how realistic it would be to make this awesome library into something that accepts a push pattern.
Thinking about something like:
import ijson
def on_event(event, value):
... # Do something
async def collect_json(source):
parser = ijson.Parser(on_event)
async for chunk in source:
parser.feed(chunk)
parser.feed_eof()
The current system only supports blocking I/O and there is no way to emulate the above without using threads and pipes/queues unfortunately.
ijson.backends.python.Lexer()
has a main loop which looks for a number or a single-character lexeme and enters a simple decision tree. If the lexeme starts a string, the rest of the string is read in, with buffer updates as necessary, and then yielded out. If it does not start a string, the Lexer always attempts to extend the lexeme. In general, this isn't an issue, but if the file stream is wrapped around a socket, this can lead to significant parser lag and handshake stalemates as both parties wait for the other to transmit another chunk of data.
ijson/ijson/backends/python.py
Lines 34 to 63 in d754f9e
My guess is that the yajl backends do not share this issue, but I've not pulled up my Linux machine to check.
The current default backend users get when importing ijson
is the pure python one. On the plus side this will always import correctly, but it was the downside that it's the one exhibiting worse performance. On a typical installation importing other backends wouldn't be an issue though, so we can probably try to offer a better backend by default by iterating over the alternatives, importing them, and returning the one that imports first.
I have come across the error when using ijson to parse big json file
IncompleteJSONError: lexical error: invalid char in json text.
{ "_id" : ObjectId("5e5d193d8cf3fe97fa488
(right here) ------^
My source code is as followed:
with codecs.open('lagou.json','rb') as f:
objects = ijson.items(f,'item')
print(objects.__next__())
It really confuses me that why character 'b' will be a invalid char
Is your feature request related to a problem? Please describe.
I'm working on packaging this project for Void Linux, which doesn't have YAJL 1.x in it's repos. I want to skip/remove this back end, but the only way I've found to do this is to manually remove ijdon/backends/yajl.py
and patch out the line that generates tests for the yajl
backend.
Describe the solution you'd like
Some sort of build flag (not familiar with Python mofule ecosystem) would be nice, or a way to detect if yajl 1.x is available and not include based on that.
The patch works for now if not.
Many people online seem lost on how to use the prefix to select objects from an ijson.items
call. Documenting the prefix
syntax would clear things up.
Hi, it seems there is a breaking change in version 2.5 which should not be breaking according to SemVer:
items = ijson.items(self.__chars, path)
E TypeError: expected bytes, str found
It breaks all out stack https://github.com/frictionlessdata for all new installations (because of the tabulator-py
dependency)
Is it possible to handle this change differently? For example, deprecating the previous behavior while supporting it until the next major version or something like this.
I have a huge stream (if saved: 3.5GB json file - usually received from unix pipe), which processed with ijson (python 3.7, conda):
ijson.version.__version__ '3.0'
cat file.json | ./process.py
simple code sample (sure I process source object, but it is enough to trigger the problem):
#!/usr/bin/env python3
#process.py
import ijson
import sys
json_objects = ijson.items(sys.stdin,'item._source')
for source in json_objects:
continue
Exception (around 360k's object):
cat x.json | ./process.py
Traceback (most recent call last):
File "./process.py", line 8, in <module>
for source in json_objects:
File "/home/tobi/miniconda3/lib/python3.7/site-packages/ijson/compat.py", line 31, in read
return self.str_reader.read(n).encode('utf-8')
File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47886: invalid start byte
My question is: how to handle correctly this type of stream "byte" and/or encoding errors with ijson? Because 99% of stream is OK, but sometimes there is a problem, how to handle stream encoding and formatting errors?
I couldn't find any solution to put into a try...except, because it is iterating over the objects what ijson generated, so it could be handled there...
Thank you.
Tamas
See the discussion in #61. The idea is to have a new stop_after_first
or similar flag in kvitems
that will allow it to skip processing the rest of the input data after yielding the key-value pairs from the first object matching the prefix.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.