Git Product home page Git Product logo

ijson's Issues

ijson.items prefix doesn't work if key contains "."

Input json file:

{
    "0.1": {
      "123":"ok"
    }
}

test file:

import ijson
filename = "input.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, '0.1')
    cities = (o for o in objects)
    for city in cities:
        print(city)

the test file prints nothing.
If I change the key to be "01", and prefix to be "01" then {"123":"ok"} prints as expected.

Any help would be appreciated!

This is very very slow on my computer

So I have a json file, 330 ish MB.
The content is like this

{
  "locations" : [ {
    "timestampMs" : "1231313131313",
    "latitudeE7" : 111111111,
    "longitudeE7" : 123123131,
    "accuracy" : 36,
    "activity" : [ {
      "timestampMs" : "1211211121121",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 75
      }, {
        "type" : "ON_FOOT",
        "confidence" : 10
      }, {
        "type" : "IN_VEHICLE",
        "confidence" : 5
      }, {
        "type" : "ON_BICYCLE",
        "confidence" : 5
      }, {
        "type" : "UNKNOWN",
        "confidence" : 5
      }, {
        "type" : "WALKING",
        "confidence" : 5
      }, {
        "type" : "RUNNING",
        "confidence" : 5
      } ]
    } ]
  }, {........

Meaning an array of locations.
If I run this through json_load, then iterate over the file, pull out the two map_keys I want. It takes about 20 seconds. It is doable.

But I cannot load the whole thing in memory anymore, it is to big for mye infrastructure, so I found this lib. But when I run fex

    locations = ijson.kvitems(json_file, 'locations.item')
    timestampMsObjects = (v for k, v in locations if k == 'timestampMs')
    timestampMs = list(timestampMsObjects)

It takes many many minutes. I dont know how long acctually becaus i quit it everytime it goes to far.

Why is this? Im just trying to get the length of that list.
See how many points im working with.

Afterwards I want to pull out 3 map_keys, and combine them into a smaller object. But just need ti naje sure this software is fast enough.

Anyone with some insight on this?

Memory leak with yajl2_c backend

Hi,

Thank you very much for the ijson library.

I think I may have found a memory leak when using the yajl2_cbackend. I've reused a code similar to the one found in a previous issue:

https://gist.github.com/mhugo/dec469223e578ea7ec94946edcd43e6f

With yajl2_c:

using backend yajl2_c
using ijson version 3.1.1
starting memory usage: 11.736 MB
spent time: 5.27
memory usage after ijson calls: 204.432 MB
memory usage after garbage collection: 204.432 MB

With yajl2_cffi:

using backend yajl2_cffi
using ijson version 3.1.1
starting memory usage: 18.556 MB
spent time: 16.25
memory usage after ijson calls: 18.556 MB
memory usage after garbage collection: 18.568 MB

ijson.common.items removed in 3.0 [was: Unable to use items with event generator]

Reading data that is formatted as sort of preamble followed by a stream similar items in an array, I am currently using a combination of ijson.parse and ijson.items, reading the preamble as prefixed events, followed by processing the array items. Something like

{"interesting": "preamble",
 "another": "key",
 "actual_results": [{"object": 1}, {"object": 2}]}
def read_preamble(events):
    preamble = None
    for prefix, etype, value in events:
        if prefix == 'preamble':
            preamble = value
        if prefix == 'actual_results':
            return preamble

events = ijson.parse(data)
preamble = read_preamble(events)
for result in ijson.items(events, 'actual_results.item'):
    process_result(result)

As of ijson 3.0, items no longer accepts an event generator and requires a raw input, which I can't recreate (actual use case is a sizable HTTP response that I wouldn't want to request twice).

I guess it's possible to use just ijson.parse and mimic the old ijson.items with an ObjectBuilder, but that seems a step back in usability to me. Is there a better way to accomplish this / can we get an items-like API that enables this?

How to correctly stream *multiple* chunks of JSON?

I'm using asyncio.
I do asynchronous requests to a child process that may be connected through stdin or a socket (if in a remote machine), then I wait for JSON responses through the stdin pipe or the socket.
Since the requests are asynchronous, I could get multiple JSON responses one after the other, consecutively.

What I want to achieve is an asynchronous streamer of multiple JSON chunks, for instance i could get:
{"x": ... 3} ... {"y ... ":4}
or
{"x":3}{"y ... ":4}

And for these 2 cases my code do work perfectly, because once I receive the first correct JSON chunk (that is {"x":3}) I reset the items_coro coroutine, but not the events list.

Problem arises when I get {"x":3}{"y":4} all in once. The library can't parse that because it is not valid JSON, of course.
But then, how to get the streaming of multiple JSON chunks correctly?! Is that possible?

I cannot even use a list of JSONs for the chunks because if I try to put all those JSON chunks inside a list, it will never get parsed until I receive the last ], which I won't receive until the child process closes, for instance.
There is a way to parse JSON chunks having them inside a list? Is that the solution? Thanks.

This is my code:

import typing as t
import asyncio
import aioconsole
import ijson

class JSONStreamer:
    BUFSIZE = 4

    def __init__(self, asyncio_file, on_json=None, on_error=None):
    	    self.input = asyncio_file
	    self.jsonEvents = None
	    self.jsonCoro = None
	    self.active = True
	    self.onJSONCallback = on_json
	    self.onErrorCallback = on_error


    def __repr__(self):
	    return "<JSONStreamer object>"


    def onJSON(self, py_dict:dict[str, t.Any]) -> None:
	    if self.onJSONCallback is not None:
		    self.onJSONCallback(py_dict)


    def onError(self, exc:Exception) -> None:
	    if self.onErrorCallback is not None:
		    self.onErrorCallback(exc)


    def close(self) -> None:
	    if self.jsonEvents is not None:
		    self.jsonEvents = None
	    if self.jsonCoro is not None:
		    try:
			    self.jsonCoro.close()
		    except Exception:
			    pass
		    finally:
			    self.jsonCoro = None


    def reset(self) -> None:
	    self.close()
	    self.jsonEvents = ijson.sendable_list()
	    self.jsonCoro = ijson.items_coro(self.jsonEvents, '')


    async def run(self) -> None:
	    # RESET the coroutine state.
	    if self.jsonEvents is None:
		    self.reset()
	    # FOREVER.
	    while self.active:
		    try:
			    # READ binary json data. (E.g. from socket, or stdin)
			    data = await self.input.read(self.BUFSIZE)
		    except asyncio.CancelledError:
			    # CLOSE on cancel.
			    self.close()
			    self.active = False
			    return
		    # EOF! Close and return.
		    if not data:
			    self.close()
			    self.active = False
			    return
		    elif len(data) > 0:
			    # STRIP unwanted bytes terminators.
			    data = data.strip(b"\n\r\t")
			    # CHECK if we have some data after stripping.
			    if len(data) > 0:
				    try:
					    # SEND this (partial?) JSON data to the coroutine.
					    self.jsonCoro.send(data)
				    except Exception as exc:
					    # ERROR: Unable to parse JSON correctly, maybe malformed.
					    # CALL onError callback.
					    self.onError(exc)
					    # RESET the coroutine state.
					    self.reset()
					    # CONTINUE to grasp JSON.
					    continue
				    must_reset = False
				    # FOREACH correctly parsed JSONs.
				    for py_obj in self.jsonEvents:
					    print("O IS:", py_obj, type(py_obj))
					    # CALL the callback to signal we got a correctly parsed JSON chunk,
					    # that has been already converted to a python object (list or dict).
					    self.onJSON(py_obj)
					    must_reset = True
				    if must_reset:
					    self.reset()
	    self.close()

    def _clean_data(self, data):
	    return data.strip(b"\n\r\t")


def on_json(py_dict):
    print("ON JSON!!", py_dict)

def on_error(exc):
    print("ON ERROR!!", exc, exc.__class__.__name__)

async def main2():
    stdin, _ = await aioconsole.get_standard_streams()
    print("STDIN TYPE:", type(stdin))
    json_streamer = JSONStreamer(stdin, on_json=on_json, on_error=on_error)
    await json_streamer.run()


if __name__=="__main__":
    try:
	    asyncio.run(main2())
    except BaseException as e:
	    print(e.__class__.__name__, e)

Parsing non-UTF-8 data

RFC 8259 allows non-UTF-8 data in "a closed ecosystem".

I am using ijson to iteratively read JSON from stdin, and I don't presently know a way to change its encoding, without either causing an error in ijson or having to buffer the entire input. Among other attempts, I tried monkey-patching b2s in ijson.compat to use a different encoding, but it led to a different error than UnicodeDecodeError.

Is there a way (or a desire) to parse non-UTF-8 data?

Memory leak in yajl2_c backend

Opening a 1GB JSON file and iterating over it using ijson.items takes up a large amount of memory which can not be reclaimed by running the garbage collector.

Software & Environment version:
ijson==3.0.1, Python3.8, Debian buster.

Included are a Dockerfile and Python file which fully demonstrate the issue.


# Dockerfile
FROM python:3.8-buster

RUN wget https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/NewYork.zip

RUN unzip NewYork.zip -d /data
RUN rm NewYork.zip

RUN apt update && apt install -y \
    libyajl2

RUN pip3 install ijson==3.0.1 cffi

COPY main.py /

ENTRYPOINT ["python", "/main.py"]

# main.py
import resource
import gc

import ijson.backends.yajl2_c as ijson
# import ijson.backends.python as ijson
# import ijson.backends.yajl2_cffi as ijson


def memusage():
    """memory usage in GB. getrusage defaults to KB on Linux."""
    return str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)


def iter_features(filename):
    with open(filename, 'rb') as f:
        yield from ijson.items(f, "features.item")


def main():
    print("using backend", ijson.backend)
    print("starting memory usage:", memusage(), 'GB')

    for feature in iter_features('/data/NewYork.geojson'):
        pass

    print("memory usage after reading file:", memusage(), 'GB')
    gc.collect()
    print("memory usage after garbage collection:", memusage(), 'GB')


if __name__ == '__main__':
    main()

Create a new directory with the Dockerfile and main.py and then run:

$ docker build -t test-ijson .
$ docker run -it test-ijson
using backend yajl2_c
starting memory usage: 0.010604 GB
memory after reading file: 5.824332 GB
memory usage after garbage collection: 5.824332 GB

if you change main.py to use the python backend, there is no issue:

using backend python
starting memory usage: 0.01056 GB
memory after reading file: 0.01582 GB
memory usage after garbage collection: 0.01582 GB

Same with yajl2_cffi:

using backend yajl2_cffi
starting memory usage: 0.015032 GB
memory after reading file: 0.016292 GB
memory usage after garbage collection: 0.016292 GB

Though they are, of course, much slower to test than yajl_c.

It is not possible to parse a file with newline (\n etc.)

Please help. What am I doing wrong?

import asyncio

import ijson
# ijson==3.1.2.post0

from aiofile import AIOFile

data = """
[
 "a"
]
"""


async def main():
    with open("test.json", "w") as f1:
        f1.write(data)

    with open("test.json", "r") as f2:
        for obj in ijson.items(f2, prefix="item", use_float=True):
            print(obj)

    # ijson.common.IncompleteJSONError: parse error: trailing garbage
    #                                        [  "a" ]
    #                      (right here) ------^
    async with AIOFile("test.json",  "r") as fp:
        async for obj in ijson.items_async(fp, prefix="item", use_float=True):
            print(obj)

asyncio.run(main())

Get items iterator from root

If I have a json array as file root using items(f, '') returns a generator with 1 item (the array itself), instead of the generator with its items.

Example:

import io, ijson
it = ijson.items(io.BytesIO(b'[1, 2, 3]'), '')
for el in it:
  print(it)
# Prints [1, 2, 3] instead of the items 1, 2, and 3

Resuming stream processing

Hey,

I've been playing around with ijson for the last couple days and it's quite helpful.
I'm having a hard time figuring out what's the best way to resume a stream process.
The file I'm processing is a 20GB json. I'm processing it from a cloud based storage, and a stream-error of some kind might occur during the processing.
The structure of my json is:
{
"fileVersion": number,
"id": uuid,
"items": a very big array(10000s of items) of small-medium object(up to 100 lines)
}

  1. If the process is corrupted after 18GB, how can I resume the process at that point?
  2. Is there a more efficient way to process the json other than a single-threaded kvitems? I'm using the yajl2_c backend.

Thanks for the support!

Event interception not available on async functions

I am trying to implement the Intercepting Events pattern from https://github.com/ICRAR/ijson#id13 to consume an aiohttp response. When using non-async sources everything works as expected.

Running Python 3.9.4 (on Kubuntu 20.04), ijson 3.1.4, aiohttp 3.7.4.post0. For the sake of testing all backends i also installed cffi 1.14.5, and the OS package libyajl2:amd64 2.1.0-3. The precise versions do not seem crucial.
The code below uses a json file from the web, the specific json data is not important. The path specified in the code is not important either.

import asyncio
import traceback

import aiohttp
import ijson

url ='https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json'

async def run():
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            parse_events = ijson.parse_async(response.content)  # 1 <----
            async for prefix, event, value in parse_events:
                print(prefix, event, value)
        async with session.get(url) as response:
            async for i in ijson.items_async(response.content, "quiz.maths.q2.options.item"):  # 2 <----
                print(i)
        async with session.get(url) as response:
            body = await response.read()
            parse_events = ijson.parse(body)
            for i in ijson.items(parse_events, "quiz.maths.q2.options.item"):  # 3 <----
                print(i)
        for backend in ['yajl2_c', 'yajl2_cffi', 'yajl2', 'python']:
            try:
                ijson_backend = ijson.get_backend(backend)
                async with session.get(url) as response:
                    parse_events = ijson_backend.parse_async(response.content)
                    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):   # 4 <----
                        print(i)
            except Exception as e:
                print(f"{backend}\n\n\n{traceback.format_exc()}\n\n\n")


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(run())

1, 2, and 3 work fine.
4 raises various exceptions depending on the backend:

yajl2_c


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
AttributeError: '_yajl2._parse_async' object has no attribute 'read'




yajl2_cffi


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfdd0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
    yajl_parse(handle, buffer)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

yajl2


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfcf0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
    yajl_parse(handle, buffer)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bf970>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

python


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfe40>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

Exception ignored in: <generator object utf8_encoder at 0x7fb0f3587200>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
    target.close()
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
    target.send(EOF)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
    raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content
Exception ignored in: <generator object utf8_encoder at 0x7fb0f35870b0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
    target.close()
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
    target.send(EOF)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
    raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content

I tried a few combinations of parse_async/parse/items_async/items/async for/for, but without luck.

Am i doing something wrong or is there an issue?

Example code issue

There are several syntax error in some of the example code. Here's a version that works:

import io
import ijson

Open a file

fo = open("foo.txt", "rb")
print ("Name of the file: ", fo.name)

line = fo.read(10)

Close opened file

fo.close()

parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
while True:
prefix, event, value = next(parse_events)
if value == "skip":
break
for obj in ijson.items(parse_events, 'item'):
print(obj)

Memory leak in asyncio interface

Hi.
I think I've found a memory leak in asyncio interface.
I'm using latest ijson==3.0.4 with python3.7 on macos mojave.

Here's an example of the leak:

import asyncio
import ijson.backends.yajl2_c as ijson


class AsyncReaderWrapper:
    def __init__(self, stream):
        self._stream = stream

    async def read(self, value: int):
        if value == 0:
            return b""
        return self._stream.read()


async def parse_json_async(json_fp):
    async for objects in ijson.parse_async(AsyncReaderWrapper(json_fp)):
        yield objects


async def amain():
    events = 0
    with open("100mb.json", "rb") as json_fp:
        async for prefix, event, value in parse_json_async(json_fp):
            events += 1

    print(f"Got {events}")
    print("Press any key...")
    input()


if __name__ == "__main__":
    asyncio.run(amain())

And a sync version for comparison:

import ijson.backends.yajl2_c as ijson


def main():
    events = 0
    with open("100mb.json", "rb") as json_fp:
        for prefix, event, value in ijson.parse(json_fp):
            events += 1

    print(f"Got {events}")
    print("Press any key...")
    input()


if __name__ == "__main__":
    main()

I've used memory_profiler on both and got following results:

Async version

image

Sync version:
image

Looks like when I'm using async interface, memory released only after the whole file was processed.
I tried all backends and all have comparable results.

Working without prefixes?

I may have misunderstood the library, but is there any way to loop over structures without any prefixes? I've been working in a memory-constrained environment where I wish to replace something like the following

data = json.load(srcfile)
for value in data.values():
       do_something(value)

ie. converting some JSON into CSV.

To do this with IJSON I've had to drop down to the basic_parser() where I manually handle start_array and end_array, concatenate values and ignore other events. But it feels like there should be another way.

(yajl2_c) TypeError when reading from an aiofiles file

Running Python 3.8.2 (on Linux), ijson 3.1.post0, backend yajl2_c, aiofiles 0.5.0.

When trying to asynchronously get items from a simple json file, a TypeError: _get_read() missing 1 required positional argument: 'f' is raised.

Code:

import aiofiles
import asyncio
import ijson

ijson_backend = ijson.get_backend("yajl2_c")

async def main():
    async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
        async for i in ijson_backend.items_async(buff, "item"):
            print(i)

asyncio.run(main())

test.json:

["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]

Output:

Traceback (most recent call last):
  File "test_ijson.py", line 12, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "ignore_test_ijson.py", line 9, in main
    async for t in ijson_backend.items_async(buff, "item"):
TypeError: _get_read() missing 1 required positional argument: 'f'

I would expect normal operation that would print all items from the test.json file.


Synchronous code works correctly:

import ijson

ijson_backend = ijson.get_backend("yajl2_c")


def noasync_main():
    with open("test.json", "r", encoding="utf-8") as buff:
        for i in ijson_backend.items(buff, "item"):
            print(i)


noasync_main()

Output:

1
2
3
4
5
6
7
8
9
10

Update: the yajl2_cffi backend works correctly:

Code:

import aiofiles
import asyncio
import ijson

ijson_backend = ijson.get_backend("yajl2_cffi")

async def main():
    async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
        async for i in ijson_backend.items_async(buff, "item"):
            print(i)

asyncio.run(main())

Output:

1
2
3
4
5
6
7
8
9
10

add support for iterating over key/value pairs?

If I understand correctly, there is the built-in items wrapper for iterating over items in a list, but there isn't one for iterating over keys in a dictionary.

I've seen the solution for the special case when the keys are at the top level of the JSON isagalaev#62 (comment) but what if the large list of keys is not at the top level? E.g.

{
 "my_big_data":
  { 
    "1": 1,
    "2": 2
  }
}

Would it be difficult to do an analogous function to items where one can specify the prefix of the dictionary to iterate over and returns the keys and values?

I guess besides the implementation, there is also the question what to call it.
It is perhaps a bit unfortunate that in python 3 the natural name for the dictionary iterator returning keys and values would actually be items but I guess that is already taken ;-)

work with generator source

Hi,
I was trying to use ijson with a json stream coming from a zip archive through through a libarchive binding. Unfortunately the package I tried first exposed only a generator for getting the file bytes out of the zip:

https://github.com/Changaco/python-libarchive-c/blob/master/libarchive/entry.py#L48-L56

This is apparently not currently supported by ijson? At least I was getting very strange errors (internal C errors with the default C backend, "too many values to unpack" with the python backend using .items()) which I eventually could narrow down to the generator when using .basic_parse(). Would it make sense to support generators as a source as well or is that somehow fundamentally incompatible?

(Meanwhile I've switched to using the other python libarchive binding which does offer a file-like interface for reading from the archive.)

Output items sequence as original bytes

Is your feature request related to a problem? Please describe.

It'd be great if there were a way to rapidly break a large stream of multiple JSON values (i.e. the multiple_values option) into its constituent values. For use-cases where you just need to know e.g. the number of JSON values in a stream, or need to multiplex an incoming stream across threads, or simply substring match the entire raw JSON value without first interpreting it, this is a pretty useful feature. As a point of reference, some JSON libraries like Golang's support this out of the box: in that case, you can decode a JSON-containing byte array into a json.RawMessage, which just copies the byte array.

Describe the solution you'd like

I'd like some equivalent to ijson.items that simply produces the original bytes (possibly copied) instead of parsing the items themselves.

Describe alternatives you've considered

If I had full control over the production of these JSON streams, I could require that the output were newline-delimited. At present, this is not the case.

I think the current workaround is to run jq -cM in a subprocess and pipe the stream into jq, which will force sequences like {}{} to get produced as {}\n{}\n. I could try to reserialize the original items, but that doesn't always result in the desired behavior (and would probably be slower than the jq equivalent). This is an imperfect solution because it'll mangle the original bytes, which may not be the desired behavior when searching for item-level substring matches.

Continue parsing after an using ijson.items on an array

Description

I want to parse an array using ijson.items, stop on StopIteration, and then proceed to parse trailing content using the underlying ijson.parse object. In the code below, the final line fails. If I replace while True with for i in range(3) then I parse all of arr without a StopIteration and I can go on to parse the trailing content. This doesn't help me though, because I don't know how long arr is...

Thank you for you help.

import io
import ijson

parse_events = ijson.parse(io.BytesIO(b'''
{
  "leading": "hi",
  "arr": [ 1, 2, 3 ],
  "trailing": "bye"
}
'''))

# not shown: iterate over parse_events to process "leading"

arr_iter = ijson.items(parse_events, 'arr.item')

try:
    while True:
        print(next(arr_iter))
except StopIteration:
    print('Caught StopIteration')

next(parse_events) # I want to read "trailing" now but get StopIteration

Arm64 wheel

With the increase of Arm CPUs in datacenters and the upcoming Apple migration to Arm, the use of Python on these platforms is growing. However, installing Python modules without Wheels often fails or it is very slow. The error messages users see are not clearly identifying the problem as missing build dependency. Publishing a Wheel is typically low effort, just a few lines in the build script. For users this saves significant time by avoiding troubleshooting and not having to wait for build processes to finish.

Ijson uses cibuildwheel via GitHub actions. If you're open to it - the easiest way to create arm64 wheels would be to move to cibuildwheel on travis.com. I would be happy to try and create a pull request in that direction.

ijson.parse iter_lines() returns error too many values to unpack (expected 2)

Me again, sorry. I'm trying to stream data from a web call directly into ijson. However, if I use...

parser=ijson.parse(cellset_response.iter_lines())
for prefix, event, value in parser:
pass

I get 'too many values to unpack (expected 2)' error. If I use iter_content() instead of iter_lines() I get 'not enough values to unpack (expected 2, got 1).

It seems I can't win. :D

IncompleteJSONError is not recognized for null and boolean values

Hi guys,

we are using this library to be able to parse incomplete JSON, specifically its Python backend.

We noticed that sometimes it raises UnexpectedSymbol instead of IncompleteJSONError and after some investigation, we found out that the problem is that given JSON ended with incomplete null or true or false value. E.g. {"a": n. It seems that the lexer's state machine does not recognize it.

Do you think this is a valid case and if so, can it be fixed?

Thank you for your response.

Python Backend Allows Leading 0s in Numbers

The python backend will not throw an IncompleteJSONError when parsing a JSON file that has leading 0s in the numeric value. When using the yajl2_c parser, the error is thrown as expected.

For example, when given this invalid JSON, both backends should throw the error.

{"should_fail": 001}

However, as the output shows, only the yajl2 backend fails, while the python parser parses 001 as 1

>>> import ijson.backends.python as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
[{'should_fail': 1}]
>>> import ijson.backends.yajl2_c as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
ijson.common.IncompleteJSONError: parse error: after key and value, inside map, I expect ',' or '}'
                       {"should_fail": 001}
                     (right here) ------^

Python Version: 3.6.12
ijson Version: 3.1.1

errors in python and yajl2_c backends

I've hit another snag with my libarchive experiment:

(Currently using https://github.com/smartfile/python-libarchive, working with this file-like object: https://github.com/smartfile/python-libarchive/blob/master/libarchive/__init__.py#L183)

When using either the python or the yajl2_c backends I hit some error condition on the end of the file. It works correctly with the yajl2_cffi or yajl2 backends.

Here's what happens when using yajl2_c:

count = 0
with libarchive.Archive('myarchive.zip') as archive:
    for entry in archive:
        if entry.pathname == 'myjson.json':

            jsonstream = archive.readstream(entry.size)
            objects = ijson.items(jsonstream, 'somekey.item')
            for o in objects:
                count +=1
                print(count)
[prints the correct number of things, so everything is parsed right to the end it seems]
Traceback (most recent call last):
  File "myfile.py", line 60, in <module>
    for o in objects:
TypeError: a bytes-like object is required, not 'NoneType'

With the python backend (same code):

[same correct number again]
Traceback (most recent call last):
  File myfile.py", line 60, in <module>
    for o in objects:
  File "/usr/lib/python3.9/site-packages/ijson/utils.py", line 55, in coros2gen
    f.send(value)
  File "/usr/lib/python3.9/site-packages/ijson/backends/python.py", line 36, in utf8_encoder
    sdata = decode(bdata, final)
  File "/usr/lib/python3.9/codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat NoneType to bytes

I'm not sure if I'm doing something wrong with calling ijson, or that my file-stream from libarchive is somehow "weird" (but I wouldn't know how really). The weird thing is that it works with the cffi backend. Oh and it also works with all backends if I .read() the full jsonstream into memory (but this defeats the point of using ijson of course).

PyInstaller Hooks

It would be nice if ijson had PyInstaller hooks so we can bundle it as part of a standalone application.

This code from init.py causes problems when building with PyInstaller because dynamically loaded modules are not included
return importlib.import_module('ijson.backends.' + backend)

For now I'm getting around this by specifying the below in the .spec file
hiddenimports=['ijson.backends.yajl2_c']

Add support for user-specified mapping type [was: Parsing into OrderedDict]

I have code that re-orders JSON keys into a standardized order using OrderedDict.move_to_end. I want to use ijson to read the input iteratively. Presently, I think I would need to convert the dict that ijson returns into an OrderedDict, but my data has deep JSON objects, so this would be a fairly expensive operation. It would be faster to parse the data into an OrderedDict directly.

Is there an interest in adding this feature?

Memory Explosion Using Parser

Hi,

I am using this module because parsing using json.loads() results in my already large json string (about 900MB) memory usage going up by about 10x (over 9GB). I was expecting to be able to parse the JSON line by line. It works, but I was a little surprised that when I call ijson.parse() it grabs about 3GB of memory. May I ask why the memory usage is so large? More conversion to dictionaries behind the scenes?

Thanks
Ray

Feature request: Return bytes at prefix

I don't know if this is too narrow a use case for this library, or if there is another way to do this.

I work with large remote JSON objects. If I know a JSON key occurs near the start of the file, I might download only a few kilobytes, and then use ijson.items to read that key. ijson will not raise an IncompleteJSONError, because it never had to read to the end of the file. So far, so good.

While most of these JSON objects are standardized, some are not. For example, a publisher might wrap a standard object inside another object, like {"result": { standardized content } }. My data pipeline currently handles this, by being able to set the prefix for a given remote file (in this case result). A pipeline step returns the data at that prefix, using ijson.items.

When I combine these two tactics, I run into trouble. ijson.items(data, 'result') tries to read entire result key, but the JSON is incomplete, so it raises an error.

One solution is to collapse the two steps into one, i.e. index to result.some_key in one step. In my case, this is undesirable, because then I'll have multiple pipelines using similar code, which is more work to maintain.

Another solution might be to have a method that just returns the bytes from the given prefix, without parsing them. (Since the prefix might include one or more item, I guess the return value would be a string of bytes.) In that case, I could then use ijson.items as usual on one of the return values.

Error with 2.5

Code Example:

test_ijson.py

import json
import ijson
import codecs

with codecs.open('test.json', encoding="utf-8") as json_file:
    form = ijson.items(json_file, 'menu.test_items.item')
    print(form)
    forms = (o for o in form)
    print(forms)
    for objects in forms:
        pass

test.json

{"menu": {"header": "SVG Viewer","test_items": [{"id": "Open"},{"id": "OpenNew", "label": "Open New"},{"id": "ZoomIn", "label": "Zoom In"},{"id": "ZoomOut", "label": "Zoom Out"},{"id": "OriginalView", "label": "Original View"},{"id": "Quality"},{"id": "Pause"},{"id": "Mute"},{"id": "Find", "label": "Find..."}]}}

Result:

<_yajl2.items object at 0x7fb21010f650>
<generator object at 0x7fb2100e0c00>
Traceback (most recent call last):
File "test_ijson.py", line 11, in
for objects in forms:
File "test_ijson.py", line 9, in
forms = (o for o in form)
TypeError: expected bytes, str found

If I import the python backend as ijson everthing went fine.

IncompleteJSONError: lexical error: invalid char in json text

Is there a solution for parsing json files with syntax problems?

Can this type of error be skipped?

IncompleteJSONError: lexical error: invalid char in json text.
          type":"Point","coordinates":[NaN,NaN],"crs":{"type":"link","
                     (right here) ------^

This is the actual pipeline being applied:

def get_generator(gen, export_path, batch_size):
  records = []
  for i, ob in enumerate(gen):
    records.append(get_record(ob))
    if len(records)==batch_size:
      records = save_parquet(records, i, export_path)
      print('saved:', str(i+1))
  if len(records) > 0:
    records = save_parquet(records, i, export_path)
    print('saved:', str(i+1))
    
def read_json_parquet(file_name, export_path, batch_size):
  Path(export_path).mkdir(parents=False, exist_ok=False)
  with open(file_name, 'rb') as file:
    obj = ijson.items(file, 'features.item')
    obj = get_generator(obj, export_path=export_path, batch_size=batch_size)

Which repo does current pypi release originate from?

Hi, thanks for maintaining this package. I'm just curious since the pypi package links to this repo but links from the original repo points to the pypi package. Have you started publishing the package now?

ijson.kvitems(prefix) iterates through items even after the prefix was found and closed

Describe the bug
ijson.kvitems(prefix) iterates through items even after the prefix was found and closed.

How to reproduce

If first generated an example.json file with the following code.

import json


def generate_example(n):
    with open("example.json", "w") as f:
        json.dump({"a": {"foo": 13}, "b": {"bar": list(range(n))}}, f)

This gives the following JSON for n=5

{"a": {"foo": 13}, "b": {"bar": [0, 1, 2, 3, 4]}}

Then I tried to load only the items under the prefix a with the two following functions:

import ijson


def g():
    with open("example.json", "r") as f:
        result = list(ijson.kvitems(f, prefix="a", use_float=True))

    print(result)


def h():
    with open("example.json", "r") as f:
        under_prefix = False
        result = []
        for prefix, event, value in ijson.parse(f):

            if prefix == "a" and event == "start_map":
                under_prefix = True

            if under_prefix:
                result.append((prefix, event, value))

            if prefix == "a" and event == "end_map":
                break

    print(result)

To compare the two functions, I created a big example.json file with n=10000000.

The function h took 0.43s to execute while the function g took 2.60s. Thus it seems that ijson.kvitems(prefix) iterates through items even after the prefix was found and closed (i.e event end_map is found).

Expected behavior
ijson.kvitems(prefix) should stop iterating through items once the prefix was found and closed.

Execution information:

  • Python version: 3.7.4
  • ijson version: 3.1.4
  • ijson backend: I don't know
  • ijson installation method: pip (with poetry 1.1.5)
  • OS: Ubuntu 16.04

Thanks !

Expose push / sax-like interface

Dealing with asyncio streams I was wondering how realistic it would be to make this awesome library into something that accepts a push pattern.

Thinking about something like:

import ijson

def on_event(event, value):
    ...  # Do something

async def collect_json(source):
    parser = ijson.Parser(on_event)
    async for chunk in source:
        parser.feed(chunk)
    parser.feed_eof()

The current system only supports blocking I/O and there is no way to emulate the above without using threads and pipes/queues unfortunately.

Python Lexer Is Excessively Greedy

ijson.backends.python.Lexer() has a main loop which looks for a number or a single-character lexeme and enters a simple decision tree. If the lexeme starts a string, the rest of the string is read in, with buffer updates as necessary, and then yielded out. If it does not start a string, the Lexer always attempts to extend the lexeme. In general, this isn't an issue, but if the file stream is wrapped around a socket, this can lead to significant parser lag and handshake stalemates as both parties wait for the other to transmit another chunk of data.

if lexeme == '"':
pos = match.start()
start = pos + 1
while True:
try:
end = buf.index('"', start)
escpos = end - 1
while buf[escpos] == '\\':
escpos -= 1
if (end - escpos) % 2 == 0:
start = end + 1
else:
break
except ValueError:
data = f.read(buf_size)
if not data:
raise common.IncompleteJSONError('Incomplete string lexeme')
buf += data
yield discarded + pos, buf[pos:end + 1]
pos = end + 1
else:
while match.end() == len(buf):
data = f.read(buf_size)
if not data:
break
buf += data
match = LEXEME_RE.search(buf, pos)
lexeme = match.group()
yield discarded + match.start(), lexeme
pos = match.end()

My guess is that the yajl backends do not share this issue, but I've not pulled up my Linux machine to check.

Better default backend

The current default backend users get when importing ijson is the pure python one. On the plus side this will always import correctly, but it was the downside that it's the one exhibiting worse performance. On a typical installation importing other backends wouldn't be an issue though, so we can probably try to offer a better backend by default by iterating over the alternatives, importing them, and returning the one that imports first.

How to fix IncompleteJSONError

I have come across the error when using ijson to parse big json file

IncompleteJSONError: lexical error: invalid char in json text.
                        {      "_id" : ObjectId("5e5d193d8cf3fe97fa488
                     (right here) ------^

My source code is as followed:

with codecs.open('lagou.json','rb') as f:
    objects = ijson.items(f,'item')
    print(objects.__next__())

It really confuses me that why character 'b' will be a invalid char

Build without yajl 1.x

Is your feature request related to a problem? Please describe.
I'm working on packaging this project for Void Linux, which doesn't have YAJL 1.x in it's repos. I want to skip/remove this back end, but the only way I've found to do this is to manually remove ijdon/backends/yajl.py and patch out the line that generates tests for the yajl backend.

Describe the solution you'd like
Some sort of build flag (not familiar with Python mofule ecosystem) would be nice, or a way to detect if yajl 1.x is available and not include based on that.

The patch works for now if not.

Document items' prefix specification

Many people online seem lost on how to use the prefix to select objects from an ijson.items call. Documenting the prefix syntax would clear things up.

Breaking change in version 2.5

Overview

Hi, it seems there is a breaking change in version 2.5 which should not be breaking according to SemVer:

items = ijson.items(self.__chars, path)
E       TypeError: expected bytes, str found

It breaks all out stack https://github.com/frictionlessdata for all new installations (because of the tabulator-py dependency)

Is it possible to handle this change differently? For example, deprecating the previous behavior while supporting it until the next major version or something like this.

Handle processing errors?

I have a huge stream (if saved: 3.5GB json file - usually received from unix pipe), which processed with ijson (python 3.7, conda):

ijson.version.__version__ '3.0'

cat file.json | ./process.py

simple code sample (sure I process source object, but it is enough to trigger the problem):

#!/usr/bin/env python3
#process.py

import ijson
import sys

json_objects = ijson.items(sys.stdin,'item._source')
for source in json_objects:
    continue

Exception (around 360k's object):

cat x.json | ./process.py 
Traceback (most recent call last):
  File "./process.py", line 8, in <module>
    for source in json_objects:
  File "/home/tobi/miniconda3/lib/python3.7/site-packages/ijson/compat.py", line 31, in read
    return self.str_reader.read(n).encode('utf-8')
  File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47886: invalid start byte

My question is: how to handle correctly this type of stream "byte" and/or encoding errors with ijson? Because 99% of stream is OK, but sometimes there is a problem, how to handle stream encoding and formatting errors?
I couldn't find any solution to put into a try...except, because it is iterating over the objects what ijson generated, so it could be handled there...

Thank you.
Tamas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.