Git Product home page Git Product logo

ijson's Introduction

https://coveralls.io/repos/github/ICRAR/ijson/badge.svg?branch=master

ijson

Ijson is an iterative JSON parser with standard Python iterator interfaces.

Ijson is hosted in PyPI, so you should be able to install it via pip:

pip install ijson

Binary wheels are provided for major platforms and python versions. These are built and published automatically using cibuildwheel via GitHub Actions.

All usage example will be using a JSON document describing geographical objects:

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city", "info": { ... }},
      {"name": "Thames", "type": "river", "info": { ... }},
      // ...
    ],
    "america": [
      {"name": "Texas", "type": "state", "info": { ... }},
      // ...
    ]
  }
}

Most common usage is having ijson yield native Python objects out of a JSON stream located under a prefix. This is done using the items function. Here's how to process all European cities:

import ijson

f = urlopen('http://.../')
objects = ijson.items(f, 'earth.europe.item')
cities = (o for o in objects if o['type'] == 'city')
for city in cities:
    do_something_with(city)

For how to build a prefix see the prefix section below.

Other times it might be useful to iterate over object members rather than objects themselves (e.g., when objects are too big). In that case one can use the kvitems function instead:

import ijson

f = urlopen('http://.../')
european_places = ijson.kvitems(f, 'earth.europe.item')
names = (v for k, v in european_places if k == 'name')
for name in names:
    do_something_with(name)

Sometimes when dealing with a particularly large JSON payload it may worth to not even construct individual Python objects and react on individual events immediately producing some result. This is achieved using the parse function:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')

Even more bare-bones is the ability to react on individual events without even calculating a prefix using the basic_parse function:

import ijson

events = ijson.basic_parse(urlopen('http://.../'))
num_names = sum(1 for event, value in events
                if event == 'map_key' and value == 'name')

A command line utility is included with ijson to help visualise the output of each of the routines above. It reads JSON from the standard input, and it prints the results of the parsing method chosen by the user to the standard output.

The tool is available by running the ijson.dump module. For example:

$> echo '{"A": 0, "B": [1, 2, 3, 4]}' | python -m ijson.dump -m parse
#: path, name, value
--------------------
0: , start_map, None
1: , map_key, A
2: A, number, 0
3: , map_key, B
4: B, start_array, None
5: B.item, number, 1
6: B.item, number, 2
7: B.item, number, 3
8: B.item, number, 4
9: B, end_array, None
10: , end_map, None

Using -h/--help will show all available options.

A command line utility is included with ijson to help benchmarking the different methods offered by the package. It offers some built-in example inputs that try to mimic different scenarios, but more importantly it also supports user-provided inputs. You can also specify which backends to time, number of iterations, and more.

The tool is available by running the ijson.benchmark module. For example:

$> python -m ijson.benchmark my/json/file.json -m items -p values.item

Using -h/--help will show all available options.

Although not usually how they are meant to be run, all the functions above also accept bytes and str objects directly as inputs. These are then internally wrapped into a file object, and further processed. This is useful for testing and prototyping, but probably not extremely useful in real-life scenarios.

All of the methods above work also on file-like asynchronous objects, so they can be iterated asynchronously. In other words, something like this:

import asyncio
import ijson

async def run():
   f = await async_urlopen('http://..../')
   async for object in ijson.items(f, 'earth.europe.item'):
      if object['type'] == 'city':
         do_something_with(city)
asyncio.run(run())

An explicit set of *_async functions also exists offering the same functionality, except they will fail if anything other than a file-like asynchronous object is given to them. (so the example above can also be written using ijson.items_async). In fact in ijson version 3.0 this was the only way to access the asyncio support.

The four routines shown above internally chain against each other: tuples generated by basic_parse are the input for parse, whose results are the input to kvitems and items.

Normally users don't see this interaction, as they only care about the final output of the function they invoked, but there are occasions when tapping into this invocation chain this could be handy. This is supported by passing the output of one function (i.e., an iterable of events, usually a generator) as the input of another, opening the door for user event filtering or injection.

For instance if one wants to skip some content before full item parsing:

import io
import ijson

parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
while True:
    prefix, event, value = next(parse_events)
    if value == "skip":
        break
for obj in ijson.items(parse_events, 'item'):
    print(obj)

Note that this interception only makes sense for the basic_parse -> parse, parse -> items and parse -> kvitems interactions.

Note also that event interception is currently not supported by the async functions.

All examples above use a file-like object as the data input (both the normal case, and for asyncio support), and hence are "pull" interfaces, with the library reading data as necessary. If for whatever reason it's not possible to use such method, you can still push data through yet a different interface: coroutines (via generators, not asyncio coroutines). Coroutines effectively allow users to send data to them at any point in time, with a final target coroutine-like object receiving the results.

In the following example the user is doing the reading instead of letting the library do it:

import ijson

@ijson.coroutine
def print_cities():
   while True:
      obj = (yield)
      if obj['type'] != 'city':
         continue
      print(obj)

coro = ijson.items_coro(print_cities(), 'earth.europe.item')
f = urlopen('http://.../')
for chunk in iter(functools.partial(f.read, buf_size)):
   coro.send(chunk)
coro.close()

All four ijson iterators have a *_coro counterpart that work by pushing data into them. Instead of receiving a file-like object and option buffer size as arguments, they receive a single target argument, which should be a coroutine-like object (anything implementing a send method) through which results will be published.

An alternative to providing a coroutine is to use ijson.sendable_list to accumulate results, providing the list is cleared after each parsing iteration, like this:

import ijson

events = ijson.sendable_list()
coro = ijson.items_coro(events, 'earth.europe.item')
f = urlopen('http://.../')
for chunk in iter(functools.partial(f.read, buf_size)):
   coro.send(chunk)
   process_accumulated_events(events)
   del events[:]
coro.close()
process_accumulated_events(events)

Additional options are supported by all ijson functions to give users more fine-grained control over certain operations:

  • The use_float option (defaults to False) controls how non-integer values are returned to the user. If set to True users receive float() values; otherwise Decimal values are constructed. Note that building float values is usually faster, but on the other hand there might be loss of precision (which most applications will not care about) and will raise an exception when overflow occurs (e.g., if 1e400 is encountered). This option also has the side-effect that integer numbers bigger than 2^64 (but sometimes 2^32, see backends) will also raise an overflow error, due to similar reasons. Future versions of ijson might change the default value of this option to True.
  • The multiple_values option (defaults to False) controls whether multiple top-level values are supported. JSON content should contain a single top-level value (see the JSON Grammar). However there are plenty of JSON files out in the wild that contain multiple top-level values, often separated by newlines. By default ijson will fail to process these with a parse error: trailing garbage error unless multiple_values=True is specified.
  • Similarly the allow_comments option (defaults to False) controls whether C-style comments (e.g., /* a comment */), which are not supported by the JSON standard, are allowed in the content or not.
  • For functions taking a file-like object, an additional buf_size option (defaults to 65536 or 64KB) specifies the amount of bytes the library should attempt to read each time.
  • The items and kvitems functions, and all their variants, have an optional map_type argument (defaults to dict) used to construct objects from the JSON stream. This should be a dict-like type supporting item assignment.

When using the lower-level ijson.parse function, three-element tuples are generated containing a prefix, an event name, and a value. Events will be one of the following:

  • start_map and end_map indicate the beginning and end of a JSON object, respectively. They carry a None as their value.
  • start_array and end_array indicate the beginning and end of a JSON array, respectively. They also carry a None as their value.
  • map_key indicates the name of a field in a JSON object. Its associated value is the name itself.
  • null, boolean, integer, double, number and string all indicate actual content, which is stored in the associated value.

A prefix represents the context within a JSON document where an event originates at. It works as follows:

  • It starts as an empty string.
  • A <name> part is appended when the parser starts parsing the contents of a JSON object member called name, and removed once the content finishes.
  • A literal item part is appended when the parser is parsing elements of a JSON array, and removed when the array ends.
  • Parts are separated by ..

When using the ijson.items function, the prefix works as the selection for which objects should be automatically built and returned by ijson.

Ijson provides several implementations of the actual parsing in the form of backends located in ijson/backends:

  • yajl2_c: a C extension using YAJL 2.x. This is the fastest, but might require a compiler and the YAJL development files to be present when installing this package. Binary wheel distributions exist for major platforms/architectures to spare users from having to compile the package.
  • yajl2_cffi: wrapper around YAJL 2.x using CFFI.
  • yajl2: wrapper around YAJL 2.x using ctypes, for when you can't use CFFI for some reason.
  • yajl: deprecated YAJL 1.x + ctypes wrapper, for even older systems.
  • python: pure Python parser, good to use with PyPy

This list of backend names is available under the ijson.ALL_BACKENDS constant.

You can import a specific backend and use it in the same way as the top level library:

import ijson.backends.yajl2_cffi as ijson

for item in ijson.items(...):
    # ...

Importing the top level library as import ijson uses the first available backend in the same order of the list above, and its name is recorded under ijson.backend. If the IJSON_BACKEND environment variable is set its value takes precedence and is used to select the default backend.

You can also use the ijson.get_backend function to get a specific backend based on a name:

backend = ijson.get_backend('yajl2_c')
for item in backend.items(...):
    # ...

In more-or-less decreasing order, these are the most common actions you can take to ensure you get most of the performance out of ijson:

  • Make sure you use the fastest backend available. See backends for details.
  • If you know your JSON data contains only numbers that are "well behaved" consider turning on the use_float option. See options for details.
  • Make sure you feed ijson with binary data instead of text data. See faq #1 for details.
  • Play with the buf_size option, as depending on your data source and your system a value different from the default might show better performance. See options for details.

The benchmarking tool should help with trying some of these options and observing their effect on your input files.

  1. Q: Does ijson work with bytes or str values?

    A: In short: both are accepted as input, outputs are only str.

    All ijson functions expecting a file-like object should ideally be given one that is opened in binary mode (i.e., its read function returns bytes objects, not str). However if a text-mode file object is given then the library will automatically encode the strings into UTF-8 bytes. A warning is currently issued (but not visible by default) alerting users about this automatic conversion.

    On the other hand ijson always returns text data (JSON string values, object member names, event names, etc) as str objects. This mimics the behavior of the system json module.

  2. Q: How are numbers dealt with?

    A: ijson returns int values for integers and decimal.Decimal values for floating-point numbers. This is mostly because of historical reasons. Since 3.1 a new use_float option (defaults to False) is available to return float values instead. See the options section for details.

  3. Q: I'm getting an UnicodeDecodeError, or an IncompleteJSONError with no message

    A: This error is caused by byte sequences that are not valid in UTF-8. In other words, the data given to ijson is not really UTF-8 encoded, or at least not properly.

    Depending on where the data comes from you have different options:

    • If you have control over the source of the data, fix it.
    • If you have a way to intercept the data flow, do so and pass it through a "byte corrector". For instance, if you have a shell pipeline feeding data through stdin into your process you can add something like ... | iconv -f utf8 -t utf8 -c | ... in between to correct invalid byte sequences.
    • If you are working purely in python, you can create a UTF-8 decoder using codecs' incrementaldecoder to leniently decode your bytes into strings, and feed those strings (using a file-like class) into ijson (see our string_reader_async internal class for some inspiration).

    In the future ijson might offer something out of the box to deal with invalid UTF-8 byte sequences.

  4. Q: I'm getting parse error: trailing garbage or Additional data found errors

    A: This error signals that the input contains more data than the top-level JSON value it's meant to contain. This is usually caused by JSON data sources containing multiple values, and is usually solved by passing the multiple_values=True to the ijson function in use. See the options section for details.

  5. Q: Are there any differences between the backends?

    A: Apart from their performance, all backends are designed to support the same capabilities. There are however some small known differences:

    • The yajl backend doesn't support multiple_values=True. It also doesn't complain about additional data found after the end of the top-level JSON object. When using use_float=True it also doesn't properly support values greater than 2^32 in 32-bit platforms or Windows. Numbers with leading zeros are not reported as invalid (although they are invalid JSON numbers). Incomplete JSON tokens at the end of an incomplete document (e.g., {"a": fals) are not reported as IncompleteJSONError.
    • The python backend doesn't support allow_comments=True It also internally works with str objects, not bytes, but this is an internal detail that users shouldn't need to worry about, and might change in the future.

ijson was originally developed and actively maintained until 2016 by Ivan Sagalaev. In 2019 he handed over the maintenance of the project and the PyPI ownership.

Python parser in ijson is relatively simple thanks to Douglas Crockford who invented a strict, easy to parse syntax.

The YAJL library by Lloyd Hilaiel is the most popular and efficient way to parse JSON in an iterative fashion.

Ijson was inspired by yajl-py wrapper by Hatem Nassrat. Though ijson borrows almost nothing from the actual yajl-py code it was used as an example of integration with yajl using ctypes.

ijson's People

Contributors

acrisci avatar dav1dde avatar davidfischer avatar explodingcabbage avatar isagalaev avatar jayvdb avatar jpmckinney avatar kianmeng avatar martin-molinero avatar matiasg avatar meggycal avatar mgorny avatar pydsigner avatar radhermit avatar rtobar avatar selik avatar signalpillar avatar simonw avatar zjuchenyuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ijson's Issues

Document items' prefix specification

Many people online seem lost on how to use the prefix to select objects from an ijson.items call. Documenting the prefix syntax would clear things up.

(yajl2_c) TypeError when reading from an aiofiles file

Running Python 3.8.2 (on Linux), ijson 3.1.post0, backend yajl2_c, aiofiles 0.5.0.

When trying to asynchronously get items from a simple json file, a TypeError: _get_read() missing 1 required positional argument: 'f' is raised.

Code:

import aiofiles
import asyncio
import ijson

ijson_backend = ijson.get_backend("yajl2_c")

async def main():
    async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
        async for i in ijson_backend.items_async(buff, "item"):
            print(i)

asyncio.run(main())

test.json:

["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]

Output:

Traceback (most recent call last):
  File "test_ijson.py", line 12, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "ignore_test_ijson.py", line 9, in main
    async for t in ijson_backend.items_async(buff, "item"):
TypeError: _get_read() missing 1 required positional argument: 'f'

I would expect normal operation that would print all items from the test.json file.


Synchronous code works correctly:

import ijson

ijson_backend = ijson.get_backend("yajl2_c")


def noasync_main():
    with open("test.json", "r", encoding="utf-8") as buff:
        for i in ijson_backend.items(buff, "item"):
            print(i)


noasync_main()

Output:

1
2
3
4
5
6
7
8
9
10

Update: the yajl2_cffi backend works correctly:

Code:

import aiofiles
import asyncio
import ijson

ijson_backend = ijson.get_backend("yajl2_cffi")

async def main():
    async with aiofiles.open("test.json", "r", encoding="utf-8") as buff:
        async for i in ijson_backend.items_async(buff, "item"):
            print(i)

asyncio.run(main())

Output:

1
2
3
4
5
6
7
8
9
10

IncompleteJSONError: lexical error: invalid char in json text

Is there a solution for parsing json files with syntax problems?

Can this type of error be skipped?

IncompleteJSONError: lexical error: invalid char in json text.
          type":"Point","coordinates":[NaN,NaN],"crs":{"type":"link","
                     (right here) ------^

This is the actual pipeline being applied:

def get_generator(gen, export_path, batch_size):
  records = []
  for i, ob in enumerate(gen):
    records.append(get_record(ob))
    if len(records)==batch_size:
      records = save_parquet(records, i, export_path)
      print('saved:', str(i+1))
  if len(records) > 0:
    records = save_parquet(records, i, export_path)
    print('saved:', str(i+1))
    
def read_json_parquet(file_name, export_path, batch_size):
  Path(export_path).mkdir(parents=False, exist_ok=False)
  with open(file_name, 'rb') as file:
    obj = ijson.items(file, 'features.item')
    obj = get_generator(obj, export_path=export_path, batch_size=batch_size)

How to correctly stream *multiple* chunks of JSON?

I'm using asyncio.
I do asynchronous requests to a child process that may be connected through stdin or a socket (if in a remote machine), then I wait for JSON responses through the stdin pipe or the socket.
Since the requests are asynchronous, I could get multiple JSON responses one after the other, consecutively.

What I want to achieve is an asynchronous streamer of multiple JSON chunks, for instance i could get:
{"x": ... 3} ... {"y ... ":4}
or
{"x":3}{"y ... ":4}

And for these 2 cases my code do work perfectly, because once I receive the first correct JSON chunk (that is {"x":3}) I reset the items_coro coroutine, but not the events list.

Problem arises when I get {"x":3}{"y":4} all in once. The library can't parse that because it is not valid JSON, of course.
But then, how to get the streaming of multiple JSON chunks correctly?! Is that possible?

I cannot even use a list of JSONs for the chunks because if I try to put all those JSON chunks inside a list, it will never get parsed until I receive the last ], which I won't receive until the child process closes, for instance.
There is a way to parse JSON chunks having them inside a list? Is that the solution? Thanks.

This is my code:

import typing as t
import asyncio
import aioconsole
import ijson

class JSONStreamer:
    BUFSIZE = 4

    def __init__(self, asyncio_file, on_json=None, on_error=None):
    	    self.input = asyncio_file
	    self.jsonEvents = None
	    self.jsonCoro = None
	    self.active = True
	    self.onJSONCallback = on_json
	    self.onErrorCallback = on_error


    def __repr__(self):
	    return "<JSONStreamer object>"


    def onJSON(self, py_dict:dict[str, t.Any]) -> None:
	    if self.onJSONCallback is not None:
		    self.onJSONCallback(py_dict)


    def onError(self, exc:Exception) -> None:
	    if self.onErrorCallback is not None:
		    self.onErrorCallback(exc)


    def close(self) -> None:
	    if self.jsonEvents is not None:
		    self.jsonEvents = None
	    if self.jsonCoro is not None:
		    try:
			    self.jsonCoro.close()
		    except Exception:
			    pass
		    finally:
			    self.jsonCoro = None


    def reset(self) -> None:
	    self.close()
	    self.jsonEvents = ijson.sendable_list()
	    self.jsonCoro = ijson.items_coro(self.jsonEvents, '')


    async def run(self) -> None:
	    # RESET the coroutine state.
	    if self.jsonEvents is None:
		    self.reset()
	    # FOREVER.
	    while self.active:
		    try:
			    # READ binary json data. (E.g. from socket, or stdin)
			    data = await self.input.read(self.BUFSIZE)
		    except asyncio.CancelledError:
			    # CLOSE on cancel.
			    self.close()
			    self.active = False
			    return
		    # EOF! Close and return.
		    if not data:
			    self.close()
			    self.active = False
			    return
		    elif len(data) > 0:
			    # STRIP unwanted bytes terminators.
			    data = data.strip(b"\n\r\t")
			    # CHECK if we have some data after stripping.
			    if len(data) > 0:
				    try:
					    # SEND this (partial?) JSON data to the coroutine.
					    self.jsonCoro.send(data)
				    except Exception as exc:
					    # ERROR: Unable to parse JSON correctly, maybe malformed.
					    # CALL onError callback.
					    self.onError(exc)
					    # RESET the coroutine state.
					    self.reset()
					    # CONTINUE to grasp JSON.
					    continue
				    must_reset = False
				    # FOREACH correctly parsed JSONs.
				    for py_obj in self.jsonEvents:
					    print("O IS:", py_obj, type(py_obj))
					    # CALL the callback to signal we got a correctly parsed JSON chunk,
					    # that has been already converted to a python object (list or dict).
					    self.onJSON(py_obj)
					    must_reset = True
				    if must_reset:
					    self.reset()
	    self.close()

    def _clean_data(self, data):
	    return data.strip(b"\n\r\t")


def on_json(py_dict):
    print("ON JSON!!", py_dict)

def on_error(exc):
    print("ON ERROR!!", exc, exc.__class__.__name__)

async def main2():
    stdin, _ = await aioconsole.get_standard_streams()
    print("STDIN TYPE:", type(stdin))
    json_streamer = JSONStreamer(stdin, on_json=on_json, on_error=on_error)
    await json_streamer.run()


if __name__=="__main__":
    try:
	    asyncio.run(main2())
    except BaseException as e:
	    print(e.__class__.__name__, e)

Parsing non-UTF-8 data

RFC 8259 allows non-UTF-8 data in "a closed ecosystem".

I am using ijson to iteratively read JSON from stdin, and I don't presently know a way to change its encoding, without either causing an error in ijson or having to buffer the entire input. Among other attempts, I tried monkey-patching b2s in ijson.compat to use a different encoding, but it led to a different error than UnicodeDecodeError.

Is there a way (or a desire) to parse non-UTF-8 data?

How to fix IncompleteJSONError

I have come across the error when using ijson to parse big json file

IncompleteJSONError: lexical error: invalid char in json text.
                        {      "_id" : ObjectId("5e5d193d8cf3fe97fa488
                     (right here) ------^

My source code is as followed:

with codecs.open('lagou.json','rb') as f:
    objects = ijson.items(f,'item')
    print(objects.__next__())

It really confuses me that why character 'b' will be a invalid char

ijson.items prefix doesn't work if key contains "."

Input json file:

{
    "0.1": {
      "123":"ok"
    }
}

test file:

import ijson
filename = "input.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, '0.1')
    cities = (o for o in objects)
    for city in cities:
        print(city)

the test file prints nothing.
If I change the key to be "01", and prefix to be "01" then {"123":"ok"} prints as expected.

Any help would be appreciated!

Memory Explosion Using Parser

Hi,

I am using this module because parsing using json.loads() results in my already large json string (about 900MB) memory usage going up by about 10x (over 9GB). I was expecting to be able to parse the JSON line by line. It works, but I was a little surprised that when I call ijson.parse() it grabs about 3GB of memory. May I ask why the memory usage is so large? More conversion to dictionaries behind the scenes?

Thanks
Ray

ijson.parse iter_lines() returns error too many values to unpack (expected 2)

Me again, sorry. I'm trying to stream data from a web call directly into ijson. However, if I use...

parser=ijson.parse(cellset_response.iter_lines())
for prefix, event, value in parser:
pass

I get 'too many values to unpack (expected 2)' error. If I use iter_content() instead of iter_lines() I get 'not enough values to unpack (expected 2, got 1).

It seems I can't win. :D

Which repo does current pypi release originate from?

Hi, thanks for maintaining this package. I'm just curious since the pypi package links to this repo but links from the original repo points to the pypi package. Have you started publishing the package now?

Memory leak in yajl2_c backend

Opening a 1GB JSON file and iterating over it using ijson.items takes up a large amount of memory which can not be reclaimed by running the garbage collector.

Software & Environment version:
ijson==3.0.1, Python3.8, Debian buster.

Included are a Dockerfile and Python file which fully demonstrate the issue.


# Dockerfile
FROM python:3.8-buster

RUN wget https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/NewYork.zip

RUN unzip NewYork.zip -d /data
RUN rm NewYork.zip

RUN apt update && apt install -y \
    libyajl2

RUN pip3 install ijson==3.0.1 cffi

COPY main.py /

ENTRYPOINT ["python", "/main.py"]

# main.py
import resource
import gc

import ijson.backends.yajl2_c as ijson
# import ijson.backends.python as ijson
# import ijson.backends.yajl2_cffi as ijson


def memusage():
    """memory usage in GB. getrusage defaults to KB on Linux."""
    return str(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6)


def iter_features(filename):
    with open(filename, 'rb') as f:
        yield from ijson.items(f, "features.item")


def main():
    print("using backend", ijson.backend)
    print("starting memory usage:", memusage(), 'GB')

    for feature in iter_features('/data/NewYork.geojson'):
        pass

    print("memory usage after reading file:", memusage(), 'GB')
    gc.collect()
    print("memory usage after garbage collection:", memusage(), 'GB')


if __name__ == '__main__':
    main()

Create a new directory with the Dockerfile and main.py and then run:

$ docker build -t test-ijson .
$ docker run -it test-ijson
using backend yajl2_c
starting memory usage: 0.010604 GB
memory after reading file: 5.824332 GB
memory usage after garbage collection: 5.824332 GB

if you change main.py to use the python backend, there is no issue:

using backend python
starting memory usage: 0.01056 GB
memory after reading file: 0.01582 GB
memory usage after garbage collection: 0.01582 GB

Same with yajl2_cffi:

using backend yajl2_cffi
starting memory usage: 0.015032 GB
memory after reading file: 0.016292 GB
memory usage after garbage collection: 0.016292 GB

Though they are, of course, much slower to test than yajl_c.

Memory leak in asyncio interface

Hi.
I think I've found a memory leak in asyncio interface.
I'm using latest ijson==3.0.4 with python3.7 on macos mojave.

Here's an example of the leak:

import asyncio
import ijson.backends.yajl2_c as ijson


class AsyncReaderWrapper:
    def __init__(self, stream):
        self._stream = stream

    async def read(self, value: int):
        if value == 0:
            return b""
        return self._stream.read()


async def parse_json_async(json_fp):
    async for objects in ijson.parse_async(AsyncReaderWrapper(json_fp)):
        yield objects


async def amain():
    events = 0
    with open("100mb.json", "rb") as json_fp:
        async for prefix, event, value in parse_json_async(json_fp):
            events += 1

    print(f"Got {events}")
    print("Press any key...")
    input()


if __name__ == "__main__":
    asyncio.run(amain())

And a sync version for comparison:

import ijson.backends.yajl2_c as ijson


def main():
    events = 0
    with open("100mb.json", "rb") as json_fp:
        for prefix, event, value in ijson.parse(json_fp):
            events += 1

    print(f"Got {events}")
    print("Press any key...")
    input()


if __name__ == "__main__":
    main()

I've used memory_profiler on both and got following results:

Async version

image

Sync version:
image

Looks like when I'm using async interface, memory released only after the whole file was processed.
I tried all backends and all have comparable results.

errors in python and yajl2_c backends

I've hit another snag with my libarchive experiment:

(Currently using https://github.com/smartfile/python-libarchive, working with this file-like object: https://github.com/smartfile/python-libarchive/blob/master/libarchive/__init__.py#L183)

When using either the python or the yajl2_c backends I hit some error condition on the end of the file. It works correctly with the yajl2_cffi or yajl2 backends.

Here's what happens when using yajl2_c:

count = 0
with libarchive.Archive('myarchive.zip') as archive:
    for entry in archive:
        if entry.pathname == 'myjson.json':

            jsonstream = archive.readstream(entry.size)
            objects = ijson.items(jsonstream, 'somekey.item')
            for o in objects:
                count +=1
                print(count)
[prints the correct number of things, so everything is parsed right to the end it seems]
Traceback (most recent call last):
  File "myfile.py", line 60, in <module>
    for o in objects:
TypeError: a bytes-like object is required, not 'NoneType'

With the python backend (same code):

[same correct number again]
Traceback (most recent call last):
  File myfile.py", line 60, in <module>
    for o in objects:
  File "/usr/lib/python3.9/site-packages/ijson/utils.py", line 55, in coros2gen
    f.send(value)
  File "/usr/lib/python3.9/site-packages/ijson/backends/python.py", line 36, in utf8_encoder
    sdata = decode(bdata, final)
  File "/usr/lib/python3.9/codecs.py", line 321, in decode
    data = self.buffer + input
TypeError: can't concat NoneType to bytes

I'm not sure if I'm doing something wrong with calling ijson, or that my file-stream from libarchive is somehow "weird" (but I wouldn't know how really). The weird thing is that it works with the cffi backend. Oh and it also works with all backends if I .read() the full jsonstream into memory (but this defeats the point of using ijson of course).

add support for iterating over key/value pairs?

If I understand correctly, there is the built-in items wrapper for iterating over items in a list, but there isn't one for iterating over keys in a dictionary.

I've seen the solution for the special case when the keys are at the top level of the JSON isagalaev#62 (comment) but what if the large list of keys is not at the top level? E.g.

{
 "my_big_data":
  { 
    "1": 1,
    "2": 2
  }
}

Would it be difficult to do an analogous function to items where one can specify the prefix of the dictionary to iterate over and returns the keys and values?

I guess besides the implementation, there is also the question what to call it.
It is perhaps a bit unfortunate that in python 3 the natural name for the dictionary iterator returning keys and values would actually be items but I guess that is already taken ;-)

PyInstaller Hooks

It would be nice if ijson had PyInstaller hooks so we can bundle it as part of a standalone application.

This code from init.py causes problems when building with PyInstaller because dynamically loaded modules are not included
return importlib.import_module('ijson.backends.' + backend)

For now I'm getting around this by specifying the below in the .spec file
hiddenimports=['ijson.backends.yajl2_c']

ijson.kvitems(prefix) iterates through items even after the prefix was found and closed

Describe the bug
ijson.kvitems(prefix) iterates through items even after the prefix was found and closed.

How to reproduce

If first generated an example.json file with the following code.

import json


def generate_example(n):
    with open("example.json", "w") as f:
        json.dump({"a": {"foo": 13}, "b": {"bar": list(range(n))}}, f)

This gives the following JSON for n=5

{"a": {"foo": 13}, "b": {"bar": [0, 1, 2, 3, 4]}}

Then I tried to load only the items under the prefix a with the two following functions:

import ijson


def g():
    with open("example.json", "r") as f:
        result = list(ijson.kvitems(f, prefix="a", use_float=True))

    print(result)


def h():
    with open("example.json", "r") as f:
        under_prefix = False
        result = []
        for prefix, event, value in ijson.parse(f):

            if prefix == "a" and event == "start_map":
                under_prefix = True

            if under_prefix:
                result.append((prefix, event, value))

            if prefix == "a" and event == "end_map":
                break

    print(result)

To compare the two functions, I created a big example.json file with n=10000000.

The function h took 0.43s to execute while the function g took 2.60s. Thus it seems that ijson.kvitems(prefix) iterates through items even after the prefix was found and closed (i.e event end_map is found).

Expected behavior
ijson.kvitems(prefix) should stop iterating through items once the prefix was found and closed.

Execution information:

  • Python version: 3.7.4
  • ijson version: 3.1.4
  • ijson backend: I don't know
  • ijson installation method: pip (with poetry 1.1.5)
  • OS: Ubuntu 16.04

Thanks !

Breaking change in version 2.5

Overview

Hi, it seems there is a breaking change in version 2.5 which should not be breaking according to SemVer:

items = ijson.items(self.__chars, path)
E       TypeError: expected bytes, str found

It breaks all out stack https://github.com/frictionlessdata for all new installations (because of the tabulator-py dependency)

Is it possible to handle this change differently? For example, deprecating the previous behavior while supporting it until the next major version or something like this.

Get items iterator from root

If I have a json array as file root using items(f, '') returns a generator with 1 item (the array itself), instead of the generator with its items.

Example:

import io, ijson
it = ijson.items(io.BytesIO(b'[1, 2, 3]'), '')
for el in it:
  print(it)
# Prints [1, 2, 3] instead of the items 1, 2, and 3

Continue parsing after an using ijson.items on an array

Description

I want to parse an array using ijson.items, stop on StopIteration, and then proceed to parse trailing content using the underlying ijson.parse object. In the code below, the final line fails. If I replace while True with for i in range(3) then I parse all of arr without a StopIteration and I can go on to parse the trailing content. This doesn't help me though, because I don't know how long arr is...

Thank you for you help.

import io
import ijson

parse_events = ijson.parse(io.BytesIO(b'''
{
  "leading": "hi",
  "arr": [ 1, 2, 3 ],
  "trailing": "bye"
}
'''))

# not shown: iterate over parse_events to process "leading"

arr_iter = ijson.items(parse_events, 'arr.item')

try:
    while True:
        print(next(arr_iter))
except StopIteration:
    print('Caught StopIteration')

next(parse_events) # I want to read "trailing" now but get StopIteration

Build without yajl 1.x

Is your feature request related to a problem? Please describe.
I'm working on packaging this project for Void Linux, which doesn't have YAJL 1.x in it's repos. I want to skip/remove this back end, but the only way I've found to do this is to manually remove ijdon/backends/yajl.py and patch out the line that generates tests for the yajl backend.

Describe the solution you'd like
Some sort of build flag (not familiar with Python mofule ecosystem) would be nice, or a way to detect if yajl 1.x is available and not include based on that.

The patch works for now if not.

Better default backend

The current default backend users get when importing ijson is the pure python one. On the plus side this will always import correctly, but it was the downside that it's the one exhibiting worse performance. On a typical installation importing other backends wouldn't be an issue though, so we can probably try to offer a better backend by default by iterating over the alternatives, importing them, and returning the one that imports first.

Expose push / sax-like interface

Dealing with asyncio streams I was wondering how realistic it would be to make this awesome library into something that accepts a push pattern.

Thinking about something like:

import ijson

def on_event(event, value):
    ...  # Do something

async def collect_json(source):
    parser = ijson.Parser(on_event)
    async for chunk in source:
        parser.feed(chunk)
    parser.feed_eof()

The current system only supports blocking I/O and there is no way to emulate the above without using threads and pipes/queues unfortunately.

ijson.common.items removed in 3.0 [was: Unable to use items with event generator]

Reading data that is formatted as sort of preamble followed by a stream similar items in an array, I am currently using a combination of ijson.parse and ijson.items, reading the preamble as prefixed events, followed by processing the array items. Something like

{"interesting": "preamble",
 "another": "key",
 "actual_results": [{"object": 1}, {"object": 2}]}
def read_preamble(events):
    preamble = None
    for prefix, etype, value in events:
        if prefix == 'preamble':
            preamble = value
        if prefix == 'actual_results':
            return preamble

events = ijson.parse(data)
preamble = read_preamble(events)
for result in ijson.items(events, 'actual_results.item'):
    process_result(result)

As of ijson 3.0, items no longer accepts an event generator and requires a raw input, which I can't recreate (actual use case is a sizable HTTP response that I wouldn't want to request twice).

I guess it's possible to use just ijson.parse and mimic the old ijson.items with an ObjectBuilder, but that seems a step back in usability to me. Is there a better way to accomplish this / can we get an items-like API that enables this?

Memory leak with yajl2_c backend

Hi,

Thank you very much for the ijson library.

I think I may have found a memory leak when using the yajl2_cbackend. I've reused a code similar to the one found in a previous issue:

https://gist.github.com/mhugo/dec469223e578ea7ec94946edcd43e6f

With yajl2_c:

using backend yajl2_c
using ijson version 3.1.1
starting memory usage: 11.736 MB
spent time: 5.27
memory usage after ijson calls: 204.432 MB
memory usage after garbage collection: 204.432 MB

With yajl2_cffi:

using backend yajl2_cffi
using ijson version 3.1.1
starting memory usage: 18.556 MB
spent time: 16.25
memory usage after ijson calls: 18.556 MB
memory usage after garbage collection: 18.568 MB

Output items sequence as original bytes

Is your feature request related to a problem? Please describe.

It'd be great if there were a way to rapidly break a large stream of multiple JSON values (i.e. the multiple_values option) into its constituent values. For use-cases where you just need to know e.g. the number of JSON values in a stream, or need to multiplex an incoming stream across threads, or simply substring match the entire raw JSON value without first interpreting it, this is a pretty useful feature. As a point of reference, some JSON libraries like Golang's support this out of the box: in that case, you can decode a JSON-containing byte array into a json.RawMessage, which just copies the byte array.

Describe the solution you'd like

I'd like some equivalent to ijson.items that simply produces the original bytes (possibly copied) instead of parsing the items themselves.

Describe alternatives you've considered

If I had full control over the production of these JSON streams, I could require that the output were newline-delimited. At present, this is not the case.

I think the current workaround is to run jq -cM in a subprocess and pipe the stream into jq, which will force sequences like {}{} to get produced as {}\n{}\n. I could try to reserialize the original items, but that doesn't always result in the desired behavior (and would probably be slower than the jq equivalent). This is an imperfect solution because it'll mangle the original bytes, which may not be the desired behavior when searching for item-level substring matches.

Python Backend Allows Leading 0s in Numbers

The python backend will not throw an IncompleteJSONError when parsing a JSON file that has leading 0s in the numeric value. When using the yajl2_c parser, the error is thrown as expected.

For example, when given this invalid JSON, both backends should throw the error.

{"should_fail": 001}

However, as the output shows, only the yajl2 backend fails, while the python parser parses 001 as 1

>>> import ijson.backends.python as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
[{'should_fail': 1}]
>>> import ijson.backends.yajl2_c as ijson
>>> print ([x for x in ijson.items('{"should_fail": 001}', '')])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
ijson.common.IncompleteJSONError: parse error: after key and value, inside map, I expect ',' or '}'
                       {"should_fail": 001}
                     (right here) ------^

Python Version: 3.6.12
ijson Version: 3.1.1

Handle processing errors?

I have a huge stream (if saved: 3.5GB json file - usually received from unix pipe), which processed with ijson (python 3.7, conda):

ijson.version.__version__ '3.0'

cat file.json | ./process.py

simple code sample (sure I process source object, but it is enough to trigger the problem):

#!/usr/bin/env python3
#process.py

import ijson
import sys

json_objects = ijson.items(sys.stdin,'item._source')
for source in json_objects:
    continue

Exception (around 360k's object):

cat x.json | ./process.py 
Traceback (most recent call last):
  File "./process.py", line 8, in <module>
    for source in json_objects:
  File "/home/tobi/miniconda3/lib/python3.7/site-packages/ijson/compat.py", line 31, in read
    return self.str_reader.read(n).encode('utf-8')
  File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47886: invalid start byte

My question is: how to handle correctly this type of stream "byte" and/or encoding errors with ijson? Because 99% of stream is OK, but sometimes there is a problem, how to handle stream encoding and formatting errors?
I couldn't find any solution to put into a try...except, because it is iterating over the objects what ijson generated, so it could be handled there...

Thank you.
Tamas

Add support for user-specified mapping type [was: Parsing into OrderedDict]

I have code that re-orders JSON keys into a standardized order using OrderedDict.move_to_end. I want to use ijson to read the input iteratively. Presently, I think I would need to convert the dict that ijson returns into an OrderedDict, but my data has deep JSON objects, so this would be a fairly expensive operation. It would be faster to parse the data into an OrderedDict directly.

Is there an interest in adding this feature?

Arm64 wheel

With the increase of Arm CPUs in datacenters and the upcoming Apple migration to Arm, the use of Python on these platforms is growing. However, installing Python modules without Wheels often fails or it is very slow. The error messages users see are not clearly identifying the problem as missing build dependency. Publishing a Wheel is typically low effort, just a few lines in the build script. For users this saves significant time by avoiding troubleshooting and not having to wait for build processes to finish.

Ijson uses cibuildwheel via GitHub actions. If you're open to it - the easiest way to create arm64 wheels would be to move to cibuildwheel on travis.com. I would be happy to try and create a pull request in that direction.

This is very very slow on my computer

So I have a json file, 330 ish MB.
The content is like this

{
  "locations" : [ {
    "timestampMs" : "1231313131313",
    "latitudeE7" : 111111111,
    "longitudeE7" : 123123131,
    "accuracy" : 36,
    "activity" : [ {
      "timestampMs" : "1211211121121",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 75
      }, {
        "type" : "ON_FOOT",
        "confidence" : 10
      }, {
        "type" : "IN_VEHICLE",
        "confidence" : 5
      }, {
        "type" : "ON_BICYCLE",
        "confidence" : 5
      }, {
        "type" : "UNKNOWN",
        "confidence" : 5
      }, {
        "type" : "WALKING",
        "confidence" : 5
      }, {
        "type" : "RUNNING",
        "confidence" : 5
      } ]
    } ]
  }, {........

Meaning an array of locations.
If I run this through json_load, then iterate over the file, pull out the two map_keys I want. It takes about 20 seconds. It is doable.

But I cannot load the whole thing in memory anymore, it is to big for mye infrastructure, so I found this lib. But when I run fex

    locations = ijson.kvitems(json_file, 'locations.item')
    timestampMsObjects = (v for k, v in locations if k == 'timestampMs')
    timestampMs = list(timestampMsObjects)

It takes many many minutes. I dont know how long acctually becaus i quit it everytime it goes to far.

Why is this? Im just trying to get the length of that list.
See how many points im working with.

Afterwards I want to pull out 3 map_keys, and combine them into a smaller object. But just need ti naje sure this software is fast enough.

Anyone with some insight on this?

IncompleteJSONError is not recognized for null and boolean values

Hi guys,

we are using this library to be able to parse incomplete JSON, specifically its Python backend.

We noticed that sometimes it raises UnexpectedSymbol instead of IncompleteJSONError and after some investigation, we found out that the problem is that given JSON ended with incomplete null or true or false value. E.g. {"a": n. It seems that the lexer's state machine does not recognize it.

Do you think this is a valid case and if so, can it be fixed?

Thank you for your response.

Working without prefixes?

I may have misunderstood the library, but is there any way to loop over structures without any prefixes? I've been working in a memory-constrained environment where I wish to replace something like the following

data = json.load(srcfile)
for value in data.values():
       do_something(value)

ie. converting some JSON into CSV.

To do this with IJSON I've had to drop down to the basic_parser() where I manually handle start_array and end_array, concatenate values and ignore other events. But it feels like there should be another way.

It is not possible to parse a file with newline (\n etc.)

Please help. What am I doing wrong?

import asyncio

import ijson
# ijson==3.1.2.post0

from aiofile import AIOFile

data = """
[
 "a"
]
"""


async def main():
    with open("test.json", "w") as f1:
        f1.write(data)

    with open("test.json", "r") as f2:
        for obj in ijson.items(f2, prefix="item", use_float=True):
            print(obj)

    # ijson.common.IncompleteJSONError: parse error: trailing garbage
    #                                        [  "a" ]
    #                      (right here) ------^
    async with AIOFile("test.json",  "r") as fp:
        async for obj in ijson.items_async(fp, prefix="item", use_float=True):
            print(obj)

asyncio.run(main())

work with generator source

Hi,
I was trying to use ijson with a json stream coming from a zip archive through through a libarchive binding. Unfortunately the package I tried first exposed only a generator for getting the file bytes out of the zip:

https://github.com/Changaco/python-libarchive-c/blob/master/libarchive/entry.py#L48-L56

This is apparently not currently supported by ijson? At least I was getting very strange errors (internal C errors with the default C backend, "too many values to unpack" with the python backend using .items()) which I eventually could narrow down to the generator when using .basic_parse(). Would it make sense to support generators as a source as well or is that somehow fundamentally incompatible?

(Meanwhile I've switched to using the other python libarchive binding which does offer a file-like interface for reading from the archive.)

Python Lexer Is Excessively Greedy

ijson.backends.python.Lexer() has a main loop which looks for a number or a single-character lexeme and enters a simple decision tree. If the lexeme starts a string, the rest of the string is read in, with buffer updates as necessary, and then yielded out. If it does not start a string, the Lexer always attempts to extend the lexeme. In general, this isn't an issue, but if the file stream is wrapped around a socket, this can lead to significant parser lag and handshake stalemates as both parties wait for the other to transmit another chunk of data.

if lexeme == '"':
pos = match.start()
start = pos + 1
while True:
try:
end = buf.index('"', start)
escpos = end - 1
while buf[escpos] == '\\':
escpos -= 1
if (end - escpos) % 2 == 0:
start = end + 1
else:
break
except ValueError:
data = f.read(buf_size)
if not data:
raise common.IncompleteJSONError('Incomplete string lexeme')
buf += data
yield discarded + pos, buf[pos:end + 1]
pos = end + 1
else:
while match.end() == len(buf):
data = f.read(buf_size)
if not data:
break
buf += data
match = LEXEME_RE.search(buf, pos)
lexeme = match.group()
yield discarded + match.start(), lexeme
pos = match.end()

My guess is that the yajl backends do not share this issue, but I've not pulled up my Linux machine to check.

Error with 2.5

Code Example:

test_ijson.py

import json
import ijson
import codecs

with codecs.open('test.json', encoding="utf-8") as json_file:
    form = ijson.items(json_file, 'menu.test_items.item')
    print(form)
    forms = (o for o in form)
    print(forms)
    for objects in forms:
        pass

test.json

{"menu": {"header": "SVG Viewer","test_items": [{"id": "Open"},{"id": "OpenNew", "label": "Open New"},{"id": "ZoomIn", "label": "Zoom In"},{"id": "ZoomOut", "label": "Zoom Out"},{"id": "OriginalView", "label": "Original View"},{"id": "Quality"},{"id": "Pause"},{"id": "Mute"},{"id": "Find", "label": "Find..."}]}}

Result:

<_yajl2.items object at 0x7fb21010f650>
<generator object at 0x7fb2100e0c00>
Traceback (most recent call last):
File "test_ijson.py", line 11, in
for objects in forms:
File "test_ijson.py", line 9, in
forms = (o for o in form)
TypeError: expected bytes, str found

If I import the python backend as ijson everthing went fine.

Event interception not available on async functions

I am trying to implement the Intercepting Events pattern from https://github.com/ICRAR/ijson#id13 to consume an aiohttp response. When using non-async sources everything works as expected.

Running Python 3.9.4 (on Kubuntu 20.04), ijson 3.1.4, aiohttp 3.7.4.post0. For the sake of testing all backends i also installed cffi 1.14.5, and the OS package libyajl2:amd64 2.1.0-3. The precise versions do not seem crucial.
The code below uses a json file from the web, the specific json data is not important. The path specified in the code is not important either.

import asyncio
import traceback

import aiohttp
import ijson

url ='https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json'

async def run():
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            parse_events = ijson.parse_async(response.content)  # 1 <----
            async for prefix, event, value in parse_events:
                print(prefix, event, value)
        async with session.get(url) as response:
            async for i in ijson.items_async(response.content, "quiz.maths.q2.options.item"):  # 2 <----
                print(i)
        async with session.get(url) as response:
            body = await response.read()
            parse_events = ijson.parse(body)
            for i in ijson.items(parse_events, "quiz.maths.q2.options.item"):  # 3 <----
                print(i)
        for backend in ['yajl2_c', 'yajl2_cffi', 'yajl2', 'python']:
            try:
                ijson_backend = ijson.get_backend(backend)
                async with session.get(url) as response:
                    parse_events = ijson_backend.parse_async(response.content)
                    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):   # 4 <----
                        print(i)
            except Exception as e:
                print(f"{backend}\n\n\n{traceback.format_exc()}\n\n\n")


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(run())

1, 2, and 3 work fine.
4 raises various exceptions depending on the backend:

yajl2_c


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
AttributeError: '_yajl2._parse_async' object has no attribute 'read'




yajl2_cffi


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfdd0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
    yajl_parse(handle, buffer)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

yajl2


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfcf0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 225, in basic_parse_basecoro
    yajl_parse(handle, buffer)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2_cffi.py", line 196, in yajl_parse
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bf970>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

python


Traceback (most recent call last):
  File "/home/federico/python-tests/test.py", line 23, in run
    async for i in ijson_backend.items_async(parse_events, "quiz.maths.q2.options.item"):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 48, in __anext__
    self.read = await _get_read(self.f)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/utils35.py", line 20, in _get_read
    if type(await f.read(0)) == compat.bytetype:
TypeError: 'NoneType' object is not callable




Exception ignored in: <generator object basic_parse_basecoro at 0x7fb0f35bfe40>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/yajl2.py", line 50, in basic_parse_basecoro
    raise exception(error)
ijson.common.IncompleteJSONError: parse error: premature EOF
                                       
                     (right here) ------^

Exception ignored in: <generator object utf8_encoder at 0x7fb0f3587200>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
    target.close()
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
    target.send(EOF)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
    raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content
Exception ignored in: <generator object utf8_encoder at 0x7fb0f35870b0>
Traceback (most recent call last):
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 46, in utf8_encoder
    target.close()
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 116, in Lexer
    target.send(EOF)
  File "/home/federico/python-tests/.venv/lib/python3.9/site-packages/ijson/backends/python.py", line 161, in parse_value
    raise common.IncompleteJSONError('Incomplete JSON content')
ijson.common.IncompleteJSONError: Incomplete JSON content

I tried a few combinations of parse_async/parse/items_async/items/async for/for, but without luck.

Am i doing something wrong or is there an issue?

Resuming stream processing

Hey,

I've been playing around with ijson for the last couple days and it's quite helpful.
I'm having a hard time figuring out what's the best way to resume a stream process.
The file I'm processing is a 20GB json. I'm processing it from a cloud based storage, and a stream-error of some kind might occur during the processing.
The structure of my json is:
{
"fileVersion": number,
"id": uuid,
"items": a very big array(10000s of items) of small-medium object(up to 100 lines)
}

  1. If the process is corrupted after 18GB, how can I resume the process at that point?
  2. Is there a more efficient way to process the json other than a single-threaded kvitems? I'm using the yajl2_c backend.

Thanks for the support!

Example code issue

There are several syntax error in some of the example code. Here's a version that works:

import io
import ijson

Open a file

fo = open("foo.txt", "rb")
print ("Name of the file: ", fo.name)

line = fo.read(10)

Close opened file

fo.close()

parse_events = ijson.parse(io.BytesIO(b'["skip", {"a": 1}, {"b": 2}, {"c": 3}]'))
while True:
prefix, event, value = next(parse_events)
if value == "skip":
break
for obj in ijson.items(parse_events, 'item'):
print(obj)

Feature request: Return bytes at prefix

I don't know if this is too narrow a use case for this library, or if there is another way to do this.

I work with large remote JSON objects. If I know a JSON key occurs near the start of the file, I might download only a few kilobytes, and then use ijson.items to read that key. ijson will not raise an IncompleteJSONError, because it never had to read to the end of the file. So far, so good.

While most of these JSON objects are standardized, some are not. For example, a publisher might wrap a standard object inside another object, like {"result": { standardized content } }. My data pipeline currently handles this, by being able to set the prefix for a given remote file (in this case result). A pipeline step returns the data at that prefix, using ijson.items.

When I combine these two tactics, I run into trouble. ijson.items(data, 'result') tries to read entire result key, but the JSON is incomplete, so it raises an error.

One solution is to collapse the two steps into one, i.e. index to result.some_key in one step. In my case, this is undesirable, because then I'll have multiple pipelines using similar code, which is more work to maintain.

Another solution might be to have a method that just returns the bytes from the given prefix, without parsing them. (Since the prefix might include one or more item, I guess the return value would be a string of bytes.) In that case, I could then use ijson.items as usual on one of the return values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.