I have a huge stream (if saved: 3.5GB json file - usually received from unix pipe), wh

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Handle processing errors? about ijson HOT 6 CLOSED

icrar commented on July 21, 2024

Handle processing errors?

from ijson.

Comments (6)

rtobar commented on July 21, 2024

Hi @tatobi,

The very short answer to your final question is: there is no way to deal with erroneous individual bytes in ijson, at least yet, but in your example we couldn't even if we supported it.

A longer answer now: if I'm reading the error correctly, your program would fail even without ijson in the picture, because the decoding error happens when data from sys.stdin is being read from and decoded as utf8 bytes into strings (sys.stdin is a text file in python3). In other words I think this should also fail with the same error:

cat file.json | python3 just_read.py

# just_read.py
import sys
import functools

reader = functools.partial(sys.stdin.read, n)
n = 64 * 1024
for data in iter(reader, ''):
    pass

This also means you are doing extra work when passing your data to ijson -- on the one hand sys.stdin is automatically decoding utf8 bytes into strings, which ijson is then encoding back into utf8 bytes. You should have received a warning on the console about this, did you not? To bypass that extra decoding+encoding overhead you can pass down sys.stdin.buffer to ijson instead (see the note in the link above), which is the raw byte stream.

Apart from performance, the other difference that passing sys.stdin.buffer would make is that ijson will now see the original error, which with the original code wouldn't be possible (the error happens within sys.stdin, so there's no way ijson can properly recover from that). This would at least give it a chance to deal with it, but again, this is not something we currently support. If you have any control on the producer of your data then your best chance is to get that fixed so it emits valid utf8-encoded bytes.

from ijson.

rtobar commented on July 21, 2024

You should have received a warning on the console about this, did you not?

I actually just realized DeprecationWarnings (the one I'm issuing) are not printed by default by most interpreters, unless users actively turn them on. If you run with python -Wall or similar switches you will see it though.

from ijson.

tatobi commented on July 21, 2024

Hi @rtobar,

Thank you!!
Tested your code, you are right.
This also drops the same error:

n = 64 * 1024
reader = functools.partial(sys.stdin.read, n)
for data in iter(reader, ''):
    pass

sys.exit(0)

cat x.json | python -Wall process.py
Traceback (most recent call last):
  File "process.py", line 12, in <module>
    for data in iter(reader, ''):
  File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47849: invalid start byte

Because of the stream is generated by an external tool (tshark), as I see it's options, we cannot correcting it, no control over the producer's encoding so the question still open. Probably sed... I need to filter out | drop these bytes somehow. The performance is critical, we are dealing with huge amount of data.

from ijson.

rtobar commented on July 21, 2024

@tatobi good point, you could probably add iconv -f utf8 -t utf8 -c to your shell pipeline to cleanup the content before giving it to ijson. And try out using sys.stdin.buffer instead of sys.stdin, that will hopefully give you some performance gain.

from ijson.

tatobi commented on July 21, 2024

@rtobar : Thank you!
The iconv -f utf8 -t utf8 -c helped. Processing run without issue, even the first one.

cat x.json | iconv -f utf8 -t utf8 -c | python -Wall process.py

I can see the warning now as well.
Because it is not really an ijson issue, I sugges to close it. Nevertheless it is helpful to everyone who is facing stream errors.
Thanks again!

from ijson.

rtobar commented on July 21, 2024

Thanks for letting me know that this solved it. This raised a couple of mental notes for myself too: a) I'd like the str->bytes conversion warning to be more visible by default, b) I've been thinking on adding a FAQ section to the documentation and this would be a good addition, and c) I'll eventually study the feasibility of if/how to handle invalid utf8 bytes within ijson.

from ijson.

Handle processing errors? about ijson HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent