Git Product home page Git Product logo

Comments (6)

rtobar avatar rtobar commented on July 21, 2024

Hi @tatobi,

The very short answer to your final question is: there is no way to deal with erroneous individual bytes in ijson, at least yet, but in your example we couldn't even if we supported it.

A longer answer now: if I'm reading the error correctly, your program would fail even without ijson in the picture, because the decoding error happens when data from sys.stdin is being read from and decoded as utf8 bytes into strings (sys.stdin is a text file in python3). In other words I think this should also fail with the same error:

cat file.json | python3 just_read.py
# just_read.py
import sys
import functools

reader = functools.partial(sys.stdin.read, n)
n = 64 * 1024
for data in iter(reader, ''):
    pass

This also means you are doing extra work when passing your data to ijson -- on the one hand sys.stdin is automatically decoding utf8 bytes into strings, which ijson is then encoding back into utf8 bytes. You should have received a warning on the console about this, did you not? To bypass that extra decoding+encoding overhead you can pass down sys.stdin.buffer to ijson instead (see the note in the link above), which is the raw byte stream.

Apart from performance, the other difference that passing sys.stdin.buffer would make is that ijson will now see the original error, which with the original code wouldn't be possible (the error happens within sys.stdin, so there's no way ijson can properly recover from that). This would at least give it a chance to deal with it, but again, this is not something we currently support. If you have any control on the producer of your data then your best chance is to get that fixed so it emits valid utf8-encoded bytes.

from ijson.

rtobar avatar rtobar commented on July 21, 2024

You should have received a warning on the console about this, did you not?

I actually just realized DeprecationWarnings (the one I'm issuing) are not printed by default by most interpreters, unless users actively turn them on. If you run with python -Wall or similar switches you will see it though.

from ijson.

tatobi avatar tatobi commented on July 21, 2024

Hi @rtobar,

Thank you!!
Tested your code, you are right.
This also drops the same error:

n = 64 * 1024
reader = functools.partial(sys.stdin.read, n)
for data in iter(reader, ''):
    pass

sys.exit(0)

cat x.json | python -Wall process.py
Traceback (most recent call last):
  File "process.py", line 12, in <module>
    for data in iter(reader, ''):
  File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47849: invalid start byte

Because of the stream is generated by an external tool (tshark), as I see it's options, we cannot correcting it, no control over the producer's encoding so the question still open. Probably sed... I need to filter out | drop these bytes somehow. The performance is critical, we are dealing with huge amount of data.

from ijson.

rtobar avatar rtobar commented on July 21, 2024

@tatobi good point, you could probably add iconv -f utf8 -t utf8 -c to your shell pipeline to cleanup the content before giving it to ijson. And try out using sys.stdin.buffer instead of sys.stdin, that will hopefully give you some performance gain.

from ijson.

tatobi avatar tatobi commented on July 21, 2024

@rtobar : Thank you!
The iconv -f utf8 -t utf8 -c helped. Processing run without issue, even the first one.

cat x.json | iconv -f utf8 -t utf8 -c | python -Wall process.py

I can see the warning now as well.
Because it is not really an ijson issue, I sugges to close it. Nevertheless it is helpful to everyone who is facing stream errors.
Thanks again!

from ijson.

rtobar avatar rtobar commented on July 21, 2024

Thanks for letting me know that this solved it. This raised a couple of mental notes for myself too: a) I'd like the str->bytes conversion warning to be more visible by default, b) I've been thinking on adding a FAQ section to the documentation and this would be a good addition, and c) I'll eventually study the feasibility of if/how to handle invalid utf8 bytes within ijson.

from ijson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.