Comments (6)
Hi @tatobi,
The very short answer to your final question is: there is no way to deal with erroneous individual bytes in ijson, at least yet, but in your example we couldn't even if we supported it.
A longer answer now: if I'm reading the error correctly, your program would fail even without ijson in the picture, because the decoding error happens when data from sys.stdin
is being read from and decoded as utf8 bytes into strings (sys.stdin
is a text file in python3). In other words I think this should also fail with the same error:
cat file.json | python3 just_read.py
# just_read.py
import sys
import functools
reader = functools.partial(sys.stdin.read, n)
n = 64 * 1024
for data in iter(reader, ''):
pass
This also means you are doing extra work when passing your data to ijson -- on the one hand sys.stdin
is automatically decoding utf8 bytes into strings, which ijson is then encoding back into utf8 bytes. You should have received a warning on the console about this, did you not? To bypass that extra decoding+encoding overhead you can pass down sys.stdin.buffer
to ijson instead (see the note in the link above), which is the raw byte stream.
Apart from performance, the other difference that passing sys.stdin.buffer
would make is that ijson will now see the original error, which with the original code wouldn't be possible (the error happens within sys.stdin
, so there's no way ijson can properly recover from that). This would at least give it a chance to deal with it, but again, this is not something we currently support. If you have any control on the producer of your data then your best chance is to get that fixed so it emits valid utf8-encoded bytes.
from ijson.
You should have received a warning on the console about this, did you not?
I actually just realized DeprecationWarnings (the one I'm issuing) are not printed by default by most interpreters, unless users actively turn them on. If you run with python -Wall
or similar switches you will see it though.
from ijson.
Hi @rtobar,
Thank you!!
Tested your code, you are right.
This also drops the same error:
n = 64 * 1024
reader = functools.partial(sys.stdin.read, n)
for data in iter(reader, ''):
pass
sys.exit(0)
cat x.json | python -Wall process.py
Traceback (most recent call last):
File "process.py", line 12, in <module>
for data in iter(reader, ''):
File "/home/tobi/miniconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 47849: invalid start byte
Because of the stream is generated by an external tool (tshark), as I see it's options, we cannot correcting it, no control over the producer's encoding so the question still open. Probably sed
... I need to filter out | drop these bytes somehow. The performance is critical, we are dealing with huge amount of data.
from ijson.
@tatobi good point, you could probably add iconv -f utf8 -t utf8 -c
to your shell pipeline to cleanup the content before giving it to ijson. And try out using sys.stdin.buffer
instead of sys.stdin
, that will hopefully give you some performance gain.
from ijson.
@rtobar : Thank you!
The iconv -f utf8 -t utf8 -c
helped. Processing run without issue, even the first one.
cat x.json | iconv -f utf8 -t utf8 -c | python -Wall process.py
I can see the warning now as well.
Because it is not really an ijson
issue, I sugges to close it. Nevertheless it is helpful to everyone who is facing stream errors.
Thanks again!
from ijson.
Thanks for letting me know that this solved it. This raised a couple of mental notes for myself too: a) I'd like the str->bytes conversion warning to be more visible by default, b) I've been thinking on adding a FAQ section to the documentation and this would be a good addition, and c) I'll eventually study the feasibility of if/how to handle invalid utf8 bytes within ijson.
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- Iterate over more than one prefix? HOT 2
- How to parse a large gzip json file. HOT 2
- Make new release HOT 2
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.