Comments (8)
Is there a solution for parsing json files with syntax problems?
@carlosg-m saldy, no.
Can this type of error be skipped?
Sadly, not either.
Addressing both problems would require a fair amount of work. In particular, all the yajl
-based backends (almost all) do not get access to the last valid byte in the buffer when errors are found, and therefore there's nothing ijson can work with to fix or skip the problem, or to let the user do it.
The best you can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed
or any similar streaming mechainsm (even within your own program) you to change those NaN
s into some proper JSON content (e.g., sed 's/NaN,NaN/"nan","nan"/g' you-file > fixed-file
) before it's given to ijson.
from ijson.
I think nan support is something that should be seriously reconsidered. While it's not part of standard json, nans are extremely common in real life and real data. This issue is blocking me from making use of ijson. I suppose I'll try stream manipulation in the meantime, but I'm interested in finding a way to solve this.
(or perhaps just the python backend could get experimental support for this feature?)
Is the diff as simple as:
diff --git a/ijson/backends/python.py b/ijson/backends/python.py
index 8efda79..1726ef0 100644
--- a/ijson/backends/python.py
+++ b/ijson/backends/python.py
@@ -8,7 +8,7 @@ from ijson import common, utils
import codecs
-LEXEME_RE = re.compile(r'[a-z0-9eE\.\+-]+|\S')
+LEXEME_RE = re.compile(r'[a-z0-9eNE\.\+-]+|\S')
UNARY_LEXEMES = set('[]{},')
EOF = -1, None
@@ -184,6 +184,9 @@ def parse_value(target, multivalue, use_float):
elif symbol == 'false':
send(('boolean', False))
pop()
+ elif symbol == 'NaN':
+ send(('number', float('nan')))
+ pop()
elif symbol[0] == '"':
send(('string', parse_string(symbol)))
pop()
@@ -226,7 +229,7 @@ def parse_value(target, multivalue, use_float):
if number == inf:
raise common.JSONError("float overflow: %s" % (symbol,))
except:
- if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol):
+ if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol) or 'NaN'.startswith(symbol):
raise common.IncompleteJSONError('Incomplete JSON content')
raise UnexpectedSymbol(symbol, pos)
else:
?
from ijson.
@Erotemic I'm not sure if that diff you posted above would be all that's necessary (I don't mean this as in "there is stuff missing!", I'm literally not sure because I'd have to invest some time to think thoroughly about it). In any case it would only cover the python backend -- which is definitely not what you want to be using in a production environment, when the yajl2_c
backend is about ~50x times faster.
The rest of the backends are not that easily modifiable though (or not at all, some just load compiled externally-provided shared libraries), so I'd be hesitant to add support for a non-standard JSON feature that is only supported by a single backends, which happens to be the slowest.
So sorry, but I won't be adding support for nans in ijson. Even if they are a common in real life and real data, they should not be common in JSON documents. I'm sorry to hear this blocks you from using ijson, but I'd expect this to block you from using a few other JSON libraries too that probably won't handle nans either. If you have control over your data sources, then you can try modifying it so it generates valid JSON; if you don't have that control then your best bet is to tap into the stream and perform the necessary corrections to obtain valid JSON content.
from ijson.
can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used
sed
or any similar streaming mechainsm
Thank you for the information it helps a lot. We are going to try to fix the issue directly on the source, that in this case is a custom made legacy geographic information system.
from ijson.
which is definitely not what you want to be using in a production environment
Actually it is. When I have a 12 GB json file, where I know there is one entry at the very top that is only a few bytes, taking a 50x slowdown to read a few bytes is more than worth it. Specifically, I have a COCO json file that represents an object detection dataset. The basic structure is {"info": List[dict], "categories": List[dict], "images": List[dict], "annotations": List[dict]}
. There are thousands of image and annotation entries, which make the file quite large, and in this case I'm only interested in parsing the items in "info" at the very top because they contain some metadata I'm interested in.
It's not uncommon for some backends to include features that are unsupported by faster more specialized backends, especially when they are experimental. There are very real production use-cases where using a slower, more feature-rich backend is an overall benefit to the system. It is nice for all backends to have feature parity, but I think could be more room for flexibility than your initial impressions would suggest.
they should not be common in JSON documents
This is an opinion, and I respectfully disagree. My opinion is that it was shortsighted for the rfc8259 standard not to include NaN. A NaN and null value have subtly different meanings, and they are not interchangeable.
Of course you aren't obliged to make or accept changes to this library, and even less obliged to change your opinion. There's not anything objectively wrong with it, and there is value in adhering to specifications, but I want to suggest that your notion of correctness is conflated with compliance. The opposing opinion is simply non-compliant, not incorrect. Food for thought.
Fortunately this library is extremely well written and I can simply copy the python backend into a new module, modify it, and use it with the regular ijson core. For anyone else that needs the functionality, here is the code that I'm using: https://gist.github.com/Erotemic/60ec1b4ffc9961fa34a1821bb139f70f
from ijson.
I'm confused by this issue. ijson is a JSON parser, and NaN is not part of JSON...
from ijson.
True, yet support for values like NaN and Inf are common in json parsers (including ujson and the stdlib json library). There is no specification for NaN in RFC8259. It's not unreasonable to deny support for it, but it's also not unreasonable to request support for it, or fork a project to add support for it.
from ijson.
@Erotemic thanks for the constructive discussion, and for raising your points with respect and clarity -- that's hugely appreciated. Adding that link to your solution is also a good idea for people in the future who might find it useful.
"which is definitely not what you want to be using in a production environment"
Actually it is
Sorry, I meant this in the more general case, and was just guessing at your particular use case, where you indeed don't care much about performance.
"they [nands] should not be common in JSON documents"
This is an opinion, and I respectfully disagree
My apologies, that was poor writing from my side, I meant that phrase as an observation, not as an opinion. The observation is that this package is downloaded millions of time per month and yet the issue of NaNs appearing in JSON documents has been brought up only two times in multiple years. My assumption then is that NaNs do not appear in JSON documents that often.
I'm not sure what the story behind explicitly not supporting NaNs in the RFC is. My impression is that they didn't want to bind themselves to the IEEE 754 format, given also how they allow arbitrarily big or precise numbers (e.g., "1e400" or "3.141592653589793238462643383279") to be used, but that's just a guess. I agree with that that this feels like an obvious miss, but this is thanks to the benefit of hindsight; the original specification is 16 years old, when JSON wasn't as ubiquitous as it is today.
About the support for NaNs in ijson, and to finish my intervention: probably if ijson consisted only on the python backend I wouldn't mind too much about adding this support. But the fact that there are other backends that I don't control as much (plus another one on the works) I'm much more hesitant to add more maintenance burden on myself for a feature that isn't standard. If I merge this in it's a matter of time before people request support for inf
/infinite
, then for pi
, then to support these in the other backends, etc.
from ijson.
Related Issues (20)
- yajl2_c backend crashes on PyPy3 HOT 19
- Is there a way to recursively iterate the key? HOT 4
- ijson.items(file, prefix) waits for EOF HOT 8
- Wheels for Python 3.12 with yajl2_c backend HOT 4
- Include array index HOT 2
- Iterate over more than one prefix? HOT 2
- HighLevelAPI: Raise an error if the prefix does not exist HOT 2
- Is it possible to use multiple prefix HOT 8
- yajl2_c backend for lambda function HOT 2
- How to use ijson to covert string to dict? HOT 3
- How to read json records in chunks using ijson? HOT 4
- Question: is it possible that returing bytes instead of str could speedup parsing? HOT 3
- Thread safety HOT 9
- Full support for byte stream generator HOT 9
- Allow to use ijson package by a relative import HOT 4
- How can I most-efficiently check for a key in the top-level of a json object? HOT 3
- Python3.12 compilation error: ‘PyGenObject’ has no member named ‘gi_code’ HOT 5
- Is it possible to use isjon with Jsonl, ndjson ? HOT 5
- Memory leak on exception handling with yajl2_c backend HOT 6
- _yajl2 backend broken with Python 3.12 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ijson.