Git Product home page Git Product logo

Comments (8)

rtobar avatar rtobar commented on July 21, 2024 1

Is there a solution for parsing json files with syntax problems?

@carlosg-m saldy, no.

Can this type of error be skipped?

Sadly, not either.

Addressing both problems would require a fair amount of work. In particular, all the yajl-based backends (almost all) do not get access to the last valid byte in the buffer when errors are found, and therefore there's nothing ijson can work with to fix or skip the problem, or to let the user do it.

The best you can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm (even within your own program) you to change those NaNs into some proper JSON content (e.g., sed 's/NaN,NaN/"nan","nan"/g' you-file > fixed-file) before it's given to ijson.

from ijson.

Erotemic avatar Erotemic commented on July 21, 2024 1

I think nan support is something that should be seriously reconsidered. While it's not part of standard json, nans are extremely common in real life and real data. This issue is blocking me from making use of ijson. I suppose I'll try stream manipulation in the meantime, but I'm interested in finding a way to solve this.

(or perhaps just the python backend could get experimental support for this feature?)

Is the diff as simple as:

diff --git a/ijson/backends/python.py b/ijson/backends/python.py
index 8efda79..1726ef0 100644
--- a/ijson/backends/python.py
+++ b/ijson/backends/python.py
@@ -8,7 +8,7 @@ from ijson import common, utils
 import codecs
 
 
-LEXEME_RE = re.compile(r'[a-z0-9eE\.\+-]+|\S')
+LEXEME_RE = re.compile(r'[a-z0-9eNE\.\+-]+|\S')
 UNARY_LEXEMES = set('[]{},')
 EOF = -1, None
 
@@ -184,6 +184,9 @@ def parse_value(target, multivalue, use_float):
             elif symbol == 'false':
                 send(('boolean', False))
                 pop()
+            elif symbol == 'NaN':
+                send(('number', float('nan')))
+                pop()
             elif symbol[0] == '"':
                 send(('string', parse_string(symbol)))
                 pop()
@@ -226,7 +229,7 @@ def parse_value(target, multivalue, use_float):
                     if number == inf:
                         raise common.JSONError("float overflow: %s" % (symbol,))
                 except:
-                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol):
+                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol) or 'NaN'.startswith(symbol):
                         raise common.IncompleteJSONError('Incomplete JSON content')
                     raise UnexpectedSymbol(symbol, pos)
                 else:

?

from ijson.

rtobar avatar rtobar commented on July 21, 2024 1

@Erotemic I'm not sure if that diff you posted above would be all that's necessary (I don't mean this as in "there is stuff missing!", I'm literally not sure because I'd have to invest some time to think thoroughly about it). In any case it would only cover the python backend -- which is definitely not what you want to be using in a production environment, when the yajl2_c backend is about ~50x times faster.

The rest of the backends are not that easily modifiable though (or not at all, some just load compiled externally-provided shared libraries), so I'd be hesitant to add support for a non-standard JSON feature that is only supported by a single backends, which happens to be the slowest.

So sorry, but I won't be adding support for nans in ijson. Even if they are a common in real life and real data, they should not be common in JSON documents. I'm sorry to hear this blocks you from using ijson, but I'd expect this to block you from using a few other JSON libraries too that probably won't handle nans either. If you have control over your data sources, then you can try modifying it so it generates valid JSON; if you don't have that control then your best bet is to tap into the stream and perform the necessary corrections to obtain valid JSON content.

from ijson.

carlosg-m avatar carlosg-m commented on July 21, 2024

can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm

Thank you for the information it helps a lot. We are going to try to fix the issue directly on the source, that in this case is a custom made legacy geographic information system.

from ijson.

Erotemic avatar Erotemic commented on July 21, 2024

which is definitely not what you want to be using in a production environment

Actually it is. When I have a 12 GB json file, where I know there is one entry at the very top that is only a few bytes, taking a 50x slowdown to read a few bytes is more than worth it. Specifically, I have a COCO json file that represents an object detection dataset. The basic structure is {"info": List[dict], "categories": List[dict], "images": List[dict], "annotations": List[dict]}. There are thousands of image and annotation entries, which make the file quite large, and in this case I'm only interested in parsing the items in "info" at the very top because they contain some metadata I'm interested in.

It's not uncommon for some backends to include features that are unsupported by faster more specialized backends, especially when they are experimental. There are very real production use-cases where using a slower, more feature-rich backend is an overall benefit to the system. It is nice for all backends to have feature parity, but I think could be more room for flexibility than your initial impressions would suggest.

they should not be common in JSON documents

This is an opinion, and I respectfully disagree. My opinion is that it was shortsighted for the rfc8259 standard not to include NaN. A NaN and null value have subtly different meanings, and they are not interchangeable.

Of course you aren't obliged to make or accept changes to this library, and even less obliged to change your opinion. There's not anything objectively wrong with it, and there is value in adhering to specifications, but I want to suggest that your notion of correctness is conflated with compliance. The opposing opinion is simply non-compliant, not incorrect. Food for thought.

Fortunately this library is extremely well written and I can simply copy the python backend into a new module, modify it, and use it with the regular ijson core. For anyone else that needs the functionality, here is the code that I'm using: https://gist.github.com/Erotemic/60ec1b4ffc9961fa34a1821bb139f70f

from ijson.

jpmckinney avatar jpmckinney commented on July 21, 2024

I'm confused by this issue. ijson is a JSON parser, and NaN is not part of JSON...

from ijson.

Erotemic avatar Erotemic commented on July 21, 2024

True, yet support for values like NaN and Inf are common in json parsers (including ujson and the stdlib json library). There is no specification for NaN in RFC8259. It's not unreasonable to deny support for it, but it's also not unreasonable to request support for it, or fork a project to add support for it.

from ijson.

rtobar avatar rtobar commented on July 21, 2024

@Erotemic thanks for the constructive discussion, and for raising your points with respect and clarity -- that's hugely appreciated. Adding that link to your solution is also a good idea for people in the future who might find it useful.

"which is definitely not what you want to be using in a production environment"
Actually it is

Sorry, I meant this in the more general case, and was just guessing at your particular use case, where you indeed don't care much about performance.

"they [nands] should not be common in JSON documents"
This is an opinion, and I respectfully disagree

My apologies, that was poor writing from my side, I meant that phrase as an observation, not as an opinion. The observation is that this package is downloaded millions of time per month and yet the issue of NaNs appearing in JSON documents has been brought up only two times in multiple years. My assumption then is that NaNs do not appear in JSON documents that often.

I'm not sure what the story behind explicitly not supporting NaNs in the RFC is. My impression is that they didn't want to bind themselves to the IEEE 754 format, given also how they allow arbitrarily big or precise numbers (e.g., "1e400" or "3.141592653589793238462643383279") to be used, but that's just a guess. I agree with that that this feels like an obvious miss, but this is thanks to the benefit of hindsight; the original specification is 16 years old, when JSON wasn't as ubiquitous as it is today.

About the support for NaNs in ijson, and to finish my intervention: probably if ijson consisted only on the python backend I wouldn't mind too much about adding this support. But the fact that there are other backends that I don't control as much (plus another one on the works) I'm much more hesitant to add more maintenance burden on myself for a feature that isn't standard. If I merge this in it's a matter of time before people request support for inf/infinite, then for pi, then to support these in the other backends, etc.

from ijson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.