simdjson / simdjson Goto Github PK

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

Home Page: https://simdjson.org

License: Apache License 2.0

Makefile 0.01% C++ 97.54% C 1.14% Shell 0.26% JavaScript 0.01% Python 0.40% CMake 0.65% Ruby 0.01% Dockerfile 0.01%

json json-parser simd avx2 sse42 neon aarch64 arm64 arm vs2019

simdjson's Introduction

simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.

Fast: Over 4x faster than commonly used production-grade JSON parsers.
Record Breaking Features: Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s.
Easy: First-class, easy to use and carefully documented APIs.
Strict: Full JSON and UTF-8 validation, lossless parsing. Performance with no compromises.
Automatic: Selects a CPU-tailored parser at runtime. No configuration needed.
Reliable: From memory allocation to error handling, simdjson's design avoids surprises.
Peer Reviewed: Our research appears in venues like VLDB Journal, Software: Practice and Experience.

This library is part of the Awesome Modern C++ list.

Real-world usage
Quick Start
Documentation
Godbolt
Performance results
Bindings and Ports of simdjson
About simdjson
Funding
Contributing to simdjson
License

Real-world usage

If you are planning to use simdjson in a product, please work from one of our releases.

Quick Start

The simdjson library is easily consumable with a single .h and .cpp file.

Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environments like Visual Studio and Xcode, but different steps are needed. Users of clang++ may need to specify the C++ version (e.g., c++ -std=c++17) since clang++ tends to default on C++98.

Pull simdjson.h and simdjson.cpp into a directory, along with the sample file twitter.json. You can download them with the wget utility:

wget https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.h https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.cpp https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json

Create quickstart.cpp:

#include <iostream>
#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;
}

c++ -o quickstart quickstart.cpp simdjson.cpp
./quickstart

 100 results.

Documentation

Usage documentation is available:

Basics is an overview of how to use simdjson and its APIs.
Performance shows some more advanced scenarios and how to tune for them.
Implementation Selection describes runtime CPU detection and how you can work with it.
API contains the automatically generated API documentation.

Godbolt

Some users may want to browse code along with the compiled assembly. You want to check out the following lists of examples:

Performance results

The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.

The following figure represents parsing speed in GB/s for parsing various files on an Intel Skylake processor (3.4 GHz) using the GNU GCC 10 compiler (with the -O3 flag). We compare against the best and fastest C++ libraries on benchmarks that load and process the data. The simdjson library offers full unicode (UTF-8) validation and exact number parsing.

The simdjson library offers high speed whether it processes tiny files (e.g., 300 bytes) or larger files (e.g., 3MB). The following plot presents parsing speed for synthetic files over various sizes generated with a script on a 3.4 GHz Skylake processor (GNU GCC 9, -O3).

All our experiments are reproducible.

For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions.

Bindings and Ports of simdjson

We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).

ZippyJSON: Swift bindings for the simdjson project.
libpy_simdjson: high-speed Python bindings for simdjson using libpy.
pysimdjson: Python bindings for the simdjson project.
cysimdjson: high-speed Python bindings for the simdjson project.
simdjson-rs: Rust port.
simdjson-rust: Rust wrapper (bindings).
SimdJsonSharp: C# version for .NET Core (bindings and full port).
simdjson_nodejs: Node.js bindings for the simdjson project.
simdjson_php: PHP bindings for the simdjson project.
simdjson_ruby: Ruby bindings for the simdjson project.
fast_jsonparser: Ruby bindings for the simdjson project.
simdjson-go: Go port using Golang assembly.
rcppsimdjson: R bindings.
simdjson_erlang: erlang bindings.
simdjsone: erlang bindings.
lua-simdjson: lua bindings.
hermes-json: haskell bindings.
simdjzon: zig port.
JSON-Simd: Raku bindings.
JSON::SIMD: Perl bindings; fully-featured JSON module that uses simdjson for decoding.
gemmaJSON: Nim JSON parser based on simdjson bindings.
simdjson-java: Java port.

About simdjson

The simdjson library takes advantage of modern microarchitectures, parallelizing with SIMD vector instructions, reducing branch misprediction, and reducing data dependency to take advantage of each CPU's multiple execution cores.

Our default front-end is called On Demand, and we wrote a paper about it:

John Keiser, Daniel Lemire, On-Demand JSON: A Better Way to Parse Documents?, Software: Practice and Experience (to appear)

Some people enjoy reading the first (2019) simdjson paper: A description of the design and implementation of simdjson is in our research article:

Geoff Langdale, Daniel Lemire, Parsing Gigabytes of JSON per Second, VLDB Journal 28 (6), 2019.

We have an in-depth paper focused on the UTF-8 validation:

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience 51 (5), 2021.

We also have an informal blog post providing some background and context.

For the video inclined,

(It was the best voted talk, we're kinda proud of it.)

Funding

The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.

Contributing to simdjson

Head over to CONTRIBUTING.md for information on contributing to simdjson, and HACKING.md for information on source, building, and architecture/design.

License

This code is made available under the Apache License 2.0.

Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it is under the liberal (business-friendly) MIT license.

For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license. Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

For efficient number serialization, we bundle Florian Loitsch's implementation of the Grisu2 algorithm for binary to decimal floating-point numbers. The implementation was slightly modified by JSON for Modern C++ library. Both Florian Loitsch's implementation and JSON for Modern C++ are provided under the MIT license.

For runtime dispatching, we use some code from the PyTorch project licensed under 3-clause BSD.

simdjson's People

Contributors

Stargazers

Watchers

Forkers

amoshappy lpereira j-martin nawazzeeshan niqmk tzuryby utahdave leochencipher vdt rurban leijurv d12santosh sqlcode smorin xcbat tchen0123 longjohncoder briangibb tisma kaybarax johnsmith1212 venseer dhilip89 tomdyson omgasm sujeshthekkepatt whlook sahwar claudiouzelac hariyawan rogersguedes rjammala muharremokutan stjordanis lisprez rlalance shannonyu mznsolucoes iamsingularity marcomontaltomonella anishkny sergeymakeev dbeattie71 databasesworks rouzier stephengroat todda templeblock hhy5277 0xflotus lihuawu cashlalala alenwesker crazyguitar wuyonggang rahulsoibam jay-johnson cuulee superzyl shnwang awesome-archive rockystevejobs fuath cprakashagr tailin fabgithub b08514 waytc baruch2008 tianyuzhang1214 suyuancheng13 thaneacheron drcwr bedri deimsdeutsch abalkin ezhangle tktech googol-lab hoaianhkhang zhanglei jay87682 iquanhe newproggie lifeisstrange olayemi12 gragonvlad egorbo raymondseger skyformat99 sphinxorunixinskie jacktang chenshuguang roki1988 silky dorianzheng bbhunter doytsujin soyoo wojciechmula

simdjson's Issues

consider cleaning out Geoff's 'personalized types'

I am fine with u8, i64 and so forth as type names, but it is hard to justify using so many typedefs when uint8_t and int64_t are perfectly go standards.

The over-reliance on u8 as opposed to char is also a bit of a problem.

This is mostly a "social" problem. The code looks a bit alien. It could look more "standard".

C# Version

Hi ：Is there a version of C#？thanks.

add cmake testing to our circleci setup

We are currently not testing using cmake with circleci...
https://github.com/lemire/simdjson/blob/master/.circleci/config.yml

shovel_machine should probably not forcibly go through all depths...

The shovel_machine function has code that looks like this...

  for (u32 i = 0; i < MAX_DEPTH; i++) {
    ....
  }

My understanding of the code is that if a level is empty (start_loc == end_loc), then all other levels are going to be empty. So the loop should end once this stage is reached.

(I am aware that this is not a performance issue.)

What is "white space"? and is the current approach to detect them best?

@geofflangdale 's code seems to "define" characters to be one of '\t', '\n', '\r', ' ' from reading of the code.

One issue to contend with is that the definition of "white space" depends on the character encoding, so UTF-8 has other white space characters.

Anyhow, let us look at whether what Geoff does is optimally efficient...

The code to detect white space looks like this...


    __m256i v_lo = _mm256_and_si256(
        _mm256_shuffle_epi8(low_nibble_mask, input_lo),
        _mm256_shuffle_epi8(high_nibble_mask,
                            _mm256_and_si256(_mm256_srli_epi32(input_lo, 4),
                                             _mm256_set1_epi8(0x7f))));
    __m256i tmp_ws_lo = _mm256_cmpeq_epi8(
        _mm256_and_si256(v_lo, whitespace_shufti_mask), _mm256_set1_epi8(0));

So AND, SHUF, SHUF, AND, SHIFT, CMP, AND...
And then I guess you have to negate the result...

It sure is complicated!!!

I think you can do it more cheaply... (OR, CMP, SHUF, ADDS) and detect all of the five ASCII white space characters (tab, line feed, line tabulation, form feed, carriage return, space):

__m128i mask_20 = _mm_set1_epi8( 0x20 );// c==32
__m128i mask_70 = _mm_set1_epi8( 0x70 );// adding 0x70 does not check low 4-bits
 // but moves any value >= 16 above 128

  //for 9 <= c <= 13:
 __m128i lut_cntrl = _mm_setr_epi8(
		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
		0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00);

__m128i v = ... // your data

__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
			_mm_shuffle_epi8(lut_cntrl,  _mm_adds_epu8(mask_70, v)));
// bytemask has 0xFF on ASCII white space

Lower performance on small files

I have checked on small messages with size 10Kb - 15Kb and the benchmark shows only 0.3 - 0.5 Gb per second, while for original files it shows more than 2 Gb per second.

Also, when messages grow in size up to 10Mb then benchmark shows ~1.5 Gb.

Sources of small files are here.

Bellow is a full output:

andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/che-1.geo.json 
number of bytes 11481 number of structural chars 3310 ratio 0.288
mem alloc instructions:       1346 cycles:       3165 (4.38 %) ins/cycles: 0.43 mis. branches:          2 (cycles/mis.branch 1526.95) cache accesses:        136 (failure          0)
 mem alloc runs at 0.28 cycles per input byte.
stage 1 instructions:      47601 cycles:      14322 (19.80 %) ins/cycles: 3.32 mis. branches:          3 (cycles/mis.branch 4227.40) cache accesses:        136 (failure          0)
 stage 1 runs at 1.25 cycles per input byte.
stage 2 instructions:     190120 cycles:      54860 (75.83 %) ins/cycles: 3.47 mis. branches:        126  (cycles/mis.branch 434.83)  cache accesses:        338 (failure          0)
 stage 2 runs at 4.78 cycles per input byte and 16.57 cycles per structural character.
 all stages: 6.30 cycles per input byte.
Min:  3.7839e-05 bytes read: 11481 Gigabytes/second: 0.303417
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/twitter_api_response.json 
number of bytes 15253 number of structural chars 1440 ratio 0.094
mem alloc instructions:       1346 cycles:       3010 (10.41 %) ins/cycles: 0.45 mis. branches:          1 (cycles/mis.branch 2886.06) cache accesses:        207 (failure          0)
 mem alloc runs at 0.20 cycles per input byte.
stage 1 instructions:      41378 cycles:      12579 (43.50 %) ins/cycles: 3.29 mis. branches:          1 (cycles/mis.branch 9256.61) cache accesses:        207 (failure          0)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:      42454 cycles:      13330 (46.09 %) ins/cycles: 3.18 mis. branches:         14  (cycles/mis.branch 930.40)  cache accesses:        294 (failure          0)
 stage 2 runs at 0.87 cycles per input byte and 9.26 cycles per structural character.
 all stages: 1.90 cycles per input byte.
Min:  2.8586e-05 bytes read: 15253 Gigabytes/second: 0.533583
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/twitter.json
number of bytes 631514 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1191 cycles:        792 (0.07 %) ins/cycles: 1.50 mis. branches:          1 (cycles/mis.branch 616.13) cache accesses:      30067 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1825757 cycles:     575452 (51.51 %) ins/cycles: 3.17 mis. branches:       1506 (cycles/mis.branch 382.07) cache accesses:      30067 (failure         27)
 stage 1 runs at 0.91 cycles per input byte.
stage 2 instructions:    1639867 cycles:     540975 (48.42 %) ins/cycles: 3.03 mis. branches:       1406  (cycles/mis.branch 384.52)  cache accesses:      52092 (failure         50)
 stage 2 runs at 0.86 cycles per input byte and 9.79 cycles per structural character.
 all stages: 1.77 cycles per input byte.
Min:  0.000309023 bytes read: 631514 Gigabytes/second: 2.04358
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/twitter_api_response_10Mb.json 
number of bytes 9606023 number of structural chars 905350 ratio 0.094
mem alloc instructions:       1443 cycles:       7052 (0.04 %) ins/cycles: 0.20 mis. branches:         13 (cycles/mis.branch 526.28) cache accesses:     337396 (failure        250)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   25956521 cycles:    8704923 (48.30 %) ins/cycles: 2.98 mis. branches:      10742 (cycles/mis.branch 810.34) cache accesses:     337396 (failure     237093)
 stage 1 runs at 0.91 cycles per input byte.
stage 2 instructions:   26562775 cycles:    9310078 (51.66 %) ins/cycles: 2.85 mis. branches:       7822  (cycles/mis.branch 1190.20)  cache accesses:     653145 (failure     415533)
 stage 2 runs at 0.97 cycles per input byte and 10.28 cycles per structural character.
 all stages: 1.88 cycles per input byte.
Min:  0.00619737 bytes read: 9606023 Gigabytes/second: 1.55002

#warning isn't portable, causing errors on MSVC (and others)

#warning is used to tell the end-user that AVX2 isn't available, however on all versions of Visual Studio #warning isn't available and will cause a cryptic error if building without AVX2 support. #warning isn't standard and isn't guaranteed to be available.

[cmake] add soversion to the resulting shared library

Adding a SOVERSION to the resulting shared object would help indicating API/ABI changes in this library. This would greatly benefit linux distribution packagers. Please consider adding the following patch to master:

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index ea48953..dd7f2a4 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -32,7 +32,8 @@ if(NOT MSVC)
 ## We output the library at the root of the current directory where cmake is invoked
 ## This is handy but Visual Studio will happily ignore us
 set_target_properties(${SIMDJSON_LIB_NAME} PROPERTIES
-  LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
+  LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}
+  SOVERSION 0)
 MESSAGE( STATUS "Library output directory (does not apply to Visual Studio): " ${CMAKE_BINARY_DIR})
 endif()

write fast function that can check whether a document has been minified

That seems like a nice little complementary exercise.

Why not have a "thicker tape"?

Current code dedicates 24 bits to index the JSON document, using 8 bits to track the character.

We have 64-bit machines, why use a 32-bit tape? This would allow supporting enormous JSON documents and could even allow us to move up and down the tree instead of just down (or not).

(There might be good reasons for using a 32-bit tape... )

When maxdepth is exceeded, the program crashes (expectedly)

We need to do better.

benchmark against rust de facto standard lib: serde

https://github.com/serde-rs/json

package as a bona fide library

Currently, the library is only usable to run benchmarks. It does not build an actual library file that can be installed and integrated into other software.

Explore lazy materialization, for greater speed

The current code validates to a point values (nulls, and so forth) but does not materialize (store) the values.

This could be done. For example, use 64-bit per value, using some kind of union where the string case is a handled as a pointer to the in-situ string that you have unescaped.

You need to have some way to map the tape to the values. Or maybe you want to write the values directly on the tape, somehow.

What are sane and interesting benchmarks?

As far as research prototypes go, this could be rather limited, but if we are going to write benchmarks, we need some stable programming interface so that we can solve problems.

The Mison paper pretends to answer queries from a paper by Adabi...

SELECT DISTINCT “user.id” FROM tweets;

SELECT SUM(retweet count) FROM tweets GROUP BY “user.id”;

SELECT “user.id” FROM tweets t1, deletes d1, deletes d2 WHERE t1.id str = d1.“delete.status.id_str” AND d1.“delete.status.user_id” = d2.“delete.status.user_id” AND t1.“user.lang” = ‘msa’;

SELECT t1.“user.screen name”, t2.“user.screen name” FROM tweets t1, tweets t2, tweets t3 WHERE t1.“user.screen name”=t3.“user.screen name”AND t1.“user.screen name” = t2.in reply to screen name AND t2.“user.screen name” = t3.in reply to screen name;

I don't think they actually answer these queries. This requires an actual processing engine. I think that they do things like this...

Find all user/id values in tweets (where user/id is to be interpreted as a path)
Find all pairs user/id, retweet_count in tweets
Find all pairs user/id, user/lang in tweets
Find all pairs "in_reply_to_screen_name" and user/id

I am not sure I fully understand what the Mison paper tested, http://www.vldb.org/pvldb/vol10/p1118-li.pdf, though I am sure we can figure it out...

My guess as to what is a good and generic benchmark is to start with a JSON document and extract some kind of tabular data out of it. So turn the twitter JSON document into a table.

Add fuzz testing

I ran AFL and it almost immediately produces a segmentation fault.

Can be reproduced by using the benchmark parse:

#0  find_structural_bits (buf=buf@entry=0x55555558f100 "\n", len=len@entry=1, pj=...) at src/stage1_find_marks.cpp:437
#1  0x000055555555823f in find_structural_bits (pj=..., len=1, buf=0x55555558f100 "\n") at include/simdjson/stage1_find_marks.h:12
#2  main (argc=<optimized out>, argv=<optimized out>) at benchmark/parse.cpp:154

id:000000,sig:11,src:000000,op:havoc,rep:128.zip

Support AVX-512

This is currently a low priority but it seems worthwhile to ask whether AVX-512 helps, and by how much. Of course, the work should be completed on AVX2 first.

compare number conversion with state-of-the-art Google double conv

Google double-conv

When dumping json, the numbers are garbagy

Parsing numbers appear fine, but the json printing function prints garbage.

nodejs or/and webassembly binary ?

Is it plan to export as nodejs or/and webassembly binary ?

Improve accuracy of get_double

When I parse -65.619720000000029 value I get '-65.619720000000044' (other parsers return correct value for it)

Two headers have CRLF line terminators

This is a minor... Only two files (singleheader/simdjson.h and include/simdjson/common_defs.h) have CRLF line terminators (DOS).

The first one looks auto generated, so only include/simdjson/common_defs.h is problematic.

simdjson should use a namespace

Right now all functions and classes are defined in global namespace. This might potentially lead to clashes. Instead simdjson should put everything into its own namespace.

add tests for JSON minify function and write scalar equivalent

The copy_without_useless_spaces function should be tested. This is low priority but should be done.

A dedicated scalar minifier code should be written.

Error masks should probably not be needlessly converted to booleans

There are many checks to see if the error masks are non-zero, and then work on booleans. I am guessing that this is not generally a problem due to branch predictions, but it seems unwarranted.

Better errors in the high-level interface.

The high-level interface (such as json_parse) only return True or False, but can fail under numerous different conditions.

As someone building on top of simdjson, I either need to hijack stderr when calling these methods and search the string to figure out what really went wrong, or I need to re-implement them entirely.

It would be preferable to instead return an error code, using 1 for success and negative values for the various errors.

Do we want to allow the user to set limits on the maximum depth of nesting.

The JSON spec states...

An implementation may set limits on the maximum depth of nesting.

De facto, the current code goes up to 256... which is perfectly reasonable (it may even be a feature).

It seems that one benchmark the Mison paper uses is to parse all (but only) the "root-level" fields. Depending on your terminology that might be level 2 or 3.

In one email, Geoff points out that stage 3 and 4 could be greatly accelerated if we limited the depth. And this seems like a useful and attractive proposition. Suppose that you know, as the user, that you are not going to need to go very deep. Then it seems you could get much of the benefits of the Mison approach, simply by specifying the maximal depth.

Of course, you would not validate the lower levels, where they might be junk hiding... but the result would be well-defined, at least. (That is, you could define the result.)

Pick a software license

I usually default on a business-friendly APL 2.

benchmark against sajson

https://github.com/chadaustin/sajson

WebAssembly Compile Target

Hi there! I'm very limited in my WebAssembly and C++, C knowledge, however if this was something that could be run via WebAssembly and allow JavaScript application's and processes to yield similar (even if slightly degraded results), it would be pretty rad!

I'm not sure what work would be involved, so at the least, I'd love to request this as a feature so that tools running on NodeJS, or the Web can leverage this awesome piece of work.

Provide a streaming API

Without much effort, we could support streaming processing without materializing JSON documents objects as in-memory tapes.

Odd warning about freeing through stringview

`In file included from /home/geoff/git/simdjson/include/simdjson/common_defs.h:4:0,
from /home/geoff/git/simdjson/benchmark/parse.cpp:33:

In function ‘void aligned_free(void*)’,
inlined from ‘int main(int, char**)’ at /home/geoff/git/simdjson/benchmark/parse.cpp:262:15:

/home/geoff/git/simdjson/include/simdjson/portability.h:123:9: warning: attempt to free a non-heap object [-Wfree-nonheap-object]
free(memblock);
~~~~^~~~~~~~~~
`
gcc 7.2.0 doesn't like us trying to look through string_view to free our data. I don't understand this warning or what the analysis thinks it's doing.

"states" array is written to but never used

I don't understand the purpose of the "states" array, though I am sure it was explained to me. From the code, all we have is this...

  states[depth] = trans[states[depth]][c];

When is this used for anything?

Consider making this into a single-header library

At a minimum, amalgamation should be an option.

typo in readme, should be ryzen or zen

the architecture is called zen

consider fusing the string and main tapes

The intuition behind having two separate tapes is that it keeps the main tape tight where there are many large strings. This may or may not be an important consideration.

ape often returns false (with many/most JSON documents)

In ape, the following code seems to almost always return false:

    for (u32 i = 0; i < MAX_DEPTH; i++) {
        if (states[i] == 0) {
            return false;
        }
    }

numberparsingcheck crashes

Hi,

'make test' on my Fedora 29 is crashing while 'numberparsingcheck'.
https://gist.github.com/szydell/c2a9c01aadced506b1bfb16445a15bd1

Kernel: 4.20.10-200.fc29.x86_64
gcc-c++: Version 8.2.1, Architecture : x86_64

$ ldd numberparsingcheck
linux-vdso.so.1 (0x00007fff60ff7000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ff7b42d5000)
libm.so.6 => /lib64/libm.so.6 (0x00007ff7b4151000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ff7b4136000)
libc.so.6 => /lib64/libc.so.6 (0x00007ff7b3f70000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff7b449c000)

CPU:
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
stepping : 10
microcode : 0x96
cpu MHz : 800.005
cache size : 9216 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf

Add clear error when AVX2 is not supported on the current arch

I tried running the benchmarks on my machine but got an error when running the parser. I cloned the repo, ran cmake ., make, then make test to get this result. See screenfetch at bottom for specs of my machine.

Test project /Users/speleo/Downloads/simdjson
    Start 1: jsoncheck
1/1 Test #1: jsoncheck ........................***Exception: Illegal  0.01 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.01 sec

The following tests FAILED:
	  1 - jsoncheck (ILLEGAL)
Errors while running CTest
make: *** [test] Error 8

                    'c.          [email protected]
                 ,xNMM.          -----------------
               .OMMMMo           OS: macOS 10.14 18A391 x86_64
               OMMM0,            Host: MacBookPro10,1
     .;loddo:' loolloddol;.      Kernel: 18.0.0
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Uptime: 7 days, 12 hours, 19 mins
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 273
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Shell: zsh 5.3
;MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 1440x900
:MMMMMMMMMMMMMMMMMMMMMMMM:       DE: Aqua
.MMMMMMMMMMMMMMMMMMMMMMMMX.      WM: Kwm
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    Terminal: iTerm2
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   CPU: Intel i7-3635QM (8) @ 2.40GHz
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   GPU: Intel HD Graphics 4000, NVIDIA GeForce GT 650M
    kMMMMMMMMMMMMMMMMMMMMMMd     Memory: 3678MiB / 16384MiB
     ;KMMMMMMMWXXWMMMMMMMk.
       .cooc,.    .,coo:.

remove the using namespace std in the header only version

stage1_find_marks.cpp and stage2_build_tape.cpp both starts with a using namespace std line. which are agglomerated for the header-only version.

Since the code is not namespaced, they while be visible in the unit including them, wich can be a problem.

easy fix:

namespace simdjson
{
#include <simdjson.h>
#include <simdjson.cpp>
}

Code

https://github.com/hanxu317317/city_pickers/blob/65478dd09823862607d56ee72d34b8892ddfd2e1/lib/modal/result.dart#L9

Support null characters in strings

There seem to be some strong opinions out there that NUL chars in JSON strings should be supported. This would suggest that we maintain strings by length - possibly keep offset+length on the tape? Or start each string with a length field.

Replace usage of string_view by a custom string class

From the README:

std::string_view p = get_corpus(filename);
...
// You can safely delete the string content
free((void*)p.data());

This is a misuse of string_view's semantics, and runs afoul of various rules that will be flagged by static analyzers, and we don't want students seeing this and copying it without understanding that it's not recommended (string_view is not just a blanket replacement for char*!).

For the sake of the example, it'd probably be better to just use an idiomatic std::string and illustrate that the parsed document can still be used after that string goes out of scope.

Build a "normalize/stringify/minify" JSON function

A useful task is to trim all useless spaces from a JSON document. This is intrinsically useful. There is an argument that this should not be necessary, but I think it is bogus. In many applications, you can't expect the JSON to be minified.

This could be a problem almost orthogonal to JSON parsing.

Am I correct in thinking that your previous approach to structural elements was better suited to this task (the one with clmul in it)?

My current thinking is that it might offer one concrete test for this work, and even if it does not go through all stages, that would still be an interesting test.

Support Unsigned 64-bit integer

As I know number field in JSON is not limited by the number of digits,
but parser is limited by their own language specification.

In terms of the datatype, uint64 is frequently used for identifier, hash value...
and it is natural that many of Cpp based project uses unsigned long.

My opinion is that it would be great if the parser supports both uint64 and int64.

I am really impressed with this artwork. thanks a lot =)

For large files, the tapes collide

The current tape is just one giant tape with arbitrary segments. The segments can overwrite themselves.

We can do better.

To reproduce:

./parse -d scripts/javascript/large.json |grep size

Example of output:

 tape section i 32   (START)  from: 4161536 to: 4161538  size: 2
 tape section i 33  (NORMAL)  from: 4291584 to: 5291566  size: 999982
 tape section i 34  (NORMAL)  from: 4421632 to: 13421452  size: 8999820
 tape section i 35  (NORMAL)  from: 4551680 to: 15051470  size: 10499790
 tape section i 36  (NORMAL)  from: 4681728 to: 7681668  size: 2999940

This is scary bad. 👎