Git Product home page Git Product logo

Ubuntu 20.04 CI Fuzzing Status

Doxygen Documentation

simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.
  • Fast: Over 4x faster than commonly used production-grade JSON parsers.
  • Record Breaking Features: Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s.
  • Easy: First-class, easy to use and carefully documented APIs.
  • Strict: Full JSON and UTF-8 validation, lossless parsing. Performance with no compromises.
  • Automatic: Selects a CPU-tailored parser at runtime. No configuration needed.
  • Reliable: From memory allocation to error handling, simdjson's design avoids surprises.
  • Peer Reviewed: Our research appears in venues like VLDB Journal, Software: Practice and Experience.

This library is part of the Awesome Modern C++ list.

Table of Contents

Real-world usage

If you are planning to use simdjson in a product, please work from one of our releases.

Quick Start

The simdjson library is easily consumable with a single .h and .cpp file.

  1. Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environments like Visual Studio and Xcode, but different steps are needed. Users of clang++ may need to specify the C++ version (e.g., c++ -std=c++17) since clang++ tends to default on C++98.

  2. Pull simdjson.h and simdjson.cpp into a directory, along with the sample file twitter.json. You can download them with the wget utility:

    wget https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.h https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.cpp https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json
    
  3. Create quickstart.cpp:

#include <iostream>
#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;
}
  1. c++ -o quickstart quickstart.cpp simdjson.cpp
  2. ./quickstart
 100 results.

Documentation

Usage documentation is available:

  • Basics is an overview of how to use simdjson and its APIs.
  • Performance shows some more advanced scenarios and how to tune for them.
  • Implementation Selection describes runtime CPU detection and how you can work with it.
  • API contains the automatically generated API documentation.

Godbolt

Some users may want to browse code along with the compiled assembly. You want to check out the following lists of examples:

Performance results

The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.

The following figure represents parsing speed in GB/s for parsing various files on an Intel Skylake processor (3.4 GHz) using the GNU GCC 10 compiler (with the -O3 flag). We compare against the best and fastest C++ libraries on benchmarks that load and process the data. The simdjson library offers full unicode (UTF-8) validation and exact number parsing.

The simdjson library offers high speed whether it processes tiny files (e.g., 300 bytes) or larger files (e.g., 3MB). The following plot presents parsing speed for synthetic files over various sizes generated with a script on a 3.4 GHz Skylake processor (GNU GCC 9, -O3).

All our experiments are reproducible.

For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions.

Bindings and Ports of simdjson

We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).

About simdjson

The simdjson library takes advantage of modern microarchitectures, parallelizing with SIMD vector instructions, reducing branch misprediction, and reducing data dependency to take advantage of each CPU's multiple execution cores.

Our default front-end is called On Demand, and we wrote a paper about it:

Some people enjoy reading the first (2019) simdjson paper: A description of the design and implementation of simdjson is in our research article:

We have an in-depth paper focused on the UTF-8 validation:

We also have an informal blog post providing some background and context.

For the video inclined,
simdjson at QCon San Francisco 2019
(It was the best voted talk, we're kinda proud of it.)

Funding

The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.

Contributing to simdjson

Head over to CONTRIBUTING.md for information on contributing to simdjson, and HACKING.md for information on source, building, and architecture/design.

License

This code is made available under the Apache License 2.0.

Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it is under the liberal (business-friendly) MIT license.

For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license. Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

For efficient number serialization, we bundle Florian Loitsch's implementation of the Grisu2 algorithm for binary to decimal floating-point numbers. The implementation was slightly modified by JSON for Modern C++ library. Both Florian Loitsch's implementation and JSON for Modern C++ are provided under the MIT license.

For runtime dispatching, we use some code from the PyTorch project licensed under 3-clause BSD.

simdjson's Projects

cmakedemo icon cmakedemo

Simple example on how to use simdjson with CMake using a git submodule

debian-ppa icon debian-ppa

Personal Package Archive repository for software used in the simdjson organization

simdjson icon simdjson

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

simdjson-java icon simdjson-java

A Java version of simdjson, a high-performance JSON parser utilizing SIMD instructions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.