saka1 / simdjson_ruby Goto Github PK
View Code? Open in Web Editor NEWRuby bindings for simdjson
License: MIT License
Ruby bindings for simdjson
License: MIT License
I've run into a problem where a UTF-8 encoded string is parsed by Simdjson.parse
and one of the resulting strings is encoded in ASCII-8BIT. I can reproduce this like so:
# run with ruby --encoding=UTF-8 if UTF-8 isn't your system default.
x = '{"m":" – "}' # note the non-ASCII character in the value
puts x.encoding # => #<Encoding::UTF-8>
y = Simdjson.parse(x)
puts y['m'].encoding # => #<Encoding:ASCII-8BIT>
It seems like the encoding of the output strings should remain the same as the encoding of the input string, right? I'm not sure if this is an issue that belongs here or in the main simdjson repository but I appreciate you taking a look either way.
What is the current status of the package?
Highlights
New features:
Performance:
System support:
The current implementation has two steps:
According to the roadmap(simdjson/simdjson#997), SAJ API will be available someday.
We would bypass the tape construction with SAJ, which improves efficiency.
I was looking to contribute to the gem, but it blows up after cloning when trying to get up and running... A bit of docs on dev setup or what might be missing to build from scratch would be helpful.
bundle exec rake
Copy singleheader files to ext/simdjson...
rake aborted!
Errno::ENOENT: No such file or directory @ rb_sysopen - /Users/danmayer/projects/simdjson_ruby/vendor/simdjson/singleheader/simdjson.h
/Users/danmayer/projects/simdjson_ruby/Rakefile:22:in `block in <top (required)>'
/Users/danmayer/.rvm/gems/ruby-2.6.2/gems/rake-13.0.1/exe/rake:27:in `<top (required)>'
/Users/danmayer/.rvm/gems/ruby-2.6.2/bin/ruby_executable_hooks:24:in `eval'
/Users/danmayer/.rvm/gems/ruby-2.6.2/bin/ruby_executable_hooks:24:in `<main>'
Tasks: TOP => default => compile => before_compile
(See full trace by running task with --trace)
I recently wrote up a post about how much faster simdjson_ruby can be than OJ and other options.
While that is all true, when you need things to have symbols due to the expected upstream usage, the benchmark falls apart... I was looking to add support to create symbols while building up the hash vs having to convert to them after.
My benchmark updated to have symbolized keys for all implementations...
require 'benchmark/ips'
require 'json'
require 'oj'
require 'simdjson'
require 'memory_profiler'
require 'rails'
json = File.read("./json_data.json")
puts "ensure these match"
puts Oj.load(json.dup, symbol_keys: true) == Simdjson.parse(json.dup).deep_symbolize_keys! && Simdjson.parse(json.dup).deep_symbolize_keys! == JSON.parse(json.dup, symbolize_names: true)
Benchmark.ips do |x|
x.config(:time => 15, :warmup => 3)
x.report("oj parse") { Oj.load(json.dup, symbol_keys: true) }
x.report("simdjson parse") { Simdjson.parse(json.dup).deep_symbolize_keys }
x.report("stdlib JSON parse") { JSON.parse(json.dup, symbolize_names: true) }
x.compare!
end
The resulting output shows that all the perf improvements of the parser are lost to having to do a second pass for symbolizing, at least in the case of large JSON files.
ensure these match
true
Warming up --------------------------------------
oj parse 101.000 i/100ms
simdjson parse 44.000 i/100ms
stdlib JSON parse 58.000 i/100ms
Calculating -------------------------------------
oj parse 1.016k (± 4.9%) i/s - 15.251k in 15.051368s
simdjson parse 420.256 (± 6.7%) i/s - 6.292k in 15.052436s
stdlib JSON parse 503.879 (±11.1%) i/s - 7.482k in 15.037979s
Comparison:
oj parse: 1016.2 i/s
stdlib JSON parse: 503.9 i/s - 2.02x (± 0.00) slower
simdjson parse: 420.3 i/s - 2.42x (± 0.00) slower
I haven't wrote a C extension for Ruby for years, but happy to help if I can get the full build / test cycle working... Or if this all makes sense to you happy to review a PR if you think this is a good idea and know-how to tackle it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.