Git Product home page Git Product logo

Comments (7)

lemire avatar lemire commented on May 4, 2024

The Mison paper pretends to extract "top-level fields". I don't know what a top-level field is... but I am pretty sure it is a bit like my idea of extracting tabular data out of a JSON document.

from simdjson.

lemire avatar lemire commented on May 4, 2024

We'd like some level of abstraction so we don't work directly on the tape... maybe a zero-cost abstraction.

from simdjson.

lemire avatar lemire commented on May 4, 2024

Looks like people are happy to just measure the "parsing time": https://github.com/chadaustin/sajson/blob/master/benchmark/benchmark.cpp

That's not entirely satisfying, though we can play this game as well...

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

The 'parsing time' is not entirely satisfactory as an benchmark outcome in situations where we want to do something like the Mison use case. However, I'm not entirely convinced at this stage that this partial parsing use case is well-defined. The fundamental flaw of papers analysing such queries is that they tend towards the "ask yourself a question and answer it" school of thought. Parsing everything down to the level that the zero-cost abstraction you describe can traverse it is a known outcome.

It might be a bit churlish, but one approach is to try to achieve a full parse faster than Mison (or similar systems) can answer queries; this isn't a sustainable position to hold in the long run, but might be briefly amusing.

I agree that working directly with the tapes is awkward, but once number and string parsing are properly in place I think you'll find that post-stage 4 they are largely a friendly data structure fundamentally, even if they cosmetically resemble the droppings of an ape.

They don't contain 'luxury' pointers (i.e. 'up' pointers) but I think it's possible to do most traversals on them. I think a shallow layer on top of them (the "zero-cost abstraction" idea) should do the trick.

from simdjson.

lemire avatar lemire commented on May 4, 2024

Parsing everything down to the level that the zero-cost abstraction you describe can traverse it is a known outcome.

Fair enough.

I agree that working directly with the tapes is awkward, but once number and string parsing are properly in place I think you'll find that post-stage 4 they are largely a friendly data structure fundamentally, even if they cosmetically resemble the droppings of an ape.

I agree that it is a nice data structure for this problem. My thinking is more that we need some convenience functions so that the queries don't unnecessarily depend on the underlying data structure. I imagine that something like 20 or 30 lines of code would be enough to have a decent wrapper so that a programmer with little knowledge of the tapes can be productive.

from simdjson.

geofflangdale avatar geofflangdale commented on May 4, 2024

Completely agreed. Definitely well worth doing. I had deferred thinking about this because the structure of the tapes has been constantly changing, but it's worth getting it done. Especially because in my experience it's only when you actually try to write real code that you discover that it's harder than you thought!

from simdjson.

lemire avatar lemire commented on May 4, 2024

The following has been implemented:

SELECT DISTINCT “user.id” FROM tweets;

We now have a prototypical API so I am going to close this.

from simdjson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.