Git Product home page Git Product logo

rilldata / rill Goto Github PK

View Code? Open in Web Editor NEW
1.4K 10.0 93.0 217.66 MB

Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.

License: Apache License 2.0

JavaScript 0.37% HTML 0.57% Svelte 22.23% CSS 0.29% TypeScript 34.51% Shell 0.31% Dockerfile 0.02% Makefile 0.03% Go 41.68%
duckdb svelte sveltejs sveltekit dataviz csv parquet parquet-tools golang s3

rill's People

Contributors

adityahegde avatar akshayjain3450 avatar aljazerzen avatar andrewrtsao avatar bcolloran avatar begelundmuller avatar briangregoryholmes avatar cohenscottr avatar cwarden avatar djbarnwal avatar domoritz avatar egor-ryashin avatar ericpgreen2 avatar esevastyanov avatar godzig avatar hamilton avatar himadrisingh avatar k-anshul avatar kaspersjo avatar katie-rilldata avatar kavinjsir avatar larcedis avatar magorlick avatar medriscoll avatar mindspank avatar nishantmonu51 avatar pjain1 avatar rakeshsharma14317 avatar rohithreddykota avatar skokenes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rill's Issues

make a cardiplora but for parquet files

Nishant showed me cardiplora, which was an internal tool used to assess the contribution of a column to a rollup dataset's size. He mentioned that it could take a while to calculate for a million row dataset.

  • cardinality of all other dims – he said this was calculated with a group by all other dimension columns other than the target dimension column
  • impact – the total number of added cardinality this column imposes on the dataset
  • impact / total cardinality – what percentage of the total cardinality does this contribute?

[meta] Data Modeler Concern Areas

These concerns are from a meeting with Nishant and Mike on 2022-01-07, along with my own ideation about how we might tackle these areas.

concepts and terminology

staging, sources, & perimeters

  • supported formats: for April, we will provide support for ndjson, csv, parquet, and possibly Kafka.
  • concept flow: #52
  • ingestion: #39
  • remote sources: #53

performance concerns

Can duckdb do the kinds of transforms we want it to do?
Can we actually create a smooth felt experience w/ duckdb as it is now?

subquery inspection

writing queries is hard. How do I inspect my CTEs as I'm going so that I know what I'm doing wrong?

fast query execution

we want queries to feel like they're being inspected, not ran. To do this we need preview result sets to return as quickly as possible.

clear, expected, and achievable outputs

I think that we need to sharpen what we have in mind for the data modeler output.

dialect / engine support

what engines and dialects do we need to support? Do we need to natively run an engine to make this a realityor can we transpile a query in one SQL to duckdb?

export artifacts

what artifacts do we actually want to export out of the data modeler? What is the target here?

  • ingestion spec for druid?
  • are we modeling things into the kafka sink?

relationship between data modeler and cloud

  • when do we pair the data modeler with orchestration?

pipeline

when do we prioritize pipeline buildouts?

stateful vs stateless

  • stateful transforms are expensive. What metrics should we show?
  • vcpu calculations

extracting and inspecting a subquery

goal

we want to be able to run parts of subqueries to inspect them as if they were their own datasets.

Methods investigated

AST parser to inspect the tables

Utilize a pre-existing SQL AST parser of some kind to decompose a query and get source-table information. Almost all the AST parsers fail because while many of them support the postgres syntax, they tend to fail with duckdb-specific syntax (which we'd want to support).

In theory, if we could get the parse tree from duckdb, we could easily inspect and reconstruct anything we need.

duckdb profiler

Get the core source tables for any complex query via scan defs.

the duckdb profile does give information about sequence and parquet scans. While the sequence scans show the table names, the parquet scans do not. I filed this issue to add the source file name to the parquet scans.

Ultimately the profiler does not give us any more information we need, so this is unfortunately a dead-end.

encouraging dbt-core style templating

We could skip out on subquery parsing altogether and simply encourage a number of models instead of CTEs. While this would enable each query to be inspected in the traditional way, however, it breaks the model of dbt (and the CTEs).

hacked-together subquery parser

I've written a subquery extractor that takes in the query & the cursor location. If the location is on a viable subquery, then it returns the subquery as well as the index bounds of the subquery.

I've also written a CTE extractor as well. In theory one could reconstruct all the calls needed for a subquery by selecting the utilized CTEs by scanning for all subqueries & the FROM statement of a selected query. This would be enough to reconstruct a viable query to be run to inspect the resultset of the subquery.

The final piece missing to reconstruct subquery inspection would be to extract the source table from the FROM statement, reverse-engineer the tables or CTEs that are referenced, and then construct a query w/ CTEs that would return the result of that subquery.

apache calcite

Nishant recommended looking into Apache Calcite. https://calcite.apache.org/docs/

improve the caching checks for source datasets

DM works primarily as demoware, especially our source dataset scanning feature.

To make this more robust, we'll need to make the modeler doesn't constantly scan source files every time there's a new websocket connection. It will also need to pass through if a field of a dataset has already been analyzed.

[meta] backend problems to solve

  • how do we want to handle adding new sources? For now, we'll just scan the root dir for all parquet files, but we will need to do something different in the future.
  • how do we connect the queries directly to DBT?

switch to tailwind

As much as I love hand-writing css, not everyone agrees with this approach. Moving to Tailwind will probably ensure greater productivity for others and will undoubtedly help standardize my prototype styling.

migrate to websockets

We will want to be doing some background processing down the line to improve the user experience. Given that, and given that this will be a pretty localhost-centric experience going forward, it makes the most sense to switch to web sockets. This will enable us to do a few things:

  1. capture state server-side, which enables us to perform background operations on the app state
  2. make the frontend a bit more resilient to change, since the background is the source of truth
  3. remove the need for localhost to maintain the core app state

implement type information into the server

The most crucial place for type information and documentation right now is the server. Given that the frontend store is just a wrapper over the server store, a single interface that describes the structure would suffice.

migrate better-sql to only operate on parquet files

Now that duckdb/duckdb#2818 is merged, we can migrate entirely to parquet input files. The solution should look like:

  • parquet file in
  • constructed duckdb db as a staging db
  • parquet file out

Before doing all that, let's start just by running queries against parquet files. Does this work well?

December cleanup

Going to spend this morning cleaning up the repository.

  • typescript is supported but not really implemented
  • the app state is fractured. We need to handle the inspector in the store, and it should probably mostly live on the server & run remotely
  • we should bring in tailwind to simplify app development

bringing parquet & csv files into the perimeter

goal

  1. Figure out the processing time required to bring in a (parquet, csv) file into a duckdb perimeter
  2. Look at the time savings in running a standard top-k against the resulting table (vs. the original parquet / csv file)

Filing this issue to capture a conversation had w/ Nishant and Mike D on 2022-01-07.

Currently, the data modeler prototype scans its root directory for the presence of parquet files. Users can then write queries directly against these parquet files as part of the modeling process.

We know this is not the long-term solution. Ultimately, having source files converted to duckdb tables directly have a number of benefits.

improved performance – duckdb's columnar engine does a great job of optimizing query plans by producing indices on all the available columns. Even in the case of calculating a top-k query against a parquet file, duckdb's optimizations reduces the runtime by a half.

consistency of experience – even though a user can directly query a parquet file from the data modeler, similar performance & functionality does not exist with other more common data formats such as CSV. We can, for instance, directly query a CSV to calculate the top-k of a field, but for a 5 million row dataset, this takes approximately 10 seconds to run (still pretty good compared to writing a script to calculate this against the CSV directly). Loading into duckdb, however, will take ~ 10 seconds, but after that, similar queries will drop to a low-latency level.

There are a few ways we might do this:1. manually adding a data file to duckdb, e.g. dropping it into the window and letting duckdb do the rest2. scanning a chosen directory for source files and automatically ingesting these files into duckdb. The point of this issue is to assess the general viability of ingesting from a performance perspective.

method

I wrote a node script that his the duckdb engine to run these queries.

outcome

---------- SF 311 Dataset ----------


~~~ running w/ the parquet file (5mil rows, 40ish fields, ~ 786 MB) ~~~

adding sf311.parquet to perimeter: 5.926s
topk on the parquet file: 79.787ms
topk on the perimeter table: 43.817ms


~~~ running against sf-311-cases-dec-2021.csv (same size, 2.1G) ~~~

adding the sf311 csv to perimeter: 15.618s
topk on the csv file: 10.163s
topk on the perimeter table: 45.133ms

Takeaways

  • moving parquet to duckdb cost ~6 seconds
  • the top-k querying took 44ms vs. 80ms without, which is almost half the time
  • moving csv to duckdb cost ~ 15.6 seconds
  • a top-k query on a csv takes 10.2 seconds total
  • a top-k on the resulting table instead takes around 45ms (comparable to the parquet-to-duckdb ingestion case)
  • Moving a data file into the duckdb perimeter will likely have time savings vs. querying the regular parquet file
  • It's a necessity for csvs for them to be loaded into the duckdb perimeter. We should require this. Anything we can do to speed this up is a good move
  • it's important to note that duckdb's own storage format right now is not quite at 1.0, so parquet files are a better option between app sessions for now. This means we can't instantly start up things over time until this format settles a bit. Another alternative is to throw caution to the wind and just go for it and be ready to blow up a user's persisted perimeter for the time being.

Next Steps

key on sanitized query

a query should probably be keyed on the query minus unneeded punctuation, removed comments, remove all \n, double spaces, etc.

show query errors somewhere

there are two options:

1.) the easiest – show the error in a toaste-like notification element.
2.) the better one – underline the error in Codemirror.

add MD IO into this repository

It's unclear if we stick with this font in the long-run but while developing right now, it's been a great choice. At any rate we should make sure the dev experience is uniform.

modularize the transform actions

these include ones like moveQueryUp, moveQueryDown, etc.

  • move over the transform actions
  • add test coverage for the transforms

For now, we're going to hold off on renaming query to transform I think.

Switch to DuckDB for prototype

DuckDB is likely much better-suited than sqlite3 for the task at hand. Let's see if it can provide the same needs:

  • a way to run the query (this is obviously easy)
  • a way to get info about the query, such as an execution plan that alludes to the source tables

investigate the duckdb histogram function

We know we can effortlessly do top k in duckdb. Can we generate an equal-width histogram?

This is close but not quite there:

WITH minCTE as (
  SELECT 
  MIN(createdAt) as minValue,
  MAX(createdAt) - MIN(createdAt) as width
  from pages
),
binCTE AS (SELECT floor(
  (5 * CAST(createdAt - (SELECT minValue FROM minCTE) as float)
  ) / 
  (select width from minCTE))
  * ((select width from minCTE) / 5)
  + (select minValue FROM minCTE)
  AS histogram_bin, createdAt FROM pages
)
SELECT count(*) as c, histogram_bin, first(createdAt) 
FROM  binCTE 
GROUP BY histogram_bin 
ORDER BY histogram_bin asc;

make sources easier to work with

currently, the sources have to be known to the user rather than "added" or scanned via a directory.

The solution, whether it's independent scanning or whatever, is to create an availableSources array that we can pass to the global store.

add a more accurate README

I don't think the readme is a reflection of the build steps atm ~ let's add this to the phase 1 goals

create a notification loop

This requires

  • having a notification message interceptable by the frontend
  • having a standard "notification store" which triggers when the store receives a message, and then fades out after n seconds

remove unused packages from `package.json`

In the coming months this repository will build up a lot of unused packages that were discarded for other options. Before the end of this current milestone, we should make sure to get rid of all unused packages.

make autolimiting smarter

Currently if a user sets LIMIT k the query will break since we are enforcing a limit on the query side right now to prevent massive resultsets in the preview. We should just check for the presence of a LIMIT clause.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.