rilldata / rill Goto Github PK

Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.

License: Apache License 2.0

JavaScript 0.37% HTML 0.57% Svelte 22.23% CSS 0.29% TypeScript 34.51% Shell 0.31% Dockerfile 0.02% Makefile 0.03% Go 41.68%

duckdb svelte sveltejs sveltekit dataviz csv parquet parquet-tools golang s3

rill's People

Contributors

Stargazers

Watchers

Forkers

domoritz yasser-sobhy dio chadhoyt bangpound guptam wlach gitpod-forks djbarnwal tarentino76 medriscoll mohit0928 chilang qharlie drifting-in-space arstman vivekchauhan harrzie sidhutripathy cmrajan cliffordfajardo aljazerzen quangr luyang2902 alekglez willeppy rohithreddykota nikking-dg simonyang-cs gg-big-org willingc primenumsdev cwarden smblytics jkbowle marcelteraflow conversight mmirandalcgbr idahopotato1 nandaninahaystack begelundmuller ayuryshev rogervaas saraiqx sweetcard iq-scm litanlitudan mohanrex nicosql jfontestad pjain1 0xforked fleetwood-ai magorlick marianne-jarvis chriswillis gdn0101 elim-bh mukundesh vinclou shankul31 luzhengyang mindspank elismasilva thomasneirynck deshetti janjagusch fredguth bhanu-pappala fredehagelund92 nikhilkaushik12345 k-anshul sdverter datalearns rd1427 sf-bigdata marcusdiaz kj-9 mdsohelmia dougthompson1976 akarbas kzk2000 wfclark5 codeaudit faducoder syaikhipin tmdc-io vast-enterprise vvhungy huang-salute polya20

rill's Issues

make a cardiplora but for parquet files

Nishant showed me cardiplora, which was an internal tool used to assess the contribution of a column to a rollup dataset's size. He mentioned that it could take a while to calculate for a million row dataset.

cardinality of all other dims – he said this was calculated with a group by all other dimension columns other than the target dimension column
impact – the total number of added cardinality this column imposes on the dataset
impact / total cardinality – what percentage of the total cardinality does this contribute?

[meta] Data Modeler Concern Areas

These concerns are from a meeting with Nishant and Mike on 2022-01-07, along with my own ideation about how we might tackle these areas.

concepts and terminology

staging, sources, & perimeters

supported formats: for April, we will provide support for ndjson, csv, parquet, and possibly Kafka.
concept flow: #52
ingestion: #39
remote sources: #53

performance concerns

Can duckdb do the kinds of transforms we want it to do?
Can we actually create a smooth felt experience w/ duckdb as it is now?

subquery inspection

writing queries is hard. How do I inspect my CTEs as I'm going so that I know what I'm doing wrong?

fast query execution

we want queries to feel like they're being inspected, not ran. To do this we need preview result sets to return as quickly as possible.

clear, expected, and achievable outputs

I think that we need to sharpen what we have in mind for the data modeler output.

dialect / engine support

what engines and dialects do we need to support? Do we need to natively run an engine to make this a realityor can we transpile a query in one SQL to duckdb?

export artifacts

what artifacts do we actually want to export out of the data modeler? What is the target here?

ingestion spec for druid?
are we modeling things into the kafka sink?

relationship between data modeler and cloud

when do we pair the data modeler with orchestration?

pipeline

when do we prioritize pipeline buildouts?

stateful vs stateless

stateful transforms are expensive. What metrics should we show?
vcpu calculations

[ADR] create an application server using Flask

extracting and inspecting a subquery

goal

we want to be able to run parts of subqueries to inspect them as if they were their own datasets.

Methods investigated

AST parser to inspect the tables

Utilize a pre-existing SQL AST parser of some kind to decompose a query and get source-table information. Almost all the AST parsers fail because while many of them support the postgres syntax, they tend to fail with duckdb-specific syntax (which we'd want to support).

In theory, if we could get the parse tree from duckdb, we could easily inspect and reconstruct anything we need.

duckdb profiler

Get the core source tables for any complex query via scan defs.

the duckdb profile does give information about sequence and parquet scans. While the sequence scans show the table names, the parquet scans do not. I filed this issue to add the source file name to the parquet scans.

Ultimately the profiler does not give us any more information we need, so this is unfortunately a dead-end.

encouraging dbt-core style templating

We could skip out on subquery parsing altogether and simply encourage a number of models instead of CTEs. While this would enable each query to be inspected in the traditional way, however, it breaks the model of dbt (and the CTEs).

hacked-together subquery parser

I've written a subquery extractor that takes in the query & the cursor location. If the location is on a viable subquery, then it returns the subquery as well as the index bounds of the subquery.

I've also written a CTE extractor as well. In theory one could reconstruct all the calls needed for a subquery by selecting the utilized CTEs by scanning for all subqueries & the FROM statement of a selected query. This would be enough to reconstruct a viable query to be run to inspect the resultset of the subquery.

The final piece missing to reconstruct subquery inspection would be to extract the source table from the FROM statement, reverse-engineer the tables or CTEs that are referenced, and then construct a query w/ CTEs that would return the result of that subquery.

apache calcite

Nishant recommended looking into Apache Calcite. https://calcite.apache.org/docs/

investigate dbt-core api / features to mimic

improve the caching checks for source datasets

DM works primarily as demoware, especially our source dataset scanning feature.

To make this more robust, we'll need to make the modeler doesn't constantly scan source files every time there's a new websocket connection. It will also need to pass through if a field of a dataset has already been analyzed.

[meta] backend problems to solve

how do we want to handle adding new sources? For now, we'll just scan the root dir for all parquet files, but we will need to do something different in the future.
how do we connect the queries directly to DBT?

outline the core concepts of the data modeler

Some progress has been made via diagramming

https://www.figma.com/file/IJFQQh86N2ZRPaz4NFeBut/Data-Modeler%3A-ingestion-%26-edit-preview-loops?node-id=0%3A1

Based on conversations, will generate a document that codifies this information.

switch to tailwind

As much as I love hand-writing css, not everyone agrees with this approach. Moving to Tailwind will probably ensure greater productivity for others and will undoubtedly help standardize my prototype styling.

unify sveltekit and express server

We probably don't need a separate server in place of sveltekit.

migrate to websockets

We will want to be doing some background processing down the line to improve the user experience. Given that, and given that this will be a pretty localhost-centric experience going forward, it makes the most sense to switch to web sockets. This will enable us to do a few things:

capture state server-side, which enables us to perform background operations on the app state
make the frontend a bit more resilient to change, since the background is the source of truth
remove the need for localhost to maintain the core app state

utilize commit hooks to catch linting errors, test failures, etc.

This repository will eventually have CI. I am a fan of also requiring commit hooks to catch commits that might introduce code quality problems before other humans even see it. Husky is a sensible option for this.

flip column and row icons

have server listen for changes and reload

implement type information into the server

The most crucial place for type information and documentation right now is the server. Given that the frontend store is just a wrapper over the server store, a single interface that describes the structure would suffice.

syntax highlighting different column types in the destination preview table

outline precisely how source datasets will be brought into the data modeler

add 5-num to each column

migrate better-sql to only operate on parquet files

Now that duckdb/duckdb#2818 is merged, we can migrate entirely to parquet input files. The solution should look like:

parquet file in
constructed duckdb db as a staging db
parquet file out

Before doing all that, let's start just by running queries against parquet files. Does this work well?

December cleanup

Going to spend this morning cleaning up the repository.

typescript is supported but not really implemented
the app state is fractured. We need to handle the inspector in the store, and it should probably mostly live on the server & run remotely
we should bring in tailwind to simplify app development

[ADR] Python environment / dependency management

implement profiling for destination table

bringing parquet & csv files into the perimeter

goal

Figure out the processing time required to bring in a (parquet, csv) file into a duckdb perimeter
Look at the time savings in running a standard top-k against the resulting table (vs. the original parquet / csv file)

Filing this issue to capture a conversation had w/ Nishant and Mike D on 2022-01-07.

Currently, the data modeler prototype scans its root directory for the presence of parquet files. Users can then write queries directly against these parquet files as part of the modeling process.

We know this is not the long-term solution. Ultimately, having source files converted to duckdb tables directly have a number of benefits.

improved performance – duckdb's columnar engine does a great job of optimizing query plans by producing indices on all the available columns. Even in the case of calculating a top-k query against a parquet file, duckdb's optimizations reduces the runtime by a half.

consistency of experience – even though a user can directly query a parquet file from the data modeler, similar performance & functionality does not exist with other more common data formats such as CSV. We can, for instance, directly query a CSV to calculate the top-k of a field, but for a 5 million row dataset, this takes approximately 10 seconds to run (still pretty good compared to writing a script to calculate this against the CSV directly). Loading into duckdb, however, will take ~ 10 seconds, but after that, similar queries will drop to a low-latency level.

There are a few ways we might do this:1. manually adding a data file to duckdb, e.g. dropping it into the window and letting duckdb do the rest2. scanning a chosen directory for source files and automatically ingesting these files into duckdb. The point of this issue is to assess the general viability of ingesting from a performance perspective.

method

I wrote a node script that his the duckdb engine to run these queries.

outcome

---------- SF 311 Dataset ----------


~~~ running w/ the parquet file (5mil rows, 40ish fields, ~ 786 MB) ~~~

adding sf311.parquet to perimeter: 5.926s
topk on the parquet file: 79.787ms
topk on the perimeter table: 43.817ms


~~~ running against sf-311-cases-dec-2021.csv (same size, 2.1G) ~~~

adding the sf311 csv to perimeter: 15.618s
topk on the csv file: 10.163s
topk on the perimeter table: 45.133ms

Takeaways

moving parquet to duckdb cost ~6 seconds
the top-k querying took 44ms vs. 80ms without, which is almost half the time
moving csv to duckdb cost ~ 15.6 seconds
a top-k query on a csv takes 10.2 seconds total
a top-k on the resulting table instead takes around 45ms (comparable to the parquet-to-duckdb ingestion case)
Moving a data file into the duckdb perimeter will likely have time savings vs. querying the regular parquet file
It's a necessity for csvs for them to be loaded into the duckdb perimeter. We should require this. Anything we can do to speed this up is a good move
it's important to note that duckdb's own storage format right now is not quite at 1.0, so parquet files are a better option between app sessions for now. This means we can't instantly start up things over time until this format settles a bit. Another alternative is to throw caution to the wind and just go for it and be ready to blow up a user's persisted perimeter for the time being.

Next Steps

try bundling with electron

Eventually this application can / should be bundled via electron and distributed as a desktop app.

key on sanitized query

a query should probably be keyed on the query minus unneeded punctuation, removed comments, remove all \n, double spaces, etc.

[design spike] explore re-implementation

make panes expandable

The inspector (and future asset navigator) panes should be user-expandable.

show query errors somewhere

there are two options:

1.) the easiest – show the error in a toaste-like notification element.
2.) the better one – underline the error in Codemirror.

clean up and modularize the transform inspection API

add MD IO into this repository

It's unclear if we stick with this font in the long-run but while developing right now, it's been a great choice. At any rate we should make sure the dev experience is uniform.

add important summary values to dataset profile

numerics: median value
timestamps: total time range of column (months, days, time)
categoricals: cardinality

nullity would be good as well

modularize the transform actions

these include ones like moveQueryUp, moveQueryDown, etc.

move over the transform actions
add test coverage for the transforms

For now, we're going to hold off on renaming query to transform I think.

Switch to DuckDB for prototype

DuckDB is likely much better-suited than sqlite3 for the task at hand. Let's see if it can provide the same needs:

a way to run the query (this is obviously easy)
a way to get info about the query, such as an execution plan that alludes to the source tables

investigate the duckdb histogram function

We know we can effortlessly do top k in duckdb. Can we generate an equal-width histogram?

This is close but not quite there:

WITH minCTE as (
  SELECT 
  MIN(createdAt) as minValue,
  MAX(createdAt) - MIN(createdAt) as width
  from pages
),
binCTE AS (SELECT floor(
  (5 * CAST(createdAt - (SELECT minValue FROM minCTE) as float)
  ) / 
  (select width from minCTE))
  * ((select width from minCTE) / 5)
  + (select minValue FROM minCTE)
  AS histogram_bin, createdAt FROM pages
)
SELECT count(*) as c, histogram_bin, first(createdAt) 
FROM  binCTE 
GROUP BY histogram_bin 
ORDER BY histogram_bin asc;

make sources easier to work with

currently, the sources have to be known to the user rather than "added" or scanned via a directory.

The solution, whether it's independent scanning or whatever, is to create an availableSources array that we can pass to the global store.

ingest scanned sources into duckdb

add a more accurate README

I don't think the readme is a reflection of the build steps atm ~ let's add this to the phase 1 goals

create a notification loop

This requires

having a notification message interceptable by the frontend
having a standard "notification store" which triggers when the store receives a message, and then fades out after n seconds

[meta] refactor actions into modules vs. one monolithic actions module

take all actions related to a concept (e.g. a dataset) and put them in a standalone module
this module should export a function that takes the API and notification callback, and returns an object of dispatch thunks

We'll start with these:

datasets
queries / transforms

rilldata / rill Goto Github PK

rill's People

Contributors

Stargazers

Watchers

Forkers

rill's Issues

concepts and terminology

staging, sources, & perimeters

performance concerns

subquery inspection

fast query execution

clear, expected, and achievable outputs

dialect / engine support

export artifacts

relationship between data modeler and cloud

pipeline

stateful vs stateless

goal

Methods investigated

AST parser to inspect the tables

duckdb profiler

encouraging dbt-core style templating

hacked-together subquery parser

apache calcite

goal

method

outcome

Takeaways

Next Steps

Recommend Projects

Recommend Topics

Recommend Org