rilldata / rill Goto Github PK
View Code? Open in Web Editor NEWRill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.
License: Apache License 2.0
Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.
License: Apache License 2.0
Nishant showed me cardiplora, which was an internal tool used to assess the contribution of a column to a rollup dataset's size. He mentioned that it could take a while to calculate for a million row dataset.
These concerns are from a meeting with Nishant and Mike on 2022-01-07, along with my own ideation about how we might tackle these areas.
Can duckdb do the kinds of transforms we want it to do?
Can we actually create a smooth felt experience w/ duckdb as it is now?
writing queries is hard. How do I inspect my CTEs as I'm going so that I know what I'm doing wrong?
we want queries to feel like they're being inspected, not ran. To do this we need preview result sets to return as quickly as possible.
I think that we need to sharpen what we have in mind for the data modeler output.
what engines and dialects do we need to support? Do we need to natively run an engine to make this a realityor can we transpile a query in one SQL to duckdb?
what artifacts do we actually want to export out of the data modeler? What is the target here?
when do we prioritize pipeline buildouts?
we want to be able to run parts of subqueries to inspect them as if they were their own datasets.
Utilize a pre-existing SQL AST parser of some kind to decompose a query and get source-table information. Almost all the AST parsers fail because while many of them support the postgres syntax, they tend to fail with duckdb-specific syntax (which we'd want to support).
In theory, if we could get the parse tree from duckdb, we could easily inspect and reconstruct anything we need.
Get the core source tables for any complex query via scan defs.
the duckdb profile does give information about sequence and parquet scans. While the sequence scans show the table names, the parquet scans do not. I filed this issue to add the source file name to the parquet scans.
Ultimately the profiler does not give us any more information we need, so this is unfortunately a dead-end.
We could skip out on subquery parsing altogether and simply encourage a number of models instead of CTEs. While this would enable each query to be inspected in the traditional way, however, it breaks the model of dbt (and the CTEs).
I've written a subquery extractor that takes in the query & the cursor location. If the location is on a viable subquery, then it returns the subquery as well as the index bounds of the subquery.
I've also written a CTE extractor as well. In theory one could reconstruct all the calls needed for a subquery by selecting the utilized CTEs by scanning for all subqueries & the FROM statement of a selected query. This would be enough to reconstruct a viable query to be run to inspect the resultset of the subquery.
The final piece missing to reconstruct subquery inspection would be to extract the source table from the FROM statement, reverse-engineer the tables or CTEs that are referenced, and then construct a query w/ CTEs that would return the result of that subquery.
Nishant recommended looking into Apache Calcite. https://calcite.apache.org/docs/
DM works primarily as demoware, especially our source dataset scanning feature.
To make this more robust, we'll need to make the modeler doesn't constantly scan source files every time there's a new websocket connection. It will also need to pass through if a field of a dataset has already been analyzed.
Some progress has been made via diagramming
Based on conversations, will generate a document that codifies this information.
As much as I love hand-writing css, not everyone agrees with this approach. Moving to Tailwind will probably ensure greater productivity for others and will undoubtedly help standardize my prototype styling.
We probably don't need a separate server in place of sveltekit.
We will want to be doing some background processing down the line to improve the user experience. Given that, and given that this will be a pretty localhost-centric experience going forward, it makes the most sense to switch to web sockets. This will enable us to do a few things:
This repository will eventually have CI. I am a fan of also requiring commit hooks to catch commits that might introduce code quality problems before other humans even see it. Husky is a sensible option for this.
The most crucial place for type information and documentation right now is the server. Given that the frontend store is just a wrapper over the server store, a single interface that describes the structure would suffice.
Now that duckdb/duckdb#2818 is merged, we can migrate entirely to parquet input files. The solution should look like:
Before doing all that, let's start just by running queries against parquet files. Does this work well?
Going to spend this morning cleaning up the repository.
Filing this issue to capture a conversation had w/ Nishant and Mike D on 2022-01-07.
Currently, the data modeler prototype scans its root directory for the presence of parquet files. Users can then write queries directly against these parquet files as part of the modeling process.
We know this is not the long-term solution. Ultimately, having source files converted to duckdb tables directly have a number of benefits.
improved performance – duckdb's columnar engine does a great job of optimizing query plans by producing indices on all the available columns. Even in the case of calculating a top-k query against a parquet file, duckdb's optimizations reduces the runtime by a half.
consistency of experience – even though a user can directly query a parquet file from the data modeler, similar performance & functionality does not exist with other more common data formats such as CSV. We can, for instance, directly query a CSV to calculate the top-k of a field, but for a 5 million row dataset, this takes approximately 10 seconds to run (still pretty good compared to writing a script to calculate this against the CSV directly). Loading into duckdb, however, will take ~ 10 seconds, but after that, similar queries will drop to a low-latency level.
There are a few ways we might do this:1. manually adding a data file to duckdb, e.g. dropping it into the window and letting duckdb do the rest2. scanning a chosen directory for source files and automatically ingesting these files into duckdb. The point of this issue is to assess the general viability of ingesting from a performance perspective.
I wrote a node script that his the duckdb engine to run these queries.
---------- SF 311 Dataset ----------
~~~ running w/ the parquet file (5mil rows, 40ish fields, ~ 786 MB) ~~~
adding sf311.parquet to perimeter: 5.926s
topk on the parquet file: 79.787ms
topk on the perimeter table: 43.817ms
~~~ running against sf-311-cases-dec-2021.csv (same size, 2.1G) ~~~
adding the sf311 csv to perimeter: 15.618s
topk on the csv file: 10.163s
topk on the perimeter table: 45.133ms
Eventually this application can / should be bundled via electron and distributed as a desktop app.
a query should probably be keyed on the query minus unneeded punctuation, removed comments, remove all \n
, double spaces, etc.
The inspector (and future asset navigator) panes should be user-expandable.
there are two options:
1.) the easiest – show the error in a toaste-like notification element.
2.) the better one – underline the error in Codemirror.
It's unclear if we stick with this font in the long-run but while developing right now, it's been a great choice. At any rate we should make sure the dev experience is uniform.
nullity would be good as well
these include ones like moveQueryUp
, moveQueryDown
, etc.
For now, we're going to hold off on renaming query
to transform I think.
DuckDB is likely much better-suited than sqlite3 for the task at hand. Let's see if it can provide the same needs:
We know we can effortlessly do top k in duckdb. Can we generate an equal-width histogram?
This is close but not quite there:
WITH minCTE as (
SELECT
MIN(createdAt) as minValue,
MAX(createdAt) - MIN(createdAt) as width
from pages
),
binCTE AS (SELECT floor(
(5 * CAST(createdAt - (SELECT minValue FROM minCTE) as float)
) /
(select width from minCTE))
* ((select width from minCTE) / 5)
+ (select minValue FROM minCTE)
AS histogram_bin, createdAt FROM pages
)
SELECT count(*) as c, histogram_bin, first(createdAt)
FROM binCTE
GROUP BY histogram_bin
ORDER BY histogram_bin asc;
currently, the sources have to be known to the user rather than "added" or scanned via a directory.
The solution, whether it's independent scanning or whatever, is to create an availableSources
array that we can pass to the global store.
I don't think the readme is a reflection of the build steps atm ~ let's add this to the phase 1 goals
This requires
We'll start with these:
In the coming months this repository will build up a lot of unused packages that were discarded for other options. Before the end of this current milestone, we should make sure to get rid of all unused packages.
Currently if a user sets LIMIT k
the query will break since we are enforcing a limit on the query side right now to prevent massive resultsets in the preview. We should just check for the presence of a LIMIT
clause.
This will help us catch buggy API calls.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.