Git Product home page Git Product logo

mosaic's Introduction

Mosaic: An Extensible Framework for Linking Databases and Interactive Views

  • 📈 Explore massive datasets
    Visualize, select, and filter datasets with millions or billions of records.
  • 🚀 Flexible deployment
    Build data-driven web apps, or interact with data directly in Jupyter notebooks.
  • 🛠️ Interoperable & extensible
    Create new components that seamlessly integrate across selections and datasets.
  • 🦆 Powered by DuckDB
    Mosaic pushes computation to DuckDB, both server-side and in your browser via WebAssembly.

Mosaic is an extensible architecture for linking data visualizations, tables, input widgets, and other data-driven components, leveraging a backing database for scalable processing of both static and interactive views. With Mosaic, you can visualize and explore millions and even billions of data points at interactive rates.

The key idea is to have interface components "publish" their data needs as declarative queries that can be managed, optimized, and cross-filtered by a coordinator that proxies access to DuckDB.

Learn more about Mosaic at the documentation site, or read the Mosaic research paper.

If referencing Mosaic, please use the following citation:

@article{heer2024mosaic,
  title={Mosaic: An Architecture for Scalable \& Interoperable Data Views},
  author={Heer, Jeffrey and Moritz, Dominik},
  journal={IEEE Transactions on Visualization and Computer Graphics},
  year={2024},
  volume={30},
  number={1},
  pages={436-446},
  doi={10.1109/TVCG.2023.3327189}
}

Repository Structure

This repository contains a set of related packages.

Note: For convenience, the vgplot package re-exports much of the mosaic-core, mosaic-sql, mosaic-plot, and mosaic-inputs packages. For most applications, it is sufficient to either import @uwdata/vgplot alone or in conjunction with @uwdata/mosaic-spec.

Core Components

  • mosaic-core: The core Mosaic components. A central coordinator, parameters and selections for linking scalar values or query predicates (respectively) across Mosaic clients, and filter groups with optimized index management. The Mosaic coordinator can send queries either over the network to a backing server (socket and rest clients) or to a client-side DuckDB-WASM instance (wasm client).
  • mosaic-sql: An API for convenient construction and analysis of SQL queries. Query objects then coerce to SQL query strings.
  • mosaic-inputs: Standalone data-driven components such as input menus, text search boxes, and sortable, load-on-scroll data tables.
  • mosaic-plot: An interactive grammar of graphics implemented on top of Observable Plot. Marks (plot layers) serve as individual Mosaic clients. These marks can push data processing (binning, hex binning, regression) and optimizations (such as M4 for line/area charts) down to the database. This package also provides interactors for linked selection, filtering, and highlighting using Mosaic Params and Selections.

Applications

  • vgplot: A visualization grammar API for building interactive Mosaic-powered visualizations and dashboards. This package provides convenient, composable methods that combine multiple Mosaic packages (core, inputs, plot, etc.) in an integrated API. This API re-exports much of the mosaic-core, mosaic-sql, mosaic-plot, and mosaic-inputs packages, enabling use in a stand-alone fashion.
  • mosaic-spec: Declarative specification of Mosaic-powered applications as JSON or YAML files. This package provides a parser and code generation framework for reading specifications in a JSON format and generating live Mosaic visualizations and dashboards using the vgplot API.
  • duckdb-server: A Python-based server that runs a local DuckDB instance and support queries over Web Sockets or HTTP, returning data in either Apache Arrow or JSON format.
  • widget: A Jupyter widget for Mosaic. Given a declarative specification, will generate web-based visualizations while leveraging DuckDB in the Jupyter kernel. Create interactive Mosaic plots over Pandas and Polars data frames or DuckDB connections.

Miscellaneous

  • mosaic-duckdb: A Promise-based Node.js API to DuckDB, along with a data server that supports transfer of Apache Arrow and JSON data over either Web Sockets or HTTP. Due to persistent quality issues involving the Node.js DuckDB client and Arrow extension, we recommend using the Python-based duckdb-server package instead. However, we retain this package for both backwards compatibility and potential future use as quality issues improve.
  • vega-example: A proof-of-concept example integrating Vega-Lite with Mosaic for data management and cross-view linking.

Build and Usage Instructions

To build and develop Mosaic locally:

  • Clone https://github.com/uwdata/mosaic.
  • Run npm i to install dependencies.
  • Run npm test to run the test suite.
  • Run npm run build to build client-side bundles.

To run local interactive examples:

  • Run npm run dev to launch a local web server and view examples. By default, the examples use DuckDB-WASM in the browser. For greater performance, launch and connect to a local DuckDB server as described below below.

To launch a local DuckDB server:

  • Install hatch, if not already present.
  • Run npm run server to launch the duckdb-server. This runs the server in development mode, so the server will restart if you change its code.

To use Mosaic with DuckDB Python in Jupyter Notebooks:

To use Mosaic with DuckDB-WASM in Observable Notebooks:

To use Mosaic and DuckDB in Observable Framework:

mosaic's People

Contributors

dependabot[bot] avatar domoritz avatar eknowles avatar frtennis1 avatar jheer avatar kwonoh avatar manzt avatar mbostock avatar mingfang avatar rickiesmooth avatar spren9er avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mosaic's Issues

Custom aggregate expressions break data cube indexer

From @frtennis1:


In DataCubeIndexer.js::getIndexColumns

    switch (aggregate?.toUpperCase()) {

fails on

TypeError: aggregate.toUpperCase is not a function

where aggregate = true. It seems there's a difference between how vg.agg tags aggregate (as a boolean true) and how AggregateFunction tags aggregates (as a string with the aggregate name).

Is it safe to have the {aggregate: True} aggregates fall to the default case in this switch? Or would we prefer tagging vg.agg with something like {aggregate: "CUSTOM"} or changing this switch statement to gracefully handle bools?


I believe a correct initial fix here is to avoid assuming that the aggregate value is a string. If it is not a string (and hence not a single aggregate function) we should fall through the default switch case, and thereby skip attempts at indexing.

In the future one could imagine parsing aggregate expressions to make more nuanced determinations, but that is much more involved and can possibly be taken up at a later date.

Exception is raised when intervalXY is used on cells

The underlying issue is that the invert.js code expects a scale.invert function. But cell marks use d3.scaleBand, which doesn't have a invert function. Here's a quick monkeypatch/proof-of-concept fix (I don't make any adjustments for pixelSize; also note that this won't work if the domain isn't sorted low to high):

invert.js

export function invert(value, scale, pixelSize = 1) {
  if (scale.type == 'band') {
    return Math.round(( (value - scale.range[0]) / (scale.range[1] - scale.range[0]) ) * (scale.domain.slice(-1) - scale.domain[0]));
  } else {
    return scale.invert(pixelSize * Math.floor(value / pixelSize));
  }
}

Enable the `castTimestampToDate` option for duckdb-wasm?

When I try to use mosaic with duckdb-wasm using vg.wasmConnector, dates aren't treated as dates and my charts don't use date axes. It seems like DuckDB added an option to cast dates, but it's false by default:

But this doesn't look like it's toggled on for Mosaic, which causes time series data to be wonky when run directly in a browser.

Data is cached even after table is replaced unless `.configure` is called

I made a full reproduction repo: https://github.com/mhkeller/mosaic-cache-bug

I noticed that if I reuse a table name when creating a table from a parquet file using CREATE OR REPLACE, the data is not replaced.

I thought the issue was due to Mosaic cacheing the results so I set the following, which fixes the issue:

await vg.coordinator().configure({
  cache: false
})

However, it doesn't matter what the value of cache here is. The bug is fixed as long as this configure call is made. For example, setting it to the defaults fixes it:

await vg.coordinator().configure({
  cache: true,
  index: true
})

This also fixes it:

await vg.coordinator().configure({})

Expected

expected.mov

Actual

actual.mov

Arrow extension error

When running the below command for the first time after cloning and building the code,

npm run server

getting the below error:

[Error: HTTP Error: Failed to download extension "arrow" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64/arrow.duckdb_extension.gz"

Candidate extensions: "parquet"] {
  errno: -1,
  code: 'DUCKDB_NODEJS_ERROR',
  errorType: 'HTTP',
  statusCode: 403,
  response: '<?xml version="1.0" encoding="UTF-8"?>\n' +
    '<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>0PPQ3MBV0ZF6S049</RequestId><HostId>vtgA1i3BYRWWN3uagdtM2uxLv5xfAz4th9XajMjrtTwtuxqK+EJ9n2IH4bZkjqyXH1kplh66mhk=</HostId></Error>',
  reason: 'Forbidden',
  headers: {
    Connection: 'close',
    'Content-Type': 'application/xml',
    Date: 'Tue, 18 Jul 2023 09:04:38 GMT',
    Server: 'AmazonS3',
    'Transfer-Encoding': 'chunked',
    Via: '1.1 dc76840562be68c0220fc72e79c212b8.cloudfront.net (CloudFront)',
    'X-Amz-Cf-Id': 'tcxRBgZSrHjye30qgyTF13OJPDt8cjQB2YgvciWh632LuM3irYPGKw==',
    'X-Amz-Cf-Pop': 'HYD50-C1',
    'X-Cache': 'Error from cloudfront'
  }
}

Jupyter widget breaks on non-localhost http jupyters

If you run Jupyter either over https or on localhost, you don't get this bug. But when you're on a remote using http (or on your localhost, but aliased), the widget breaks with,

Uncaught (in promise) TypeError: globalThis.crypto.randomUUID is not a function
    at s (5ca03b59-285e-488c-b59a-ef82c3b57b2c:37017:33)
    at 5ca03b59-285e-488c-b59a-ef82c3b57b2c:37022:36
    at new Promise (<anonymous>)
    at Object.query (5ca03b59-285e-488c-b59a-ef82c3b57b2c:37022:14)
    at u (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8429:48)
    at a (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8409:9)
    at c (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8414:21)
    at Object.add (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8209:134)
    at Object.request (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8450:20)
    at sL.query (5ca03b59-285e-488c-b59a-ef82c3b57b2c:8527:25)

Some research suggests that browsers do not give access to the crypto api outside of "secure contexts" https://stackoverflow.com/questions/105034/how-do-i-create-a-guid-uuid,

crypto.randomUUID() is now standard on all modern browsers and JS runtimes. However, because new browser APIs are restricted to secure contexts, this method is only available to pages served locally (localhost or 127.0.0.1) or over HTTPS.

and perhaps a recommended fix would be to replace the single callsite with the recommended uuid module from that stackoverflow answer.

make data-server usable as a websocket library

Currently the data-server websocker standalone server is standalone.
Sometimes that's not desirable because you already have your own server, e.g. Express.
To make data-server usable in Express, we need to export queryHandler and socketResponse.

Support of registered files (DuckDB-WASM) in vgplot

Data loading helpers create DuckDB tables/views from parquet, JSON or CSV files. Unfortunately, this process can take a while when data is big. Sometimes — in DuckDB-WASM — it is nicer to register files and run queries directly on them (see registerFileURL). Query performance on DuckDB tables is optimal, but queries on parquet files are fast, too.

Here, we try to use a registered CSV file (via HTTP) in vgplot:

import * as vg from "@uwdata/vgplot";

import { DuckDBDataProtocol } from '@duckdb/duckdb-wasm';

const wasm = await vg.wasmConnector();

wasm.db.registerFileURL(
  'athletes.csv',
  'https://uwdata.github.io/mosaic-datasets/data/athletes.csv',
  DuckDBDataProtocol.HTTP,
  true,
);

vg.coordinator().databaseConnector(wasm);

const $query = vg.Selection.intersect();

export default vg.vconcat(
  vg.menu({
    label: 'Sport',
    as: $query,
    from: 'athletes.csv',
    column: 'sport',
    value: 'aquatics'
  }),
  vg.vspace(10),
  vg.plot(
    vg.barX(vg.from('athletes.csv', { filterBy: $query }), {
      x: vg.sum('gold'),
      y: 'nationality',
      fill: 'steelblue',
      sort: { y: '-x', limit: 10 },
    }),
    vg.xLabel('Gold Medals'),
    vg.yLabel('Nationality'),
    vg.yLabelAnchor('top'),
    vg.marginTop(15)
  )
);

Note, that references of mosaic clients point now to registered file athletes.csv, not a DuckDB table anymore.

This works! However, currently not all marks from vgplot are supported with registered files. Connected marks (e.g. area, line) don't work. When replacing above mark barX by

vg.lineY(vg.from('athletes.csv', { filterBy: $query }), {
  x: 'nationality',
  y: vg.sum('gold')
})

an error is raised:

Coordinator.js:145 Uncaught (in promise) Error: Catalog Error: Table with name csv does not exist!

I think the reason is probably due to some preprocessing of table name, see here. This logic conflicts with registered file name. I tried changing name of registered file athletes.csv, but if I omit .csv suffix, then DuckDB-WASM interprets reference as table.

It would be great if vgplot fully works with registered files (via HTTP(S), S3, local storage, etc.).

Please add type `TIMESTAMP WITH TIME ZONE`

type TIMESTAMP WITH TIME ZONE is the same as TIMESTAMPZ

case 'TIMESTAMPTZ':

I'm using Mosaic to query Postgres using the DuchDB postgres_scanner.
It works well except for tables with column time TIMESTAMP WITH TIME ZONE.

js-type.js:34 Uncaught (in promise) Error: Unsupported type: TIMESTAMP WITH TIME ZONE
    at jsType (js-type.js:34:13)
    at getTableInfo (Catalog.js:75:13)
    at async resolveFields (Catalog.js:85:21)
    at async Catalog.queryFields (Catalog.js:57:18)
    at async Coordinator.connect (Coordinator.js:130:24)

Cannot specify arbitrary aggregations in json

vg.sql and vg.agg are close cousins. They each allow arbitrary sql expressions; vg.agg just sets an extra aggregation property. However, best I can tell, the json spec only supports arbitrary sql expressions via vg.sql if you set the "expr" key. This makes it impossible to generate some charts via json. As an example of something I don't believe can be currently specified via json,

x = np.arange(100)
y = x + np.random.normal(scale=5, size=len(x))
df = pd.DataFrame({'x_val': x, 'y': y})

MosaicWidget(
    {
        "plot": [
            {
                "mark": "ruleX",
                "y1": {"agg": "avg(y) - stddev(y)/count(y)"},
                "y2": {"agg": "avg(y) + stddev(y)/count(y)"},
                "x": {"bin": "x_val"},
                "marker": "tick",
                "data": {"from": "df"},
            }
        ]
    },
    data={'df': df},
)

My proposal would be to special-case the "agg" key in the same way the "expr" key is special-cased.

Create example Vega + Mosaic integration.

Potential ideas for a Mosaic / Vega-Lite component:

  1. Take a Vega-Lite specification as input, analyze transforms/encodings to extract SQL queries.
  2. Instantiate Mosaic clients for data sources.
  3. Feed Mosaic-provided query data to resulting (simplified) Vega-Lite visualizations.
  4. Add listeners for param/selection changes and propagate those to Mosaic (we would need to "disable" downstream selection updates, so step 1 above might involve rewriting to the VL spec in order to leave filtering/domain binding up to Mosaic).

Achieving all the steps above might be a big lift, but even piecemeal integrations might be useful to showcase.

I view this as a stretch goal – not a requirement – for an initial release.

Let interactors update more than one parameter

Here is an example of something that I think is currently impossible: say you have a dataset like,

df = pd.DataFrame(itertools.product(np.arange(.2, 1.2, .01), np.arange(0, 50)), columns=['alpha', 'x'])
df['alpha'] = df['alpha'].round(2)
df['x_pow_a'] = np.power(df['x'], df['alpha'])

>>> df
      alpha   x     x_pow_a
0      0.20   0    0.000000
1      0.20   1    1.000000
...     ...  ..         ...
4998   1.19  48  100.156023
4999   1.19  49  102.643945

[5000 rows x 3 columns]

and want to make two line plots: one of (x, x_pow_a) and one of (alpha, x_pow_a) cross-filtered each with a Nearest selector that maps to some global filter call it $filt. This so far is possible. But then if you want to use the selector to make a visual representation of what is being selected (like a "ruleX" element, as in the "Stocks" Mosaic example) it becomes impossible because,

(1) You can't have two Nearest selectors on the same plot because they overwrite some event listener.
(2) You can't have a Nearest selector simultaneously update two selections (one local to the chart to draw the rule and one global for the cross-filter).

A visual representation of the plot described in words:

image

My suggestion is to change the selection member of interactors to optionally be an array. I have implemented a proof of concept for Nearest and it seems to work well. Wanted to get people's thoughts on this before I took it any further.

cannot parse/access json response of query: Bad escaped character

Hi, I'm trying to access the result of a query via json but it seems the output isn't propery escaped.
my table contains a regex but that's irrelevant, I think, it's just a VARCHAR.

my failed attempt:

const q = await vg.coordinator().query('FROM allregexes;', {type: "json", cache: false});
JSON.parse(q); // fails: Bad escaped character in JSON at position 32
console.log(q.toString()); // because the \d should be escaped as \\d {"regex": "^(?P<DD>0[1-9]|[1-2]\d|3[0-1])...

is this the correct way to access the query results via JSON, or did i find a bug?

curently, my workaround looks like this but would prefer json

const q = await vg.coordinator().query('FROM allregexes;', {type: "arrow", cache: false});
for (const entry of q) {
	console.log(entry.regex)
}

is this the right approach?

thanks!

Why Node.js ?

I am just wondering why the decision was made to wrap DuckDB around a node.js server.

If duckdb is a cpu process, wouldn't that block the Node.js event loop?

Update to latest version of duckdb

As a general note, I'm not sure if this is a bug in Mosaic, Electron or DuckDB but I made a reproduction here: https://github.com/mhkeller/duckdb-electron-mosaic-crash

When trying to start up the data server from Electron, it crashes with a SIGSEGV error. The line that triggers it is the call tovg.plot(args). If you launch the server via child_process.exec, it works. I'm not sure how to get more info on this so posting here in case anyone has any ideas.

I'm loading @uwdata/mosaic-duckdb via an await import command because Electron doesn't support CommonJS but maybe that esm/commonjs mismatch is still causing problems.

It could also be likely that this has nothing to do with Mosaic and is a DuckDB/Electron problem. If there is a way to isolate better which DuckDB commands are being executed, that would be helpful to narrow down the culprit.

Thanks for any help.

edit: for what it's worth, the top lines in the macos crash report are below. My inclination is this is a duckdb bug and it would be helpful to figure out what commands mosaic is sending to duckdb that are causing the crash.

0   duckdb.node                   	       0x16757b2ff node_duckdb::RunPreparedTask::Callback() + 3295
1   duckdb.node                   	       0x167565943 node_duckdb::Task::DoCallback() + 67
2   duckdb.node                   	       0x1675618f7 node_duckdb::TaskCompleteCallback(napi_env__*, napi_status, void*) + 167

Examples

Maybe with a way to run examples with your own backend.

Feature request: confidence intervals for bar charts

We have been using Mosaic for reporting on hundreds of thousands of hospital prices (demos here: https://beta.payless.health/examples/stlukes-bethlehem.html & https://beta.payless.health/examples/mount-sinai.html).

These prices are often listed according to minimum and maximum negotiated rates across several insurance products.

To accurately relay this information in a visualization, confidence intervals are necessary.

Would this be possible in Mosaic?

(In the docs I only found confidence intervals mentioned for the regressionY mark here: https://uwdata.github.io/mosaic/vgplot/#connected-marks)

Consolidate queries

Databases can perform more efficient queries when we can combine multiple requests. For example, we can request statistics in a single query with a single pass over the data path than one query per column. Similarly, we may be able to optimize the index creation using e.g. DuckDB's grouping sets. The way we could implement this is that a coordinator takes multiple queries from a client, packs them up Ito fewer SQL queries, and then unpacks the results when it receives the result from the database before returning the individual results to the clients. Clients don't have to change. So I think we can contain this functionality entirely in the client.

This could also give us more parity with some of the optimizations Falcon is doing. For example, Falcon creates filter masks that it then unions to filter tables. Right now, databases cannot use similar tricks since we send individual queries. Having said that, I am not sure yet how to do this particular optimization since we apply different filters to different aggregates. But at least we could combine the sets of queries with the same filter but different aggregates.

This is already touched upon in the paper but we could implement a few simple optimizations at least.

Add http2 server

HTTP2 is more efficient than HTTP1 and well supported by browsers.

Selection of `NULL` values does not work

When clicking on a bar representing the total amount of records where a specific column is NULL, filtering does not work.

Example:

data:
  penguins: { file: data/penguins.csv }
params:
  query: { select: crossfilter }
vconcat:
- name: total
  plot:
  - mark: barX
    data: { from: penguins, filterBy: $query }
    x: { count: '' }
  xLabel: Total →
  xDomain: Fixed
- vspace: 20
- name: sex
  plot:
  - mark: barY
    data: { from: penguins, filterBy: $query }
    x: sex
    y: { count: '' }
  - select: toggleX
    as: $query
  xLabel: ''
  yLabel: Total
  xDomain: Fixed
  yDomain: Fixed

Bildschirmfoto 2023-08-03 um 09 09 10

In above example, clicking on the right bar of lower bar chart shows a total of 0 in upper bar chart.

This bug was first mentioned in #144. Note from @jheer:

..., the issue with the NULL is a bug, due to the toggle interactors generating equality (=) comparisons rather than IS NOT DISTINCT FROM (which is null safe).

Highlight interactor does not work properly with data cube indexing

I have a simple vgplot specification with cross filter selection and two bar charts.
Bars should be highlighted when selected.

data:
  athletes: { file: data/athletes.csv }
params:
  query: { select: crossfilter }
  selectNationality: { select: single }
  selectSport: { select: single }
vconcat:
  - plot:
    - mark: barX
      data: { from: athletes, filterBy: $query }
      x: { count: }
      y: nationality
      order: nationality
      sort: {
        y: '-x',
        limit: 10
      }
    - select: highlight
      by: $selectNationality
    - select: toggleY
      as: $selectNationality
    - select: toggleY
      as: $query
      width: 600
      height: 400
    marginLeft: 80
  - vspace: 10
  - plot:
    - mark: barX
      data: { from: athletes, filterBy: $query }
      x: { count: }
      y: sport
      order: sport
      sort: {
        y: '-x',
        limit: 10
      }
    - select: highlight
      by: $selectSport
    - select: toggleY
      as: $selectSport
    - select: toggleY
      as: $query
      width: 600
      height: 400
    marginLeft: 80

Highlighting doesn't work like expected.
There'll be bars, which aren't highlighted anymore at a certain point (mark indexes from __data__ seem to get out of sync) or bars, where highlighting appears on other bars.

Reproducible examples for above specification:

  • Clicking USA on upper chart, and then clicking one of the first three bars on lower chart, bar USA will lose highlighting.
  • Clicking athletics on lower chart, and then clicking on USA, POL or UKR on upper chart, won't highlight country.

I tried different settings (with and without order attribute), but that didn't help.

When disabling cube indexing in coordinator (indexes = false), highlighting works properly.

Local parquet file not found Quarto-Observable

Hi!

I'm trying to implement mosaic in my quarto notebook via OJS (Observable) code cell. The notebook does work seamlessly via url https://uwdata.github.io/mosaic-datasets/data/flights-10m.parquet, but does not upon local file specification.

Notebook in attachment (did not get .qmd uploaded, replaced qmd by txt extension - should replaced again)
example_mosaic.txt

Returned error msg
error

In same dir where I run quarto preview example_mosaic.qmd the flights-10m.parquet file is located.

quarto version 1.4.176

any thoughts what I'm doing wrong here? Much appreciated!

rectY Mark is not updating bin step size as data is filtered

I have a data set with long min/max tails that results in step=100,000 (from the bins() function in bin.js). As I filter data with a shared Selection, the min/max tails disappear and the resulting stats should reduce to step=10,000 or less. However, the Mark retains its original min/max stats for this channel and re-uses these in bin.js, always resulting in the same (now inappropriate) step=100,000. My bar plot therefore becomes a single large bar that covers my entire filtered data set.

What I expect to happen:
As data is filtered, the min and max statistics for that channel on that Mark should be recalculated.

How do I set the Mark to recalculate min/max for its channel as a result of filtering on the Selection?

So far I've tried updating Mark class with a filterIndexable function that always returns true, but this has no effect.

Feature request: box-and-whisker plots

I am not sure whether this is feasible with Mosaic, but it would be very helpful for comparing prices across hospitals. Examples of Mosaic visualizations here:

https://www.payless.health/hospital/lehigh-valley-st-lukes-comparison
https://www.payless.health/hospital/indiana-university
https://www.payless.health/hospital/parkview-health

Box-and-whisker plots would enable easier visual comparison across different procedures, medications, treatments, and insurance products across different regions of the country!

Window functions do not work in query with aggregate

This is half bug half feature request. The following snippet does not currently work,

x = np.random.normal(scale=50, size=100)
y = x + np.random.normal(scale=5, size=len(x))
df = pd.DataFrame({'x_val': x, 'y': y})

MosaicWidget(
    {
        "plot": [
            {
                "mark": "ruleX",
                "y1": {"agg": "avg(y) - stddev(y)/sqrt(count(y))"},
                "y2": {"agg": "avg(y) + stddev(y)/sqrt(count(y))"},
                "x": {"expr": "ntile(10) over (order by x_val)"},
                "marker": "tick",
                "data": {"from": "df"},
            }
        ]
    },
    data={'df': df},
)

That's because the generated sql is problematic,

> SELECT avg(y) - stddev(y)/sqrt(count(y)) AS "y1", avg(y) + stddev(y)/sqrt(count(y)) AS "y2", ntile(10) over (order by x_val) AS "x" FROM "df" AS "source" GROUP BY "x"
Binder Error: GROUP BY clause cannot contain window functions!

It'd be pretty nice if the above chart snippet could work because the intent seems unambiguous to me.

A suggestion for making it work would be to modify the query generation to move the non-aggregate clause computation into a "WITH" statement where there are no restrictions on window functions, i.e.

WITH "__mosaicTemp" AS (SELECT *, ntile(10) over (order by x_val) AS "x" FROM "df") SELECT avg(y) - stddev(y)/sqrt(count(y)) AS "y1", avg(y) + stddev(y)/sqrt(count(y)) AS "y2", "x" FROM "__mosaicTemp" GROUP BY "x"

which I think should not be a perf cost in cases where it's not needed because duckdb will optimize it away (?).

How do you filter data in a particular plot?

From the documentation, I can see examples of filtering the entire dataset up-front, like in the example: https://uwdata.github.io/mosaic/examples/ in which we provide { where: "Symbol = 'AAPL'" } to filter just for AAPL.

I can't find an example or documentation of loading a larger dataset and showing subsets of it. Like, if in that example, we wanted a page in which we had AAPL, MSFT, and a few other stocks or whatever - in separate plots, how would that work? We probably wouldn't want to set up the executor again?

There's documentation for a query builder, but I don't see how this hooks up to charts: https://uwdata.github.io/mosaic/sql/

There's documentation for methods that seem like they create filtered sub-tables from parquet files: https://uwdata.github.io/mosaic/api/sql/data-loading.html

Is there a way to show a chart of data from a parquet file, but not all the data in the parquet file?

Highlight interactor sometimes highlights the wrong thing for large datasets

I haven't figured out the source of the bug, but I can trigger it pretty reliably now, so maybe someone here will have ideas. I've run into it sporadically with "real" datasets, but I can trigger it intentionally by,

  1. Making two of the bar values equal to each other.
  2. Having a large dataset (10k rows doesn't trigger, but 100k rows does)

These are needed in conjunction (i.e. equal counts small dataset is fine; large dataset unequal counts also fails but much less consistently). When I trigger the bug, it remains persistent for the chart that triggers it. Here is an example that works for me locally,

d = {
    "hconcat": [
        {"plot": [
            {"mark": "barX", "data": {"from": "df"}, "x": {"sum": "y"}, "y": "cat1"}, 
            {"select": "toggleY", "as": "$filt1"},
            {"select": "highlight", "by": "$filt1"},
        ]},
        {"plot": [
            {"mark": "barX", "data": {"from": "df", "filterBy": "$filt1"}, "x": {"sum": "y"}, "y": "cat1"},
        ]}
    ]
}
df = pd.DataFrame({'cat1': ['a', 'b', 'c'], 'y': [1., 2., 2.]})
MosaicWidget(d, data={'df': pd.concat([df]*100_000)})

When I click on the "a" bar on the LHS chart, it erroneously highlights the "c" column
image

The RHS chart is not needed for the bug to be triggered, but it shows that the "toggleY" filter is working properly (i.e. I clicked "a" and it filters only to "a"), which is why I think the bug is in the highlight interactor.

Examples have hidden setup

If I pull up the index example: https://uwdata.github.io/mosaic/examples/

I see code which looks complete, and to me suggests that if I have a file at "data/stocks.parquet", then it should work if I paste it into an HTML file and do the proper importing of the library.

This doesn't work, though, because all of the examples have this init method:

ready = wasmConnector().then(wasm => {
coordinator().logger(null);
coordinator().databaseConnector(wasm);
});

So right now if you copy an example, it'll give quite different behavior than what it has on the web - instead of working with a file, it'll try to connect to a coordinator server and fail. This seems like a stumbling block for folks trying to learn how to use this.

Trying to reproduce flights-10m.js example using MDX

This library is amazing and we are starting to use it across @onefact. Thank you!! While we could get an Observable example working (https://observablehq.com/@onefact/mosaic-cross-filter-flights-10m), @raghavbansal2005 and I have been struggling to get it working on the frontend using the following steps:

  1. Add an mdx file with the following contents: https://github.com/uwdata/mosaic/blob/main/docs/public/specs/esm/flights-10m.js, for example to pages/flights-10m.mdx as in this repo: https://github.com/onefact/nextra/blob/main/docs/pages/flights-10m.mdx and run pnpm install to install dependencies:
import * as vg from "@uwdata/vgplot";

await vg.coordinator().exec(
  `CREATE TEMP TABLE IF NOT EXISTS flights10m AS SELECT GREATEST(-60, LEAST(ARR_DELAY, 180))::DOUBLE AS delay, DISTANCE AS distance, DEP_TIME AS time FROM 'https://uwdata.github.io/mosaic-datasets/data/flights-10m.parquet'`
);

const $brush = vg.Selection.crossfilter();

export default vg.vconcat(
  vg.plot(
    vg.rectY(
      vg.from("flights10m", { filterBy: $brush }),
      { x: vg.bin("delay"), y: vg.count(), fill: "steelblue", inset: 0.5 }
    ),
    vg.intervalX({ as: $brush }),
    vg.xDomain(vg.Fixed),
    vg.marginLeft(75),
    vg.width(600),
    vg.height(200)
  ),
  vg.plot(
    vg.rectY(
      vg.from("flights10m", { filterBy: $brush }),
      { x: vg.bin("time"), y: vg.count(), fill: "steelblue", inset: 0.5 }
    ),
    vg.intervalX({ as: $brush }),
    vg.xDomain(vg.Fixed),
    vg.marginLeft(75),
    vg.width(600),
    vg.height(200)
  ),
  vg.plot(
    vg.rectY(
      vg.from("flights10m", { filterBy: $brush }),
      { x: vg.bin("distance"), y: vg.count(), fill: "steelblue", inset: 0.5 }
    ),
    vg.intervalX({ as: $brush }),
    vg.xDomain(vg.Fixed),
    vg.marginLeft(75),
    vg.width(600),
    vg.height(200)
  )
);
  1. Run pnpm dev
  2. View the error:
- error pages/flights-10m.mdx (15:12) @ $brush
- error Error [ReferenceError]: $brush is not defined
    at eval (webpack-internal:///./pages/flights-10m.mdx:38:15) {
  digest: undefined
}
  13 | import * as vg from "@uwdata/vgplot";
  14 | const MDXLayout = vg.vconcat(vg.plot(vg.rectY(vg.from("flights10m", {
> 15 |   filterBy: $brush
     |            ^
  16 | }), {
  17 |   x: vg.bin("delay"),
  18 |   y: vg.count(),

Most likely this issue is due to my being a novice at all of these frameworks and libraries! (e.g. Dominik had to explain how to run a JavaScript server instead of an index.html file like I was trying to do 🤦 ) Thank you!!

Document how to put widgets in arbitrary positions in the DOM

The vgplot DSL is great for creating a self-contained dashboard. It could be especially great for new APIs that generate it (e.g. from Python). However, many developers may want to put Mosaic widgets in an existing application or implement custom widgets in React/Svelte etc. Let's add documentation or examples for how to do that.

Docs don't make it obvious that `vgplot` exports everything

Reading the docs, a newcomer learns that the repository is subdivided into packages. This makes sense, but usually in such a setup, the packages are kept relatively independent. In Mosaic's case, vgplot reexports the modules from all the other packages, allowing the user to only use the vgplot package.

Now, one can question this approach's merits, but I am not monorepo guru and I don't know enough about your context to judge. Rather, I would propose that if the team is committed to the approach, the documentation would make it clear that this is the case.

I suspect most users will just need one package in that case, making the API reference somewhat confusing (especially in relation to the examples, which all import from vgplot only).

Need to refresh page to update Mosaic plot - non-deterministic behavior

Not sure how to solve this bug, but Mosaic visualizations are needing to be refreshed after navigating to another page on the website.

For example, clicking on https://www.payless.health/hospital/houston-methodist shows this:

image

But then navigating to this page (https://www.payless.health/hospital/parkview-health) shows this:

image

And refreshing the page shows the actual Mosaic visualization:

image

The Mosaic specs:

https://github.com/onefact/payless.health/blob/main/docs/public/specs/yaml/houston-methodist-sorted-bars.yaml
https://github.com/onefact/payless.health/blob/main/docs/public/specs/yaml/parkview-health.yaml

From the console, it looks like the first click leads to a Binder error, but then when refreshing there are no errors - so I'm confused about how to replicate this bug:

Console log from the first click:

Error: Binder Error: Referenced column "name_minimum" not found in FROM clause!
Candidate bindings: "charges.payer"
LINE 1: SELECT DISTINCT "name_minimum" AS "value" FROM "charges...
                        ^
    runQuery https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:11
    onMessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:10
    onmessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:24
[parse-spec.e96b9020.js:4:76051](https://www.payless.health/assets/chunks/parse-spec.e96b9020.js)
Error: Binder Error: Referenced column "min_charge" not found in FROM clause!
Candidate bindings: "source.payer"
LINE 1: SELECT "min_charge", "max_charge", "name_minim...
               ^
    runQuery https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:11
    onMessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:10
    onmessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:24
[parse-spec.e96b9020.js:4:76051](https://www.payless.health/assets/chunks/parse-spec.e96b9020.js)
Error: Binder Error: Referenced column "min_charge" not found in FROM clause!
Candidate bindings: "source.payer"
    runQuery https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:11
    onMessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:10
    onmessage https://cdn.jsdelivr.net/npm/@duckdb/[email protected]/dist/duckdb-browser-eh.worker.js:24

Open to advice, thank you!!

Review/streamline SQL helper library

  • Add agg (or similar) template literal to create custom aggregate expressions.
  • Perform code review and streamline/consolidate abstractions as needed.
  • Add more comprehensive tests.

Add Panel example

HoloViz Panel is a powerful framework for data exploration and web apps. Bokeh, Panel and IpyWidgets integrate with each other, i.e. they can use each others widgets. This includes AnyWidget and MosaicWidget.

Please document the integration to Panel. If you are interested, I would be willing to make a PR. Let me know.

Example

works-with-mosaic.mp4
# pip install panel pandas ipywidgets_bokeh mosaic-widget
import pandas as pd
from mosaic_widget import MosaicWidget

weather = pd.read_csv(
    "https://uwdata.github.io/mosaic-datasets/data/seattle-weather.csv",
    parse_dates=["date"],
)

specification = {
    "meta": {
        "title": "Seattle Weather",
        "description": "An interactive view of Seattle’s weather, including maximum temperature, amount of precipitation, and type of weather. By dragging on the scatter plot, you can see the proportion of days in that range that have sun, fog, drizzle, rain, or snow.\n",
        "credit": "Based on a [Vega-Lite/Altair example](https://vega.github.io/vega-lite/examples/interactive_seattle_weather.html) by Jake Vanderplas.",
    },
    "params": {
        "click": {"select": "single"},
        "domain": ["sun", "fog", "drizzle", "rain", "snow"],
        "colors": ["#e7ba52", "#a7a7a7", "#aec7e8", "#1f77b4", "#9467bd"],
    },
    "vconcat": [
        {
            "hconcat": [
                {
                    "plot": [
                        {
                            "mark": "dot",
                            "data": {"from": "weather", "filterBy": "$click"},
                            "x": {"dateMonthDay": "date"},
                            "y": "temp_max",
                            "fill": "weather",
                            "r": "precipitation",
                            "fillOpacity": 0.7,
                        },
                        {
                            "select": "intervalX",
                            "as": "$range",
                            "brush": {"fill": "none", "stroke": "#888"},
                        },
                        {
                            "select": "highlight",
                            "by": "$range",
                            "fill": "#ccc",
                            "fillOpacity": 0.2,
                        },
                        {"legend": "color", "as": "$click", "columns": 1},
                    ],
                    "xyDomain": "Fixed",
                    "xTickFormat": "%b",
                    "colorDomain": "$domain",
                    "colorRange": "$colors",
                    "rDomain": "Fixed",
                    "rRange": [2, 10],
                    "width": 680,
                    "height": 300,
                }
            ]
        },
        {
            "plot": [
                {
                    "mark": "barX",
                    "data": {"from": "weather"},
                    "x": {"count": None},
                    "y": "weather",
                    "fill": "#ccc",
                    "fillOpacity": 0.2,
                },
                {
                    "mark": "barX",
                    "data": {"from": "weather", "filterBy": "$range"},
                    "x": {"count": None},
                    "y": "weather",
                    "fill": "weather",
                    "order": "weather",
                },
                {"select": "toggleY", "as": "$click"},
                {"select": "highlight", "by": "$click"},
            ],
            "xDomain": "Fixed",
            "yDomain": "$domain",
            "yLabel": None,
            "colorDomain": "$domain",
            "colorRange": "$colors",
            "width": 680,
        },
    ],
}
mosaic = MosaicWidget(specification, data={"weather": weather})

# THE PANEL APP

import panel as pn

pn.extension("ipywidgets")

component = pn.panel(mosaic, sizing_mode="stretch_width")

pn.template.FastListTemplate(
    logo="https://panel.holoviz.org/_static/logo_horizontal_dark_theme.png",
    title="Works with MosaicWidget",
    main=[component],
).servable()

DuckDB - can it be replaced with other databases at the server?

This is a very useful package. The ReadMe says, "Explore massive datasets", but given that DuckDB is just a single file-based DB at the server (as a persistent data source), would it be able to handle large storage volumes? For example, continuously growing OLAP records every hour.

  • Wondering how important DuckDB is for this whole thing to work. Is it tightly integrated with front-end? or can it be replaced with other large-scale databases?

For example, how easy / difficult it would be to make this work with, say Clickhouse at the backend (as the data source)?

Observable Plot option `sort` not available / Static filter support

Consider a simple mosaic dashboard based on data from data/athletes.csv with a single filter (selection) and some plots. Assume one plot should show a ranking of top ten nationalities w.r.t. number of gold medals from female athletes.

import * as vg from "@uwdata/vgplot";

await vg.coordinator().exec(
  vg.loadParquet("athletes", "data/athletes.parquet")
);

const selection = Selection.intersect();

export default vg.vconcat(
  vg.menu({
    label: 'Sport',
    from: 'athletes',
    column: 'sport',
    value: 'athletics',
    as: selection
  }),
  vg.plot(
    vg.barX(vg.from('athletes', { filterBy: selection }), {
      x: vg.sum('gold'),
      y: 'nationality',
      sort: {
        y: '-x',
        limit: 10,
      },
    }),
    vg.width(600),
    vg.height(400)
  ),
  // other plots with filter `selection`...
)

Issue 1: Above code fails. Bars should be sorted (nationality with most gold medals of female athletes on top). However, option sort (see Observable Plot documentation) is not detected as Observable Plot mark option.

How to achieve sorting (and limiting) in vgplot?

There seems to be an order attribute, which is used on query level, but Observable Plot bars are sorted naturally w.r.t. lexicographical order by default. Using vg.yDomain is not an alternative here as domain depends on selection.

Issue 2: Above plot is missing a filter. For this specific plot only female athletes should be considered. Underlying SQL should look like

SELECT nationality, SUM(gold) AS gold_medals
FROM 'athletes'
WHERE sex = 'female'       /* static filter (local) */
AND sport = 'athletics'    /* from selection (global) */
GROUP BY nationality
ORDER BY gold_medals DESC
LIMIT 10

Is there a mosaic way to express this in vgplot?

I don't want to bypass coordinator logic (run a manual query and pass a data object). Failed with cloning selection, extend it and pass extended version to filterBy.

Add struct/list support for load query options.

Discussed in #181

Originally posted by fundef1 September 21, 2023
Hi, I'm trying to load a CSV and manually specify the columns.
it's a bit of a degenerate case as I only have a single column, but I can't figure out the correct format:

const csvopts = {
	auto_detect: false, 
	all_varchar: true,
	columns: "{'line': 'VARCHAR'}", 
	header: false
};

vg.coordinator().exec(vg.loadCSV("lines", `${window.location}data/lines.csv`,csvopts));

results in

Uncaught (in promise) Error: Parser Error: syntax error at or near "line"
LINE 1: ..._size=-1, all_varchar=true, columns='{'line': 'VARCHAR'}', header=false,...

the quoting of columns looks off and I've tried different variants of quoting the columns option but just can't get it to work.

any hints? thanks!

We should expand the load* methods to properly handle object (-> struct) and array (-> list) option values.

Axis scale includes null value counts

When plotting a binned bar chart, I noticed that the y axis scale is affected by the presence of nulls. Is this expected?

For example, the following chart does not include null values and the y axis has a max of 1:

await vg.coordinator().exec(vg.loadObjects("testData", [{ colA: 1 }, { colA: 2 }]));

vg.plot(
  vg.rectY(vg.from("testData"), {
    x: vg.bin("colA"),
    y: vg.count(),
    inset: 0.5,
  }),
  vg.height(200)
);
Screenshot 2023-11-29 at 6 19 58 PM

Whereas this data has null values and the y axis max is 6

 await vg.coordinator()
    .exec(
      vg.loadObjects("testData", [
        { colA: 1 },
        { colA: 2 },
        { colA: null },
        { colA: null },
        { colA: null },
        { colA: null },
        { colA: null },
        { colA: null },
        { colA: null },
      ])
    );

 
vg.plot(
  vg.rectY(vg.from("testData"), {
    x: vg.bin("colA"),
    y: vg.count(),
    inset: 0.5,
  }),
  vg.height(200)
);
Screenshot 2023-11-29 at 6 19 29 PM

I would expect these two plots to look the same, however it seems the presence of nulls is affecting the axis scale. One workaround I've found is to first create a view that filters out nulls before plotting but curious if this should not be the default behavior?

Support web-standards platforms for the data server

It'd be nice to be able to run the mosaic data server on Cloudflare Workers or Deno, and it seems kind of feasible from how it's set up currently - the only big change would be the native Node.js duckdb module, which would have to be swapped for the WASM port.

Feature request: export csv/Excel from a filter/search operation run against a Mosaic table?

What would be the easiest way to implement export of a csv or Excel file from a Mosaic filter result against a WASM duckdb database in a parquet file?

We have been using Mosaic to display hospital prices.

For example these 106k prices at a large hospital in NYC:

https://beta.payless.health/examples/mount-sinai.html (code: https://github.com/onefact/beta.payless.health/blob/main/docs/public/specs/yaml/mount-sinai.yaml; code to generate parquet file here)

We've had several requests from journalists and policymakers who are not able to use python.

Specifically, they need to export a csv or use Excel to be able to make visualizations based off a query in Mosaic or a filtering result in a table. After this export, they usually use Excel or R (a very small number use python) to make visualizations before publishing a story.

For example, all of the prices related to C sections is a common query that a journalist or policymaker would want exported into a CSV:

image

(This demo is based off of the awesome Mosaic example here: https://github.com/uwdata/mosaic/blob/1590456b08bcb0bf4dd28ab8ac909b05a9ec0a69/docs/public/specs/yaml/athletes.yaml)

This potential feature would go a long way toward helping non-specialist users of Mosaic visualizations and dashboards analyze data further without the need to learn duckdb, python, etc.

This would enable data-driven journalism based off Mosaic filtering operations for a large number of newsrooms/journalists in our network @onefact who currently require training in Jupyter notebooks such as this one before being able to tell stories.

How to add a `y=x` line?

At @onefact we are building on Mosaic for dashboards of hospital prices, like this one: https://beta.payless.health/examples/mount-sinai.html

YAML that generates this: https://github.com/onefact/beta.payless.health/blob/main/docs/public/specs/yaml/mount-sinai.yaml

I've read through the docs and can't figure out how to add a line for y = x (on the diagonal) to illustrate what an efficient market should look like.

I see the RuleX and RuleY and line in the docs: https://github.com/search?q=repo%3Auwdata%2Fmosaic%20rulex&type=code but not sure how to enter a formula for display with the identity function.

Thank you for enabling this dashboard in the first place!! It's helping us communicate how to reduce the price of health care toward a more efficient, accessible market 🙏

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.