nomic-ai / deepscatter Goto Github PK

View Code? Open in Web Editor NEW

983.0 983.0 53.0 10.88 MB

Zoomable, animated scatterplots in the browser that scales over a billion points

License: Other

JavaScript 3.23% HTML 7.85% GLSL 14.95% TypeScript 73.94% Shell 0.03%

data-visualization visualization webgl

deepscatter's People

Contributors

Stargazers

Watchers

deepscatter's Issues

Upgrade to Apache Arrow v8

Hopefully won't be as big a deal as the 5-7 conversion was.

Invalid NPM package.json

Exciting work!

I was trying to load this module via ESM, but there seems to be some corrupt links in package.json:

https://unpkg.com/browse/[email protected]/package.json lists "./dist/deepscatter.es.js" as module.

However, the "dist" directory only contains "deepscatter.js". See https://unpkg.com/browse/[email protected]/dist/

Switch to valid typescript throughout codebase

Although I started re-writing most of this as typescript several months ago, there are large sections of the code base that do not have type annotations or for which the type annotations are invalid. This is a big chore; I would guess there are on the order of 500 ts errors at the moment. Hopefully the new linting can concentrate that a bit.

Inline plots for Jupyter and R

Would be nice to be able to render these directly inside Jupyter and RStudio.

Leland McInnes suggests https://panel.holoviz.org/index.html, which seems like a good idea.

Although it would be possible to write out the tile tree structure, I suspect the easiest way would be to just pass dataframes from pandas/tidyverse to Arrow memory and draw on that, which means this issue is effectively blocked by #27

How to use for geospatial data on maps

HI thank you for this great project. I am just wondering can this project can be used for Geospatial data with GPS coordinate on MapLayer like MapBox.

Fail gracefully on corrupted feather files or files without metadata.

Roadmap for 3D support?

Hi, I'm currently evaluating deepscatter as a potential tool to use to display brain cell data. The library is solid. We were able to get our own data displaying in the viewer, which was very exciting!

However, I noticed one of the principles was "This is a 2d library. No fake 3d.", but I wasn't quite sure what "fake 3D" meant. As part of my search for a suitable tool, I was hoping to find a library that could support both 2D and 3D datasets. Is there a roadmap for supporting 3D datasets in the future?

Publish types

We're generating type annotations in deepscatter that would be useful in applications using it, but they're not actually published in a usable location that I can see. Should be fixed.

Questions I have about webgl

Can you programatically generate new shaders, and then insert them into an existing scene without updating the data?
Would it be feasible to have a couple dozen BufferGeometries drawn at the same time, or is it vital to pull all drawing of similar items into the same geometries?

Gracefully remove old tooltips

I'm not sure if this is expected behavior, but the tooltip currently sticks when you stop hovering on a point which is a bit distracting. I see the behavior on Firefox and Chrome on the observable notebook and https://all-of-us.benschmidt.org/

BUG: labels are incorrectly offset if x/y domain is 0-1 (fixed when, say, changed to 0-100)

For example, "Love Lagoon" should be whre "loving a man" is. Offset is standard across all labels (ie uniform shift)

Allow setting `tooltip_html` programatically

Inverse of #11. Currently you can only set the tooltip as a function on Scatterplot, but not assign it through the JSON api as a string. This can reuse whatever code we come up with on click_function.

How to integrate DuckDB for text-based search & filtering

Adding my Slack convo with @bmschmidt here for public reference

Ben's general guidance:

What I would do is create a new column that is a float of ‘1’ or ‘0’ in duckdb for every tile in the dataset.

That happens by applying a transformation to the data, but you hopefully won’t need to do it manually.
But there are a couple ways: if the percentage of matches in the dataset is pretty low, you could use add_sparse_identifiers to add a set of matching ids.

You could also manually define a transformation to do it. add_tiled_column does something quite similar: it defines a new transformation on the Dataset that consumes a tile. As with that function, you probably don’t want to do something like run a different duckdb search for every tile; instead, run a single search, and then write some logic that can create an object called records which keys from the tile identifier to a JS native Float32 Array (which is usually not too hard to pull off an Arrow float32),

    this.transformations[field_name] = function (tile) {
      const { key } = tile;
      const length = tile.record_batch.numRows;
      // an object you could create called `records` which keys from the tile identifier to a JS native Float32 Array
      const array = records[key];
      return bind_column(tile.record_batch, field_name, array);
    };

This is a new part of the codebase and not documented yet, so let me know if that’s not enough to go on.

The point of defining transformations at the tile level like this is that it staggers the load of updating the WebGL buffers across multiple animation ticks; it would also be possible to pull back the abstraction a little bit and simply update the record batches in place by adding a column for each one all at the same time when the duckdb results came back. For less than a million rows, it’s probably about the same--but at higher numbers of points, it would get painful waiting for transformations to be applied way down the chain.

All this requires no changes to the core deepscatter code. But integrating duckdb might require getting a little into the weeds with Arrow, since on the JS side the core unit in deepscatter is basically not the ‘point’ but actually the tile, which corresponds to a single Arrow record batch.

Only in the last couple months have I settled on a strategy for altering these record batches, which is this new ‘transformations’ binding.

scatterplot.tooltip_html settings overwritten in first plot

Currently setting scatterplot.tooltip_html only can be set after a first draw. There needs to be some way to prevent the first call to scatterplot.plotAPI from overwriting those settings.

Currently this can only be set by awaiting promise resolution on the initial plotAPI.

Allow destruction of charts

Simply deleting resources may leave GPU resources allocated.

Restore polygon drawing behind points.

I did some Augean stables cleaning on the codebase in the last 4 months focused mostly on increasing type safety and removing half-baked features that I had experimented with. In the future, these will be more likely to live on branches.

One of the half-baked features I do want to restore was the integration with some basic mapping capabilities to add geojson (re-encoded as feather using this library.) There are a lot of cases where it's nice to add a pre-defined background polygon behind points for visualization--many geographic, but others possibly not.

Tiles with multiple record batches are silently unplotted

Feather tiles generated through something other than the recommended quadfeather repo, or edited and rewritten after generation, may contain more than one Arrow record batch; if so, the other record batches are silently dropped leading to the appearance of missing tiles. Multi-record-batch tiles should, at a minimum, throw a warning in the console.

Regional Labels

Support overlay of labels on regions.

Support animation between points with x1 and y1 channels.

I've from time to time supported x1 and y1 aesthetic channels that allow animations to interpolate between two points. This can be nice for showing, e.g., flow of migrants between countries; flow of dollars between donors.

Most of the pieces are there to restore it.

Allow passing functions to Scatterplot.click_function

It should be possible to assign a function to Scatterplot that dictates click events as a function of the point being clicked.

Currently, you can only assign to the click function with a string passed through Scatterplot.plotAPI

Solution:

Add get click_function and set click_function on Scatterplot that invoke and set Renderer.click_function.
have the Renderer getter-setter methods do a type check--if they’re being given a function instead of a string, put dummy values in this.scatterplot.prefs.click_function and this._current_click_function_string and assign the function directly to this._current_click_function .

https://github.com/CreatingData/deepscatter/blob/d1f899dd46d7d5530c7d92bfe4e3b6f97850b862/src/rendering.js#L89-L92

Properly handle nulls

Right now, any data marked as NULL in the arrow file is being plotted with a value (usually 0). Short of just manually setting things to -999, a really easy solution to unpacking the bitmasks is not clear to me.

Programatically request tooltips

It would be nice to be able to spec some tooltips inside the JSON API, so that a narrative could call out one or more points without requiring mouseover.

Support dictionary values on non-quadtree datasets.

The easiest way to do this is to stop converting dictionaries to float on the shaders and do it when pushing to the regl buffers.

Benchmarking, I can convert 10,000 typed arrays of 2**16 values each from Uint32 to Float32 in 500 ms. This is close enough to "no time at all" that I think it's safe to call doing this on a worker thread a premature optimization.

Filter lambda expressions failing to update filters correctly in some cases

Overall, behaviour is somewhat consistently wrong, with some blips, hinting at perhaps some sort of inconsistent state update or race condition.

Example:

The following updates are executed in order for this codebase https://github.com/davidnmora/lyric-viz which is deployed via GitHub pages here (reproduce easily using the categorical dropdown menu):

lambda: 'generic_genre => generic_genre === "HIP_HOP"' filters correctly
'generic_genre => generic_genre === "HIP_HOP" || generic_genre === "ROCK"' fails to trigger any update to the chart
'generic_genre => generic_genre === "HIP_HOP" || generic_genre === "ROCK" || generic_genre === "UNCATEGORIZED"' filters correctly

Note: I've observed this with both passing in functions as well as passing in strings to be converted to functions.

In other cases, toggling between filtering on 1 vs 2 categories triggers a visible re-painting of the scatter plot, but the plot is stuck on a 2-category filter state regardless. Below: red & blue categories are present, despite the current state being set to just show the blue category:

Refactor to allow non-quadtree arrow datasets

The visualization vocabulary here is better than most other webgl scatterplot libraries, and most datasets don't have more than a few million points anyway.

So there should be a subclass of Tile that accepts a buffer of Arrow IPC as an argument and returns a set of tiles corresponding to the record batches. This would not be a single tile representation because the fastest and most memory-efficient approach would be use the chunks already existing on the arrow dataframe.

For now users would be responsible for serializing an arrow table themselves; I can make an observable notebook showing how to do this from a CSV using arquero, but I don't want to have to shove arquero into the already-large deepscatter library.

Chore: Rename master branch to main

Posting as an FYI. I'll probably do this as part of the next non-trivial version bump.

Allow resizing window

Tracking: #81

Add bound parameters to mouseover and click_functions.

Currently 'click' lets you define a function on datum, but 'mouseover' just creates a tooltip that is unaware of the local styling environment.

mouseover should instead accept a function-defining string with the arguments (datum, event, $), as in arquero, where $ refers to a bindable context on plot to allow safe (non-eval) materialization of the strings, and event is the d3 event context.

plot.params.select = d3.select
plot.params.make_tooltips = function(data, selector) {
const values = [...Object.entries(datum)];
const dl = d3.select(selector).selectAll("dl").data([1])
[... yadda yadda yadda ... populate <dt> <dl> items.
}
"mouseover": '$.select(".tooltip").style("transform", "translate(${datum.x}, ${datum.y}"); $.make_tooltips('

Change tile naming conventions

I believe the old code here uses d3.zoom levels as tile names. But it should use [0, 1, 2, 3...] as the levels like a normal map tile. The new tiler writes out that way.

.points() should include sidecar columns

Currently calling .points() does not return any columns that are loaded as sidecars

Investigate colorpicking failures

The code to identify points on mouseover seems to be breaking right now at points with especially high index numbers right now. So while mouseover works well for the first 4,000,000 or so points in a plot, lower points are not always selectable.

I suspect the problem is that the unique identifiers being generated by painting to the color canvas aren't unique enough, and that offscreen points are being identified instead. Need to check, though.

Spots disappear when zoomed out

Thanks for open sourcing, so far works great. Have noticed a small artefact, see:

artefact_vid.mov

Notice 3 spots that disappear when we zoom out and appear when zoomed in. There's roughly 10M points there. Default quadfeather flags. Latest deepscatter.

Allow resizing parent container to change chart size.

It should be possible to resize the parent container and have the chart resize. This will involve refactoring a lot of the transform matrix code.

plot.dim('color').scale undefined with certain color ranges.

plot.dim("color").scale should be defined for all supported colormaps. But while it works for viridis and plasma, it doesn't for puor. I suspect this has to do the weird uppercase/lowercase logic for color brewer scales.

Add opacity to aesthetic channels.

Currently opacity can only be set globally for all points, but there are cases where you want to have a subset of points at full opacity, e.g. #35. This is pretty straightforward to implement.

Choose a JS testing framework that works well with vite.

Does this have enough UI elements that it needs storybook? What are the cool people using nowadays? Who are the cool people nowadays, anyway?

Switch file format from '.feather' to '.arrow'

I've used '.feather' as the file format here, but over the last couple years the community has standardized around '.arrow' instead, to the point where the mailing list is now discussing deprecating the 'feather' methods entirely and reminding me that ".arrow is the official registered extension."

This will only get worse as time goes on, so will be changed in the next version bump. This is going to cause some breakage and will probably need an ability to specify, in plots, use of the '.feather' extension.

SVG dot don't include jitter applied in webGL

How to reproduce:

Encode aesthetics x or y to a give field & render scatter. Hover works properly.
Update to a different field. Do this by directly setting a new x or y property (using x0 or y0 doesn't make a difference)
Now hover a point. Notice that the SVG dot has renders at the previous coordinates, not the new one based on the new field.

The underlying issue:

Imagine we change from encoding y as y_field_1 to y_field_2.

After the update, we'd see that the y aesthetic instance seems to remain out of date:

deepscatter/src/interaction.ts

Line 179 in 195e844

const y_aes = renderer.aes.dim('y').current;

...

deepscatter/src/interaction.ts

Lines 229 to 230 in 195e844

 .attr('cx', (datum) => x_(x_aes.value_for(datum))) 

 .attr('cy', (datum) => y_(y_aes.value_for(datum))),

where the value of y_aes.field === 'y_field_1' still.

This is surprising because if you check during execution field is being correctly updated here:

deepscatter/src/Aesthetic.ts

Line 451 in 195e844

this.field = encoding.field;

TODO: look further into why that update isn't showing up later when we're setting the SVG circle position.

Images of an example of updating `y`:

Note in the second image, I'm hovering the new location of the purple dot, as indicated by a correctly positioned tooltip (which uses mouse position, I think)

Prettier formatting

Switching over to prettier formatting.

Example request: 3d scatter plot

Is it possible?

Use deterministic randomness for 'shufbow' scheme

Minify build?

Question from @thatandromeda.

The lib is currently pretty big. I haven't really worried about it because the whole program is such a data hog--but it would be nice to shrink the bundle when being loaded directly.

Vite's docs state that "Note the build.minify option is not available when using the 'es' format in lib mode." I'm not sure if this is simply an issue of they haven't gotten around it yet, or matter of principle--i.e., that minimization is something that should happen in a downstream application in a bundle process rather than in the distributed bundle. If so, it's possible that it might be worth shipping a minified copy as well.

Ordinarily I'd want the minified copy to also be transpiled into es2015 or something equivalent, but my impression is that the web workers involved in this make that more or less of a non-starter.

Stop logging in prod mode.

This thing is way too chatty for debugging purposes; could try to keep clearing it out, but better would be to switch entirely to a module that only logs routine events when running in dev mode.

Noted by @thatandromeda

Points Visibility

I have a question regarding points visibility. I've implemented a filter and I would like to make points that match this filter visible. These points do appear when I zoom in, but I would prefer if they were visible at the default zoom level.

Based on issue #77, I suspect that the visibility of these points may be associated with the ix value. I am wondering if adjusting the ix value would make the points visible at the default zoom level. Could you please advise if this is the appropriate approach for achieving this?

For additional context, I'm using the Deepscatter version available in this example: https://observablehq.com/d/cae8e4a3a8b7d4db#deepscatter. Additionally, I have all of the tiles prefetched using plot._root.root_tile.download_to_depth.

I look forward to your guidance.

Initial test suite.

Let's start small. There needs to be a hello-world test that at least ensures some points get drawn to the screen.

I settled on playwright for testing here: #18.

The first tests will work on Arrow Datasets, not quadtree Datasets (which are slightly more complicated) so this is blocked by #27.

Refactor

This code is a mess; hastily translated from D3 V3 to V4 while staying in a closure. I want it to be in ES6.

The basic classes would be:

PointSet: A wrapper for a collection.
- area_search; for pointover events, etc.
- filter: return a filtered set. This might create a new PointSet, or might just yield an iterator over points. Probably the former, though, because we don't want to have to run complicated filters inside a render loop. Child point sets might sometimes inherit things like quadtrees from their parent.
- labels: Bundle the labelling algorithm.
ViewController: Handle D3 zoom related stuff, which can be a PITA, and translate them into things like a threejs camera angle.
- render(PointSet, Renderer): the actually drawing loop.
- Potentially also rednering aesthetics type stuff lives here.
Renderer: Handle the actual rendering probably over a list of points or maybe an iterator off of the pointset.
* Method for explanatory popups.
* Might be several different ones; an SVG renderer to do high level labels, a threejs one for points.
DataTile. Manage data tiles on the server. Data tiles and quadtrees are basically the same thing with a data node of 1,000 points. It might make sense to fully replicate d3 quadtree attributes so that we can do things like 'visit' or 'search' within them.
* fetch: pull a given datatile from the server.

Selection or lasso tool?

Hi again,

I wanted to see if there are any plans for selecting multiple points either through a lasso tool or a simple bounding box.

If not, do you think that is something that a third party could easily build on top of deepscatter?

Improve passing colormaps.

Currently the handling of colorscales for categorical variables sucks. @AndriyMulyar was asking about this on the slack more delicately than I'll put it here, but:

If you want to colorize by a categorical field called 'classification,' you have to do this:

encoding: {
   color: {
      field: "classification",
      range: "category10",
      domain: [-2047, 2047]
   }
}

There are a few problems here:

1). Oh my god domain: [-2047, 2047] is terrible. This reflects the underlying encoding of arrow dictionary codes. But there's no reason any user should care about this, because the actual dictionary codes are going to be [0, 4095] and I only cast it down for reasons involving half-precision floating point performance that should be completely hidden. It's probably best to let this be fully implicit; but it would be better to have a string shortcut or something. Also, It's not even right--there are 4097 integers in that range and so something around the 2048th will be duplicated.
2.) There's no defined way to access the scale client-side. Wanting to get at the scale for legends, etc. is a super-common use case.
3. The order of dictionary fields cannot be defined, because quadfeather may shuffle them around. So if you have five values called ['worst', 'bad', 'fine', 'good', 'better', 'best'] it's not possible to guarantee colors for them in that order.
4. If someone wants to create their own colormap, it's very hard to to do so. I think you can pass [r, g, b] values in; but it's actually fairly hard.

So a better solution would be to allow either of the following syntaxes.

encoding: {
   color: {
      field: "classification",
      range: "category10" 
    }
}

Where the fact that 'classification' is categorical implicitly triggers a domain over the range [-2047, 2046] in a way that never mentions those numbers.

encoding: {
   color: {
      field: "classification",
      range: ['#FFEEBB', '#226600', ...]
      domain: ['comedy', 'sci-fi', ....]
   }
}

@AndriyMulyar suggest passing a map or a d3-scale to the argument, but I'm inclined to only do it this way which is more grammar-of-graphics standard; it's pretty easy to cast to separate domain and range from either of those.

Add data description to tiler.js

The new tiler program runs in node.js and is generally smarter and simpler than the python one. Rather than try t work across the whole tree at once, it dumps to local overflow files well down the tree once it has too many open files, which means it can pick up work in a smart way. It should scale to billions of points comfortably, unless I don't understand how to close files in node. It just needs to write out the same metadata and it should be good to go.

Add Foreground/background aesthetic

In selection contexts, it's desirable to have selected points appear with the default encoding, with a background layer behind those points being gray with high opacity and not (probably) supporting interaction.

The best way to accomplish this is probably through two successive draw calls to ensure the front layer points are never occluded. Transitions between these two states may be initially ugly, though.

	.attr('cx', (datum) => x_(x_aes.value_for(datum)))
	.attr('cy', (datum) => y_(y_aes.value_for(datum))),

nomic-ai / deepscatter Goto Github PK

deepscatter's People

Contributors

Stargazers

Watchers

Forkers

deepscatter's Issues

Ben's general guidance:

Example:

How to reproduce:

The underlying issue:

Images of an example of updating y:

Recommend Projects

Recommend Topics

Recommend Org

Images of an example of updating `y`: