chartshq / datamodel Goto Github PK

Relational algebra compliant in memory tabular data store.

Home Page: https://muzejs.org/docs/wa/latest/concepts/datamodel/introducing-datamodel

License: Other

datamodel datatable data javascript relational-algebra datatables datagrid rust webassembly wasm tabular-data rust-language schema

datamodel's Introduction

What is DataModel?

DataModel is an in-browser representation of tabular data. It uses WebAssembly for high performance and works seamlessly with any JavaScript library. It supports Relational Algebra operators which enable you to run select, group, sort (and many more) operations on the data.

The current version performs all the data operations like filtering, aggregation, etc. on WebAssembly which gives a 10x performance boost compared to the old JavaScript version.

It is written in Rust Language to handle computation intensive data operations, which is then compiled to WebAssembly, thereby providing a native-like performance for data operations.

DataModel can be used if you need an in-browser tabular data store for data analysis, visualization or just general use of data.

Features

🎉 Supports Relational Algebra operators e.g. selection, projection, group, calculateVariable, sort etc out-of-the-box.
💎 Every operation creates Immutable DataModel instance and builds a Directed Acyclic Graph (DAG) which establishes auto interactivity.
🚀 Uses WebAssembly for handling huge datasets and for better performance.
⛺ Also works in Nodejs environment out-of-the-box.

Installation

CDN

Insert the DataModel build into the <head>:

<script src="https://cdn.jsdelivr.net/npm/@chartshq/[email protected]/dist/browser/datamodel.js" type="text/javascript"></script>

NPM

Install DataModel from NPM:

$ npm install --save @chartshq/datamodel

As we're using Worker internally, so the worker-loader needs to be installed as follows:

$ npm install worker-loader --save-dev

And then within your webpack configuration object, you'll need to add the worker-loader to the list of module rules, like so:

module.exports = {
  module: {
    rules: [
      // Add the following object to your module `rules` list.
      {
        test: /\.worker/,
        include: /datamodel/,
        loader: 'worker-loader',
        options: {
          inline: false, // If you want to make it inline, set to true.
          fallback: true
        },
      },
    ],
  }
};

You also can checkout our datamodel-app-template to try out the DataModel quickly through a boilerplate app.

Getting Started

Once the installation is done, please follow the steps below:

Prepare the data and the corresponding schema:

// Prepare the schema for data.
const schema = [
  {
    name: 'Name',
    type: 'dimension'
  },
  {
    name: 'Maker',
    type: 'dimension'
  },
  {
    name: 'Horsepower',
    type: 'measure',
    defAggFn: 'avg'
  },
  {
    name: 'Origin',
    type: 'dimension'
  }
]

// Prepare the data.
const data = [
   {
    "Name": "chevrolet chevelle malibu",
    "Maker": "chevrolet",
    "Horsepower": 130,
    "Origin": "USA"
  },
  {
    "Name": "buick skylark 320",
    "Maker": "buick",
    "Horsepower": 165,
    "Origin": "USA"
  },
  {
    "Name": "datsun pl510",
    "Maker": "datsun",
    "Horsepower": 88,
    "Origin": "Japan"
  }
]

Import DataModel as follows:

If you are using the npm package, import the package as below:

import Engine from '@chartshq/datamodel';

If you are using it in NodeJS, then require it as below:

const Engine = require('@chartshq/datamodel').default;

If you are using CDN, then use it as follows:

const Engine = window.DataModel;

Load the DataModel engine and pass the data and schema to DataModel constructor and create a new DataModel instance:

// As the DataModel are asynchronous, so we need to
// use async-await syntax.
async function myAsyncFn() {
  // Load the DataModel module.
  const DataModel = await Engine.onReady();

  // Converts the raw data into a format
  // which DataModel can consume.
  const formattedData = await DataModel.loadData(data, schema);

  // Create a new DataModel instance with
  // the formatted data.
  const dm = new DataModel(formattedData);

  console.log(dm.getData().data);
  // Output:
  //  [
  //     ["chevrolet chevelle malibu", "chevrolet", 130, "USA"],
  //     ["buick skylark 320", "buick", 165, "USA"],
  //     ["datsun pl510", "datsun", 88, "Japan"]
  //  ]

  // Perform the selection operation.
  const selectDm = dm.select({ field: 'Origin', value: 'USA', operator: DataModel.ComparisonOperators.EQUAL });
  console.log(selectDm.getData().data);
  // Output:
  //  [
  //     ["chevrolet chevelle malibu", "chevrolet", 130, "USA],
  //     ["buick skylark 320", "buick", 165, "USA]
  //  ]

  // Perform the projection operation.
  const projectDm = dm.project(["Origin", "Maker"]);
  console.log(projectDm.getData().data);
  // Output:
  //  [
  //     ["USA", "chevrolet"],
  //     ["USA", "buick"],
  //     ["Japan", "datsun"]
  //  ]

  console.log(projectDm.getData().schema);
  // Output:
  //  [
  //     {"name": "Origin","type": "dimension"},
  //     {"name": "Maker","type": "dimension"}
  //  ]
}

myAsyncFn()
  .catch(console.error.bind(console));

Now dispose the DataModel instance if it's not needed:

// This also disposes all the datamodels which are created from it.
dm.dispose();

Documentation

Find detailed documentation and API reference from here.

What has changed?

DataModel 3.0.0 now has the core written in Rust language and has been ported to WebAssembly bringing in a huge performance difference w.r.t to previous version, in terms of both data size and computing speed. While the JavaScript version is deprecated and no active development will take place there but critical bugs if raised would be taken and released in GitHub only.

You can visit the JavaScript (deprecated) version here https://github.com/chartshq/datamodel-deprecated

Migrating from previous versions of DataModel

Now the DataModel became asynchronous as opposed to being synchronous in the previous JavaScript version.

import Engine from '@chartshq/datamodel';

(async () => {
  // Load the DataModel module.
  const DataModel = await Engine.onReady();

  // Converts the raw data into a format
  // which DataModel can consume.
  const formattedData = await DataModel.loadData(data, schema);

  // Create a new DataModel instance with
  // the formatted data.
  const dm = new DataModel(formattedData);
})();

Changed APIs

select

DataModel deprecated version:

dm.select((fields) => {
  return fields.Origin.value === 'USA';
});

Latest version:

dm.select({
  field: 'Origin',
  operator: DataModel.ComparisonOperators.EQUAL,
  value: 'USA'
});

groupBy

DataModel deprecated version:

dm.groupBy(['Origin'], {
  Acceleration: 'avg'
});

Latest version:

dm.groupBy(['Origin'], [{
  aggn: DataModel.AggregationFunctions.AVG,
  field: 'Acceleration'
}]);

Supported data operations:

select
project
calculateVariable
sort
groupBy

Upcoming data operations:

join
bin
compose
union
difference
... many more ...

For more details on APIs visit our docs.

License

Custom License (Free to use)

datamodel's People

Contributors

Stargazers

Watchers

Forkers

cube3power xwzpp ryan0428 yakuzaaaa adminbbbbb spaceblocks

datamodel's Issues

Compose API consistency

Compose should support sort and calculateVariable as well. Any operator which takes one operand should be supported.

Broken link to Date format documentation and improve comprehensibility

Date format reference here leads to broken link here.

I did find useful documentation in the source here

Load Data from Remote URL

DataModel should give option to load data from remote url or API, which provide data in JSON or CSV format.

API for serialization of DataModel

Serialization of datamodel to json, csv or 2darray.

If no param is sent, then take the user’s data mode automatically.

Currently datamodel has getData which generates a data format helpful for internal consumption.

Correct bin function

bin does not work the way it should.

Binning should be

uniform
non-uniform

Uniform binning is just a special case of non-uniform binning.

Uniform binning can be configured by

either mentioning binSize

bin({
    binSize: /* size of each bin */
    start: /* starting of a bin, if not given the lower end of data domain */
})

or mentioning binsCount

bin({
    binsCount: /* total number of bin */
    start: /* starting of a bin, if not given the lower end of data domain */
    end: /* ending of a bin, if not given the upper end of data domain */
})

Non uniform binning can be defined by mentioning the bins itself

bin({
    config: [10, 20, 30, 40, 50, 100, 110]
    start: /* starting of a bin, if not given the lower end of data domain */
    end: /* ending of a bin, if not given the upper end of data domain */
})

For non-uniform binning

if start is not mentioned
    and if config[0] is greater than lower end of data domain then
         [lower_end_of_data_domain, 10, 20, ..., 110]
    else
        [10, 20, ..., 110]
else
    if start is greater than lower end of data domain then
        start = lower_end_of_data_domain
    if config[0] is greater than start then
        [start, 10, 20, ..., 110]
    else 
       [10, 20, ..., 110]

The vice versa is valid for end and upper end of data domain.

This operator internally creates a binned dimension field.

Add capability to split DataModel row-wise and column-wise given particular conditions

ROW WISE:
By adding the capability to split DataModels, faceting can be done directly through it and it will not require intermediate DataModels to be formed.

Ways of splitting will be by particular set of dimensions(faceting) and an additional reducer function.

It should also support compose operation, where all the split DataModels should apply the next function in the compose function chain

Examples:
Pseudo Code:

newSetOfDataModels = DataModel.splitByRow(dimensions, reducerFn)

Split By Dimension -
Assume, DataModel has field "Country" with values "USA", "Japan" and "India"
then
newSetOfDataModels = DataModel.splitByRow(['Country'])
Custom Function -
newSetOfDataModels = DataModel.splitByRow(['Country'],(fields)=>{ return fields.Country.value !== "India" })
Another example:
If the data has state as well:
newSetOfDataModels = DataModel.splitByRow(['Country', 'State'],(fields)=>{ return fields.Country.value !== "India" && fields.State.value !== 'Texas' })

COLUMN WISE

Similarly column wise would split the DataModel column wise, with each DataModel containing the common set of fields and the unique set of fields as given below:

newSetOfDataModels = DataModel.splitByColumn(commonFields, uniqueFields)

For example:
newSetOfDataModels = DataModel.splitByColumn(['Origin'], [['Horsepower'], ['Acceleration']])

This will give two DMs, one with columns: Origin and Horsepower, and the other with Origin and Acceleration

Empty / null value representation

DataModel does not take care of empty (undefined) / null / non parsable value as consistently.

For categorical variable, if a column is empty (undefined) or null, represent it as null value by default. A user might choose to treat undefined and null separately. Like, undefined value could be represented as 'NA' and explicit null value could have NULL type
For temporal variable, all of the above points are valid with an additional step for non parsable date strings.
For measure, undefined, null and non parsable value can be treated like above.

For large data DAG network slows down viz during interaction

For a large amount of data, propagation in dag of DataModel slows down interactivity significantly.

CalculatedVariable re-calculate during groupBy

When groupBy operator is applied, calculated variable is not re-calculated with the dependent fields. DataModel applies regular aggregation on the calculated variable as well.

If a variable is dependent on any other variable then

calculatedVariable({
    /* schema goes here */
    /* no aggregation function required */
}, [dep1, dep2, (valDep1, valDep2) => {
    /* implementation */
}])

Then when aggregation happens calculatedVariable is recalculated after dependent variables are aggregated.

calculatedVariable({
    /* schema goes here */
    /* aggregation function required here */
}, [() => {
    /* implementation */
}])

If no other variable is defined then aggregation function is applied

Increase test coverage

Currently, DataModel test coverage is below the minimum threshold(80%), write enough test cases to meet that number 80%.

Provision for complex calculation when applying operator

For a schema like [Gender, Exercise, count], there is no way we can compute percentage of male exercise daily currently.

For select, calculateVariable and in aggregation function pass a clone (detached root) of the current DataModel instance and a store to save a value.

(value, i, colnedDM, store) => {
    ...
}

Select operator fails when provided parsed value

The select operator fails when a date field is selected by passing the parsed string value.

Raw value of the temporal field has to be used for successful operation

It would be better if it worked with the parsed value

Not giving type in schema forces incorrect groupBy behaviour

    {
      "name": "Exercise"
    },
    {
      "name": "Gender",
      "type": "dimension"
    },
    {
      "name": "SexualOrientation"
    }

When groupBy is performed by Exercise or SexualOrientation column identifiers are not generated properly given incorrect behaviour when datamodel is pushed to canvas.

Integrate travis CI

Setting saveChild to false should break link with parent

Currently setting saveChild to false does not break link with parent but removes it from children.
It should do both. Due to this change, the aggregation function name retrieval method also needs to be changed

Consider supporting ISO8601 date format for temporal fields

It would be useful if temporal fields supported date/time represented in the international standard ISO8601 format

Make a build to DataModel

Remove ".default" from "require('@chartshq/datamodel').default;" for NodeJs import

Validate field type and subtype before use

Check and validate the schema field type and subtype before creating a DataModel instance.

Remove multi dimensional array supports from group-by functions

In the current implementation of group-by functions e.g. sum, avg etc supports multi-dimensional array as input as follows:

// for sum function
sum([[1, 2, 3], [4, 5, 6]])

But now need to remove that and each group-by functions expects a single dimensional array as input:

// for 1D array
sum([1, 2, 3]) // returns 6

// for multi-dimensional array
sum([[1, 2, 3], [1, 2]]) // returns NaN

Add type definition for TypeScript

Add *.d.ts file for type definitions. Also helps vscode to pick up API signatures.

For completeness chartshq/muze#35 also needs to get fixed.

Update package.json

Add the following attributes to package.json:

repository
bugs
homepage
engines
Update "author" to "Charts.com [email protected] (https://charts.com/)"

Incorrect minimum consecutive difference when data not sorted

When unsorted data is passed to datamodel, then wrong minimum consecutive difference is returned.

Expose group by function name enums

"unreachable" error when trying to load data with a dimension of all empty/null values

When trying to load data with a dimension column with all empty string values, eg.

import Engine from '@chartshq/datamodel' // used version v3.0.0

async function f () {
  const DataModel = await Engine.onReady()
  const fd = await DataModel.loadData(
    [ { d1: '' } ], // d1: null also fails
    [ { name: 'd1', type: 'dimension' as any } ],
    {}
  )
  const dm = new DataModel(fd)
}

... an error is thrown on load:

    RuntimeError: unreachable
        at wasm-function[167]:0x242f5
        at wasm-function[188]:0x24886
        at wasm-function[192]:0x2493f
        at wasm-function[182]:0x24738
        at wasm-function[27]:0x15eab
        at wasm-function[20]:0x12a24
        at wasm-function[19]:0x12760

      at T.add_field (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:50456)
      at Rt (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:24606)
      at t.createField (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:30596)

Rename charts.com to muzejs.org as well as its CDN

Throws error when sorting field containing invalid data

When a field contains some invalid values and applied sorting on that field, it throws an error.
Example:

const data = [
  { origin: "USA", Acceleration: 11 },
  { origin: null, Acceleration: 11 },
  { origin: "Japan", Acceleration: null },
];

const dm = new DataModel(*/ data and schema /*);
const sortedDm = dm.sort([["origin", "asc" ]]);
// Throws error

%B token is not working in DateTimeFormatter

Always save derivations after any relational algebra operation

In current impl, the persistDerivation is being called only when saveChild config is set, but persistDerivation should be called in any operation regardless saveChild value.

Restructure field hierarchy

Currently, different kind fields are created from the wrong / incomplete conceptual hierarchy.
Like ParialFields have provision for Categorical, Temporal etc. But for a concrete Field no such hierarchy is present, hence making the inheritence inconsistence.

There are two options

Create serial inheritence PartialField <- PartialMeasure <- Continuous (Same for dimension)
Create dimensions and measures as mixin. And use Field and partial field to mix with the mixins.

Wrong Date Parsing when full year is not given.

If the year data is given like 18-Jan-18.

The parsed date get as 18-Jan-1918.

Expected Value should be 18-Jan-2018

Sorting should persist parent child relationship.

Currently while sorting parent-child relation is not maintained.

Fixed -:

persist parent child relationship while sorting.
do not create multiple child is multiple sorting is performed on same datamodel.

API for detached root

Create an API on DataModel which creates a detached root from the current instance of DataModel.
dm.detachedRoot()

Detaching a root creates a datamodel in silos. This datamodel does not have any parent/child linked to it.

Allow for renaming column or changing its type in Datamodel

Do you want to request a feature or report a bug?

feature

What is the current behavior?
After a Datamodel instance is created there is no way to rename a column or change its type.
This feature would be very helpful in cases where we can ask user to change the type of a field
or rename it through a user interface.
It is possible to achieve the same by creating a new instance of Datamodel with modified data and
schema but for reasonably large datasets it would become computationally expensive.

What is the expected behavior?
Allow for renaming a column or changing its type.

Which versions of MuzeJS, and which browser/OS are affected by this issue? Did this work in previous versions of MuzeJS?

latest

Cache namespace.fieldsObj(), namespace.getMeasure(), namespace.getDimension() values

Whenever namespace.fieldsObj(), namespace.getMeasure(), namespace.getDimension() are called, they iterate through the fields and re-calculate the value and which redundant and might affect the performance.
So, we should cache the result once and next time, the cached values will be returned rather than re-calculation.

Increase performance of select operation

Ability to use multi sort while retaining order of particular fields

Do you want to request a feature or report a bug?

bug/feature

What is the current behavior?

Currently if we want to perform multi sort, for example in 2 fields, where we want to retain the sorting order of the 1st field and based on that order, sort the second field, it's not possible.

Even applying functions to the sort does not achieve this result giving an error

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem. Your bug will get fixed much faster if we can run your code. You can either a scale down sample, JSFiddle or JSBin link

What is the expected behavior?

Which versions of MuzeJS, and which browser/OS are affected by this issue? Did this work in previous versions of MuzeJS?

Add api for getting the derivation information

Store the groupBy functions in derivation information of datamodel. Also add api for getting parent and immediate child data models.

Sorting after groupBy changes the order of data

Sorting creates an ordering of tuples. If sort is applied followed by groupBy then the order of the data is lost currently.
DataModel should keep the order preserved even after sorting.

Cases

If groupBy is applied followed by sorting, then reapply sorting on the datamodel created after applying groupBy
What happens if sorting is done and groupBy removes a field which is being used by sorting.

API for breaking link with child datamodel

There should be an api for breaking link between a datamodel with any of it's child datamodel.