Git Product home page Git Product logo

datamodel's Introduction

Free License NPM version Contributors

What is DataModel?

DataModel is an in-browser representation of tabular data. It uses WebAssembly for high performance and works seamlessly with any JavaScript library. It supports Relational Algebra operators which enable you to run select, group, sort (and many more) operations on the data.

The current version performs all the data operations like filtering, aggregation, etc. on WebAssembly which gives a 10x performance boost compared to the old JavaScript version.

It is written in Rust Language to handle computation intensive data operations, which is then compiled to WebAssembly, thereby providing a native-like performance for data operations.

DataModel can be used if you need an in-browser tabular data store for data analysis, visualization or just general use of data.

Features

  • πŸŽ‰ Supports Relational Algebra operators e.g. selection, projection, group, calculateVariable, sort etc out-of-the-box.

  • πŸ’Ž Every operation creates Immutable DataModel instance and builds a Directed Acyclic Graph (DAG) which establishes auto interactivity.

  • πŸš€ Uses WebAssembly for handling huge datasets and for better performance.

  • β›Ί Also works in Nodejs environment out-of-the-box.

Installation

CDN

Insert the DataModel build into the <head>:

<script src="https://cdn.jsdelivr.net/npm/@chartshq/[email protected]/dist/browser/datamodel.js" type="text/javascript"></script>

NPM

Install DataModel from NPM:

$ npm install --save @chartshq/datamodel

As we're using Worker internally, so the worker-loader needs to be installed as follows:

$ npm install worker-loader --save-dev

And then within your webpack configuration object, you'll need to add the worker-loader to the list of module rules, like so:

module.exports = {
  module: {
    rules: [
      // Add the following object to your module `rules` list.
      {
        test: /\.worker/,
        include: /datamodel/,
        loader: 'worker-loader',
        options: {
          inline: false, // If you want to make it inline, set to true.
          fallback: true
        },
      },
    ],
  }
};

You also can checkout our datamodel-app-template to try out the DataModel quickly through a boilerplate app.

Getting Started

Once the installation is done, please follow the steps below:

  1. Prepare the data and the corresponding schema:
// Prepare the schema for data.
const schema = [
  {
    name: 'Name',
    type: 'dimension'
  },
  {
    name: 'Maker',
    type: 'dimension'
  },
  {
    name: 'Horsepower',
    type: 'measure',
    defAggFn: 'avg'
  },
  {
    name: 'Origin',
    type: 'dimension'
  }
]

// Prepare the data.
const data = [
   {
    "Name": "chevrolet chevelle malibu",
    "Maker": "chevrolet",
    "Horsepower": 130,
    "Origin": "USA"
  },
  {
    "Name": "buick skylark 320",
    "Maker": "buick",
    "Horsepower": 165,
    "Origin": "USA"
  },
  {
    "Name": "datsun pl510",
    "Maker": "datsun",
    "Horsepower": 88,
    "Origin": "Japan"
  }
]
  1. Import DataModel as follows:

If you are using the npm package, import the package as below:

import Engine from '@chartshq/datamodel';

If you are using it in NodeJS, then require it as below:

const Engine = require('@chartshq/datamodel').default;

If you are using CDN, then use it as follows:

const Engine = window.DataModel;
  1. Load the DataModel engine and pass the data and schema to DataModel constructor and create a new DataModel instance:
// As the DataModel are asynchronous, so we need to
// use async-await syntax.
async function myAsyncFn() {
  // Load the DataModel module.
  const DataModel = await Engine.onReady();

  // Converts the raw data into a format
  // which DataModel can consume.
  const formattedData = await DataModel.loadData(data, schema);

  // Create a new DataModel instance with
  // the formatted data.
  const dm = new DataModel(formattedData);

  console.log(dm.getData().data);
  // Output:
  //  [
  //     ["chevrolet chevelle malibu", "chevrolet", 130, "USA"],
  //     ["buick skylark 320", "buick", 165, "USA"],
  //     ["datsun pl510", "datsun", 88, "Japan"]
  //  ]

  // Perform the selection operation.
  const selectDm = dm.select({ field: 'Origin', value: 'USA', operator: DataModel.ComparisonOperators.EQUAL });
  console.log(selectDm.getData().data);
  // Output:
  //  [
  //     ["chevrolet chevelle malibu", "chevrolet", 130, "USA],
  //     ["buick skylark 320", "buick", 165, "USA]
  //  ]

  // Perform the projection operation.
  const projectDm = dm.project(["Origin", "Maker"]);
  console.log(projectDm.getData().data);
  // Output:
  //  [
  //     ["USA", "chevrolet"],
  //     ["USA", "buick"],
  //     ["Japan", "datsun"]
  //  ]

  console.log(projectDm.getData().schema);
  // Output:
  //  [
  //     {"name": "Origin","type": "dimension"},
  //     {"name": "Maker","type": "dimension"}
  //  ]
}

myAsyncFn()
  .catch(console.error.bind(console));
  1. Now dispose the DataModel instance if it's not needed:
// This also disposes all the datamodels which are created from it.
dm.dispose();

Documentation

Find detailed documentation and API reference from here.

What has changed?

DataModel 3.0.0 now has the core written in Rust language and has been ported to WebAssembly bringing in a huge performance difference w.r.t to previous version, in terms of both data size and computing speed. While the JavaScript version is deprecated and no active development will take place there but critical bugs if raised would be taken and released in GitHub only.

You can visit the JavaScript (deprecated) version here https://github.com/chartshq/datamodel-deprecated

Migrating from previous versions of DataModel

Now the DataModel became asynchronous as opposed to being synchronous in the previous JavaScript version.

import Engine from '@chartshq/datamodel';

(async () => {
  // Load the DataModel module.
  const DataModel = await Engine.onReady();

  // Converts the raw data into a format
  // which DataModel can consume.
  const formattedData = await DataModel.loadData(data, schema);

  // Create a new DataModel instance with
  // the formatted data.
  const dm = new DataModel(formattedData);
})();

Changed APIs

  • select

    DataModel deprecated version:

    dm.select((fields) => {
      return fields.Origin.value === 'USA';
    });

    Latest version:

    dm.select({
      field: 'Origin',
      operator: DataModel.ComparisonOperators.EQUAL,
      value: 'USA'
    });
  • groupBy

    DataModel deprecated version:

    dm.groupBy(['Origin'], {
      Acceleration: 'avg'
    });

    Latest version:

    dm.groupBy(['Origin'], [{
      aggn: DataModel.AggregationFunctions.AVG,
      field: 'Acceleration'
    }]);

Supported data operations:

  • select
  • project
  • calculateVariable
  • sort
  • groupBy

Upcoming data operations:

  • join
  • bin
  • compose
  • union
  • difference
  • ... many more ...

For more details on APIs visit our docs.

License

Custom License (Free to use)

datamodel's People

Contributors

adarshlilha avatar adotg avatar mridulmeh avatar ranajitbanerjee avatar rousan avatar sandeep1995 avatar subhash-halder avatar sushrut141 avatar ud-ud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamodel's Issues

Compose API consistency

Compose should support sort and calculateVariable as well. Any operator which takes one operand should be supported.

Load Data from Remote URL

DataModel should give option to load data from remote url or API, which provide data in JSON or CSV format.

API for serialization of DataModel

Serialization of datamodel to json, csv or 2darray.

If no param is sent, then take the user’s data mode automatically.

Currently datamodel has getData which generates a data format helpful for internal consumption.

Correct bin function

bin does not work the way it should.

Binning should be

  • uniform
  • non-uniform

Uniform binning is just a special case of non-uniform binning.

Uniform binning can be configured by

either mentioning binSize

bin({
    binSize: /* size of each bin */
    start: /* starting of a bin, if not given the lower end of data domain */
})

or mentioning binsCount

bin({
    binsCount: /* total number of bin */
    start: /* starting of a bin, if not given the lower end of data domain */
    end: /* ending of a bin, if not given the upper end of data domain */
})

Non uniform binning can be defined by mentioning the bins itself

bin({
    config: [10, 20, 30, 40, 50, 100, 110]
    start: /* starting of a bin, if not given the lower end of data domain */
    end: /* ending of a bin, if not given the upper end of data domain */
})

For non-uniform binning

if start is not mentioned
    and if config[0] is greater than lower end of data domain then
         [lower_end_of_data_domain, 10, 20, ..., 110]
    else
        [10, 20, ..., 110]
else
    if start is greater than lower end of data domain then
        start = lower_end_of_data_domain
    if config[0] is greater than start then
        [start, 10, 20, ..., 110]
    else 
       [10, 20, ..., 110]

The vice versa is valid for end and upper end of data domain.

This operator internally creates a binned dimension field.

Add capability to split DataModel row-wise and column-wise given particular conditions

ROW WISE:
By adding the capability to split DataModels, faceting can be done directly through it and it will not require intermediate DataModels to be formed.

Ways of splitting will be by particular set of dimensions(faceting) and an additional reducer function.

It should also support compose operation, where all the split DataModels should apply the next function in the compose function chain

Examples:
Pseudo Code:

newSetOfDataModels = DataModel.splitByRow(dimensions, reducerFn)

  1. Split By Dimension -
    Assume, DataModel has field "Country" with values "USA", "Japan" and "India"
    then
    newSetOfDataModels = DataModel.splitByRow(['Country'])

  2. Custom Function -
    newSetOfDataModels = DataModel.splitByRow(['Country'],(fields)=>{ return fields.Country.value !== "India" })

  3. Another example:
    If the data has state as well:
    newSetOfDataModels = DataModel.splitByRow(['Country', 'State'],(fields)=>{ return fields.Country.value !== "India" && fields.State.value !== 'Texas' })

COLUMN WISE

Similarly column wise would split the DataModel column wise, with each DataModel containing the common set of fields and the unique set of fields as given below:

newSetOfDataModels = DataModel.splitByColumn(commonFields, uniqueFields)

For example:
newSetOfDataModels = DataModel.splitByColumn(['Origin'], [['Horsepower'], ['Acceleration']])

This will give two DMs, one with columns: Origin and Horsepower, and the other with Origin and Acceleration

Empty / null value representation

DataModel does not take care of empty (undefined) / null / non parsable value as consistently.

  • For categorical variable, if a column is empty (undefined) or null, represent it as null value by default. A user might choose to treat undefined and null separately. Like, undefined value could be represented as 'NA' and explicit null value could have NULL type
  • For temporal variable, all of the above points are valid with an additional step for non parsable date strings.
  • For measure, undefined, null and non parsable value can be treated like above.

CalculatedVariable re-calculate during groupBy

When groupBy operator is applied, calculated variable is not re-calculated with the dependent fields. DataModel applies regular aggregation on the calculated variable as well.

If a variable is dependent on any other variable then

calculatedVariable({
    /* schema goes here */
    /* no aggregation function required */
}, [dep1, dep2, (valDep1, valDep2) => {
    /* implementation */
}])

Then when aggregation happens calculatedVariable is recalculated after dependent variables are aggregated.

calculatedVariable({
    /* schema goes here */
    /* aggregation function required here */
}, [() => {
    /* implementation */
}])

If no other variable is defined then aggregation function is applied

Increase test coverage

Currently, DataModel test coverage is below the minimum threshold(80%), write enough test cases to meet that number 80%.

Provision for complex calculation when applying operator

For a schema like [Gender, Exercise, count], there is no way we can compute percentage of male exercise daily currently.

For select, calculateVariable and in aggregation function pass a clone (detached root) of the current DataModel instance and a store to save a value.

(value, i, colnedDM, store) => {
    ...
}

Select operator fails when provided parsed value

The select operator fails when a date field is selected by passing the parsed string value.
Screenshot 2019-03-25 at 3 40 46 PM

Raw value of the temporal field has to be used for successful operation
Screenshot 2019-03-25 at 3 40 23 PM

It would be better if it worked with the parsed value

Not giving type in schema forces incorrect groupBy behaviour

    {
      "name": "Exercise"
    },
    {
      "name": "Gender",
      "type": "dimension"
    },
    {
      "name": "SexualOrientation"
    }

When groupBy is performed by Exercise or SexualOrientation column identifiers are not generated properly given incorrect behaviour when datamodel is pushed to canvas.

Remove multi dimensional array supports from group-by functions

In the current implementation of group-by functions e.g. sum, avg etc supports multi-dimensional array as input as follows:

// for sum function
sum([[1, 2, 3], [4, 5, 6]])

But now need to remove that and each group-by functions expects a single dimensional array as input:

// for 1D array
sum([1, 2, 3]) // returns 6

// for multi-dimensional array
sum([[1, 2, 3], [1, 2]]) // returns NaN

"unreachable" error when trying to load data with a dimension of all empty/null values

When trying to load data with a dimension column with all empty string values, eg.

import Engine from '@chartshq/datamodel' // used version v3.0.0

async function f () {
  const DataModel = await Engine.onReady()
  const fd = await DataModel.loadData(
    [ { d1: '' } ], // d1: null also fails
    [ { name: 'd1', type: 'dimension' as any } ],
    {}
  )
  const dm = new DataModel(fd)
}

... an error is thrown on load:

    RuntimeError: unreachable
        at wasm-function[167]:0x242f5
        at wasm-function[188]:0x24886
        at wasm-function[192]:0x2493f
        at wasm-function[182]:0x24738
        at wasm-function[27]:0x15eab
        at wasm-function[20]:0x12a24
        at wasm-function[19]:0x12760

      at T.add_field (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:50456)
      at Rt (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:24606)
      at t.createField (node_modules/@chartshq/datamodel/dist/node/2.datamodel.js:1:30596)

Throws error when sorting field containing invalid data

When a field contains some invalid values and applied sorting on that field, it throws an error.
Example:

const data = [
  { origin: "USA", Acceleration: 11 },
  { origin: null, Acceleration: 11 },
  { origin: "Japan", Acceleration: null },
];

const dm = new DataModel(*/ data and schema /*);
const sortedDm = dm.sort([["origin", "asc" ]]);
// Throws error

Restructure field hierarchy

Currently, different kind fields are created from the wrong / incomplete conceptual hierarchy.
Like ParialFields have provision for Categorical, Temporal etc. But for a concrete Field no such hierarchy is present, hence making the inheritence inconsistence.

There are two options

  • Create serial inheritence PartialField <- PartialMeasure <- Continuous (Same for dimension)
  • Create dimensions and measures as mixin. And use Field and partial field to mix with the mixins.

Sorting should persist parent child relationship.

Currently while sorting parent-child relation is not maintained.

Fixed -:

  • persist parent child relationship while sorting.
  • do not create multiple child is multiple sorting is performed on same datamodel.

API for detached root

Create an API on DataModel which creates a detached root from the current instance of DataModel.
dm.detachedRoot()

Detaching a root creates a datamodel in silos. This datamodel does not have any parent/child linked to it.

Allow for renaming column or changing its type in Datamodel

Do you want to request a feature or report a bug?

  • feature

What is the current behavior?
After a Datamodel instance is created there is no way to rename a column or change its type.
This feature would be very helpful in cases where we can ask user to change the type of a field
or rename it through a user interface.
It is possible to achieve the same by creating a new instance of Datamodel with modified data and
schema but for reasonably large datasets it would become computationally expensive.

What is the expected behavior?
Allow for renaming a column or changing its type.

Which versions of MuzeJS, and which browser/OS are affected by this issue? Did this work in previous versions of MuzeJS?

  • latest

Ability to use multi sort while retaining order of particular fields

Do you want to request a feature or report a bug?

bug/feature

What is the current behavior?

Currently if we want to perform multi sort, for example in 2 fields, where we want to retain the sorting order of the 1st field and based on that order, sort the second field, it's not possible.

Even applying functions to the sort does not achieve this result giving an error

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem. Your bug will get fixed much faster if we can run your code. You can either a scale down sample, JSFiddle or JSBin link

What is the expected behavior?

Which versions of MuzeJS, and which browser/OS are affected by this issue? Did this work in previous versions of MuzeJS?

Sorting after groupBy changes the order of data

Sorting creates an ordering of tuples. If sort is applied followed by groupBy then the order of the data is lost currently.
DataModel should keep the order preserved even after sorting.

Cases

  • If groupBy is applied followed by sorting, then reapply sorting on the datamodel created after applying groupBy
  • What happens if sorting is done and groupBy removes a field which is being used by sorting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.