toucantoco / weaverbird Goto Github PK

A visual data pipeline builder with various backends

Home Page: https://weaverbird.toucantoco.dev

License: BSD 3-Clause "New" or "Revised" License

JavaScript 0.16% HTML 0.19% Vue 20.67% TypeScript 50.16% SCSS 0.13% Makefile 0.05% Python 28.57% Dockerfile 0.04% Shell 0.04%

mongodb pandas vuejs mysql postgresql redshift snowflake sql data-transformation

weaverbird's People

Contributors

Stargazers

Watchers

Forkers

testinnplayin arthurzenika sermonis stantoxt pouc charlesrngrd chrismeyersfsu davinov jeremyjacquemont ankurseth82 arnoldligtvoet pifauc dakshinka arooba-git

weaverbird's Issues

Keep SASS style scoped in .vue component

We wanted to add the possibility to override some style variables (e.g: active-color), we thus had to remove the scope attributes on style block, that prevent to overwrite a class not on purpose.

So now, we've got style in a separate file that prevents the style to be computed before we can override it in another project.

I think a better solution can be found in order to follow Vue convention 'Single File Component' , but I failed to find it.

New column based on conditions on existing ones

An important pain for query editors in Toucan Toco today is to express basic "if...then...else" conditions to populate a new column based on conditions on existing ones.

This point was not obvious in our steps usage analysis as it was not identified as an aggregation operator in the MongoDB aggregation pipeline ($cond), and it is not implement in Toucan Toco postprocess python utils.

Nevertheless, it is widely used and we should give it some priority.

A possible approach would be to use a dedicated formula in our formula editor accessed when creating a new column. A basic formula such as the excel one should do the trick:
IF( [COLUMN_1] < [COLUMN_2] AND [COLUMN_3] = "TCTC", "OK", [COLUMN_4])

Another, more graphical approach (and more explanatory to the user), would be to have a dropdown appearing when clicking on "New column" to select between "Formula" (for very basic arithmetic operations between column) and "Based on condition" (opening a modal to graphically express conditions and output)

Implement the 2-panes view

The left-pane will host a graphic rendering of the pipeline. The right-pane (720px min ?) will host the dataset viewer.

Step creation (+ edition) form

Each step is created (or edited) via a specific form.

We have yet to decide if these forms are built manually or with something like https://github.com/fagbokforlaget/vue-jsonschema-form

Highlight candidate columns on operation:hover

As a query designer, when I hover an operation, the candidate columns for this operations are highlighted.

'filter' step to Mongo: complex filtering (AND/OR)

TO DO

Basic filtering (single conditions)
Complex filtering (combination of conditions)

This 'filter' step config in our VQB "language"...

{
  name: 'filter',
  and: [
    {
      column: 'column_1'
      value: ['foo', 'bar']
      operator: 'in'
    }
    {
      column: 'column_2'
      value: ['toucan', 'toco']
      operator: 'nin'
    }
    {
      or: [
        {
          column: 'column_3'
          value: 'toto'
          operator: 'eq'
        }
        {
          and: [
           {
              column: 'column_4'
              value: 42
              operator: 'lte'
           }
           {
              column: 'column_4'
              value: 0
              operator: 'gt'
           }
          ]
        }
      ]
    }
  ]
}

... should yield the following mongo aggregation pipeline step:

{
  $match:
    column_1: $in: ["foo", "bar"]
    column_2: $nin: ["toucan", "toco"] # an 'and' between conditions on different columns does not need an explicit '$and' operator
    $or: [
      {
        column_3: $eq: "toto" # '$eq' is optional as it is implicit in Mongo
      }
      {
        $and: [ # an 'and' between conditions on the same column do need an explicit '$and' operator
          {
             column_4: $lte: 42
          }
          {
             column_4: $gt: 0
          }
        ]
      }
    ]
}

Toy backend for easier testing

As a developer, it would be nice to test the editor in real conditions with real data (e.g. CSV files or a toy mongo database).

One could imagime something like:

git clone https://github.com/ToucanToco/vue-query-builder/issues/new
python examples/toy-backend.py my-config-file.yml
yarn start-demo

'unpivot' step to Mongo

Example 1: with 'dropna' parameter to true

This 'unpivot' step in our VQB "language"...

{
    name: 'unpivot'
    keep: ["MARCHE", "CANAL"]
    unpivot: ["NB_CLIENTS_TOTAL", "NB_ROWS"]
    unpivot_column_name: "KPI"
    value_column_name: "VALUE"
    dropna: true # default
}

... should yield the following mongo aggregation steps:

{
  $project:
    MARCHE: "$MARCHE"
    CANAL: "$CANAL"
    dimensions:
      $objectToArray: {
        NB_CLIENTS_TOTAL: "$NB_CLIENTS_TOTAL"
        NB_ROWS: "$NB_ROWS"
      }
}
{
  $unwind: "$dimensions"
}
{
  $project:
    MARCHE: 1
    CANAL: 1
    KPI: "$dimensions.k"
    VALUE: "$dimensions.v"
}
{
  $match:
    VALUE: $ne: null # dropna here
}

Example 2: with 'dropna' parameter to false

This 'unpivot' step in our VQB "language"...

{
    name: 'unpivot'
    keep: ["MARCHE", "CANAL"]
    unpivot: ["NB_CLIENTS_TOTAL", "NB_ROWS"]
    unpivot_column_name: "KPI"
    value_column_name: "VALUE"
    dropna: false
}

... should yield the following mongo aggregation steps:

{
  $project:
    MARCHE: "$MARCHE"
    CANAL: "$CANAL"
    dimensions:
      $objectToArray: {
        NB_CLIENTS_TOTAL: "$NB_CLIENTS_TOTAL"
        NB_ROWS: "$NB_ROWS"
      }
}
{
  $unwind: "$dimensions"
}
{
  $project:
    MARCHE: 1
    CANAL: 1
    KPI: "$dimensions.k"
    VALUE: "$dimensions.v"
}

Example of query 1

This pipeline of steps in our VQB "language"...

[
  {
      name: 'domain',
      domain: 'focus_media_campaigns_levers_objectives'
  }
  {
      name: 'filter',
      column: 'period',
      value: "Monthly View"
      operator: 'eq'
  }
  {
      name: 'filter',
      column: 'period',
      value: ["2017", "2018"]
      operator: 'in'
  }
  {
      name: 'group',
      dimensions: ['day', "Campaign name"]
      measures: [
        {
           column: "Clicks"
           agg_function: 'avg'
        }
        {
           column: "Budget"
           agg_function: 'sum'
        }
        {
           column: "Impressions"
           agg_function: 'sum'
        }
      ]
  }
  {
     name: 'new_column_formula',
     new_column_name: 'CPM'
     formula: "[Budget] / [Impressions] * 1000"
  }
  {
    name: 'unpivot'
    id: ["day", "campaign"]
    value: ["Budget", "Impressions", "CPM"]
    dropna: true
  }
]

... should yield the following query + postprocess pipelines:

query: [
  $match:
    domain: "focus_media_campaigns_levers_objectives"
    period: 'Monthly View'
    year: $in: ["2017", "2018"]
,
  $group:
    _id: 
      day: "$day"
      campaign: "$Campaign name"
    Clicks:
      $avg: '$Clicks'
    Budget:
      $sum: '$Budget'
    Impressions:
      $sum: "$Impressions"
,
  $project:
    day: "$_id.day"
    campaign: "$_id.campaign"
    Clicks: 1
    Budget: 1
    Impressions: 1
    CPM:
      $multiply: [ $divide: [ "$Budget", "$Impressions"] , 1000 ]
]
postprocess: [
  melt:
    id: ["day", "campaign"]
    value: ["Budget", "Impressions", "CPM"]
    dropna: true
]

Notes:

Our "group" step should generate both a {$group} step as well as a {$project} step in mongo to project both the measures and the "_id. The latter is poorly named and may include several levels as in our example. We want to get readable column names rather than the default document-oriented "_id" of mongo, so that we can keep a consistent tabular-ready format. If following steps generate project, we are then able to merge the projects as in our example. To illustrate our point, the mongo "$project" of our example is the result of merging the following "$project" mongo deriving from 2 different stpes in our pipeline:

query: [
  ...
,
  # Derived from our group step
  $project:
    day: "$_id.day"
    campaign: "$_id.campaign"
    Clicks: 1
    Budget: 1
    Impressions: 1
,
  # Derived from our formula step
  $project:
    CPM:
      $multiply: [ $divide: [ "$Budget", "$Impressions"] , 1000 ]
]
...

In our formula step, we propose the convention of escaping column names between brackets ("[column_1]")

[table] display the data

In the right pane, we need a table that displays the dataset. The table should be scrollable horizontally.

'replace' step to Mongo

Exemple 1: replace more than one value in the same column, inplace

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_1: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"] # No, there is not a more elegant way of doing it in he mongo aggregation pipeline :) But it does have the advantage to work for any data type
}
{
  # We need a distinct step for the second replace (if we decide that it is more readable than nested $cond in one step...). We cannot merge the $project steps as they would include non unique keys as we want the replacement to be performed inplace, so only one replacement would be retained by Mongo.
  $project:
    <all other columns>: 1
    column_1: $cond: [{$eq: ["$column_1", "old"]}, "new", "$column_1"]
}

Exemple 2: replace more than one value in the same column, in a new column

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "new_column"
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "new_column"
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_1: 1 # We do want to keep the source column
    new_column: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"]
}
{
  # We need a distinct step for the second replace (if we decide that it is more readable than nested $cond in one step...). We cannot merge the $project steps as they would include non unique keys as we want the replacement to be performed inplace, so only one replacement would be retained by Mongo.
  $project:
    <all other columns>: 1
    column_1: 1 # We do want to keep the source column
    new_column: $cond: [{$eq: ["$new_column", "old"]}, "new", "$column_1"] # Here we base our replacement on the 'new_column' from step 1
}

Exemple 3: replace one value in different columns

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "column_2"
   new_column: "new_column"
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_2: 1
    column_1: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"]
    new_column: $cond: [{$eq: ["$column_2", "old"]}, "new", "$column_1"]
}

dataset rowcount badge

The number of rows in my dataset is visible in a badge on the bottom left hand corner of the dataset viewer.

Add a spinner component

We'll need a spinner component for any long-processing operation

Configure step forms by clicking on cells / columns

Some of the step edition forms can be updated by simply clicking on columns or cells.

E.g. if a step applies on a specific column, I can edit the step. If I want to change the processed column, I simply have to click on the one that I want. Ideally, if this kind of step doesn't apply to a given column, it's disabled and I can't click on it.

Typecast popover

As a query designer, I can click on the "type" label in a column header and change it.

The corresponding cast step is generated in the pipeline.

'formula' step to Mongo

This 'formula' step config in our VQB "language"...

{
   name: 'formula'
   new_column: 'new_col' # new_column can be an existing column to be updated with the formula result
   formula: '(column_1 + column_2) / column_3 - column_4 * 100'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    new_col:
      $subtract: [
        {
          $divide: [
            {
              $add: [
                "$column_1"
                "$column_2"
              ]
            }
            "$column_3"
          ]
        }
        {
          $multiply: [
            "$column_4"
            100
           ]
        }
      ]
}

'fillna' step

This 'fillna' step config in our VQB "language"...

{
   name: 'fillna',
   column: "foo"
   value: "bar"
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    foo: $cond: [{$eq: ["$foo", null]}, "bar", "$foo"]
}

keyboard shortcuts

As an expert designer, I can use keyboard shortcuts to navigate in my dataset viewer, search for operations, edit the pipeline, update the preview, etc.

if-then-else step

As a query designer, I can configure a if-then-else step easiliy

'top' step to Mongo

Exemple 1: With grouping

This 'top' step config in our VQB "language"...

{
  name: 'top',
  groups: ['foo']
  value: 'bar',
  sort: 'desc' # or 'asc'
  limit: 10
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    bar: -1 # or 1 when 'asc'
}
{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
{
{
  $project:
    top_elems: $slice: ["$array", 10]
}
  $unwind: "$top_elems"
}
{
  $project:
    <all_columns>: "$top_elems.<all_columns>" # We want to keep all columns
}
]

Exemple 2: Without grouping

This 'top' steps config in our VQB "language"...

{
  name: 'top',
  value: 'bar',
  sort: 'desc' # or 'asc'
  limit: 10
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    bar: -1 # or 1 when 'asc'
}
{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
{
{
  $project:
    top_elems: $slice: ["$array", 10]
}
  $unwind: "$top_elems"
}
{
  $project:
    <all_columns>: "$top_elems.<all_columns>" # We want to keep all columns
}
]

get rid of webpack

We use rollup to build and ship an ES6 module. We should therefore get rid of webpack since we don't want to keep two distinct package bundlers.

Visualize the generated query

As a query designer, I am able to see the query that corresponds to the current transformation pipeline.

At this point, the goal is mainly to ease the debugging process, there's no need to design it thoroughly.

Complex filter operations

As a query designer, the UI allows me to configure combinations (or / and) of filters.

'percentage' step to Mongo

Exemple 1: with groups and result in new column

This 'percentage' step config in our VQB "language"...

{
   name: 'percentage'
   new_column: 'new_col'
   column: 'bar'
   group: ['foo']
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
    total_denum: $sum: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    new_col: $cond: [$eq: ["$total_denum", 0], null, $divide: ["$array.bar", "$total_denum"]] # we need to explicitely manage the case where '$total_denum' is null otherwise the query may just fail
    <all_other_columns>: "$array.<all_other_columns>"
}

Exemple 2: without groups and result in existing column

This 'percentage' step config in our VQB "language"...

{
   name: 'percentage'
   column: 'bar'
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
    total_denum: $sum: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    bar: $cond: [$eq: ["$total_denum", 0], null, $divide: ["$array.bar", "$total_denum"]] # we need to explicitely manage the case where '$total_denum' is null otherwise the query may just fail
    <all_other_columns>: "$array.<all_other_columns>"
}

make it responsive !

Define a few responsive behaviours (to be refined):

the "search field" disappear, only the magnifying glass icon remains,
the left pane shrinks to keep the right one as big as possible
…

'sort' step

Ex 1: sort on one column

This 'sort' step config in our VQB "language"...

{
   name: 'sort',
   columns: ["foo"]
   order: ["asc"] # 'asc' by default
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    foo: 1
}

Ex 2: sort on multiple columns

This 'sort' step config in our VQB "language"...

{
   name: 'sort',
   columns: ["foo", "bar"]
   order: ["asc", "desc"]
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    foo: 1
    bar: -1
}

'rename' step to Mongo

These 'rename' steps config in our VQB "language"...

{
   name: 'rename',
   oldname: 'foo',
   newname: 'bar'
}
{
   name: 'rename',
   oldname: 'old',
   newname: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    bar: "$foo"
    new: "$old"
}

'pivot' step to Mongo

This 'pivot' step config in our VQB "language"...

{
   name: 'pivot',
   index: ['column_1', 'column_2'],
   column_to_pivot: 'column_3'
   value_column: 'column_4'
   agg_function: 'sum' # 'mean' by default
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      column_1: "$column_1"
      column_2: "$column_2"
      column_3: "$column_3"
    column_4: $sum: "$column_4" # first we perform the aggregation ('$mean' by default), with the _id including the column to pivot
}
{
  $group:
    _id:
      column_1: "$_id.column_1"
      column_2: "$_id.column_2"
    array:
      $addToSet:
        column_3: "$_id.column_3"
        column_4: "$column_4" # then we group with with index columns as _id and we push documents as an array of sets including a column for the column to pivot and a column for the corresponding value
}
{
  $project:
    tmp_object:
      $arrayToObject:
        $zip:
          inputs: ["$array.column_3", "$array.column_4"] # then we project a tmp key to get an object from the array of couples [column_to_pivot, corresponding_value]
}
{
  $addFields:
    "tmp_object.column_1": "$_id.column_1"
    "tmp_object.column_2": "$_id.column_2" # Then we include back in every document created in the previous step the index columns (still accessible in the _id object as we did not explicitly specified '_id: 0' in the previous step)
}
{
  $replaceRoot:
    newRoot: "$tmp_object" # then we replace the root of the document tree to get our columns ready for our needed table-like format !
}
]

Delete a column

As a query designer, I can click on a column and choose to delete it.

This will generate a DeleteStep in the pipeline.

Delete a step

As a query designer, in the pipeline viewer, on a step, I can click on "trash" button that will display a popover with a confirmation message "Are you sure you want to delete this step ?". If I say "yes", the step is removed from the pipeline and the dataset is recomputed.

dataset pagination

As a query designer, I can browser my paginated dataset. The default is to limit the viewport to N (i.e. 50) but I can choose to configure this N visually and choose to jump to the next page.

http://allenfang.github.io/react-bootstrap-table/index.html might be a source of inspiration.

Handle simple translation of aggregation step → mongo 3.6

Implement the aggregation step translation described in #30 with the following restrictions:

no postprocess, mongo 3.x only,
handle aggregation functions that have direct mongo 3.x mapping (i.e. no count_distinct)

Definie a minimal dataset API

We'll need at least the following information:

{
   length: 12345,
   offset: 100,
   limit: 50,
   columns: [
      {
        name: 'my-col',
        type: 'integer',
        // possibly later some extra information such as `unique_values`, `min`, `max`, ...
      }
   ]
}

Column types will be useful for a number of usecases (e.g. detect which operations apply and wich don't)

Choose backend

As a query designer, I can choose to generate queries on different backends. We could imagine the following ones:

mongo 3.6 + python postprocess
mongo 4 + python postprocess
pandas only
sql + python postprocess

We don't want here to implement all those backends, just to prepare the code (and the UI) architecturally to support multiple backends.

"preview" mode

As a query designer, when I create / edit a transformation step, I have access to a "preview" button that will ask the host to recompute the dataset so I can preview the result.

The preview is also accessible when I click on a specific step in the pipeline pane.

display a list of unique values in the column

As a query designer, when I click on a column header, I see the list of unique values to be able to filter on it (or just to get some insights).

If the dataset is paginated, when I first click, I see only the unique values that are currently displayed. I can click on a label "see all unique values" that will display all the unique values in this column, whether visible or not.

[table] select a column

It must be possible to select (and highlight) a column in the table. It should also be possible to select (and highlight) more than one column at once with ctrl+click

Dropdown operation buttons

Each transformation step has an associated category. For now, we envision the following categories:

filter
compute
text
date
aggregate
reshape

For each of these categories, we should have a dropdown button that will popover the list of related transformation steps.

Filter by value

As a query designer, I can click on a column, click on the "filter" step and pick the value to filter on either by clicking on a cell or manually entering the value.

This will generate a FilterStep in the pipeline

Dropdown icon on hover only

In a column header, there is a small "call-to-action dropdown" icon to incite user to click.

To make the UI as light as possible, This dropdown icon should only be visible when the user hovers the column header.

time travel

As a query designer, in the pipeline viewer, I can click on the small circle beyond each step. This will disable the steps below the one I've just clicked on. Visually, the disabled steps will be slightly faded and the dataset viewer will update to the newly generated query.

Search operation

As a query designer, I can enter some text in the search field to find operations matching this text.

I can see a list of suggested operations as a type. Ideally, the search is fuzzy and if there's no match in a given category, the corresponding buttons fade out.

Mutiple datasets

As a query designer, I'm able to generate multiple queries on multiple datasets and combine the results.

[table] select a cell

In the dataset viewer, it should be possible to select (and higlight) a specific cell. It should also be possible to select (and highlight) multiple cells using ctrl+click.

popover component

We need a generic popover component that will be used both on category buttons and column headers.

Limit number of rows to N

If the component is initialized with a large dataset, we won't be able to display all
the lines. For the first version, we'll fix an viewport limit to N (50 ?) rows.

A future version will do better.

'Aggregate' step to Mongo :: Support count distinct

TO DO

Support basic aggregation functions with straightforward implementation
Support count_distinct

Exemple 1: Several aggregations on several dimensions

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "avg_value2"
           column: "value2"
           agg_function: 'avg'
        }
        {
           name: "MIN"
           column: "value3"
           agg_function: 'min'
        }
        {
           name: "MAX"
           column: "value4"
           agg_function: 'max'
        }
        {
           name: "count_value5"
           column: "value5"
           agg_function: 'count'
        }
        {
           name: "value6"
           column: "value6"
           agg_function: 'first'
        }
        {
           name: "value7"
           column: "value7"
           agg_function: 'last'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    avg_value2:
      $avg: '$value2'
    MIN:
      $min: "$value3"
    MAX:
      $max: "$value4"
    count_value5:
      $sum: 1 # Careful here: there is no '$count' operator, we '$sum' a series of 1 to count
    value6:
      $first: "$value6"
    value7:
      $last: "$value7"
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    avg_value2: 1
    MIN: 1
    MAX: 1
    count_value5: 1
    value6: 1
    value7: 1
]

Exemple 2: We keep the same generic approach for aggregations on one dimension

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1']
      dimensions: [
        {
           name: "sum_value1"
           column: "value1"
           agg_function: 'sum'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1" # not "_id: '$dimension'" as it may be possible but we want to keep the approach as generic as possible so we use a "levelled" definition of "_id" as it generalises to grouping on several dimensions
    sum_value1:
      $sum: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    sum_value1: 1
]

Exemple 3: The unobvious 'count_distinct' in Mongo....

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client" # This first '$group' builds unique groups and therefore make every client unique in every combination of dimension1 x dimension2 x client
,
  $group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    count_distinct_client:
      $sum: 1 # Count the the unique groups and therefore the clients for every combination of dimension1 x dimension2
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    count_distinct_client: 1
]

Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !

Important notice: We show here how we can perform such aggregations in the mongo aggregation pipeline, but the query gets quite complicated and we do not even cover the challenge of performing several count_distinct at the same time... In such a case, the approach detailed below would need to be nested a number of times equal to the number of count_distinct aggregation needed !!! It quite quickly become unacceptable both in terms of readability and performance.

=> So we believe that we should switch to the potsprocess option as soon as the aggregation contains a 'count_distinct' combined with at least one other aggregation.

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "sum_value"
           column: "value"
           agg_function: 'sum'
        }
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client"
    array_detail:
      $push: '$$ROOT' # To keep track of the details and be able to perform every aggregation, we push every document ('$$ROOT' is a mongo variable similar to a cursor representing the current document being read) in arrays, with one array per group of [dimensions + the column for which we need to get the number of unique values]
,
$group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    client:
      $sum: 1 # here we count the unique clients as there are unique as per the group _id definition of the previous step
    array_group:
      $push: '$$ROOT' # Same thing here than in the step above: we need to keep track of every document to then be able to perform the other aggregations (see steps below)
,
  $unwind: '$array_group' # The unwind step "destructure" the array to get a root document per document found in the array => in tabular terms, we get 1 row per document found in the array
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    client: 1
    array: "$array_group.array_detail" # We get the array with the most granular level of documents here to unwind it in the following step to get a row per document
,
  $unwind: "$array"
,
  $group:
    _id:
    dimension1: "$dimension1"
    dimension2: "$dimension2"
    sum_value:
      $sum: "$array.value" # Now we can perform the other aggregations based on the most granular level
    count_distinct_client:
      $first: "$client" # And we need to keep the first value of count distinct found per group as this value was duplicated after the '$unwind' operations
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value: 1
    count_distinct_client: 1
]

Exemple 5: Several aggregations on the same column

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "my_name"
           column: "value1"
           agg_function: 'count'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    count_value1:
      $count: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    my_name: 1
]

Replace value with another

As a query designer, I can click on a column and choose the "replace value" transformation step. The step creation form is displayed and I'm asked to select values (either by manually entering values or by clicking on a cell) and choose the corresponding replacement value.

This will generate a ReplaceStep in the pipeline.

table columns should be resizable

As a query designer, I can make resize column width to be able to read long labels / values.

'argmax' / 'argmin' step

Example 1: with grouping

This 'argmax' step config in our VQB "language"...

{
   name: 'argmax',
   groups: ['foo'],
   column: 'bar' # column in which to search for the max value
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
    value_to_compare: $max: "$bar" # $min for argmin
}
{
  $unwind: "$array"
}
{
  $project:
    value_to_compare: 1
    <all_other_columns>: "$array.<all_other_columns> # We want to keep all columns
}
  $redact: # shortcut operator to avoid to firstly create a boolean column via $project and then filter on 'true' rows via $match
    $cond: [
      {
        $eq: ["$foo", "$value_to_compare"]
      }
      "$$KEEP"
      "$$PRUNE" # Magic happens here with the mongo variable "$$KEEP" (for... keeping rows matching the condition) and "$$PRUNE" (for excluding rows not matching the condition)
    ]
]

Example 2: without grouping

This 'argmin' step config in our VQB "language"...

{
   name: 'argmin',
   column: 'bar'
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
    value_to_compare: $min: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    value_to_compare: 1
    <all_other_columns>: "$array.<all_other_columns> # We want to keep all columns
}
  $redact: # shortcut operator to avoid to firstly create a boolean column via $project and then filter on 'true' rows via $match
    $cond: [
      {
        $eq: ["$foo", "$value_to_compare"]
      }
      "$$KEEP"
      "$$PRUNE" # Magic happens here with the mongo variable "$$KEEP" (for... keeping rows matching the condition) and "$$PRUNE" (for excluding rows not matching the condition)
    ]
]

toucantoco / weaverbird Goto Github PK

weaverbird's People

Contributors

Stargazers

Watchers

Forkers

weaverbird's Issues

TO DO

Example 1: with 'dropna' parameter to true

Example 2: with 'dropna' parameter to false

Notes:

Exemple 1: replace more than one value in the same column, inplace

Exemple 2: replace more than one value in the same column, in a new column

Exemple 3: replace one value in different columns

Exemple 1: With grouping

Exemple 2: Without grouping

Exemple 1: with groups and result in new column

Exemple 2: without groups and result in existing column

Ex 1: sort on one column

Ex 2: sort on multiple columns

TO DO

Exemple 1: Several aggregations on several dimensions

Exemple 2: We keep the same generic approach for aggregations on one dimension

Exemple 3: The unobvious 'count_distinct' in Mongo....

Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !

Exemple 5: Several aggregations on the same column

Example 1: with grouping

Example 2: without grouping

Recommend Projects

Recommend Topics

Recommend Org