Git Product home page Git Product logo

hi-primus / bumblebee Goto Github PK

View Code? Open in Web Editor NEW
139.0 12.0 35.0 23.52 MB

๐Ÿš• A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

Home Page: https://hi-bumblebee.com/

License: Apache License 2.0

JavaScript 32.32% CSS 8.75% Vue 38.74% SCSS 6.90% Dockerfile 0.43% Batchfile 0.52% Shell 0.56% Python 2.74% TypeScript 9.05%
data-profiling data-cleaning bumblebee gui data-preparation python dask optimus gpu cudf

bumblebee's People

Contributors

argenisleon avatar dependabot[bot] avatar edogp avatar jorgeparraandrade avatar luis11011 avatar luisboitas avatar rafaelmoreno21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bumblebee's Issues

[FEA] Add options to date format transformation

For date datatypes, we should create an easy way for the user to select the format they want to match. There are hundreds of combinations between days. months, years, minutes, seconds. @luis11011 any idea how can we present this to the users?

UI/UX fixes

Load file

  • Rename "Upload" for "Preview"
  • Move "limit" inside "infer"
  • Rename "infer" to "More options"
  • Move the url load option to "More options"

Operations

  • no case senstive in the filter operation. Use ignore_case param in replace function. See hi-primus/optimus@f3702ce
  • If an operation fails do not add it to the notebook
  • The columns search box is not working. Is not filtering any column

Handle tables data columns using MongoDB

Right now, Bumblebee receives data via sockets from Optimus. This cause problem because all the sampled data must be loaded into the browser hitting hard in the performance. Using MongoDB we can read only the rows needed to render the table.

Optimus should send the profiled data to MongoDB.
Bumblebee must load only the data needed to show in the table.

Load only visible columns

Right now when loading data from to the data window we load only the visible rows. We need to make this for columns too. In this way, we can reduce the latency and increment the performance.

Queue Implementation

It's necessary to receive the profiler data from a queue. In this way, we can receive data from any source from local installations, on-premise or from a third party like databricks.

[FEA] Add preview when loading data

The user should be able to get a file preview to select the best params like sep or charset to get the output required. We can do it easily using the n_rows param when loading the data

[FEA] Create live onboarding

We should teach the user the basic Bumblebee functionality through a live onboarding.
We could use de foo.csv file in /example/data/
Some basic function we can show are:

  • Load Data
  • Clean accents and special chars
  • Unnest
  • Nest
  • Filter
  • String clustering
  • Outliers
  • Save data to a file

Use profiler datatype instead of pandas datatypes

The Optimus now detect these datatypes:

  • INT
  • DECIMAL
  • STRING
  • BOOLEAN
  • DATE
  • ARRAY
  • OBJECT
  • GENDER
  • IP
  • URL
  • EMAIL
  • CREDIT_CARD_NUMBER
  • ZIP_CODE
  • CATEGORICAL

they are sent in the profiler information an can be found in the profiler_dtype key

{'columns': {'INCIDENT_NUMBER': {'stats': {'mismatch': 0,
    'missing': 0,
    'match': 319073,
    'profiler_dtype': 'string',

[FEA] Merge data from different sources

The user can already load data from a different source but he can not be easily handled or merged. We can make sure some UI/UX improvements to make it work.

Improve the UX when dropping duplicates

  • If there are duplicated values show only the rows that match
  • Show how many duplicated values were found
  • Make clear to the user which columns are been operated
  • If not matches were found give feedback to the user

Add group by and aggregation functions

Optimus now accepts for pandas an dask:

new_col = "std_col"
new_col_1 = "min_col"
new_col_2 = "count_col"
_agg = {new_col:{"Quantity":"std"},new_col_1:{"Quantity":"min"},new_col_2:{"ProdCategory":"count"}}

df.cols.groupby(by="State", agg = _agg).rows.sort("count_col", "desc").ext.display()

The aggregations supported are:

  • count
  • sum
  • mean
  • std
  • min
  • max
  • first
  • last

@luis11011 The user should be able to select a column and add multiple aggregations; naming the output column, the column he wants to aggregate and the aggregation

Improve the UX when a operation highlight any row

These are the operations that highlight rows:

Toolbar

  • Drop duplicates
  • Drop nulls
  • Filter

Charts

  • Select bars in chars
  • Select bar in frequency

Quality bar

  • Missings
  • Mismatches
  • If there are match show only the rows that match
  • Make clear to the user which columns are been operated
  • Show how many matches were found
  • If not matches were found give feedback to the user

[FEA] Save data to databases

Right now Bumblebee can save data to files, It will be useful if the user can save data to databases. Optimus already have these features so we can implement them easily.

[FEA] Realtime preview operations

After making a full operation over the whole dataset, we could make a realtime preview.

We could start with:

  • Replace
  • Nest
  • Unnest
  • Filter
  • Rename
  • Duplicate

Improve UI/UX for data source handling

Menu "add data source"

  • Add from file
  • Add from database
  • Add from a loaded data source
  • Manage data sources

Menu "save"

  • Download
  • Save to database

UX

  • Add data source call to action when opening a new tab
  • Open the load dataset dialog when opening a new tab
  • Show more recent data sources
  • Drag and drop from file

Add drop nan

The user should be able to drop rows with nan values
In dask:
df.rows.drop_na(how="all").compute()

Options for the user

  • all
  • any

Preview for pandas
pdf["__match__"] = pdf.isnull().any(axis=1)

Redesing the notebook view

The notebook view seems overkill for most users. A plain English description could be enough for the user to have an overview of all the operations.

@luis11011 as we talk some weeks ago we could just pick the operation and let the user modify the params

Group ML related functions

Put these functions under an ML icon in the toolbar

  • Sampling
  • Prepare
  • Impute
  • Standard
  • Min Max
  • Max abs
  • Values to Columns
  • String to Index
  • Indices to strings
  • Outliers

Add drop duplicated

The user should be able to drop duplicates rows.

options

keep

  • firsts
  • last

Code for preview

pdf["__match"] = pdf.duplicated(keep='')

Code for dask

df.rows.drop_duplicates(keep='').compute()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.