hi-primus / bumblebee Goto Github PK

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

License: Apache License 2.0

JavaScript 32.32% CSS 8.75% Vue 38.74% SCSS 6.90% Dockerfile 0.43% Batchfile 0.52% Shell 0.56% Python 2.74% TypeScript 9.05%

data-profiling data-cleaning bumblebee gui data-preparation python dask optimus gpu cudf

bumblebee's People

Contributors

Stargazers

Watchers

bumblebee's Issues

[BUG] Search restart after typing

See the gif
https://recordit.co/Ek4Y6CCrf3

Reorganize columns search, sort and filter operations

None of the users interviewed use the columns operations search, sort, filter. We can reconsider to move this to the right side of the toolbox.

[FEA] Add options to date format transformation

For date datatypes, we should create an easy way for the user to select the format they want to match. There are hundreds of combinations between days. months, years, minutes, seconds. @luis11011 any idea how can we present this to the users?

UI/UX fixes

Load file

Rename "Upload" for "Preview"
Move "limit" inside "infer"
Rename "infer" to "More options"
Move the url load option to "More options"

Operations

no case senstive in the filter operation. Use ignore_case param in replace function. See hi-primus/optimus@f3702ce
If an operation fails do not add it to the notebook
The columns search box is not working. Is not filtering any column

Add undo button

Handle tables data columns using MongoDB

Right now, Bumblebee receives data via sockets from Optimus. This cause problem because all the sampled data must be loaded into the browser hitting hard in the performance. Using MongoDB we can read only the rows needed to render the table.

Optimus should send the profiled data to MongoDB.
Bumblebee must load only the data needed to show in the table.

Create dockerfile

It would be great to create a Dockerfile so user can easily install bumblebee.
Here the installation instrucctions https://blog.hi-bumblebee.com/install-bumblebee/

Load only visible columns

Right now when loading data from to the data window we load only the visible rows. We need to make this for columns too. In this way, we can reduce the latency and increment the performance.

[BUG] Phantom placehoder appering when opening datasets with different sizes

Getting some visual information in the Readme

Hi I think we should be posting some visual information in the Readme so people have an idea of what it looks now, some screenshots or gifs would work.

Let me know your thoughts on this @argenisleon @JorgeParraAndrade

Queue Implementation

It's necessary to receive the profiler data from a queue. In this way, we can receive data from any source from local installations, on-premise or from a third party like databricks.

[FEA] User can download the data in csv format after execute operations

[FEA] Add preview when loading data

The user should be able to get a file preview to select the best params like sep or charset to get the output required. We can do it easily using the n_rows param when loading the data

[FEA] Create live onboarding

We should teach the user the basic Bumblebee functionality through a live onboarding.
We could use de foo.csv file in /example/data/
Some basic function we can show are:

Load Data
Clean accents and special chars
Unnest
Nest
Filter
String clustering
Outliers
Save data to a file

Installation link isn't working

I tried opening the link https://blog.hi-bumblebee.com/install-bumblebee/ but it shows connection has timed out.

Hint the user with rows number

This is a reference for google docs

By default ignore case in replace

Use the new added Optimus param ignore_case=True to ignore case when searching for replacing

Use profiler datatype instead of pandas datatypes

The Optimus now detect these datatypes:

INT
DECIMAL
STRING
BOOLEAN
DATE
ARRAY
OBJECT
GENDER
IP
URL
EMAIL
CREDIT_CARD_NUMBER
ZIP_CODE
CATEGORICAL

they are sent in the profiler information an can be found in the profiler_dtype key

{'columns': {'INCIDENT_NUMBER': {'stats': {'mismatch': 0,
    'missing': 0,
    'match': 319073,
    'profiler_dtype': 'string',

Add set, update and delete operation over histograms and frequency charts

Create Windows 10 installation tutorial

Add string grouper clustering algorithm

Right now Optimus handle fingerprints, ngram-fingerprint, Levenstein distance to cluster strings. Using the string grouper library could give a boost to the clustering Related to hi-primus/optimus#919

Migrate to Jupyter Enterprise Gateway

We are using Jupyter Kernel Gateway, but the project seems buggy(WebSocket echo and random disconnections) and no longer maintained.
We should migrate to https://github.com/jupyter/enterprise_gateway

[FEA] Merge data from different sources

The user can already load data from a different source but he can not be easily handled or merged. We can make sure some UI/UX improvements to make it work.

[BUG] Highlight is wrong when using replace

Also seems to be adding some kind of HTML element at the end of the string

[FEA] Evaluate using shortcuts

Add trigonometric and math functions to numeric columns

These functions have been added to Optimus development branch.
def round(self, input_cols, decimals, output_cols=None):
def floor(self, input_cols, output_cols=None):
def ceil(self, input_cols, output_cols=None):

Improve UI/UX for data joining

@luis11011 here are some of the joining features:

You user can make join operation with any tab opened in the workspace.
Show the key icon only on hover. Like the visibility button in the list view for column selection.
I am exploring how can we visually show the user the two data set some work. I am trying light mixing https://www.figma.com/file/OaOQviw4uzfpHl7Uwn8uyK/Join-color-scheme-test?node-id=0%3A1. I little bright right now nut is a start.

Show patterns in detail windows

This can be useful to get an overview of string structure of a columns
def patterns(self, input_cols, output_cols=None, mode=0):

See https://github.com/ironmussa/Optimus/blob/develop-3.0/optimus/engines/base/columns.py#L153 For more info about the param

Create histograms for string type columns

[FEA] Users can create accounts

Let users create accounts using and email

[FEA] User should be able to upload files to its account

Users should be able to upload files and in csv, json, parquet, avro

Drag and drop data files to a workspace

I think this would open a new tab and will try to preview the file. The user would need to confirm the operation to finish the data loading.

Add frequency or histogram chart at the top of the column

Visualize string clustering

Improve the UX when dropping duplicates

If there are duplicated values show only the rows that match
Show how many duplicated values were found
Make clear to the user which columns are been operated
If not matches were found give feedback to the user

span element inside cell

@luis11011 Seems there are two extra span elements every cell. Could we get rid of this? I think this could be hitting the performance.

Filter columns by regex

Add group by and aggregation functions

Optimus now accepts for pandas an dask:

new_col = "std_col"
new_col_1 = "min_col"
new_col_2 = "count_col"
_agg = {new_col:{"Quantity":"std"},new_col_1:{"Quantity":"min"},new_col_2:{"ProdCategory":"count"}}

df.cols.groupby(by="State", agg = _agg).rows.sort("count_col", "desc").ext.display()

The aggregations supported are:

count
sum
mean
std
min
max
first
last

@luis11011 The user should be able to select a column and add multiple aggregations; naming the output column, the column he wants to aggregate and the aggregation

Improve the UX when a operation highlight any row

These are the operations that highlight rows:

Toolbar

Drop duplicates
Drop nulls
Filter

Charts

Select bars in chars
Select bar in frequency

Quality bar

Missings
Mismatches

If there are match show only the rows that match
Make clear to the user which columns are been operated
Show how many matches were found
If not matches were found give feedback to the user

[FEA] Save data to databases

Right now Bumblebee can save data to files, It will be useful if the user can save data to databases. Optimus already have these features so we can implement them easily.