koaning / bulk Goto Github PK

A Simple Bulk Labelling Tool

License: MIT License

Python 74.86% Makefile 1.58% JavaScript 23.56%

bulk's Introduction

🙂 Vincent D. Warmerdam
┣━━ 📦 Open Source Packages
┃   ┣━━ bulk              - simple bulk labelling interface
┃   ┣━━ embetter          - embeddings ready for sklearn
┃   ┣━━ doubtlab          - suite of tools to help find bad labels
┃   ┣━━ drawdata          - draw datasets in jupyter
┃   ┣━━ scikit-lego       - lego bricks for sklearn
┃   ┣━━ scikit-partial    - partial_fit() pipelines for sklearn
┃   ┣━━ scikit-bloom      - bloom transformers for sklearn
┃   ┣━━ human-learn       - rule-based components for sklearn
┃   ┣━━ sentence-models   - a different take on textcat
┃   ┣━━ mktestdocs        - turn markdown files into pytest tests
┃   ┣━━ lazylines         - lightweight utils for .jsonl wrangling
┃   ┣━━ cluestar          - inspiration for your first text labels
┃   ┣━━ durations         - pytest duration insights
┃   ┣━━ tuilwindcss       - tailwindcss for textual tui apps
┃   ┣━━ memo              - saves a whole log of time
┃   ┣━━ skedulord         - makes cron a bit more fun
┃   ┣━━ icepickle         - cool and safe storage for linear models
┃   ┗━━ evol              - grammar for genetic heuristics
┣━━ 👍 Project Contributions
┃   ┣━━ fairlearn         - contributed the CorrelationFilter
┃   ┣━━ polars            - contributed the .pipe() method
┃   ┗━━ BERTopic          - added lightweight sklearn pipeline support
┣━━ ⭐ Online Projects
┃   ┣━━ calmcode.io       - intermediate developer education
┃   ┣━━ koaning.io        - personal blog
┃   ┗━━ dearme.email      - reflection via a 30 day delay
┣━━ 🎙️ Popular Talks
┃   ┣━━ Natural Intelligence is All You Need
┃   ┣━━ Group-by statements that save the day
┃   ┣━━ Tools to Improve Training Data
┃   ┣━━ Optimal on Paper, Broken in Reality
┃   ┣━━ Playing by the Rules-Based-Systems
┃   ┣━━ How to Constrain Artificial Stupidity
┃   ┣━━ The Profession of Solving the Wrong Problem
┃   ┣━━ Winning with Simple, even Linear, Models
┃   ┗━━ Untitled12.ipynb
┣━━ 🔬 Random Experiments
┃   ┣━━ scikit-prune   - prune scikit learn pipelines
┃   ┣━━ gitlit         - tracking github action times across open source
┃   ┣━━ sentimany      - many sentiment models, one repo
┃   ┣━━ tokenwiser     - sklearn token tricks
┃   ┣━━ clumper        - functional API for lists of dicts
┃   ┗━━ whatlies       - exploration tools for word embeddings
┗━━ 👨‍💻 Employer
    ┣━━ 🎲 :probabl.   - scikit-learn and friends
    ┃   ┣━━ scikit-churn      - safety rails for churn work
    ┃   ┗━━ scikit-playtime   - rethinking pipelines
    ┣━━ 💥 Explosion   - developer tools for nlp
    ┃   ┣━━ prodigy-hf        - Prodigy integration for the HuggingFace stack
    ┃   ┣━━ prodigy-pdf       - Annotate PDFs via Prodigy
    ┃   ┣━━ prodigy-ann       - ANN techniques to find relevant subsets
    ┃   ┣━━ prodigy-segment   - Prodigy integration for Segment Anything
    ┃   ┣━━ prodigy-lunr      - Search techniques to find relevant subsets
    ┃   ┣━━ prodigy-whisper   - Transcribe audio with OpenAI's whisper models
    ┃   ┣━━ prodigy-tui       - Prodigy from the terminal
    ┃   ┗━━ cluestar          - inspiration for your first text labels
    ┗━━ 🤖 Rasa        - conversational software provider
        ┣━━ nlu examples      - custom nlu components for Rasa
        ┣━━ taipo             - data augmentation tools
        ┗━━ algo whiteboard   - nlp education

Follow me on twitter @fishnets88

bulk's People

Contributors

Stargazers

Watchers

bulk's Issues

I noticed for me saving doesn't work yet when using the packaged version- it just creates an empty csv file. tested this on 2 systems. It does work when running bokeh serve scripts/bulk_text.py --show

(ps1. note that in the bulk_text.py file it still refers to "meant to be ran via: bokeh serve scripts/main.py --show )

(ps2. there is still a reference in readme to "original.csv" which i believe now refers to cluestarred.csv)

Request: Save all columns from selected rows, not just text.

I'd rather preserve the rest of the columns from my df along with my subset of selections saved. I have other options to request too - what if you had a config yaml file for these?

Bulk Images Update

I have not posted here in a while, but I have been expanding the images version of bulk over the past month. It has quite a few features as well as the code to wrap bulk around docarray, umap, and hdbscan for processing images in multiple directories, embed them, flatten them, and isolate clusters and assign labels.

On the bokeh server side, it allows for you to hover over images as well as isolate different labels in a multiselect menu. (I am debugging a few things with multiselect now so that it communicates with the scatter plot and DataTable.

If you are interested, let me know and I can post a bit more.

P.S. I just saw your message on the PR. I am so sorry I missed your messages. I have been a bit busy with vacation/getting used to Croatia. I am back now and more than happy to help out with this, if you like. There is also a researcher at USHMM who has been working on his own version of this.

bokeh error does not render UMAP plot

Hi, thanks for developing this tool. Due to the following error, UMAP plots are not rendered. Not sure, but may be an easy fix.

Thanks again.

python -m bulk text ready.csv                                 
About to serve `bulk` over at http://localhost:5006/.
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name. This could either be due to a misspelling or typo, or due to an expected column being missing. : key "x" value "x", key "y" value "y" [renderer: GlyphRenderer(id='1049', ...)]

Reqest: Set Port

I know I can mod the code quickly to do this, but can you allow a port so I can run it remotely on a vm with control over that.
Thanks!

Add extras with UMAP alternatives.

minisom and https://pymde.org/ come to mind.

Add import statement to tutorial on text

Hi @koaning really nice idea. I noticed that you first tutorial is lacking an "from umap import UMAP"

Getting an error when using '--keywords' option in bulk

Hello @koaning thanks for such a great library!

I am getting an error when I am using '--keywords' option in the bulk.
It works great, if I don't use '--keywords' option ..
I wonder what the cause is ..

python -m bulk text ready2.csv --keywords "frozen"

About to serve bulk over at http://localhost:5006/.
Uncaught exception GET / (::1)
HTTPServerRequest(protocol='http', host='localhost:5006', method='GET', uri='/', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\tornado\web.py", line 1713, in _execute
result = await result
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\doc_handler.py", line 54, in get
session = await self.get_session()
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\session_handler.py", line 144, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\contexts.py", line 243, in create_session_if_needed
self._application.initialize_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\application.py", line 194, in initialize_document
h.modify_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\handlers\function.py", line 143, in modify_document
self._func(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\text.py", line 27, in bkapp
mapper, df = get_color_mapping(df)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\utils.py", line 25, in get_color_mapping
palette=Category10[len(all_values)],
KeyError: 1
500 GET / (::1) 28.95ms

Request: More efficent usage of browser space.

It would be great to fully use the floorspace available in the Browser window.

IE, The number of rows/columns of images, and the size of the graph responding to the available resolution.

This is the goal of the interface.

something relies on X-display - not sure what

When I start the bulk server, something is not progressing. This was with text. If I have X display available, everything is like a breeze.
Not sure where it is. I'll add more information if I have.
I had a similar issue in the past, this was solved by something like: mpl.use('Agg')

Rethink `--keywords`

The keywords mechanic is cool, but it does suck to have to reload the server. It still seems to fine to be able to pass keywords in the beginning, but is does seem more user-friendly to be able to change them on the fly in the interface. Possibly, we might even allow for regex stuff.

Next version

Some notes just for Vincent.

import altair as alt
import pandas as pd
import numpy as np

rand = np.random.RandomState(42)

df = pd.DataFrame({
    'xval': range(100),
    'yval': rand.randn(100).cumsum()
})

slider = alt.binding_range(min=0, max=100, step=1)
brush = alt.selection_interval(
    encodings=["x", "y"],
    on="[mousedown[event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![event.shiftKey]",
)

interaction = alt.selection_interval(
    bind="scales",
    on="[mousedown[!event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[!event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![!event.shiftKey]",
)

chart = alt.Chart(df).mark_point().encode(
    x='xval',
    y='yval',
).add_params(
    interaction, brush
)

jchart = alt.JupyterChart(chart)
jchart

from bulk import SelectionWidget, EmbeddedTextInput, TextPreview, ImagePreview


widget = SelectionWidget(dataf, color)
query = EmbeddedTextInput(embed_func, decomp)
widget.set_preview(text_preview)
widget.set_input(text_input)

from bulk import SubsetSlider

slider = SubsetSlider(jsonl, display_fn, similarity_column)

Request: Save as HTML

Hi,

would be great if I could save the figure as .html file and send it to others. Is this possible?

Best

Request: Allow size dimensions on the umap plot

Another potential for simple command line, or yaml: what should the dimensions of the umap plot display be?

if a keyword is not there (in any of the datapoints) - exception

Looked for "GPU" in some text body (just playing with 20newsgroups). Apparently not in any of the texts. Index not found (searching for color or so).

Load json and jsonl files

Finding .csv files a bit limiting.

def read_any(file:Path) -> DataTable:
    if file.name[-5:] == "jsonl":
        return pd.read_json(file, lines=True)
    if file.name[-4:] == "json":
        return pd.read_json(file)
    if file.name[-3:] == "csv":
        return pd.read_csv(file)
    raise ValueError(f"Can't read {file}.")

Help creating pipeline

Thanks for the great tool. I'm just getting started and got stuck making the pipeline. I've installed embetter.text and imported SentenceEncoder.
I've also installed sentence-transformers.
Running into error 'pip install embetter[sbert] though I'm able to create a model using:
model = SentenceTransformer('all-MiniLM-L6-v2') screen shot attached:

Can you help me - thanks.

Add 'bulk utils to-phrases'

I can probably port the to-phrases cli from gli to this project. Would make for an awesome blogpost later on.

Request for a new feature: Option to choose the size of the UMAP 2D fig

Description:
Could it be possible to add an option in order to fix the size of the UMAP 2D fig? Indeed, even if the bokeh app is in "scale_both" mode for "sizing_mode", the size of the left part of the app (i.e. the 2D fig) is fixed: plot_width = 300 (plot_width = 350 with color) and plot_height = 300. See in code

Option:

As an example, the run command could be python -m bulk text [my_file.csv] -fig_size 700. It will fix plot_width and plot_height to 700.

Benefit:

When the data points are very close, it becomes difficult to make a selection. Enlarging the figures allows us to better distinguish the points in order to select the ones we want.

Clusterization capabilities

Hi.
Probably not only manual labeling but a certain clustering algorithms could be implemented and then manualy fixed with this tool.
If agree I would love to implement in this tool optional use of clusterization technics from sklearn such as kmeans and dbscan.
Also would love to partisipate in other activities

Add `info` command to help folks debug.

It should list the Python version, OS, Bokeh version and maybe embetter?

warn when column is missing

I am currently having an issue with the text not appearing when selecting a cluster of data in the web app. However, the rest of the data seems to appear in the plot. I have followed the video and written tutorials closely but keep getting the same problem with any CSV files used.

Also, when saving the data to a new CSV file from the app, the content text is displayed correctly. This leads me to believe that is primarily an issue with the text rendering not the input data.

I've tried with multiple browsers (Chrome, Safari, and Edge) as well as multiple devices (both MacOS and Windows) but I'm still getting the same error.

I would greatly appreciate your help. Thank you!

Python version: 3.9.13
Bokeh version: 2.4.3
Bulk version: 0.2.0

Request: Inverse selection

It would be neat to be able to select points, drop them, and then save the rest. More like "sculpting" a class than selecting it, if it makes sense.

Request: Color the "already saved" dots in the umap display

When someone isn't using a color column or color by args, could there be an option to color the items "already selected and saved"? So you can sort of work your way thru sections you haven't visited yet?

README missing sentences variable

Missing sentences variable in README code; Something like sentences = df["text"] was probably meant.

Add a "demo mode"

I was giving a PyData Eindhoven talk when I couldn't use my own laptop due to an AV issue. It would've been amazing if there'd been a live demo that I could work with.

So maybe ... we should host some examples and also make it so that you can add a --download flag to replace the save button with a download one. It'd be nice to host some of the standard examples as well as some of the community contributions.

Stuff like: #41

Add `bulk embed` command.

It should be possible, with embetter as an optional dependency, to add a bulk embed command to this project. Maybe use it like:

# For text
python -m bulk embed text file-in.jsonl file-out.jsonl --pipeline sentence-tfm --model LaBSE

# For images
python -m bulk embed image file-in.jsonl file-out.jsonl --pipeline timm --model VGG16

Things to think about:

how to keep the default pipelines lightweight
maybe also allow functions to be imported from a notebook
think about progress bars, maybe with rich

missing the demo .csv file

Hi, love the tool!
I am not able to run the demo though. Can you help please?

bulk image takes a while to load - write docs

Hi,

I tried to run bulk image on a dataset which was created according to: https://github.com/explosion/prodigy-recipes/blob/master/tutorials/bulk-images/make_pets.py

Unfortunately, running the command "python -m bulk image <my_csv>" only shows a blank bokeh server page on port 5006 without any errors or additional information in the terminal... Only output visible is: "About to serve bulk over at http://localhost:5006/."

Has anyone encountered this issue? Tried locally on macOS with bulk versions 0.1.0 - 0.1.3 and python 3.8-3.10 as well as within a docker-container, both with the same results (blank bokeh server page)

Would greatly appreciate any indications - thanks a lot!

Not working with Bohek 3.0.0

Hello,

I was using bulk today for twitter data. It worked great and I was blown away by the results, but I needed to uninstall Bohek 3.0.0 and downgrade to Bohek 2.4.3. Once I did this, everything worked well. I'm writing to ask if you plan on updating the package to Bokeh 3.0.0 and, if you're not, if you could make a note about this version issue on in the GH repo.

Apologies if this is the wrong place to bring this up and thanks so much for this package!

Thanks!

New release version to pypi

Hi, could you please make a new release? I would like to get #64 and install this project as dependency from pypi. Thank you.

Request: Add a 'label' column to the saved subset data

Along with the option to have more than just the text column saved, what if we could add a label column to the saved subset, to make it easier to manage the data and files. A little input box on the form...

Color mapping crashes if too many classes defined

Looks like I crashed the color mapping with too many classes.

The palette could be upped to Category20, though this would still limit the number of classes a user can visualize. Whatever the limit is, there should be an error message with a graceful exit if the user tries to define more. I'd be happy to make a PR!

  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/tornado/web.py", line 1713, in _execute
    result = await result
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    session = await self.get_session()
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/session_handler.py", line 144, in get_session
    session = await self.application_context.create_session_if_needed(session_id, self.request, token)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/contexts.py", line 243, in create_session_if_needed
    self._application.initialize_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/application.py", line 194, in initialize_document
    h.modify_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/handlers/function.py", line 143, in modify_document
    self._func(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/text.py", line 27, in bkapp
    mapper, df = get_color_mapping(df)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/utils.py", line 25, in get_color_mapping
    palette=Category10[len(all_values)],
KeyError: 16```

bulk text not rendering

Hi @koaning I was running bulk on text with my data (using the code snippet as in embed.py). It produces the dataframe (ready.csv) but it does not open the bokeh UI after running python -m bulk text ready.csv. It doesn't run any local server on the terminal and also does not show any error.

JS code for update

I am new to Bokeh, can you provide javascript code for the update function callback, that way the code can be used for standalone HTML/JS too
def update(attr, old, new):
"""Callback used for plot update when lasso selecting"""
global highlighted_idx
subset = df.iloc[new]
highlighted_idx = new
subset = subset.iloc[np.random.permutation(len(subset))]
source.data = subset

Thank you.

koaning / bulk Goto Github PK

bulk's Introduction

bulk's People

Contributors

Stargazers

Watchers

Forkers

bulk's Issues

Recommend Projects

Recommend Topics

Recommend Org