koaning / bulk Goto Github PK

A Simple Bulk Labelling Tool

License: MIT License

Python 74.86% Makefile 1.58% JavaScript 23.56%

bulk's Issues

Request: Add a 'label' column to the saved subset data

Along with the option to have more than just the text column saved, what if we could add a label column to the saved subset, to make it easier to manage the data and files. A little input box on the form...

Add `bulk embed` command.

It should be possible, with embetter as an optional dependency, to add a bulk embed command to this project. Maybe use it like:

# For text
python -m bulk embed text file-in.jsonl file-out.jsonl --pipeline sentence-tfm --model LaBSE

# For images
python -m bulk embed image file-in.jsonl file-out.jsonl --pipeline timm --model VGG16

Things to think about:

how to keep the default pipelines lightweight
maybe also allow functions to be imported from a notebook
think about progress bars, maybe with rich

JS code for update

I am new to Bokeh, can you provide javascript code for the update function callback, that way the code can be used for standalone HTML/JS too
def update(attr, old, new):
"""Callback used for plot update when lasso selecting"""
global highlighted_idx
subset = df.iloc[new]
highlighted_idx = new
subset = subset.iloc[np.random.permutation(len(subset))]
source.data = subset

Thank you.

This is the goal of the interface.

Getting an error when using '--keywords' option in bulk

Hello @koaning thanks for such a great library!

I am getting an error when I am using '--keywords' option in the bulk.
It works great, if I don't use '--keywords' option ..
I wonder what the cause is ..

python -m bulk text ready2.csv --keywords "frozen"

About to serve bulk over at http://localhost:5006/.
Uncaught exception GET / (::1)
HTTPServerRequest(protocol='http', host='localhost:5006', method='GET', uri='/', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\tornado\web.py", line 1713, in _execute
result = await result
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\doc_handler.py", line 54, in get
session = await self.get_session()
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\session_handler.py", line 144, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\contexts.py", line 243, in create_session_if_needed
self._application.initialize_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\application.py", line 194, in initialize_document
h.modify_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\handlers\function.py", line 143, in modify_document
self._func(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\text.py", line 27, in bkapp
mapper, df = get_color_mapping(df)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\utils.py", line 25, in get_color_mapping
palette=Category10[len(all_values)],
KeyError: 1
500 GET / (::1) 28.95ms

Add 'bulk utils to-phrases'

I can probably port the to-phrases cli from gli to this project. Would make for an awesome blogpost later on.

Clusterization capabilities

Hi.
Probably not only manual labeling but a certain clustering algorithms could be implemented and then manualy fixed with this tool.
If agree I would love to implement in this tool optional use of clusterization technics from sklearn such as kmeans and dbscan.
Also would love to partisipate in other activities

bulk text not rendering

Hi @koaning I was running bulk on text with my data (using the code snippet as in embed.py). It produces the dataframe (ready.csv) but it does not open the bokeh UI after running python -m bulk text ready.csv. It doesn't run any local server on the terminal and also does not show any error.

if a keyword is not there (in any of the datapoints) - exception

Looked for "GPU" in some text body (just playing with 20newsgroups). Apparently not in any of the texts. Index not found (searching for color or so).

Request: Save as HTML

Hi,

would be great if I could save the figure as .html file and send it to others. Is this possible?

Best

New release version to pypi

Hi, could you please make a new release? I would like to get #64 and install this project as dependency from pypi. Thank you.

Add extras with UMAP alternatives.

minisom and https://pymde.org/ come to mind.

something relies on X-display - not sure what

When I start the bulk server, something is not progressing. This was with text. If I have X display available, everything is like a breeze.
Not sure where it is. I'll add more information if I have.
I had a similar issue in the past, this was solved by something like: mpl.use('Agg')

warn when column is missing

I am currently having an issue with the text not appearing when selecting a cluster of data in the web app. However, the rest of the data seems to appear in the plot. I have followed the video and written tutorials closely but keep getting the same problem with any CSV files used.

Also, when saving the data to a new CSV file from the app, the content text is displayed correctly. This leads me to believe that is primarily an issue with the text rendering not the input data.

I've tried with multiple browsers (Chrome, Safari, and Edge) as well as multiple devices (both MacOS and Windows) but I'm still getting the same error.

I would greatly appreciate your help. Thank you!

Python version: 3.9.13
Bokeh version: 2.4.3
Bulk version: 0.2.0

Add `info` command to help folks debug.

It should list the Python version, OS, Bokeh version and maybe embetter?

Request: Save all columns from selected rows, not just text.

I'd rather preserve the rest of the columns from my df along with my subset of selections saved. I have other options to request too - what if you had a config yaml file for these?

Help creating pipeline

Thanks for the great tool. I'm just getting started and got stuck making the pipeline. I've installed embetter.text and imported SentenceEncoder.
I've also installed sentence-transformers.
Running into error 'pip install embetter[sbert] though I'm able to create a model using:
model = SentenceTransformer('all-MiniLM-L6-v2') screen shot attached:

Can you help me - thanks.

saving

great stuff, very useful thanks!

I noticed for me saving doesn't work yet when using the packaged version- it just creates an empty csv file. tested this on 2 systems. It does work when running bokeh serve scripts/bulk_text.py --show

(ps1. note that in the bulk_text.py file it still refers to "meant to be ran via: bokeh serve scripts/main.py --show )

(ps2. there is still a reference in readme to "original.csv" which i believe now refers to cluestarred.csv)

Reqest: Set Port

I know I can mod the code quickly to do this, but can you allow a port so I can run it remotely on a vm with control over that.
Thanks!

Next version

Some notes just for Vincent.

import altair as alt
import pandas as pd
import numpy as np

rand = np.random.RandomState(42)

df = pd.DataFrame({
    'xval': range(100),
    'yval': rand.randn(100).cumsum()
})

slider = alt.binding_range(min=0, max=100, step=1)
brush = alt.selection_interval(
    encodings=["x", "y"],
    on="[mousedown[event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![event.shiftKey]",
)

interaction = alt.selection_interval(
    bind="scales",
    on="[mousedown[!event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[!event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![!event.shiftKey]",
)

chart = alt.Chart(df).mark_point().encode(
    x='xval',
    y='yval',
).add_params(
    interaction, brush
)

jchart = alt.JupyterChart(chart)
jchart

from bulk import SelectionWidget, EmbeddedTextInput, TextPreview, ImagePreview


widget = SelectionWidget(dataf, color)
query = EmbeddedTextInput(embed_func, decomp)
widget.set_preview(text_preview)
widget.set_input(text_input)

from bulk import SubsetSlider

slider = SubsetSlider(jsonl, display_fn, similarity_column)

Load json and jsonl files

Finding .csv files a bit limiting.

def read_any(file:Path) -> DataTable:
    if file.name[-5:] == "jsonl":
        return pd.read_json(file, lines=True)
    if file.name[-4:] == "json":
        return pd.read_json(file)
    if file.name[-3:] == "csv":
        return pd.read_csv(file)
    raise ValueError(f"Can't read {file}.")

Bulk Images Update

I have not posted here in a while, but I have been expanding the images version of bulk over the past month. It has quite a few features as well as the code to wrap bulk around docarray, umap, and hdbscan for processing images in multiple directories, embed them, flatten them, and isolate clusters and assign labels.

On the bokeh server side, it allows for you to hover over images as well as isolate different labels in a multiselect menu. (I am debugging a few things with multiselect now so that it communicates with the scatter plot and DataTable.

If you are interested, let me know and I can post a bit more.

P.S. I just saw your message on the PR. I am so sorry I missed your messages. I have been a bit busy with vacation/getting used to Croatia. I am back now and more than happy to help out with this, if you like. There is also a researcher at USHMM who has been working on his own version of this.

Add a "demo mode"

I was giving a PyData Eindhoven talk when I couldn't use my own laptop due to an AV issue. It would've been amazing if there'd been a live demo that I could work with.

So maybe ... we should host some examples and also make it so that you can add a --download flag to replace the save button with a download one. It'd be nice to host some of the standard examples as well as some of the community contributions.

Stuff like: #41

Request: Allow size dimensions on the umap plot

Another potential for simple command line, or yaml: what should the dimensions of the umap plot display be?

missing the demo .csv file

Hi, love the tool!
I am not able to run the demo though. Can you help please?

Request: Inverse selection

It would be neat to be able to select points, drop them, and then save the rest. More like "sculpting" a class than selecting it, if it makes sense.

Not working with Bohek 3.0.0

Hello,

I was using bulk today for twitter data. It worked great and I was blown away by the results, but I needed to uninstall Bohek 3.0.0 and downgrade to Bohek 2.4.3. Once I did this, everything worked well. I'm writing to ask if you plan on updating the package to Bokeh 3.0.0 and, if you're not, if you could make a note about this version issue on in the GH repo.

Apologies if this is the wrong place to bring this up and thanks so much for this package!

Thanks!

bokeh error does not render UMAP plot

Hi, thanks for developing this tool. Due to the following error, UMAP plots are not rendered. Not sure, but may be an easy fix.

Thanks again.

python -m bulk text ready.csv                                 
About to serve `bulk` over at http://localhost:5006/.
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name. This could either be due to a misspelling or typo, or due to an expected column being missing. : key "x" value "x", key "y" value "y" [renderer: GlyphRenderer(id='1049', ...)]

README missing sentences variable

Missing sentences variable in README code; Something like sentences = df["text"] was probably meant.

segmentation fault

Awesome project, I've been meaning to check it out for a while.

I ran into this error when running python prep-data.py and wondered if anyone else encountered this issue.

I reduced the dataset to ~200 sentences in case it was a memory issue.

[1]    32890 segmentation fault  python prep-data.py
/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I'm using Poetry for package mgmt.

[tool.poetry.dependencies]
python = "^3.11"
embetter = "^0.3.8"
pandas = "2.0.0"
umap-learn = "^0.5.3"

Any tips or advice greatly appreciated.

Rethink `--keywords`

The keywords mechanic is cool, but it does suck to have to reload the server. It still seems to fine to be able to pass keywords in the beginning, but is does seem more user-friendly to be able to change them on the fly in the interface. Possibly, we might even allow for regex stuff.

Add import statement to tutorial on text

Hi @koaning really nice idea. I noticed that you first tutorial is lacking an "from umap import UMAP"

Request: More efficent usage of browser space.

It would be great to fully use the floorspace available in the Browser window.

IE, The number of rows/columns of images, and the size of the graph responding to the available resolution.

bulk image takes a while to load - write docs

Hi,

I tried to run bulk image on a dataset which was created according to: https://github.com/explosion/prodigy-recipes/blob/master/tutorials/bulk-images/make_pets.py

Unfortunately, running the command "python -m bulk image <my_csv>" only shows a blank bokeh server page on port 5006 without any errors or additional information in the terminal... Only output visible is: "About to serve bulk over at http://localhost:5006/."

Has anyone encountered this issue? Tried locally on macOS with bulk versions 0.1.0 - 0.1.3 and python 3.8-3.10 as well as within a docker-container, both with the same results (blank bokeh server page)

Would greatly appreciate any indications - thanks a lot!

Request for a new feature: Option to choose the size of the UMAP 2D fig

Description:
Could it be possible to add an option in order to fix the size of the UMAP 2D fig? Indeed, even if the bokeh app is in "scale_both" mode for "sizing_mode", the size of the left part of the app (i.e. the 2D fig) is fixed: plot_width = 300 (plot_width = 350 with color) and plot_height = 300. See in code

Option:

As an example, the run command could be python -m bulk text [my_file.csv] -fig_size 700. It will fix plot_width and plot_height to 700.

Benefit:

When the data points are very close, it becomes difficult to make a selection. Enlarging the figures allows us to better distinguish the points in order to select the ones we want.

Color mapping crashes if too many classes defined

Looks like I crashed the color mapping with too many classes.

The palette could be upped to Category20, though this would still limit the number of classes a user can visualize. Whatever the limit is, there should be an error message with a graceful exit if the user tries to define more. I'd be happy to make a PR!

  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/tornado/web.py", line 1713, in _execute
    result = await result
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    session = await self.get_session()
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/session_handler.py", line 144, in get_session
    session = await self.application_context.create_session_if_needed(session_id, self.request, token)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/contexts.py", line 243, in create_session_if_needed
    self._application.initialize_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/application.py", line 194, in initialize_document
    h.modify_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/handlers/function.py", line 143, in modify_document
    self._func(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/text.py", line 27, in bkapp
    mapper, df = get_color_mapping(df)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/utils.py", line 25, in get_color_mapping
    palette=Category10[len(all_values)],
KeyError: 16```

Request: Color the "already saved" dots in the umap display

When someone isn't using a color column or color by args, could there be an option to color the items "already selected and saved"? So you can sort of work your way thru sections you haven't visited yet?

koaning / bulk Goto Github PK

bulk's Issues

Recommend Projects

Recommend Topics

Recommend Org