koaning / bulk Goto Github PK
View Code? Open in Web Editor NEWA Simple Bulk Labelling Tool
License: MIT License
A Simple Bulk Labelling Tool
License: MIT License
Along with the option to have more than just the text column saved, what if we could add a label column to the saved subset, to make it easier to manage the data and files. A little input box on the form...
It should be possible, with embetter as an optional dependency, to add a bulk embed
command to this project. Maybe use it like:
# For text
python -m bulk embed text file-in.jsonl file-out.jsonl --pipeline sentence-tfm --model LaBSE
# For images
python -m bulk embed image file-in.jsonl file-out.jsonl --pipeline timm --model VGG16
Things to think about:
I am new to Bokeh, can you provide javascript code for the update function callback, that way the code can be used for standalone HTML/JS too
def update(attr, old, new):
"""Callback used for plot update when lasso selecting"""
global highlighted_idx
subset = df.iloc[new]
highlighted_idx = new
subset = subset.iloc[np.random.permutation(len(subset))]
source.data = subset
Thank you.
Hello @koaning thanks for such a great library!
I am getting an error when I am using '--keywords' option in the bulk.
It works great, if I don't use '--keywords' option ..
I wonder what the cause is ..
python -m bulk text ready2.csv --keywords "frozen"
About to serve bulk
over at http://localhost:5006/.
Uncaught exception GET / (::1)
HTTPServerRequest(protocol='http', host='localhost:5006', method='GET', uri='/', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\tornado\web.py", line 1713, in _execute
result = await result
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\doc_handler.py", line 54, in get
session = await self.get_session()
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\session_handler.py", line 144, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\contexts.py", line 243, in create_session_if_needed
self._application.initialize_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\application.py", line 194, in initialize_document
h.modify_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\handlers\function.py", line 143, in modify_document
self._func(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\text.py", line 27, in bkapp
mapper, df = get_color_mapping(df)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\utils.py", line 25, in get_color_mapping
palette=Category10[len(all_values)],
KeyError: 1
500 GET / (::1) 28.95ms
I can probably port the to-phrases cli from gli to this project. Would make for an awesome blogpost later on.
Hi.
Probably not only manual labeling but a certain clustering algorithms could be implemented and then manualy fixed with this tool.
If agree I would love to implement in this tool optional use of clusterization technics from sklearn such as kmeans and dbscan.
Also would love to partisipate in other activities
Hi @koaning I was running bulk on text with my data (using the code snippet as in embed.py). It produces the dataframe (ready.csv) but it does not open the bokeh UI after running python -m bulk text ready.csv. It doesn't run any local server on the terminal and also does not show any error.
Looked for "GPU" in some text body (just playing with 20newsgroups). Apparently not in any of the texts. Index not found (searching for color or so).
Hi,
would be great if I could save the figure as .html file and send it to others. Is this possible?
Best
Hi, could you please make a new release? I would like to get #64 and install this project as dependency from pypi. Thank you.
minisom and https://pymde.org/ come to mind.
When I start the bulk server, something is not progressing. This was with text. If I have X display available, everything is like a breeze.
Not sure where it is. I'll add more information if I have.
I had a similar issue in the past, this was solved by something like: mpl.use('Agg')
I am currently having an issue with the text not appearing when selecting a cluster of data in the web app. However, the rest of the data seems to appear in the plot. I have followed the video and written tutorials closely but keep getting the same problem with any CSV files used.
Also, when saving the data to a new CSV file from the app, the content text is displayed correctly. This leads me to believe that is primarily an issue with the text rendering not the input data.
I've tried with multiple browsers (Chrome, Safari, and Edge) as well as multiple devices (both MacOS and Windows) but I'm still getting the same error.
I would greatly appreciate your help. Thank you!
Python version: 3.9.13
Bokeh version: 2.4.3
Bulk version: 0.2.0
It should list the Python version, OS, Bokeh version and maybe embetter?
I'd rather preserve the rest of the columns from my df along with my subset of selections saved. I have other options to request too - what if you had a config yaml file for these?
Thanks for the great tool. I'm just getting started and got stuck making the pipeline. I've installed embetter.text and imported SentenceEncoder.
I've also installed sentence-transformers.
Running into error 'pip install embetter[sbert] though I'm able to create a model using:
model = SentenceTransformer('all-MiniLM-L6-v2') screen shot attached:
Can you help me - thanks.
great stuff, very useful thanks!
I noticed for me saving doesn't work yet when using the packaged version- it just creates an empty csv file. tested this on 2 systems. It does work when running bokeh serve scripts/bulk_text.py --show
(ps1. note that in the bulk_text.py file it still refers to "meant to be ran via: bokeh serve scripts/main.py --show )
(ps2. there is still a reference in readme to "original.csv" which i believe now refers to cluestarred.csv)
I know I can mod the code quickly to do this, but can you allow a port so I can run it remotely on a vm with control over that.
Thanks!
Some notes just for Vincent.
import altair as alt
import pandas as pd
import numpy as np
rand = np.random.RandomState(42)
df = pd.DataFrame({
'xval': range(100),
'yval': rand.randn(100).cumsum()
})
slider = alt.binding_range(min=0, max=100, step=1)
brush = alt.selection_interval(
encodings=["x", "y"],
on="[mousedown[event.shiftKey], mouseup] > mousemove",
translate="[mousedown[event.shiftKey], mouseup] > mousemove!",
zoom="wheel![event.shiftKey]",
)
interaction = alt.selection_interval(
bind="scales",
on="[mousedown[!event.shiftKey], mouseup] > mousemove",
translate="[mousedown[!event.shiftKey], mouseup] > mousemove!",
zoom="wheel![!event.shiftKey]",
)
chart = alt.Chart(df).mark_point().encode(
x='xval',
y='yval',
).add_params(
interaction, brush
)
jchart = alt.JupyterChart(chart)
jchart
from bulk import SelectionWidget, EmbeddedTextInput, TextPreview, ImagePreview
widget = SelectionWidget(dataf, color)
query = EmbeddedTextInput(embed_func, decomp)
widget.set_preview(text_preview)
widget.set_input(text_input)
from bulk import SubsetSlider
slider = SubsetSlider(jsonl, display_fn, similarity_column)
Finding .csv files a bit limiting.
def read_any(file:Path) -> DataTable:
if file.name[-5:] == "jsonl":
return pd.read_json(file, lines=True)
if file.name[-4:] == "json":
return pd.read_json(file)
if file.name[-3:] == "csv":
return pd.read_csv(file)
raise ValueError(f"Can't read {file}.")
I have not posted here in a while, but I have been expanding the images version of bulk over the past month. It has quite a few features as well as the code to wrap bulk around docarray, umap, and hdbscan for processing images in multiple directories, embed them, flatten them, and isolate clusters and assign labels.
On the bokeh server side, it allows for you to hover over images as well as isolate different labels in a multiselect menu. (I am debugging a few things with multiselect now so that it communicates with the scatter plot and DataTable.
If you are interested, let me know and I can post a bit more.
P.S. I just saw your message on the PR. I am so sorry I missed your messages. I have been a bit busy with vacation/getting used to Croatia. I am back now and more than happy to help out with this, if you like. There is also a researcher at USHMM who has been working on his own version of this.
I was giving a PyData Eindhoven talk when I couldn't use my own laptop due to an AV issue. It would've been amazing if there'd been a live demo that I could work with.
So maybe ... we should host some examples and also make it so that you can add a --download
flag to replace the save
button with a download
one. It'd be nice to host some of the standard examples as well as some of the community contributions.
Stuff like: #41
Another potential for simple command line, or yaml: what should the dimensions of the umap plot display be?
Hi, love the tool!
I am not able to run the demo though. Can you help please?
It would be neat to be able to select points, drop them, and then save the rest. More like "sculpting" a class than selecting it, if it makes sense.
Hello,
I was using bulk today for twitter data. It worked great and I was blown away by the results, but I needed to uninstall Bohek 3.0.0 and downgrade to Bohek 2.4.3. Once I did this, everything worked well. I'm writing to ask if you plan on updating the package to Bokeh 3.0.0 and, if you're not, if you could make a note about this version issue on in the GH repo.
Apologies if this is the wrong place to bring this up and thanks so much for this package!
Thanks!
Hi, thanks for developing this tool. Due to the following error, UMAP plots are not rendered. Not sure, but may be an easy fix.
Thanks again.
python -m bulk text ready.csv
About to serve `bulk` over at http://localhost:5006/.
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name. This could either be due to a misspelling or typo, or due to an expected column being missing. : key "x" value "x", key "y" value "y" [renderer: GlyphRenderer(id='1049', ...)]
Missing sentences
variable in README code; Something like sentences = df["text"]
was probably meant.
Awesome project, I've been meaning to check it out for a while.
I ran into this error when running python prep-data.py
and wondered if anyone else encountered this issue.
I reduced the dataset to ~200 sentences in case it was a memory issue.
[1] 32890 segmentation fault python prep-data.py
/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I'm using Poetry for package mgmt.
[tool.poetry.dependencies]
python = "^3.11"
embetter = "^0.3.8"
pandas = "2.0.0"
umap-learn = "^0.5.3"
Any tips or advice greatly appreciated.
The keywords mechanic is cool, but it does suck to have to reload the server. It still seems to fine to be able to pass keywords in the beginning, but is does seem more user-friendly to be able to change them on the fly in the interface. Possibly, we might even allow for regex stuff.
Hi @koaning really nice idea. I noticed that you first tutorial is lacking an "from umap import UMAP"
It would be great to fully use the floorspace available in the Browser window.
IE, The number of rows/columns of images, and the size of the graph responding to the available resolution.
Hi,
I tried to run bulk image on a dataset which was created according to: https://github.com/explosion/prodigy-recipes/blob/master/tutorials/bulk-images/make_pets.py
Unfortunately, running the command "python -m bulk image <my_csv>" only shows a blank bokeh server page on port 5006 without any errors or additional information in the terminal... Only output visible is: "About to serve bulk
over at http://localhost:5006/."
Has anyone encountered this issue? Tried locally on macOS with bulk versions 0.1.0 - 0.1.3 and python 3.8-3.10 as well as within a docker-container, both with the same results (blank bokeh server page)
Would greatly appreciate any indications - thanks a lot!
Description:
Could it be possible to add an option in order to fix the size of the UMAP 2D fig? Indeed, even if the bokeh app is in "scale_both" mode for "sizing_mode", the size of the left part of the app (i.e. the 2D fig) is fixed: plot_width = 300
(plot_width
= 350 with color) and plot_height = 300
. See in code
Option:
python -m bulk text [my_file.csv] -fig_size 700
. It will fix plot_width and plot_height to 700.Benefit:
Looks like I crashed the color mapping with too many classes.
The palette could be upped to Category20, though this would still limit the number of classes a user can visualize. Whatever the limit is, there should be an error message with a graceful exit if the user tries to define more. I'd be happy to make a PR!
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/tornado/web.py", line 1713, in _execute
result = await result
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
session = await self.get_session()
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/session_handler.py", line 144, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/contexts.py", line 243, in create_session_if_needed
self._application.initialize_document(doc)
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/application.py", line 194, in initialize_document
h.modify_document(doc)
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/handlers/function.py", line 143, in modify_document
self._func(doc)
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/text.py", line 27, in bkapp
mapper, df = get_color_mapping(df)
File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/utils.py", line 25, in get_color_mapping
palette=Category10[len(all_values)],
KeyError: 16```
When someone isn't using a color column or color by args, could there be an option to color the items "already selected and saved"? So you can sort of work your way thru sections you haven't visited yet?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.