Git Product home page Git Product logo

bulk's Introduction

๐Ÿ™‚ Vincent D. Warmerdam
โ”ฃโ”โ” ๐Ÿ“ฆ Open Source Packages
โ”ƒ   โ”ฃโ”โ” bulk              - simple bulk labelling interface
โ”ƒ   โ”ฃโ”โ” embetter          - embeddings ready for sklearn
โ”ƒ   โ”ฃโ”โ” doubtlab          - suite of tools to help find bad labels
โ”ƒ   โ”ฃโ”โ” drawdata          - draw datasets in jupyter
โ”ƒ   โ”ฃโ”โ” scikit-lego       - lego bricks for sklearn
โ”ƒ   โ”ฃโ”โ” scikit-partial    - partial_fit() pipelines for sklearn
โ”ƒ   โ”ฃโ”โ” scikit-bloom      - bloom transformers for sklearn
โ”ƒ   โ”ฃโ”โ” human-learn       - rule-based components for sklearn
โ”ƒ   โ”ฃโ”โ” sentence-models   - a different take on textcat
โ”ƒ   โ”ฃโ”โ” mktestdocs        - turn markdown files into pytest tests
โ”ƒ   โ”ฃโ”โ” lazylines         - lightweight utils for .jsonl wrangling
โ”ƒ   โ”ฃโ”โ” cluestar          - inspiration for your first text labels
โ”ƒ   โ”ฃโ”โ” durations         - pytest duration insights
โ”ƒ   โ”ฃโ”โ” tuilwindcss       - tailwindcss for textual tui apps
โ”ƒ   โ”ฃโ”โ” memo              - saves a whole log of time
โ”ƒ   โ”ฃโ”โ” skedulord         - makes cron a bit more fun
โ”ƒ   โ”ฃโ”โ” icepickle         - cool and safe storage for linear models
โ”ƒ   โ”—โ”โ” evol              - grammar for genetic heuristics
โ”ฃโ”โ” ๐Ÿ‘ Project Contributions
โ”ƒ   โ”ฃโ”โ” fairlearn         - contributed the CorrelationFilter
โ”ƒ   โ”ฃโ”โ” polars            - contributed the .pipe() method
โ”ƒ   โ”—โ”โ” BERTopic          - added lightweight sklearn pipeline support
โ”ฃโ”โ” โญ Online Projects
โ”ƒ   โ”ฃโ”โ” calmcode.io       - intermediate developer education
โ”ƒ   โ”ฃโ”โ” koaning.io        - personal blog
โ”ƒ   โ”—โ”โ” dearme.email      - reflection via a 30 day delay
โ”ฃโ”โ” ๐ŸŽ™๏ธ Popular Talks
โ”ƒ   โ”ฃโ”โ” Natural Intelligence is All You Need
โ”ƒ   โ”ฃโ”โ” Group-by statements that save the day
โ”ƒ   โ”ฃโ”โ” Tools to Improve Training Data
โ”ƒ   โ”ฃโ”โ” Optimal on Paper, Broken in Reality
โ”ƒ   โ”ฃโ”โ” Playing by the Rules-Based-Systems
โ”ƒ   โ”ฃโ”โ” How to Constrain Artificial Stupidity
โ”ƒ   โ”ฃโ”โ” The Profession of Solving the Wrong Problem
โ”ƒ   โ”ฃโ”โ” Winning with Simple, even Linear, Models
โ”ƒ   โ”—โ”โ” Untitled12.ipynb
โ”ฃโ”โ” ๐Ÿ”ฌ Random Experiments
โ”ƒ   โ”ฃโ”โ” scikit-prune   - prune scikit learn pipelines
โ”ƒ   โ”ฃโ”โ” gitlit         - tracking github action times across open source
โ”ƒ   โ”ฃโ”โ” sentimany      - many sentiment models, one repo
โ”ƒ   โ”ฃโ”โ” tokenwiser     - sklearn token tricks
โ”ƒ   โ”ฃโ”โ” clumper        - functional API for lists of dicts
โ”ƒ   โ”—โ”โ” whatlies       - exploration tools for word embeddings
โ”—โ”โ” ๐Ÿ‘จโ€๐Ÿ’ป Employer
    โ”ฃโ”โ” ๐ŸŽฒ :probabl.   - scikit-learn and friends
    โ”ƒ   โ”ฃโ”โ” scikit-churn      - safety rails for churn work
    โ”ƒ   โ”—โ”โ” scikit-playtime   - rethinking pipelines
    โ”ฃโ”โ” ๐Ÿ’ฅ Explosion   - developer tools for nlp
    โ”ƒ   โ”ฃโ”โ” prodigy-hf        - Prodigy integration for the HuggingFace stack
    โ”ƒ   โ”ฃโ”โ” prodigy-pdf       - Annotate PDFs via Prodigy
    โ”ƒ   โ”ฃโ”โ” prodigy-ann       - ANN techniques to find relevant subsets
    โ”ƒ   โ”ฃโ”โ” prodigy-segment   - Prodigy integration for Segment Anything
    โ”ƒ   โ”ฃโ”โ” prodigy-lunr      - Search techniques to find relevant subsets
    โ”ƒ   โ”ฃโ”โ” prodigy-whisper   - Transcribe audio with OpenAI's whisper models
    โ”ƒ   โ”ฃโ”โ” prodigy-tui       - Prodigy from the terminal
    โ”ƒ   โ”—โ”โ” cluestar          - inspiration for your first text labels
    โ”—โ”โ” ๐Ÿค– Rasa        - conversational software provider
        โ”ฃโ”โ” nlu examples      - custom nlu components for Rasa
        โ”ฃโ”โ” taipo             - data augmentation tools
        โ”—โ”โ” algo whiteboard   - nlp education

Follow me on twitter @fishnets88

bulk's People

Contributors

brunogomescoelho avatar cgcooke avatar jefromyers avatar julesbelveze avatar kevin-m-smith avatar koaning avatar koernerfelicia avatar mpp-larsen avatar ondraz avatar zbenmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bulk's Issues

segmentation fault

Awesome project, I've been meaning to check it out for a while.

I ran into this error when running python prep-data.py and wondered if anyone else encountered this issue.

I reduced the dataset to ~200 sentences in case it was a memory issue.

[1]    32890 segmentation fault  python prep-data.py
/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I'm using Poetry for package mgmt.

[tool.poetry.dependencies]
python = "^3.11"
embetter = "^0.3.8"
pandas = "2.0.0"
umap-learn = "^0.5.3"

Any tips or advice greatly appreciated.

saving

great stuff, very useful thanks!

I noticed for me saving doesn't work yet when using the packaged version- it just creates an empty csv file. tested this on 2 systems. It does work when running bokeh serve scripts/bulk_text.py --show

(ps1. note that in the bulk_text.py file it still refers to "meant to be ran via: bokeh serve scripts/main.py --show )

(ps2. there is still a reference in readme to "original.csv" which i believe now refers to cluestarred.csv)

Bulk Images Update

I have not posted here in a while, but I have been expanding the images version of bulk over the past month. It has quite a few features as well as the code to wrap bulk around docarray, umap, and hdbscan for processing images in multiple directories, embed them, flatten them, and isolate clusters and assign labels.

On the bokeh server side, it allows for you to hover over images as well as isolate different labels in a multiselect menu. (I am debugging a few things with multiselect now so that it communicates with the scatter plot and DataTable.

If you are interested, let me know and I can post a bit more.

P.S. I just saw your message on the PR. I am so sorry I missed your messages. I have been a bit busy with vacation/getting used to Croatia. I am back now and more than happy to help out with this, if you like. There is also a researcher at USHMM who has been working on his own version of this.

image

bokeh error does not render UMAP plot

Hi, thanks for developing this tool. Due to the following error, UMAP plots are not rendered. Not sure, but may be an easy fix.

Thanks again.

python -m bulk text ready.csv                                 
About to serve `bulk` over at http://localhost:5006/.
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name. This could either be due to a misspelling or typo, or due to an expected column being missing. : key "x" value "x", key "y" value "y" [renderer: GlyphRenderer(id='1049', ...)]

Reqest: Set Port

I know I can mod the code quickly to do this, but can you allow a port so I can run it remotely on a vm with control over that.
Thanks!

Getting an error when using '--keywords' option in bulk

Hello @koaning thanks for such a great library!

I am getting an error when I am using '--keywords' option in the bulk.
It works great, if I don't use '--keywords' option ..
I wonder what the cause is ..

python -m bulk text ready2.csv --keywords "frozen"

About to serve bulk over at http://localhost:5006/.
Uncaught exception GET / (::1)
HTTPServerRequest(protocol='http', host='localhost:5006', method='GET', uri='/', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\tornado\web.py", line 1713, in _execute
result = await result
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\doc_handler.py", line 54, in get
session = await self.get_session()
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\views\session_handler.py", line 144, in get_session
session = await self.application_context.create_session_if_needed(session_id, self.request, token)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\server\contexts.py", line 243, in create_session_if_needed
self._application.initialize_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\application.py", line 194, in initialize_document
h.modify_document(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bokeh\application\handlers\function.py", line 143, in modify_document
self._func(doc)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\text.py", line 27, in bkapp
mapper, df = get_color_mapping(df)
File "C:\Users\kiyi2001\Miniconda3\envs\SemanticMatching\lib\site-packages\bulk\utils.py", line 25, in get_color_mapping
palette=Category10[len(all_values)],
KeyError: 1
500 GET / (::1) 28.95ms

Request: More efficent usage of browser space.

It would be great to fully use the floorspace available in the Browser window.

IE, The number of rows/columns of images, and the size of the graph responding to the available resolution.

something relies on X-display - not sure what

When I start the bulk server, something is not progressing. This was with text. If I have X display available, everything is like a breeze.
Not sure where it is. I'll add more information if I have.
I had a similar issue in the past, this was solved by something like: mpl.use('Agg')

Rethink `--keywords`

The keywords mechanic is cool, but it does suck to have to reload the server. It still seems to fine to be able to pass keywords in the beginning, but is does seem more user-friendly to be able to change them on the fly in the interface. Possibly, we might even allow for regex stuff.

Next version

Some notes just for Vincent.

import altair as alt
import pandas as pd
import numpy as np

rand = np.random.RandomState(42)

df = pd.DataFrame({
    'xval': range(100),
    'yval': rand.randn(100).cumsum()
})

slider = alt.binding_range(min=0, max=100, step=1)
brush = alt.selection_interval(
    encodings=["x", "y"],
    on="[mousedown[event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![event.shiftKey]",
)

interaction = alt.selection_interval(
    bind="scales",
    on="[mousedown[!event.shiftKey], mouseup] > mousemove",
    translate="[mousedown[!event.shiftKey], mouseup] > mousemove!",
    zoom="wheel![!event.shiftKey]",
)

chart = alt.Chart(df).mark_point().encode(
    x='xval',
    y='yval',
).add_params(
    interaction, brush
)

jchart = alt.JupyterChart(chart)
jchart
from bulk import SelectionWidget, EmbeddedTextInput, TextPreview, ImagePreview


widget = SelectionWidget(dataf, color)
query = EmbeddedTextInput(embed_func, decomp)
widget.set_preview(text_preview)
widget.set_input(text_input)
from bulk import SubsetSlider

slider = SubsetSlider(jsonl, display_fn, similarity_column)

Request: Save as HTML

Hi,

would be great if I could save the figure as .html file and send it to others. Is this possible?

Best

Load json and jsonl files

Finding .csv files a bit limiting.

def read_any(file:Path) -> DataTable:
    if file.name[-5:] == "jsonl":
        return pd.read_json(file, lines=True)
    if file.name[-4:] == "json":
        return pd.read_json(file)
    if file.name[-3:] == "csv":
        return pd.read_csv(file)
    raise ValueError(f"Can't read {file}.")

Help creating pipeline

Thanks for the great tool. I'm just getting started and got stuck making the pipeline. I've installed embetter.text and imported SentenceEncoder.
I've also installed sentence-transformers.
Running into error 'pip install embetter[sbert] though I'm able to create a model using:
model = SentenceTransformer('all-MiniLM-L6-v2') screen shot attached:

2022-12-04_20-43-12

Can you help me - thanks.

Add 'bulk utils to-phrases'

I can probably port the to-phrases cli from gli to this project. Would make for an awesome blogpost later on.

Request for a new feature: Option to choose the size of the UMAP 2D fig

Description:
Could it be possible to add an option in order to fix the size of the UMAP 2D fig? Indeed, even if the bokeh app is in "scale_both" mode for "sizing_mode", the size of the left part of the app (i.e. the 2D fig) is fixed: plot_width = 300 (plot_width = 350 with color) and plot_height = 300. See in code

Option:

  • As an example, the run command could be python -m bulk text [my_file.csv] -fig_size 700. It will fix plot_width and plot_height to 700.

Benefit:

  • When the data points are very close, it becomes difficult to make a selection. Enlarging the figures allows us to better distinguish the points in order to select the ones we want.

Clusterization capabilities

Hi.
Probably not only manual labeling but a certain clustering algorithms could be implemented and then manualy fixed with this tool.
If agree I would love to implement in this tool optional use of clusterization technics from sklearn such as kmeans and dbscan.
Also would love to partisipate in other activities

warn when column is missing

I am currently having an issue with the text not appearing when selecting a cluster of data in the web app. However, the rest of the data seems to appear in the plot. I have followed the video and written tutorials closely but keep getting the same problem with any CSV files used.

Screen Shot 2023-02-06 at 1 13 43 PM

Also, when saving the data to a new CSV file from the app, the content text is displayed correctly. This leads me to believe that is primarily an issue with the text rendering not the input data.

Screen Shot 2023-02-06 at 1 44 02 PM

I've tried with multiple browsers (Chrome, Safari, and Edge) as well as multiple devices (both MacOS and Windows) but I'm still getting the same error.

I would greatly appreciate your help. Thank you!

Python version: 3.9.13
Bokeh version: 2.4.3
Bulk version: 0.2.0

Request: Inverse selection

It would be neat to be able to select points, drop them, and then save the rest. More like "sculpting" a class than selecting it, if it makes sense.

Add a "demo mode"

I was giving a PyData Eindhoven talk when I couldn't use my own laptop due to an AV issue. It would've been amazing if there'd been a live demo that I could work with.

So maybe ... we should host some examples and also make it so that you can add a --download flag to replace the save button with a download one. It'd be nice to host some of the standard examples as well as some of the community contributions.

Stuff like: #41

Add `bulk embed` command.

It should be possible, with embetter as an optional dependency, to add a bulk embed command to this project. Maybe use it like:

# For text
python -m bulk embed text file-in.jsonl file-out.jsonl --pipeline sentence-tfm --model LaBSE

# For images
python -m bulk embed image file-in.jsonl file-out.jsonl --pipeline timm --model VGG16

Things to think about:

  • how to keep the default pipelines lightweight
  • maybe also allow functions to be imported from a notebook
  • think about progress bars, maybe with rich

bulk image takes a while to load - write docs

Hi,

I tried to run bulk image on a dataset which was created according to: https://github.com/explosion/prodigy-recipes/blob/master/tutorials/bulk-images/make_pets.py

Unfortunately, running the command "python -m bulk image <my_csv>" only shows a blank bokeh server page on port 5006 without any errors or additional information in the terminal... Only output visible is: "About to serve bulk over at http://localhost:5006/."

Has anyone encountered this issue? Tried locally on macOS with bulk versions 0.1.0 - 0.1.3 and python 3.8-3.10 as well as within a docker-container, both with the same results (blank bokeh server page)

Would greatly appreciate any indications - thanks a lot!

Not working with Bohek 3.0.0

Hello,

I was using bulk today for twitter data. It worked great and I was blown away by the results, but I needed to uninstall Bohek 3.0.0 and downgrade to Bohek 2.4.3. Once I did this, everything worked well. I'm writing to ask if you plan on updating the package to Bokeh 3.0.0 and, if you're not, if you could make a note about this version issue on in the GH repo.

Apologies if this is the wrong place to bring this up and thanks so much for this package!

Thanks!

New release version to pypi

Hi, could you please make a new release? I would like to get #64 and install this project as dependency from pypi. Thank you.

Color mapping crashes if too many classes defined

Looks like I crashed the color mapping with too many classes.

The palette could be upped to Category20, though this would still limit the number of classes a user can visualize. Whatever the limit is, there should be an error message with a graceful exit if the user tries to define more. I'd be happy to make a PR!

  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/tornado/web.py", line 1713, in _execute
    result = await result
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/doc_handler.py", line 54, in get
    session = await self.get_session()
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/views/session_handler.py", line 144, in get_session
    session = await self.application_context.create_session_if_needed(session_id, self.request, token)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/server/contexts.py", line 243, in create_session_if_needed
    self._application.initialize_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/application.py", line 194, in initialize_document
    h.modify_document(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bokeh/application/handlers/function.py", line 143, in modify_document
    self._func(doc)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/text.py", line 27, in bkapp
    mapper, df = get_color_mapping(df)
  File ".pyenv/versions/rules-env/lib/python3.8/site-packages/bulk/utils.py", line 25, in get_color_mapping
    palette=Category10[len(all_values)],
KeyError: 16```

bulk text not rendering

Hi @koaning I was running bulk on text with my data (using the code snippet as in embed.py). It produces the dataframe (ready.csv) but it does not open the bokeh UI after running python -m bulk text ready.csv. It doesn't run any local server on the terminal and also does not show any error.

JS code for update

I am new to Bokeh, can you provide javascript code for the update function callback, that way the code can be used for standalone HTML/JS too
def update(attr, old, new):
"""Callback used for plot update when lasso selecting"""
global highlighted_idx
subset = df.iloc[new]
highlighted_idx = new
subset = subset.iloc[np.random.permutation(len(subset))]
source.data = subset

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.