toloka / crowd-kit Goto Github PK
View Code? Open in Web Editor NEWControl the quality of your labeled data with the Python tools you already know.
Home Page: https://crowd-kit.readthedocs.io/
License: Other
Control the quality of your labeled data with the Python tools you already know.
Home Page: https://crowd-kit.readthedocs.io/
License: Other
We have implemented learning
subpackage and need to update README
https://github.com/Toloka/crowd-kit/blob/main/README.md
No response
It looks like there should be TASK_LABEL_SCORES, not TASK_LABEL_PROBAS. Scores definitely would not sum up to 1.
Is it possible to support aggregation of ordinal labels as a part of this toolkit via this reduction algorithm.
import crowdkit
# ...
mmsr = crowdkit.aggregation.classification.m_msr.MMSR(
n_iter=10000,
tol=1e-10,
n_workers=len(worker_to_id),
n_tasks=len(st2_int),
n_labels=2, # Assuming binary responses
workers_mapping=worker_to_id,
tasks_mapping=task_to_id,
labels_mapping=label_to_id,
)
Exception has occurred: AttributeError
module 'crowdkit' has no attribute 'aggregation'
File "/Users/athundt/Documents/m3c/analyze_survey_results.py", line 62, in assess_worker_responses
mmsr = crowdkit.aggregation.classification.m_msr.MMSR(
^^^^^^^^^^^^^^^^^^^^
File "/Users/athundt/Documents/m3c/analyze_survey_results.py", line 120, in statistical_analysis
worker_skills = assess_worker_responses(binary_rank_df)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/athundt/Documents/m3c/analyze_survey_results.py", line 378, in main
aggregated_df = statistical_analysis(combined_df, args.network_models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/athundt/Documents/m3c/analyze_survey_results.py", line 381, in <module>
main()
AttributeError: module 'crowdkit' has no attribute 'aggregation'
bugreport.py:
import crowdkit
def test_mmsr():
try:
mmsr = crowdkit.aggregation.classification.m_msr.MMSR
except AttributeError as e:
print(f"An error occurred: {e}")
print('it worked!')
test_mmsr()
MMSR constructor to be called.
Note this is how it is literally specified on the website, which should work if copied:
https://toloka.ai/docs/crowd-kit/reference/crowdkit.aggregation.classification.m_msr.MMSR/
MMSR
crowdkit.aggregation.classification.m_msr.MMSR | [Source code](https://github.com/Toloka/crowd-kit/blob/v1.2.1/crowdkit/aggregation/classification/m_msr.py#L17)
MMSR(
self,
n_iter: int = 10000,
tol: float = 1e-10,
random_state: Optional[int] = 0,
observation_matrix: ... = _Nothing.NOTHING,
covariation_matrix: ... = _Nothing.NOTHING,
n_common_tasks: ... = _Nothing.NOTHING,
n_workers: int = 0,
n_tasks: int = 0,
n_labels: int = 0,
labels_mapping: Dict[Any, int] = _Nothing.NOTHING,
workers_mapping: Dict[Any, int] = _Nothing.NOTHING,
tasks_mapping: Dict[Any, int] = _Nothing.NOTHING
)
The following does work, but the reported bug should work too!
from crowdkit.aggregation import MMSR
def test_mmsr():
try:
mmsr = MMSR
except AttributeError as e:
print(f"An error occurred: {e}")
print('it worked!')
test_mmsr()
Thanks for giving this a look!
3.11
1.2.1
athundt@MacBook-Pro m3c % pip freeze
aiohttp==3.8.6
aiohttp-retry==2.8.3
aiosignal==1.3.1
amqp==5.1.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
async-timeout==4.0.3
asyncssh==2.14.0
atpublic==4.0
attrs==23.1.0
billiard==4.1.0
blinker==1.7.0
boto3==1.28.82
botocore==1.31.82
celery==5.3.4
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
colorama==0.4.6
configobj==5.0.8
crowd-kit==1.2.1
cryptography==41.0.5
dictdiffer==0.9.0
diskcache==5.6.3
distro==1.8.0
docopt==0.6.2
dpath==2.1.6
dulwich==0.21.6
dvc==3.28.0
dvc-data==2.20.0
dvc-http==2.30.2
dvc-objects==1.1.0
dvc-render==0.6.0
dvc-studio-client==0.15.0
dvc-task==0.3.0
dvclive==3.2.0
entrypoints==0.4
filelock==3.13.1
Flask==3.0.0
flatten-dict==0.4.2
flufl.lock==7.1.1
frozenlist==1.4.0
fsspec==2023.10.0
funcy==2.0
gitdb==4.0.11
GitPython==3.1.40
grandalf==0.8
gto==1.5.0
huggingface-hub==0.17.3
hydra-core==1.3.2
idna==3.4
iterative-telemetry==0.0.8
itsdangerous==2.1.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.3.2
kombu==5.3.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.2.1
nltk==3.8.1
numpy==1.26.1
omegaconf==2.3.0
orjson==3.9.10
packaging==23.2
pandas==2.1.2
pathspec==0.11.2
pipreqs==0.4.13
platformdirs==3.11.0
prompt-toolkit==3.0.39
psutil==5.9.6
pycparser==2.21
pydantic==2.4.2
pydantic_core==2.10.1
pydot==1.4.2
pygit2==1.13.2
Pygments==2.16.1
pygtrie==2.5.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
rich==13.6.0
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
s3transfer==0.7.0
safetensors==0.4.0
scikit-learn==1.3.2
scipy==1.11.3
scmrepo==1.4.1
semver==3.0.2
shortuuid==1.0.11
shtab==1.6.4
six==1.16.0
smmap==5.0.1
sqltrie==0.8.0
sympy==1.12
tabulate==0.9.0
threadpoolctl==3.2.0
tokenizers==0.14.1
tomlkit==0.12.2
torch==2.1.0
tqdm==4.66.1
transformers==4.35.0
typer==0.9.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
vine==5.0.0
voluptuous==0.13.1
wcwidth==0.2.9
Werkzeug==3.0.1
yarg==0.1.9
yarl==1.9.2
zc.lockfile==3.0.post1
import crowdkit
def test_mmsr():
try:
mmsr = crowdkit.aggregation.classification.m_msr.MMSR
except AttributeError as e:
print(f"An error occurred: {e}")
test_mmsr()
An error occurred: module 'crowdkit' has no attribute 'aggregation'
I was using crowd speech dataset from crowd-kit and wanted to implement some aggravation methods and it found out that function fit_predict worked with only the columns named 'task', 'worker', 'output' but in this dataset their names new 'task', 'performer', 'text'. So I got the error that worker and output were not in the index.
Nikita Pavlichenko suggested me to create this issue to change the names of the columns in the dataset
3.7
1.0.0
No response
from crowdkit.datasets import load_dataset
from crowdkit.aggregation import TextHRRASA
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-mpnet-base-v2')
hrrasa = TextHRRASA(encoder=encoder.encode)
df, gt = load_dataset('crowdspeech-test-clean')
df['text'] = df['text'].apply(lambda s: s.lower())
result = hrrasa.fit_predict(df)
No response
I am working on a dataset of ATP (Association of Tennis Professionals - men only) tennis games over several years. I want to predict the outcome of tennis so one way to do that is using a Bradley-Terry model which is a probability model I am asking about how to do feature selection or feature engineering( I am not talking about domain knowledge FE) or preprocessing that must be applied before training the model
When I use a matrix like
data = pd.DataFrame(
[
[1, 1, 0],
[1, 2, 1],
[1, 4, 1],
[1, 5, 0],
[2, 1, 1],
[2, 2, 1],
[2, 3, 1],
[2, 4, 0],
[2, 5, 0],
[3, 1, 1],
[3, 2, 0],
[3, 3, 0],
[3, 4, 1],
[3, 5, 0],
[4, 1, 1],
[4, 2, 1],
[4, 3, 1],
[4, 4, 1],
[4, 5, 1],
[5, 1, 1],
[5, 2, 0],
[5, 3, 0],
[5, 4, 0],
[5, 5, 0],
],
columns=['task', 'performer', 'label']
)
and try
DawidSkene(n_iter=100).fit_predict(data)
then I get
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "Redacted:\crowdkit\akgregation\dawid_skene.py", line 112, in fit_predict
return self.fit(data).labels_
File "Redacted:\crowdkit\aggregation\dawid_skene.py", line 94, in fit
probas = self._e_step(data, priors, errors)
File "Redacted:\crowdkit\aggregation\dawid_skene.py", line 62, in _e_step
joined = data.join(np.log2(errors), on=['performer', 'label'])
File "Redacted:\pandas\core\frame.py", line 8110, in join
return self._join_compat(
File "Redacted:\pandas\core\frame.py", line 8135, in _join_compat
return merge(
File "Redacted:\pandas\core\reshape\merge.py", line 89, in merge
return op.get_result()
File "Redacted:\pandas\core\reshape\merge.py", line 686, in get_result
llabels, rlabels = _items_overlap_with_suffix(
File "Redacted:\pandas\core\reshape\merge.py", line 2178, in _items_overlap_with_suffix
raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
ValueError: columns overlap but no suffix specified: Index(['task'], dtype='object')
When I convert the series all to strings/object then it works.
Hi, I'm reviewing the library for JOSS openjournals/joss-reviews#6227
In your tutorials, you suggest using labeled_train_data.tsv, however I cannot find this data. Is is provided somewhere?
Thanks!
https://github.com/Toloka/crowd-kit/blob/main/examples/ECIR2023-Intents.ipynb
No response
I wanted to test quality metrics of several different algorithms from crowdkit.aggregation.classification
and found myself writing such kind of function:
def get_scores(model, data, fit=True):
if fit:
model.fit(data)
probas = getattr(model, "probas_", None)
if probas is not None:
return probas
predictor = getattr(model, "predict_score", None)
if predictor is None:
predictor = model.predict_proba
return predictor(data)
That's because different models have different methods for retrieving scores. For example, MMSR
has predict_score
while almost all others have predict_proba
. Some have field probas_
, while others don't.
This seems strange and inconsistent.
Unify naming of predict_score
functions and presence of probas_
field
Hi,
the guide says
df = pd.read_csv('results.csv') # should contain columns: task, performer, label
but when I load this file, then the second column is worker
and not performer
. I had used crowdkit with dataframes that had columns: task, performer, label, but after an update, it broke.
On page https://github.com/Toloka/crowd-kit click on e.g. any of the Method links under Categorical Responses, they all lead to a 404 Error.
However, these links work on the https://toloka.ai/docs/crowd-kit/ page.
https://github.com/Toloka/crowd-kit/blob/main/README.md
No response
Is it possible that you add MACE ? It is often used in my field but there is only a Java implementation that is hard to integrate into Python projects.
Hi,
I was trying to execute the code snippet provided as an example but it seems that the function is now in another castle.
This is the original snippet:
from crowdkit.aggregation import load_dataset
from crowdkit.aggregation import ROVER
df, gt = load_dataset('crowdspeech-test-clean')
df['text'] = df['text'].str.lower()
tokenizer = lambda s: s.split(' ')
detokenizer = lambda tokens: ' '.join(tokens)
result = ROVER(tokenizer, detokenizer).fit_predict(df)
and this is the same with the dirst line corrected
from crowdkit.datasets.load_dataset import load_dataset
from crowdkit.aggregation import ROVER
df, gt = load_dataset('crowdspeech-test-clean')
df['text'] = df['text'].str.lower()
tokenizer = lambda s: s.split(' ')
detokenizer = lambda tokens: ' '.join(tokens)
result = ROVER(tokenizer, detokenizer).fit_predict(df)
Thanks,
Marceau
https://toloka.ai/docs/crowd-kit/reference/crowdkit.aggregation.texts.rover.ROVER/
https://crowd-kit.readthedocs.io/en/latest/texts/#crowdkit.aggregation.texts.ROVER
I think that the first line from crowdkit.aggregation import load_dataset
should be changed to from crowdkit.datasets.load_dataset import load_dataset
(or from crowdkit.datasets import load_dataset
)
I was using the ROVER for the textual responses aggregation and found out that ROVER expected the column named 'text'. It's very suspicious because the analogs like TextRASA and TextHRRASA expect the column named 'output'.
I suggest to unify functions' input. Use 'output' name for an example.
3.7
1.0.0
No response
from crowdkit.aggregation import load_dataset
from crowdkit.aggregation import ROVER
df, gt = load_dataset('crowdspeech-test-clean')
df['text'] = df['text'].apply(lambda s: s.lower())
df=df.rename(columns={'performer': 'worker'})
df=df.rename(columns={'text': 'output'})
tokenizer = lambda s: s.split(' ')
detokenizer = lambda tokens: ' '.join(tokens)
result = ROVER(tokenizer, detokenizer).fit_predict(df)
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.