cd ./inferbert_datasets
python3 prepare_datasets.py --output-dir datasets
dataset location size train=808/dev=106/test=90
dataset color size train=372/dev=48/test=48
dataset trademark size train=960/dev=120/test=120
this command will:
- remove dups
- remove excluded workers taggings
- split to train / test (defaults 75%/25%)
- splits by workers tagging session (HITs) and not single samples to avoid info leakage between train and test
- creates 3 dataset types:
$ python scripts/dataset_split.py --input-file ./dataset/hy_dataset_v1.json --output-dir dataset/