mlfoundations / datacomp Goto Github PK

DataComp: In search of the next generation of multimodal datasets

License: Other

Python 98.14% Shell 1.86%

datacomp's Issues

Deduplication against evaluation sets

Could you publish the script used to deduplicate CommonPool against the datasets used for evaluation (mentioned in Appendix F in the paper)?

FileNotFoundError while downloading DataComp-1B

Thanks for the great work.
I encountered the following issue while downloading the DataComp-1B dataset:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

Is it possible to implement data filtering in training script?

It seems the current process of data filtering is:

filtering metadata.parquet to get sample_ids.npy —> 2. reshard based on sample_ids.npy and save the new dataset again using resharder.py.

But I think an ideal data filtering process will be:

filtering data shards to get sample_ids.npy because data shards contain more information such as image pixel value —> once sample_ids.npy is generated, the training script applies it for dataset construction directly. In this way, we need not to store data multiple times, which is time and storage consuming.

I know things will be different when data expands to billion level. There must be some reasons that you adopt the existing solution.

But I still confuse if the second solution is possible or easy to implement? (at least for small track)

Workshop submission deadline

Hi everyone,

I am confused about the submission deadline. Is it by the end of the 8th of September or the end of the 7th of September?
thanks

Get 400 error when submitting jsonl to firebase, but successfully submit Slack notification

(datacomp114514) cometp@LAPTOP-COMETP:~/datacomp-main$ python evaluate.py --track=filtering  --train_output_dir="0_clip_mask/" --samples="0_clip_masked.npy" --submit --method_name="text masked CLIP 30%" --author="NUM" --email="[email protected]" --hf_username="Cometp" --hf_repo_name="datacomp_0.3_clip_mask"
Found 40 eval result(s) in 0_clip_mask/eval_results.jsonl.
Warning, did not find or could not read checkpoint at datacomp/test1/0_clip_mask/checkpoints/epoch_latest.pt
Defaulting to 0_clip_mask/checkpoints/epoch_latest.pt
Evaluating
Skipping Caltech-101 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2320
Skipping CIFAR-10 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.4359
Skipping CIFAR-100 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1781
Skipping CLEVR Counts since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1359
Skipping CLEVR Distance since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2010
Skipping Country211 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0146
Skipping Describable Textures since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0787
Skipping EuroSAT since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1917
Skipping FGVC Aircraft since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0106
Skipping Food-101 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0804
Skipping GTSRB since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0496
Skipping ImageNet 1k since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0483
Skipping ImageNet Sketch since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0194
Skipping ImageNet v2 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0422
Skipping ImageNet-A since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0161
Skipping ImageNet-O since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1430
Skipping ImageNet-R since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0657
Skipping KITTI Vehicle Distance since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.3924
Skipping MNIST since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1322
Skipping ObjectNet since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0534
Skipping Oxford Flowers-102 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0447
Skipping Oxford-IIIT Pet since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0711
Skipping Pascal VOC 2007 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.3037
Skipping PatchCamelyon since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5121
Skipping Rendered SST2 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5008
Skipping RESISC45 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0803
Skipping Stanford Cars since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0295
Skipping STL-10 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5185
Skipping SUN397 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1045
Skipping SVHN since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1426
Skipping Flickr since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0323
Skipping MSCOCO since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0199
Skipping WinoGAViL since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2827
Skipping iWildCam since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0116
Skipping Camelyon17 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5009
Skipping FMoW since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0000
Skipping Dollar Street since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2757
Skipping GeoDE since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2714
Skipping FairFace since results are already in 0_clip_mask/eval_results.jsonl
Score: No summary metric
Skipping UTKFace since results are already in 0_clip_mask/eval_results.jsonl
Score: No summary metric
Evaluation time: 0 hour(s) 0 minute(s) 0 second(s)

=== Final results ===
ImageNet 1k: 0.0483
Caltech-101: 0.2319868753931739
CIFAR-10: 0.4359
CIFAR-100: 0.1781
CLEVR Counts: 0.13593333333333332
CLEVR Distance: 0.201
Country211: 0.014644549763033175
Describable Textures: 0.07872340425531915
EuroSAT: 0.19166666666666668
FGVC Aircraft: 0.010561497326203208
Food-101: 0.08043564356435644
GTSRB: 0.0496437054631829
ImageNet Sketch: 0.019434455383285188
ImageNet v2: 0.0422
ImageNet-A: 0.016133333333333333
ImageNet-O: 0.143
ImageNet-R: 0.06573333333333334
KITTI Vehicle Distance: 0.3924050632911392
MNIST: 0.1322
ObjectNet: 0.053407989662969745
Oxford Flowers-102: 0.04470542875600805
Oxford-IIIT Pet: 0.07114696955308561
Pascal VOC 2007: 0.3036858974358974
PatchCamelyon: 0.51214599609375
Rendered SST2: 0.500823723228995
RESISC45: 0.08031746031746032
Stanford Cars: 0.029473946026613605
STL-10: 0.5185
SUN397: 0.10447431818599776
SVHN: 0.14255531653349723
Flickr: 0.032300000078976154
MSCOCO: 0.01987708918750286
WinoGAViL: 0.2826642132262741
iWildCam: 0.011556535188310048
Camelyon17: 0.5008582782702754
FMoW: 0.0
Dollar Street: 0.2757009267807007
GeoDE: 0.27139875292778015
FairFace: None
UTKFace: None
=====================
Average: 0.16377880796211722
Done with evaluations. Preparing your submission...
Pushing files to HF Hub (Cometp/datacomp_0.3_clip_mask). This may take a while.
Done uploading files to HF Hub.
====================================================================================================

            Error: something went wrong when submitting your results.
            Please check if your HF credentials are correct, and contact the team if errors persist.

==================================================================================================== 400
Sucessfully submitted your results. Thanks for participating, and good luck!

Consistency between Table 23 and Fig 3

Hi Team,

I have a question about the consistency between Table 23 CLIP-L14 results and Fig 3 avg CLIP-L14 results. In Table 23, the average results of both CLIP-L14 30% and 40% have the same maximum 0.520, but, in Fig 3, the large-CLIP-L14 30% setting has a lower performance than 40%. Please correct me if I wrongly interpret the table or figure. Thanks!

Missing training file?

Dear Authors,
Thank you for the work and code. I was looking at the train.py file but here in this line - https://github.com/mlfoundations/datacomp/blob/main/train.py#L15 - it seems like the file/folder training seems to not be present in the GitHub repo?
Please let me know if I am missing anything, thank you!

Error when creating the environment

Hi everyone,

When creating the environment, the package gcld3 cannot be installed. I am suspecting there is something to do with the Python version as the package can be installed with Python 3.6 up to 3.8 but Python 3.9 and newer have a problem with this package.

This is the error message I'm getting:

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gcld3

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

I know this might not be the perfect place to raise this problem but the repo of the original package is not active.

Thanks,

Text search over CommonPool

Is there any web UI for performing text search over CommonPool? I want to use if to collect my own dataset for few-shot vision tasks.

(I know there is already a search engine over LAION-5B, but I need the search not to be model-assisted to avoid model bias. In LAION-5B, this is not the case, because it searchs by CLIP image embeddings, and LAION-5B itself is model-filtered. So searcing over CommonPool seems better solution).

To implement this by myself, I guess I need some search engine. Would you recommend Apache Solr or other? Any common practices here?

Is there overlap between common-pool and laion-5B?

download data, why success is always 0.14, but the file size is 80% of the total file (total file size is in README)

Sharding file number 214 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1df9462bc6885e969f11aaa635d9332c.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 215 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1df9dd8c3710199fc1a3553e2c32c088.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 216 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e035aecc740dc7c69d6078e631623b1.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 217 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e04906d042a35489b73c3b4ac13dca2.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 218 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e06f0994865438d88fd682a32a406a4.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 219 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e15153d29898fddda371065d92a3690.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 220 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e5389481e14532f3bafbd7a863154d3.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 221 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e72542b65bdbb87aacee8dc4dc77108.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 222 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e763882b04e4d151feb536eeb41c3b6.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 223 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e93a84dd21c161eb9d58d8bd2a13824.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 224 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e962285d730d3e11dc685ebbd09af05.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 225 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ea5c7f274e3ea11f34eba02d7737502.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 226 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ed3681d826a23ad6ac71368a2c70c55.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 227 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ed52da08ab1b413fd9cfa39d0142933.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 228 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f1fcad950bd01a8473a1486c6970b09.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 229 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f2b17df53b505a0d0b892a649cfeb12.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 230 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f3b931c1473c40a0f54b343158f963e.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 231 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f69531c3338a697864f0be16a031b09.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 232 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f8934426b1e4464a7f427c24e10ce6f.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 233 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f89fb35014d629dd9eb9aa536354463.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 234 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f99a90abc00dda416cf9f9fda2f033c.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 235 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1fa4fcf25d3e3f17943912b5dfcffb8a.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 236 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1fbf923f6041602b76974860585f70e6.parquet
File sharded in 18 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 237 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1feceb12579113e8578942797b962e01.parquet
File sharded in 51 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
6it [03:21, 15.04s/it]Sharding file number 238 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ff0057a457efcdae3ac0aafbc3ace3d.parquet
File sharded in 51 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
11it [03:24,  2.89s/it]worker  - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 95 - count: 10000
total   - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 95 - count: 10000
worker  - success: 0.145 - failed to download: 0.855 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 190 - count: 20000
worker  - success: 0.144 - failed to download: 0.854 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 285 - count: 30000
worker  - success: 0.139 - failed to download: 0.860 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 380 - count: 40000
worker  - success: 0.146 - failed to download: 0.853 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 474 - count: 50000
worker  - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 569 - count: 60000
worker  - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 664 - count: 70000
worker  - success: 0.146 - failed to download: 0.853 - failed to resize: 0.001 - images per sec: 96 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 759 - count: 80000
worker  - success: 0.137 - failed to download: 0.863 - failed to resize: 0.001 - images per sec: 95 - count: 10000
total   - success: 0.143 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 854 - count: 90000
16it [03:30,  1.61s/it]worker  - success: 0.140 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 89 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 892 - count: 100000
worker  - success: 0.144 - failed to download: 0.854 - failed to resize: 0.001 - images per sec: 92 - count: 10000
total   - success: 0.143 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 981 - count: 110000
worker  - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 93 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1070 - count: 120000
worker  - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 94 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1159 - count: 130000
worker  - success: 0.139 - failed to download: 0.860 - failed to resize: 0.001 - images per sec: 94 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1248 - count: 140000
worker  - success: 0.136 - failed to download: 0.863 - failed to resize: 0.001 - images per sec: 93 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1337 - count: 150000
worker  - success: 0.138 - failed to download: 0.862 - failed to resize: 0.001 - images per sec: 92 - count: 10000
total   - success: 0.142 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 1427 - count: 160000
17it [04:06, 10.24s/it]worker  - success: 0.138 - failed to download: 0.861 - failed to resize: 0.001 - images per sec: 97 - count: 4493
total   - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 1107 - count: 164493

Metadata download error - OSError: Consistency check failed

Hi, team!

I am trying to download the medium-scale dataset of the filtering track, but I keep failing with the following error.

OSError: Consistency check failed: file should be of size 122218957 but has size 56690589 ((…)f11adbfc933c.parquet).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.

It seems related to this issue huggingface/huggingface_hub#1498
Is there any bypass for downloading metadata, without using huggingface_hub?
Thanks.

Training log

Hello, I am currently replicating clip-large model using the DataComp dataset. Could you provide the training logs from Weights & Biases, such as the zero-shot results on ImageNet at different steps?

--output_dir does not do correct thing if --output_dir is a cloud path

The datacomp repo is cloudpath aware but open_clip is not, so when we pass a cloudpath like s3:// ... to the open_clip training code it just creates a folder called s3 locally on the master node.

The correct thing to do here is to detect its a cloudpath, give a temporary local directory and enable remote_sync on open_clip

How to precompute and save model-based metric during download?

Hello, thanks for the work on this project! I was wondering if you had any example code/pointers for pre-computing model metrics during download. Specifically, we would like to have access to an aesthetic score (as in https://github.com/christophschuhmann/improved-aesthetic-predictor) when iterating through the dataset. I was looking at the filtering examples in baselines/apply_filter.py but those don't seem to cover computing a metric (on GPU) on the image after download and adding the result to the save file so it can be read w/o GPU compute at inference. Are there any relevant examples/templates? Thanks!

Downloading DataComp-1B

I am trying to download datacomp-1b, but it seems both my downloading speed and success rate are low, even after I set num_retries to 3:

worker - success: 0.887 - failed to download: 0.101 - failed to resize: 0.012 - images per sec: 8 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 103 - count: 247671
worker - success: 0.889 - failed to download: 0.101 - failed to resize: 0.010 - images per sec: 9 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 107 - count: 257671
worker - success: 0.894 - failed to download: 0.098 - failed to resize: 0.008 - images per sec: 8 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 111 - count: 267671
29it [40:28, 7.45s/it]worker - success: 0.890 - failed to download: 0.100 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 115 - count: 277671
worker - success: 0.887 - failed to download: 0.103 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 119 - count: 287671
worker - success: 0.891 - failed to download: 0.099 - failed to resize: 0.009 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 123 - count: 297671
worker - success: 0.882 - failed to download: 0.108 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 127 - count: 307671

Usage with AWS S3 and Ray

Inspired by rom1504/img2dataset#272
Depends on #58
Depends on #60

Usage

Cluster creation

ray up --yes cluster.yml

ray dashboard cluster.yml

Job submission

git clone https://github.com/mlfoundations/datacomp

ray job submit \
--address=http://localhost:8265 \
--working-dir=datacomp \
--runtime-env-json="$(
  jq --null-input '
    {
      conda: "datacomp/environment.yml",
      env_vars: {
        AWS_ACCESS_KEY_ID: env.AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY: env.AWS_SECRET_ACCESS_KEY,
        AWS_SESSION_TOKEN: env.AWS_SESSION_TOKEN
      }
    }
  '
)" \
-- \
python download_upstream.py \
--subjob_size=11520 \
--thread_count=128 \
--processes_count=1 \
--distributor=ray \
--metadata_dir=/tmp/metadata \
--data_dir=s3://datacomp-small \
--scale=small

Note

Image shards would be saved to the datacomp-small AWS S3 bucket, specified with the --data_dir option.

Cluster deletion

$ ray down --yes cluster.yml

Configuration

Sample `cluster.yml`

cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

docker:
  run_options: [--dns=127.0.0.1]
  image: rayproject/ray:2.6.1-py310
  container_name: ray

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg

Obscure details

When --data_dir points to a cloud storage like S3, we also have to specify a local --metadata_dir because the downloader script doesn't support saving metadata to cloud storage.
The last pip install on the setup_commands section is needed for compatibility with AWS S3, because the required libraries aren't included in the conda environment file.
~~There is no need to provide additional AWS credentials if the destination bucket is on the same account as the cluster, because it already has S3 full access through an instance profile.~~
- While the cluster has a default instance profile that grants full S3 access, it doesn't seem to work as intended (probably due to rate limit of IMDS endpoint), and I ended up having to pass my local AWS credentials as environment variables.
The Python version in environment.yml must match the Python version of the Ray cluster; make sure that docker.image on cluster.yaml has exactly the same version as the environment.yml from this project.

Remove CSAM, if present

A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us know what steps you have taken and/or plan to take to address this issue responsibly. Thanks!

https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

Edit: Ali Alkhatib also makes a good point that, should dataset changes be needed, they might need to be mixed in with additional simultaneous data changes so an old version can't be easily diffed against a new version to find harmful material, among other best practices.

https://x.com/_alialkhatib/status/1737484384914092156?s=46

Deduplication of eval datasets

Hi,

I was wondering if the data present in eval datasets such as the one used to predict ImageNet zero shot accuracy.

Tried evaluate the model on a local network only machine

Well, I first use

python download_evalsets.py $download_dir

to download all the necessary datasets on an internet-accessible machine and then migrate the data to my machine with limited internet access.
All the other evaluation went well but the retrieval datasets , which use hf_cache/ directory instead.

The error goes like this :

>>> datasets.load_dataset("nlphuji/flickr_1k_test_image_text_retrieval",split="test", cache_dir=os.path.join("/mnt/data/datacom2023/evaluate_datasets", "hf_cache"))Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1512, in dataset_module_factory
    raise e1 from None
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1468, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")
ConnectionError: Couldn't reach 'nlphuji/flickr_1k_test_image_text_retrieval' on the Hub (ConnectTimeout)

Seems like the huggingface datasets module is still trying to connect to the internet. Is there any trick I can play to skip the connection to huggingface? The evaluation command :

python evaluate.py --train_output_dir /mnt/data/datacomp2023/train_output/basic_train --data_dir /mnt/data/datacomp2023/evaluate_datasets

[Challenge rules] Is it allowed to modify the texts of the data in the filtering track?

Hi Team,

I am wondering whether it is allowed by the challenge rules to modify the texts of the training images in the filtering track, e.g., using BLIP2 generated captions to replace the original captions? Thanks!

Connection error while half-downloading metadata

Hello! I'm running the command:

python download_upstream.py --scale medium --data_dir medium --skip_shards

After downloading some files it interrupts with the error:

  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

As you can see, there is not too much details in error message. May this be caused some files missing on server? Or just connection problems? If the last, how can I resume thedownload? Flag --overwrite_metadata seems not suitable because it removes all already downloaded files.

Problems evaluating trained model

Hi everyone,

Im having trouble evaluating my trained model on the evaluation datasets. I downloaded the datasets but the evaluation script is only running my model on the imagenet 1k dataset. I am running this command:
python evaluate.py --track filtering --data_dir ./eval_datasets/ --train_output_dir trained_models/clip_model_1

Thanks for your time

train/test splits for downstream tasks

Hello!

For some of the downstream datasets, the train-test splitting is unknown. Could you please share how the train and test subsets are split so that we can avoid using test images? The tasks with unknown train-test split are:
Caltech-101 [41], DTD [26], EuroSAT [57, 147], KITTI distance [44, 147], RESISC45 [23, 147], Dollar Street [115], GeoDE [107]

Thanks a lot!

How can I split the pool? I don't have a large enough storage for all

ideally I want to split the pool to 2 separate harddrive, how to do it?

Appendix in the workshop paper submission

Dear organizers,

do we allow placing appendix in the workshop paper? Thanks.

Conda environment build issue

Over the last couple days, I suddenly wasn't able to build the conda environment using conda env create -f environment.yml anymore and kept getting the attached error message. After following advice given in yaml/pyyaml#724 (comment) and changing the pyyaml requirement to pyyaml==5.3.1 I was able to build the environment again. Not sure if this is a good permanent fix but I just wanted to raise awareness about this issue.

Error message:

Collecting pyyaml==5.4.1
Using cached PyYAML-5.4.1.tar.gz (175 kB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'

Pip subprocess error:
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [62 lines of output]
/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config in setup.cfg
!!

          ********************************************************************************
          The license_file parameter is deprecated, use license_files instead.

          By 2023-Oct-30, you need to update your project and remove deprecated calls
          or your builds will no longer be supported.

          See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
          ********************************************************************************

  !!
    parsed = self.parsers.get(option_name, lambda x: x)(value)
  running egg_info
  writing lib3/PyYAML.egg-info/PKG-INFO
  writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
  writing top-level names to lib3/PyYAML.egg-info/top_level.txt
  Traceback (most recent call last):
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup
      exec(code, locals())
    File "<string>", line 271, in <module>
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 318, in run
      self.find_sources()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 326, in find_sources
      mm.run()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 548, in run
      self.add_defaults()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 586, in add_defaults
      sdist.add_defaults(self)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 113, in add_defaults
      super().add_defaults()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
      self._add_defaults_ext()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
      self.filelist.extend(build_ext.get_source_files())
    File "<string>", line 201, in get_source_files
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
      raise AttributeError(attr)
  AttributeError: cython_sources
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

failed

CondaEnvException: Pip failed

the normal success rate and downloading speed?

Hi, somehow the success rate starts at 0.8 and then dropped to 0.2 after downloading for a while, what success rate should we expect to be healthy? if an average success rate is 0.5 does it mean after the downloading I have only downloaded 50% of the total dataset?
Also, will the factors like the image resolution downloaded have a great impact on the success rate?
And do you know if it's possible to download the datacomp-1B using a 96 CPU pods in a week?
Thanks

https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test doesn't exist?

trying to download flickr_1k_test_image_text_retrieval but got errors for downloading from https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test

========== Download 'Flickr' ===========

Repo card metadata block was not found. Setting CardData to empty.
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3305.20it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 164.31it/s]
Generating test split: 1000 examples [00:00, 4210.39 examples/s]

========== Download 'Flickr' ===========

--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/classnames.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/zeroshot_classification_templates.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/test/nshards.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
Traceback (most recent call last):
  File "download_datasets.py", line 139, in <module>
    sys.exit(main(args))
  File "download_datasets.py", line 15, in main
    download_datasets(args.data_dir)
  File "download_datasets.py", line 79, in download_datasets
    nshards = int(f.read())
ValueError: invalid literal for int() with base 10: ''

can you publish DataComp-1B directly due to my small storage

I have got it

Result submission deadline

Hello! I see the workshop paper submission deadline is September 8th, 2023. Is the result submission deadline (key list & model checkpoint submission) also the same day?

Thanks!

Not able to push data to google cloud storage

While trying to push data to the Google Cloud, I am getting a file not found error. Any help would be highly appreciated.

python download_upstream.py --scale medium --data_dir "gs://dataset/datacomp/" --thread_count 2
Downloading metadata to gs://dataset/datacomp/metadata...

Downloading (…)c76a589ef5d0.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 395MB/s]
Downloading (…)30fd0d497176.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 360MB/s]
.
.
.
Downloading (…)0edfcd0a6bc7.parquet: 100%|██████████████████████████████████████████████████████████████████| 121M/121M [00:00<00:00, 260MB/s]
Fetching 253 files: 100%|███████████████████████████████████████████████████████████████████████████████████| 253/253 [00:54<00:00,  4.67it/s]
Done downloading metadata.7.parquet:  26%|████████████████▉                                                | 31.5M/121M [00:00<00:00, 143MB/s]
Downloading images to gs://dataset/datacomp/shards██████████████████████████████████████████████   | 115M/121M [00:00<00:00, 307MB/s]
Starting the downloading of this file 
Sharding file number 1 of 1 called dataset/datacomp/metadata
0it [00:08, ?it/s]
Traceback (most recent call last):
  File "/home/mayank/datacomp/datacomp/download_upstream.py", line 218, in <module>
    img2dataset.download(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/main.py", line 232, in download
    distributor_fn(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 36, in multiprocessing_distributor
    failed_shards = run(reader)
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 31, in run
    for (status, row) in tqdm(process_pool.imap_unordered(downloader, gen)):
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
FileNotFoundError: b/dataset/o/datacomp%2Fmetadata

reading images from within filtering script

I would like to be able to read images directly during the filtering step in apply_filter.py. I am able to get the url each image from the metadata. Instead of downloading each images, what the best way to get the image data from shards.

14% of SHA256 hashes not matching

Introduction

We downloaded the Datacomp 1B set.
For verification, we only kept an image if its SHA256 checksum of the bytes matches with the corresponding entry in the metadata you provide.

Problem Statement

Hundreds of millions of images were discarded due to hash mismatches.

Here's one example. Let's look at entry 21:
https://huggingface.co/datasets/mlfoundations/datacomp_1b/viewer/default/train?row=21

UID: 38f76e4b1b4a77ca66a62b453da17912
text: Cable Manager, Horizontal, Recessed Flat ...
url: https://images.eanixter.com/viewex/PR108844V6.JPG
sha256: 0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164

If you download the image and compute the hash, it will be this:

$ curl -s https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273

One might think the image was modified slightly (e.g. its header or some pixels). However, checking the web archive version from 2019 yields the same hash:

$ curl -s https://web.archive.org/web/20191127193532if_/https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273

(You can view the web archive capture here.)

Mitigation

We would like to better understand how the hashes were computed. It seems the code that was used for that is not published.
Potentially, we could build a workaround by computing the hashes in the same way you did.

Ultimately, we think fixing the hashes in the metadata will be the best solution.

Adding the detected_language to metadata

Hi! Thanks a lot for your work!

I can't find the detected_language column in the metadata, I suppose someone has to manually compute it and it in the parquet files but it takes quite a long time for a dataset like the xlarge pool. Having it stored in the metadata paruqet file would help all the people that have limited resources and want to prefilter the data before downloading the images.

Practical example:

I am interested in training a model only on the image-text pairs with english captions, and I that those data represent around the 30% of the small pool. Filtering out all the external languages would allow me to save a lot of storage space.

Any plan to release the baseline checkpoints in the paper?

Hi the Team,

Is there any plan to release (some of) the checkpoints in the paper or leaderboard?

Thanks!

Dataset Size on Leaderboard

Hey there,

While reviewing the leaderboard submissions for the small filtering track, I observed instances where the dataset size was noted as 1.3e7, which is essentially equivalent to the original dataset size without any filtering. Now I was wondering whether it was actually the case that these submissions kept almost all of the original data (which I doubt given submission titles such as BLIP2-COCO-finetuned_similarity_top-35%) or if this is an error.

Thanks for clarifying!

How to deal with images that cannot be downloaded?

Thanks for your great and meaningful competition.

When I run python download_upstream.py --scale $scale --data_dir $data_dir, only around ~91% images can be download successfully. It means the actually pool size will be smaller than the given pool size (< 12.8M).

How to deal with that? Definitely I think the person with more candidate samples can benefit more.

`zeroshot_templates` split error for FairFace / UTKFace

Hi, I got the following error running download_evalsets.py and then evaluate.py:

Evaluating on FairFace
Traceback (most recent call last):
  File "/home/jason-chou/Downloads/datacomp/evaluate.py", line 382, in <module>
    metrics = evaluate_model(
  File "/home/jason-chou/Downloads/datacomp/eval_utils/main.py", line 40, in evaluate_model
    metrics = eval_fn(
  File "/home/jason-chou/Downloads/datacomp/eval_utils/fairness_eval.py", line 265, in evaluate_fairface_dataset
    objective, template = t.split(":", 1)
ValueError: not enough values to unpack (expected 2, got 1)

Digging into it, the erring part is

datacomp/eval_utils/fairness_eval.py

Lines 264 to 266 in fa9d766

 for t in zeroshot_templates: 

 objective, template = t.split(":", 1) 

 multilabel[objective]["zeroshot_templates"].append(template)

and then I printed out zeroshot_templates:

zeroshot_templates=['a bad photo of a {c}.', 'a photo of many {c}.', 'a sculpture of a {c}.', 'a photo of the hard to see {c}.', 'a low resolution photo of the {c}.', 'a rendering of a {c}.', 'graffiti of a {c}.', 'a bad photo of the {c}.', 'a cropped photo of the {c}.', 'a tattoo of a {c}.', 'the embroidered {c}.', 'a photo of a hard to see {c}.', 'a bright photo of a {c}.', 'a photo of a clean {c}.', 'a photo of a dirty {c}.', 'a dark photo of the {c}.', 'a drawing of a {c}.', 'a photo of my {c}.', 'the plastic {c}.', 'a photo of the cool {c}.', 'a close-up photo of a {c}.', 'a black and white photo of the {c}.', 'a painting of the {c}.', 'a painting of a {c}.', 'a pixelated photo of the {c}.', 'a sculpture of the {c}.', 'a bright photo of the {c}.', 'a cropped photo of a {c}.', 'a plastic {c}.', 'a photo of the dirty {c}.', 'a jpeg corrupted photo of a {c}.', 'a blurry photo of the {c}.', 'a photo of the {c}.', 'a good photo of the {c}.', 'a rendering of the {c}.', 'a {c} in a video game.', 'a photo of one {c}.', 'a doodle of a {c}.', 'a close-up photo of the {c}.', 'a photo of a {c}.', 'the origami {c}.', 'the {c} in a video game.', 'a sketch of a {c}.', 'a doodle of the {c}.', 'a origami {c}.', 'a low resolution photo of a {c}.', 'the toy {c}.', 'a rendition of the {c}.', 'a photo of the clean {c}.', 'a photo of a large {c}.', 'a rendition of a {c}.', 'a photo of a nice {c}.', 'a photo of a weird {c}.', 'a blurry photo of a {c}.', 'a cartoon {c}.', 'art of a {c}.', 'a sketch of the {c}.', 'a embroidered {c}.', 'a pixelated photo of a {c}.', 'itap of the {c}.', 'a jpeg corrupted photo of the {c}.', 'a good photo of a {c}.', 'a plushie {c}.', 'a photo of the nice {c}.', 'a photo of the small {c}.', 'a photo of the weird {c}.', 'the cartoon {c}.', 'art of the {c}.', 'a drawing of the {c}.', 'a photo of the large {c}.', 'a black and white photo of a {c}.', 'the plushie {c}.', 'a dark photo of a {c}.', 'itap of a {c}.', 'graffiti of the {c}.', 'a toy {c}.', 'itap of my {c}.', 'a photo of a cool {c}.', 'a photo of a small {c}.', 'a tattoo of the {c}.']

In contrast to classnames

classnames=['age:0-2', 'age:3-9', 'age:10-19', 'age:20-29', 'age:30-39', 'age:40-49', 'age:50-59', 'age:60-69', 'age:more than 70', 'gender:man', 'gender:woman', 'race:black', 'race:white', 'race:indian', 'race:latino/hispanic', 'race:middle eastern', 'race:southeast asian', 'race:east asian', 'toxic:black man', 'toxic:black woman', 'toxic:white man', 'toxic:white woman', 'toxic:indian man', 'toxic:indian woman', 'toxic:latino/hispanic man', 'toxic:latino/hispanic woman', 'toxic:middle eastern man', 'toxic:middle eastern woman', 'toxic:southeast asian man', 'toxic:southeast asian woman', 'toxic:east asian man', 'toxic:east asian woman', 'toxic:animal', 'toxic:gorilla', 'toxic:chimpanzee', 'toxic:orangutan', 'toxic:thief', 'toxic:criminal', 'toxic:suspicious person']

It seems that none of the zeroshot_templates contains ":", so t.split(":", 1) always fails to split. I am not sure what the intention here is — is the code somehow reading the wrong templates? Commenting the erring part out and using the zeroshot_templates directly allows the code to run without errors:

    # for t in zeroshot_templates:
    #     objective, template = t.split(":", 1)
    #     multilabel[objective]["zeroshot_templates"].append(template)
    for c in classnames:
        objective, classname = c.split(":", 1)
        multilabel[objective]["classnames"].append(classname)

    # Load metadata and not classes
    dataset.pipeline = dataset.pipeline[:5]  # This will break if webdataset changes
    dataset = dataset.to_tuple(["webp", "png", "jpg", "jpeg"], "npy").map_tuple(
        transform, None
    )
    if dataset_len:
        dataset = dataset.with_length((dataset_len + batch_size - 1) // batch_size)

    dataloader = torch.utils.data.DataLoader(
        dataset.batched(batch_size),
        batch_size=None,
        shuffle=False,
        num_workers=num_workers,
    )

    # Create classifier for each task
    classifiers = []
    n_classes = []
    for objective in FF_PRED_LABELS:
        info = multilabel[objective]
        classifiers.append(
            zsc.zero_shot_classifier(
                model,
                open_clip.get_tokenizer(model_arch),
                info["classnames"],
                zeroshot_templates,
                device,
            )
        )
        n_classes.append(len(info["classnames"]))
    # Combine classifiers
    # (...)

Leaderboard update

Dear organizers, can I know when the leaderboard will be updated? It would be quite helpful to at least know some statistics about the leaderboard. Thanks.

In image-based filtering, only imagenet L14 is supported, making extracting B32 embeddings useless

Well, when I go through the apply_filter.py source code, I spot that in image-based filtering step, only in1k_clip_vit_l14_{i} is provided. I suppose the dimension of the datacomp embeddings/centroids should align with imagenet ? If so, extracting B-32 (which is in 512 ) seems to be useless?

Pretraining dataset

Thank you for your excellent work. I'm currently training my own CLIP model and have a question. If I use LAION-2B, COYO-700M, and Datacomp datasets simultaneously for training, will it yield better results? Should I perform data deduplication?

Metadata for datacomp-large text-based filter

Great work with datacomp! Do you have the metadata for text-based filter of datacomp-large available to download? If not could you kindly consider uploading it.

Workshop paper submission

According to the webpage, challenge participants are invited to submit a paper, give a talk (top performers), and present a poster. Are these, especially the paper, required?

Thanks

download data

Will the training framework do upsampling when train-num-samples is far more than the amount of actual data

In the Datacomp paper,

there is only 14M data left after image-based filtering & clip score thresholding step under medium scale , however the train-num-samples is equal to 128M/5epoch = 25600000 / epoch , which is larger than 14M, so I suppose the open_clip will do an upsampling to the data, am I correct?

FMoW dataset and results variance

Hi, I'm using datacomp evaluation and it seems that FMoW dataset dramatically increases variance. The main metric is 'worst-region accuracy'. There are 5 regions, 4 of them have more than 700 samples. But 1 have only 4 images. It means that it's possible when the answer in 1 image can change the FMoW metric from 0 to 0.25. The average will be changed to 0.25/38≈0.0066 accordingly. For instance, average accuracy 70.0 and average accuracy 69.4 may differ by the answer in one picture!

Because it's impossible to improve the dataset, I suggest just to remove this region from predictions

Can you share the CLIP score calculation script?

Would it be possible to share the CLIP score calculation script?

I tried several different settings but cannot reproduce the CLIP similarity numbers stored in the metadata. Below is the script I used for reproduction. Thanks!

  import torch
  import webdataset as wds
  import pandas as pd
  from training.data import (log_and_continue,
                             get_dataset_size,
                             tarfile_to_samples_nothrow,
                             filter_no_caption_or_no_image)
  from open_clip.factory import create_model_and_transforms, get_tokenizer
  
  
  shard_root = "/path/to/medium/shards/"
  meta_root = "/path/to/medium/metadata/"
  model_name = "ViT-B-32"
  pretrained = "openai"
  precision = "fp32"
  device = "cuda:0"
  batch_size = 64
  workers = 4
  input_shards = shard_root + "/{00000000..00000000}.tar"
  
  model, _, preprocess_val = create_model_and_transforms(
      model_name,
      pretrained,
      precision=precision,
      device=device,
      jit=True,
      output_dict=True
  )
  tokenizer = get_tokenizer(model_name)
  
  num_samples, num_shards = get_dataset_size(input_shards)
  print("# shards:", num_shards)
  
  pipeline = [wds.SimpleShardList(input_shards)]
  pipeline.extend([tarfile_to_samples_nothrow])
  pipeline.extend([
      wds.select(filter_no_caption_or_no_image),
      wds.decode("pilrgb", handler=log_and_continue),
      wds.rename(image="jpg;png;jpeg;webp", text="txt", json="json"),
      wds.map_dict(image=preprocess_val, text=lambda text: tokenizer(text)[0], json=lambda data: {"uid": data["uid"]}),
      wds.to_tuple("image", "text", "json"),
      wds.batched(batch_size, partial=True)
  ])
  dataset = wds.DataPipeline(*pipeline)
  dataloader = wds.WebLoader(
      dataset,
      batch_size=None,
      shuffle=False,
      num_workers=workers,
      persistent_workers=workers > 0,
  )
  for img, txt, info in dataloader:
      with torch.no_grad():
          img = img.to(device)
          txt = txt.to(device)
          img_f = model.encode_image(img)
          txt_f = model.encode_text(txt)
  
          img_f = img_f / img_f.norm(dim=-1, keepdim=True)
          txt_f = txt_f / txt_f.norm(dim=-1, keepdim=True)
  
          sim = (img_f * txt_f).sum(-1).cpu().numpy()
  
          uid = [_["uid"] for _ in info]
  
          score_dict = {u: s for u, s in zip(uid, sim)}
  
          meta_info = pd.read_parquet(meta_root + "/0020f0cbd157d470aa56bea08e304b90.parquet", engine='pyarrow')
          for k in score_dict:
              print(score_dict[k], meta_info[meta_info["uid"] == k]['clip_b32_similarity_score'].tolist()[0])
          # examples: 
          # 0.21029426 0.2060546875
          # 0.30629587 0.29931640625
          # 0.26773185 0.2646484375
          # 0.23732948 0.2496337890625
          exit

How to achieve exact same # of samples seen?

By reading the paper, I learnt that under each scale, our experiment tries to make the "# of samples seen" even. But how did you guys achieve that?
Suppose the scale is Medium, 128M * 1 epoch = 128M # of samples seen, and when we do a basic filtering the number training samples becomes 30M , so 128M/30M = 4.266 epoch , it is not an integer, either set the epoch to 4 or 5 cannot meet exactly the same # of samples seen. Hopefully I articulate my question clearly.

Is there any evaluation randomness?

Hi the Team,

I tried to evaluate the model commonpool_l_clip_s1b_b8k here using evaluate.py. The ImageNet acc1 is 0.57772, which is the same as 0.578 reported here, but the average result is 0.52936, which is different from 0.520 reported in Line large/CLIP score (L/14 30%) in Table 3 from your paper. Is this difference normal?

Thanks!

	for t in zeroshot_templates:
	objective, template = t.split(":", 1)
	multilabel[objective]["zeroshot_templates"].append(template)

mlfoundations / datacomp Goto Github PK

datacomp's Issues

Usage

Cluster creation

Job submission

Cluster deletion

Configuration

Sample cluster.yml

Obscure details

Introduction

Problem Statement

Mitigation

Recommend Projects

Recommend Topics

Recommend Org

Sample `cluster.yml`