Comments (7)
Yes. Exactly it is what i am looking for!
By comparing the your uploaded parquets and my generated parquets using https://github.com/mlfoundations/dataset2metadata, I found there are some incompatible keys ('clip_b32_similarity_score' vs. 'oai-clip-vit-b32-score').
This problem I think can be solved by change
postprocess_parquet_lookup
(https://github.com/mlfoundations/dataset2metadata/blob/225408d90462323cf2afc5b40e789bc8cd966ce6/dataset2metadata/registry.py#L25)I will post here once I finish my reproduction of baselines using this entire process.
I ran clip_score_l14_30_percent
filtering on my generated parquets, and got 5.39% zero-shot ImageNet acc for small track, which is comparable with the official 5.1%.
from datacomp.
Hey @vtddggg. That's a great question, and it was a big point of discussion when we were designing our codebase. While it's certainly possible to implement the filtering solution, the problem is that this can make training quite slow (especially if are filtering a large fraction of the pool) because the dataloader still needs to iterate through all samples in the pool. Therefore, we decided to go with the first approach, where you pay an initial cost to reshard and need some extra storage, but in return the training is much faster.
We have some discussion about this decision in our paper in Appendix J (https://arxiv.org/abs/2304.14108)
from datacomp.
I got it. Thanks!
from datacomp.
@gabrielilharco Sorry, I just reopen the issue again.
When we want to compute clip score from raw image & text (instead of reading from metadata
), we must learn to process tars
files. However there are few reference examples about the use of webdatasets.
Regarding as the detailed implementation of baseline methods, can you provide a simple example about fetch uid & image & text data from tars
with distributed data parallel, and generate sample_ids.npy
? I think it will be very helpful
from datacomp.
No problem! We also open-sourced code that we used for generating our metadata in this repo: https://github.com/mlfoundations/dataset2metadata. Let me know if that covers what you're looking for
from datacomp.
Yes. Exactly it is what i am looking for!
By comparing the your uploaded parquets and my generated parquets using https://github.com/mlfoundations/dataset2metadata, I found there are some incompatible keys ('clip_b32_similarity_score' vs. 'oai-clip-vit-b32-score').
This problem I think can be solved by change postprocess_parquet_lookup
(https://github.com/mlfoundations/dataset2metadata/blob/225408d90462323cf2afc5b40e789bc8cd966ce6/dataset2metadata/registry.py#L25)
I will post here once I finish my reproduction of baselines using this entire process.
from datacomp.
Nice—thanks for looking into this end to end! I will add a flag to dataset2metadata to use keys that are compatible with this repo
from datacomp.
Related Issues (20)
- Usage with AWS S3 and Ray HOT 5
- FMoW dataset and results variance HOT 1
- Dataset Size on Leaderboard HOT 1
- Conda environment build issue HOT 3
- 14% of SHA256 hashes not matching HOT 32
- the normal success rate and downloading speed? HOT 1
- `zeroshot_templates` split error for FairFace / UTKFace HOT 9
- Deduplication against evaluation sets HOT 1
- Remove CSAM, if present HOT 2
- Metadata for datacomp-large text-based filter HOT 1
- Pretraining dataset HOT 1
- Training log HOT 1
- Frequency of Leaderboard Updates HOT 1
- About update metadata with the corresponding image sample in shards HOT 2
- ModuleNotFoundError: No module named 'training' HOT 2
- Availability of npy indices for large pool
- Average caption length for CommonPool HOT 1
- Downloading Commonpool XLarge
- ImageNet 21k based filtered dataset HOT 1
- Invalid files for Datacomp1B
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacomp.