Git Product home page Git Product logo

datacomp's Introduction

DataComp

[ Paper ] [ Website ] [ Blog ]

Welcome to our competition. This repository contains the participant tooling necessary to download data from our pool, train CLIP models, evaluate them on downstream tasks and submit to our leaderboard.

Overview

DataComp is a competition about designing datasets for pre-training CLIP models. Instead of iterating on model design and hyperparameter tuning like in traditional benchmarks, in DataComp your task is to curate a multimodal pre-training dataset with image-text pairs that yields high accuracy on downstream tasks. Model architecture and hyperparameters are fixed allowing participants to innovate on the dataset design. As part of the benchmark, we provide a large collection of uncurated image-text pairs, crawled from the public internet.

Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD).

DataComp is structured to accommodate participants with diverse levels of computational resources: each track is broken down into four scales, with varying amounts of compute requirements.

An overview of our benchmark and participant workflow can be found below. For more information, check out our paper and website.

Participant workflow

Installing dependencies

Run:

bash create_env.sh

To activate the environment:

conda activate datacomp

If using cloud storage services (e.g. AWS S3), you'll need to install additional dependencies (e.g. pip install 'cloudpathlib[s3]').

Downloading CommonPool

To download, run the following command, replacing $scale with the competition scale (i.e. small, medium, large or xlarge) and $data_dir with the output directory where you want the data to be stored.

python download_upstream.py --scale $scale --data_dir $data_dir

There are four scales in our competition:

  • small: 12.8M pool size, 12.8M examples seen
  • medium: 128M pool size, 128M examples seen
  • large: 1.28B pool size, 1.28B examples seen
  • xlarge: 12.8B pool size, 12.8B examples seen

The script will create two directories inside $data_dir: metadata and shards.

Along with the images and captions, this script will also download metadata, including .parquet files that contain the image urls, captions, and other potentially useful information such as the similarities between the images and captions given by trained OpenAI CLIP models. If the flag --download_npz is used, the script will also download the .npz files with features extracted by the trained OpenAI CLIP models for each sample.

We download the image data using img2dataset, which stores it as .tar shards with the images and captions to be consumed by webdataset. Once the download finishes, the data will be available at $data_dir/shards.

To download only metadata, use the --skip_shards flag.

The disk requirements for each scale are shown below.

metadata (parquets) metadata (npzs) data (tars)
small scale 3 GB 75GB 450 GB
medium scale 30 GB 750GB 4.5 TB
large scale 300 GB 7.5TB 45 TB
xlarge scale 3 TB 75TB 450 TB

Downloading DataComp-1B

The script download_upstream.py can be used to download the DataComp-1B dataset that we release as our best performing subset of the xlarge pool. To download this, use the following command:

python download_upstream.py --scale datacomp_1b --data_dir $data_dir

The above command will create the same directory structure under $data_dir and can be modified as described above.

Downloading external data

The script download_upstream.py can also be used to download other image-text datasets, using img2dataset. Given parquet files containing the image urls and captions, you can use this script to download the images, by using the flag --metadata_dir to point to the directory where the parquet files are stored. By default, we also download the parquet files corresponding to the pools we provide, and this metadata is stored in a subfolder of $data_dir.

Optimizing the download

When using img2dataset, there are several ways to optimize the download process such as using multiple nodes in a distributed environment or setting up a DNS resolver to increase the success rate of images being downloaded. See the img2dataset repository for further instructions on how to optimize the download process, as well as information on potential issues during the download.

Selecting samples in the filtering track

Before training, you will need to select the subset of samples you wish to use. Given a set of chosen samples, we create new shards with only those samples, which the training code then consumes. For each scale, models are trained for a fixed number of steps, regardless of the size of the chosen subset of the provided pool.

Each sample in our pool has a unique identifier, which is present in the metadata parquets, and in the json files inside the .tar shards.

The format describing the subset of samples should be a numpy array of dtype numpy.dtype("u8,u8") (i.e. a structured array of pairs of unsigned 64-bit integers), with shape (subset_size,), containing a list of uids (128-bit hashes from the parquet files) in lexicographic sorted order, saved to disk in either npy format or memory-mapped format.

For instance, if you have a list of uids uids = ['139e4a9b22a614771f06c700a8ebe150', '6e356964a967af455c8016b75d691203'], you can store them by running the following python code:

processed_uids = np.array([(int(uid[:16], 16), int(uid[16:32], 16)) for uid in uids], np.dtype("u8,u8"))
processed_uids.sort()
np.save(out_filename, processed_uids)

After creating a subset, you may invoke the resharder to build the subset shards in $output_dir like so:

python resharder.py -i $download_dir -o $output_dir -s $subset_file

If desired, the resharder can be run in parallel on multiple nodes. The easiest way to do so is to split the input directory into smaller subfolders with fewer shards, and run separate resharder jobs for each of them, each with to separate output directories.

Baselines

Here we provide command lines for the main filter baselines found in Table 3 of our paper, along with short descriptions. Each baseline reads the .parquet metadata files (and also the .npz files when needed) , selects a subset of uids, sorts them, and saves them to a .npy subset file. This file can then be input to the resharder described above to create a webdataset containing only the selected subset of the pool.

Note: the --num_workers flag controls the number of metadata files that are read into memory and processed on parallel. It is set by default to the number of cores, but that may be too much for machine with many cores and limited memory. For baselines other than image-filtering, allow at least 256MB of memory per worker.

No filtering

Here we load all metadata uids without any additional filtering.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/no_filter.npy --name no_filter

Basic filtering

Simple checks on caption length, english being the detected caption language, image size, and image aspect ratio.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/basic_filter.npy --name basic_filter

CLIP score filtering

Retain the top k=0.3 fraction of the pool by L/14 CLIP score.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_l14_30_percent.npy --name clip_score --arch l14 --fraction 0.3

Retain all examples with B/32 CLIP score above 0.25.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/clip_score_b32_25_threshold.npy --name clip_score --arch b32 --threshold 0.25

LAION-2B filtering

Reproduces the filtering strategy used to create the LAION-2B dataset: applies a B/32 CLIP score filter on image-text pairs, retaining samples with score above 0.28, and an English filter using the gcld3 model to detect language.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/laion.npy --name laion2b

Text-based filtering

A text filter captions that contain words from the ImageNet-21k synsets.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/text_based.npy --name text_based

Image-based filtering

A image clustering based method that retains samples whose images have content close to ImageNet-1k training images, as measured by the nearest-neighbor cluster center of the image's L/14 CLIP embedding.

Note: this baseline uses GPU resources. By default it will try to use all GPUs. To control which GPUs are used, set the CUDA_VISIBLE_DEVICES environment variable.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/image_based.npy --name image_based --image_based_scale small --batch_size 512

Note: this baseline requires pre-computed image cluster centroids which will be downloaded automatically the first time you run it. If you want to generate the centroids yourself, please see baselines/image_based_clustering.md for instructions.

Intersection of image-based and CLIP score filtering

Applies both the CLIP score (L/14) with top 0.3 fraction filter and an Image-based filter. This is our best performing baseline for medium, large, and xlarge scales. We used this strategy at the xlarge scale to create the DataComp-1B dataset.

Note: this baseline uses GPU resources. By default it will try to use all GPUs. To control which GPUs are used, set the CUDA_VISIBLE_DEVICES environment variable.

python baselines.py --metadata_dir path/to/metadata --save_path path/to/image_based_intersect_clip_score_l14_30_percent.npy --name image_based_intersect_clip_score --image_based_scale small --batch_size 512 --arch l14 --fraction 0.3

Training

To train, run the following command:

torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $output_dir --exp_name $exp_name

We support using multiple different data directories. For instance, if your data is in /path/to/dir/1 and in /path/to/dir/2, you can use the flag --data_dir=/path/to/dir/1::/path/to/dir/2.

A sample script for training with SLURM with is provided at tools/slurm_train.sh.

Hyper-parameters

The hyper-parameters used for training are fixed for a given scale. For the small and medium scales we use ViT-B/32 models, for the large scale, ViT-B/16, and for the xlarge scale, ViT-L/14. The number of samples seen during training is determined by the scale, and is equal to the size of the corresponding pool we provide. Additional details on hyper-parameters can be found in the paper.

You should not modify any hyper-parameters for training, including batch size. Any changes may affect accuracy and make results incomparable.

Note on variance across runs

We observed small (but non-zero) variance in performance when changing random seeds, seeing differences in accuracy typically at the range of 0.2 percentage points on ImageNet and up to 0.006 on average. We also note that some factors can make runs non-deterministic even when setting the same random seed (for example, random network failures when streaming data can cause different batches to be formed when re-running, see also https://pytorch.org/docs/stable/notes/randomness.html).

Evaluation

[Optional] Pre-download evaluation datasets

Pre-downloading evaluation datasets is optional if you have a strong Internet connection; by default, the data will be streamed directly from Hugging Face Hub. If you wish to download the data, run the following command, replacing $download_dir with your desired download path:

python download_evalsets.py $download_dir

Evaluating

To evaluate, run the following command:

python evaluate.py  --train_output_dir $train_output_dir/$exp_name

If you have already donwloaded the datasets, you can use the flag --data_dir to point the code to the path where the data is stored. By default, the evaluation script outputs to the same directory as $train_output_dir. This can be changed with the flag --output_dir on the evaluation script.

Note: This will not submit to our leaderboard unless you pass the --submit flag.

Submitting

To submit, you'll run the evaluate script with some extra flags.

The submission script will upload files to Hugging Face Hub (like the model checkpoint and the file specifying the sample ids), and you will need a Hugging Face account for that, and a repository where these artifacts will be stored. To do so, follow these steps:

  1. Make sure you have git-lfs installed (run git lfs install if not)
  2. Create a Hugging Face account at https://huggingface.co/join.
  3. Login to your Hugging Face account: huggingface-cli login
  4. Create a repository where the data will be stored: huggingface-cli repo create <REPO_NAME> --type model.

Once you're ready to submit, run the evaluation script with some extra flags, for example:

python evaluate.py \
    --track=filtering \
    --train_output_dir=$train_output_dir \
    --samples=$sample_files \
    --dataset-size=1234568 \
    --submit \
    --method_name="[your method name, please be descriptive!]" \
    --author="[your name]" \
    --email="[[email protected]]" \
    --hf_username=$hf_username \
    --hf_repo_name=$hf_repo_name

Please note that the name of your method and the authors (and no other information) will be made publicly available in our leaderboard. Be sure to replace all fields with the correct information.

If you have a paper or blog post and would like that to be linked on our leaderboard, you can add that information with the --writeup flag.

Important: We highly encourage users to specify the samples used to train the model using the --samples flag. This can be either file(s) containing the uids of samples from our pool, and/or other files specifying the urls and captions for images outside our pool. You can specify multiple files using the :: separator, for instance --samples=/path/to/sample_ids.npy::/path/to/custom_data.parquet. We also highly encourage participants to also upload the checkpoints for their trained models using the --upload-checkpoint flag.

Checkpoints

We release the checkpoints for our main baselines as part of OpenCLIP. More details can be found at https://github.com/mlfoundations/open_clip/blob/main/docs/datacomp_models.md.

Citation

If you found this repository, our paper or data useful, please consider citing:

@article{datacomp,
  title={DataComp: In search of the next generation of multimodal datasets},
  author={Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt},
  journal={arXiv preprint arXiv:2304.14108},
  year={2023}
}

datacomp's People

Contributors

0x2b3bfa0 avatar afang-story avatar borisdayma avatar djghosh13 avatar eltociear avatar gabrielilharco avatar georgiossmyrnis avatar mitchellnw avatar nielsrogge avatar sagadre avatar vaishaal avatar yaircarmon avatar zwcolin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datacomp's Issues

Adding the detected_language to metadata

Hi! Thanks a lot for your work!

I can't find the detected_language column in the metadata, I suppose someone has to manually compute it and it in the parquet files but it takes quite a long time for a dataset like the xlarge pool. Having it stored in the metadata paruqet file would help all the people that have limited resources and want to prefilter the data before downloading the images.

Practical example:

I am interested in training a model only on the image-text pairs with english captions, and I that those data represent around the 30% of the small pool. Filtering out all the external languages would allow me to save a lot of storage space.

the normal success rate and downloading speed?

Hi, somehow the success rate starts at 0.8 and then dropped to 0.2 after downloading for a while, what success rate should we expect to be healthy? if an average success rate is 0.5 does it mean after the downloading I have only downloaded 50% of the total dataset?
Also, will the factors like the image resolution downloaded have a great impact on the success rate?
And do you know if it's possible to download the datacomp-1B using a 96 CPU pods in a week?
Thanks

Workshop submission deadline

Hi everyone,

I am confused about the submission deadline. Is it by the end of the 8th of September or the end of the 7th of September?
thanks

Leaderboard update

Dear organizers, can I know when the leaderboard will be updated? It would be quite helpful to at least know some statistics about the leaderboard. Thanks.

Pretraining dataset

Thank you for your excellent work. I'm currently training my own CLIP model and have a question. If I use LAION-2B, COYO-700M, and Datacomp datasets simultaneously for training, will it yield better results? Should I perform data deduplication?

Is it possible to implement data filtering in training script?

It seems the current process of data filtering is:

  1. filtering metadata.parquet to get sample_ids.npy —> 2. reshard based on sample_ids.npy and save the new dataset again using resharder.py.

But I think an ideal data filtering process will be:

  1. filtering data shards to get sample_ids.npy because data shards contain more information such as image pixel value —> once sample_ids.npy is generated, the training script applies it for dataset construction directly. In this way, we need not to store data multiple times, which is time and storage consuming.

I know things will be different when data expands to billion level. There must be some reasons that you adopt the existing solution.

But I still confuse if the second solution is possible or easy to implement? (at least for small track)

Dataset Size on Leaderboard

Hey there,

While reviewing the leaderboard submissions for the small filtering track, I observed instances where the dataset size was noted as 1.3e7, which is essentially equivalent to the original dataset size without any filtering. Now I was wondering whether it was actually the case that these submissions kept almost all of the original data (which I doubt given submission titles such as BLIP2-COCO-finetuned_similarity_top-35%) or if this is an error.

Thanks for clarifying!

Is there any evaluation randomness?

Hi the Team,

I tried to evaluate the model commonpool_l_clip_s1b_b8k here using evaluate.py. The ImageNet acc1 is 0.57772, which is the same as 0.578 reported here, but the average result is 0.52936, which is different from 0.520 reported in Line large/CLIP score (L/14 30%) in Table 3 from your paper. Is this difference normal?

Thanks!

Tried evaluate the model on a local network only machine

Well, I first use

python download_evalsets.py $download_dir

to download all the necessary datasets on an internet-accessible machine and then migrate the data to my machine with limited internet access.
All the other evaluation went well but the retrieval datasets , which use hf_cache/ directory instead.

The error goes like this :

>>> datasets.load_dataset("nlphuji/flickr_1k_test_image_text_retrieval",split="test", cache_dir=os.path.join("/mnt/data/datacom2023/evaluate_datasets", "hf_cache"))Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1512, in dataset_module_factory
    raise e1 from None
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1468, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")
ConnectionError: Couldn't reach 'nlphuji/flickr_1k_test_image_text_retrieval' on the Hub (ConnectTimeout)

Seems like the huggingface datasets module is still trying to connect to the internet. Is there any trick I can play to skip the connection to huggingface? The evaluation command :

python evaluate.py --train_output_dir /mnt/data/datacomp2023/train_output/basic_train --data_dir /mnt/data/datacomp2023/evaluate_datasets

Workshop paper submission

According to the webpage, challenge participants are invited to submit a paper, give a talk (top performers), and present a poster. Are these, especially the paper, required?

Thanks

Problems evaluating trained model

Hi everyone,

Im having trouble evaluating my trained model on the evaluation datasets. I downloaded the datasets but the evaluation script is only running my model on the imagenet 1k dataset. I am running this command:
python evaluate.py --track filtering --data_dir ./eval_datasets/ --train_output_dir trained_models/clip_model_1

Thanks for your time

train/test splits for downstream tasks

Hello!

For some of the downstream datasets, the train-test splitting is unknown. Could you please share how the train and test subsets are split so that we can avoid using test images? The tasks with unknown train-test split are:
Caltech-101 [41], DTD [26], EuroSAT [57, 147], KITTI distance [44, 147], RESISC45 [23, 147], Dollar Street [115], GeoDE [107]

Thanks a lot!

Training log

Hello, I am currently replicating clip-large model using the DataComp dataset. Could you provide the training logs from Weights & Biases, such as the zero-shot results on ImageNet at different steps?

14% of SHA256 hashes not matching

Introduction

We downloaded the Datacomp 1B set.
For verification, we only kept an image if its SHA256 checksum of the bytes matches with the corresponding entry in the metadata you provide.

Problem Statement

Hundreds of millions of images were discarded due to hash mismatches.

Here's one example. Let's look at entry 21:
https://huggingface.co/datasets/mlfoundations/datacomp_1b/viewer/default/train?row=21

  • UID: 38f76e4b1b4a77ca66a62b453da17912
  • text: Cable Manager, Horizontal, Recessed Flat ...
  • url: https://images.eanixter.com/viewex/PR108844V6.JPG
  • sha256: 0e77fada7a972fc2fa2ad8991701968ea0953d03f799787c4d27f072bdfa4164

PR108844V6

If you download the image and compute the hash, it will be this:

$ curl -s https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273

One might think the image was modified slightly (e.g. its header or some pixels). However, checking the web archive version from 2019 yields the same hash:

$ curl -s https://web.archive.org/web/20191127193532if_/https://images.eanixter.com/viewex/PR108844V6.JPG | sha256sum
6f392bcd683005e33356a2c003fb8378c7ff445d0b54b3429c84bd112299d273

(You can view the web archive capture here.)

Mitigation

We would like to better understand how the hashes were computed. It seems the code that was used for that is not published.
Potentially, we could build a workaround by computing the hashes in the same way you did.

Ultimately, we think fixing the hashes in the metadata will be the best solution.

FileNotFoundError while downloading DataComp-1B

Thanks for the great work.
I encountered the following issue while downloading the DataComp-1B dataset:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/opt/conda/envs/datacomp/lib/python3.10/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

FMoW dataset and results variance

Hi, I'm using datacomp evaluation and it seems that FMoW dataset dramatically increases variance. The main metric is 'worst-region accuracy'. There are 5 regions, 4 of them have more than 700 samples. But 1 have only 4 images. It means that it's possible when the answer in 1 image can change the FMoW metric from 0 to 0.25. The average will be changed to 0.25/38≈0.0066 accordingly. For instance, average accuracy 70.0 and average accuracy 69.4 may differ by the answer in one picture!

Because it's impossible to improve the dataset, I suggest just to remove this region from predictions

Consistency between Table 23 and Fig 3

Hi Team,

I have a question about the consistency between Table 23 CLIP-L14 results and Fig 3 avg CLIP-L14 results. In Table 23, the average results of both CLIP-L14 30% and 40% have the same maximum 0.520, but, in Fig 3, the large-CLIP-L14 30% setting has a lower performance than 40%. Please correct me if I wrongly interpret the table or figure. Thanks!

https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test doesn't exist?

trying to download flickr_1k_test_image_text_retrieval but got errors for downloading from https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test

========== Download 'Flickr' ===========

Repo card metadata block was not found. Setting CardData to empty.
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3305.20it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 164.31it/s]
Generating test split: 1000 examples [00:00, 4210.39 examples/s]

========== Download 'Flickr' ===========

--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/classnames.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/zeroshot_classification_templates.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
--2023-08-16 21:27:28--  https://huggingface.co/datasets/djghosh/wds_flickr_1k_test_image_text_retrieval_test/raw/main/test/nshards.txt
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.87, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.
Traceback (most recent call last):
  File "download_datasets.py", line 139, in <module>
    sys.exit(main(args))
  File "download_datasets.py", line 15, in main
    download_datasets(args.data_dir)
  File "download_datasets.py", line 79, in download_datasets
    nshards = int(f.read())
ValueError: invalid literal for int() with base 10: ''

Text search over CommonPool

Is there any web UI for performing text search over CommonPool? I want to use if to collect my own dataset for few-shot vision tasks.

(I know there is already a search engine over LAION-5B, but I need the search not to be model-assisted to avoid model bias. In LAION-5B, this is not the case, because it searchs by CLIP image embeddings, and LAION-5B itself is model-filtered. So searcing over CommonPool seems better solution).

To implement this by myself, I guess I need some search engine. Would you recommend Apache Solr or other? Any common practices here?

Usage with AWS S3 and Ray

Usage

Cluster creation

ray up --yes cluster.yml
ray dashboard cluster.yml

Job submission

git clone https://github.com/mlfoundations/datacomp
ray job submit \
--address=http://localhost:8265 \
--working-dir=datacomp \
--runtime-env-json="$(
  jq --null-input '
    {
      conda: "datacomp/environment.yml",
      env_vars: {
        AWS_ACCESS_KEY_ID: env.AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY: env.AWS_SECRET_ACCESS_KEY,
        AWS_SESSION_TOKEN: env.AWS_SESSION_TOKEN
      }
    }
  '
)" \
-- \
python download_upstream.py \
--subjob_size=11520 \
--thread_count=128 \
--processes_count=1 \
--distributor=ray \
--metadata_dir=/tmp/metadata \
--data_dir=s3://datacomp-small \
--scale=small

Note

Image shards would be saved to the datacomp-small AWS S3 bucket, specified with the --data_dir option.

Cluster deletion

$ ray down --yes cluster.yml

Configuration

Sample cluster.yml

cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

docker:
  run_options: [--dns=127.0.0.1]
  image: rayproject/ray:2.6.1-py310
  container_name: ray

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg

Obscure details

  • When --data_dir points to a cloud storage like S3, we also have to specify a local --metadata_dir because the downloader script doesn't support saving metadata to cloud storage.

  • The last pip install on the setup_commands section is needed for compatibility with AWS S3, because the required libraries aren't included in the conda environment file.

  • There is no need to provide additional AWS credentials if the destination bucket is on the same account as the cluster, because it already has S3 full access through an instance profile.

    • While the cluster has a default instance profile that grants full S3 access, it doesn't seem to work as intended (probably due to rate limit of IMDS endpoint), and I ended up having to pass my local AWS credentials as environment variables.
  • The Python version in environment.yml must match the Python version of the Ray cluster; make sure that docker.image on cluster.yaml has exactly the same version as the environment.yml from this project.

How to deal with images that cannot be downloaded?

Thanks for your great and meaningful competition.

When I run python download_upstream.py --scale $scale --data_dir $data_dir, only around ~91% images can be download successfully. It means the actually pool size will be smaller than the given pool size (< 12.8M).

How to deal with that? Definitely I think the person with more candidate samples can benefit more.

download data, why success is always 0.14, but the file size is 80% of the total file (total file size is in README)

Sharding file number 214 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1df9462bc6885e969f11aaa635d9332c.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 215 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1df9dd8c3710199fc1a3553e2c32c088.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 216 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e035aecc740dc7c69d6078e631623b1.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 217 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e04906d042a35489b73c3b4ac13dca2.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 218 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e06f0994865438d88fd682a32a406a4.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 219 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e15153d29898fddda371065d92a3690.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 220 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e5389481e14532f3bafbd7a863154d3.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 221 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e72542b65bdbb87aacee8dc4dc77108.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 222 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e763882b04e4d151feb536eeb41c3b6.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 223 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e93a84dd21c161eb9d58d8bd2a13824.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 224 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1e962285d730d3e11dc685ebbd09af05.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 225 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ea5c7f274e3ea11f34eba02d7737502.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 226 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ed3681d826a23ad6ac71368a2c70c55.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 227 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ed52da08ab1b413fd9cfa39d0142933.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 228 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f1fcad950bd01a8473a1486c6970b09.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 229 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f2b17df53b505a0d0b892a649cfeb12.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 230 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f3b931c1473c40a0f54b343158f963e.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 231 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f69531c3338a697864f0be16a031b09.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 232 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f8934426b1e4464a7f427c24e10ce6f.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 233 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f89fb35014d629dd9eb9aa536354463.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 234 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1f99a90abc00dda416cf9f9fda2f033c.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 235 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1fa4fcf25d3e3f17943912b5dfcffb8a.parquet
File sharded in 0 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 236 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1fbf923f6041602b76974860585f70e6.parquet
File sharded in 18 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
Sharding file number 237 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1feceb12579113e8578942797b962e01.parquet
File sharded in 51 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
6it [03:21, 15.04s/it]Sharding file number 238 of 253 called /hetu_group/chenqilin/datacomp/data_medium/metadata/1ff0057a457efcdae3ac0aafbc3ace3d.parquet
File sharded in 51 shards
Downloading starting now, check your bandwidth speed (with bwm-ng)your cpu (with htop), and your disk usage (with iotop)!
11it [03:24,  2.89s/it]worker  - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 95 - count: 10000
total   - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 95 - count: 10000
worker  - success: 0.145 - failed to download: 0.855 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 190 - count: 20000
worker  - success: 0.144 - failed to download: 0.854 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 285 - count: 30000
worker  - success: 0.139 - failed to download: 0.860 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 380 - count: 40000
worker  - success: 0.146 - failed to download: 0.853 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 474 - count: 50000
worker  - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 97 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 569 - count: 60000
worker  - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 98 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 664 - count: 70000
worker  - success: 0.146 - failed to download: 0.853 - failed to resize: 0.001 - images per sec: 96 - count: 10000
total   - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 759 - count: 80000
worker  - success: 0.137 - failed to download: 0.863 - failed to resize: 0.001 - images per sec: 95 - count: 10000
total   - success: 0.143 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 854 - count: 90000
16it [03:30,  1.61s/it]worker  - success: 0.140 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 89 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 892 - count: 100000
worker  - success: 0.144 - failed to download: 0.854 - failed to resize: 0.001 - images per sec: 92 - count: 10000
total   - success: 0.143 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 981 - count: 110000
worker  - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 93 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1070 - count: 120000
worker  - success: 0.143 - failed to download: 0.856 - failed to resize: 0.001 - images per sec: 94 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1159 - count: 130000
worker  - success: 0.139 - failed to download: 0.860 - failed to resize: 0.001 - images per sec: 94 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1248 - count: 140000
worker  - success: 0.136 - failed to download: 0.863 - failed to resize: 0.001 - images per sec: 93 - count: 10000
total   - success: 0.142 - failed to download: 0.857 - failed to resize: 0.001 - images per sec: 1337 - count: 150000
worker  - success: 0.138 - failed to download: 0.862 - failed to resize: 0.001 - images per sec: 92 - count: 10000
total   - success: 0.142 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 1427 - count: 160000
17it [04:06, 10.24s/it]worker  - success: 0.138 - failed to download: 0.861 - failed to resize: 0.001 - images per sec: 97 - count: 4493
total   - success: 0.141 - failed to download: 0.858 - failed to resize: 0.001 - images per sec: 1107 - count: 164493

Error when creating the environment

Hi everyone,

When creating the environment, the package gcld3 cannot be installed. I am suspecting there is something to do with the Python version as the package can be installed with Python 3.6 up to 3.8 but Python 3.9 and newer have a problem with this package.

This is the error message I'm getting:

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gcld3

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

I know this might not be the perfect place to raise this problem but the repo of the original package is not active.

Thanks,

`zeroshot_templates` split error for FairFace / UTKFace

Hi, I got the following error running download_evalsets.py and then evaluate.py:

Evaluating on FairFace
Traceback (most recent call last):
  File "/home/jason-chou/Downloads/datacomp/evaluate.py", line 382, in <module>
    metrics = evaluate_model(
  File "/home/jason-chou/Downloads/datacomp/eval_utils/main.py", line 40, in evaluate_model
    metrics = eval_fn(
  File "/home/jason-chou/Downloads/datacomp/eval_utils/fairness_eval.py", line 265, in evaluate_fairface_dataset
    objective, template = t.split(":", 1)
ValueError: not enough values to unpack (expected 2, got 1)

Digging into it, the erring part is

for t in zeroshot_templates:
objective, template = t.split(":", 1)
multilabel[objective]["zeroshot_templates"].append(template)

and then I printed out zeroshot_templates:

zeroshot_templates=['a bad photo of a {c}.', 'a photo of many {c}.', 'a sculpture of a {c}.', 'a photo of the hard to see {c}.', 'a low resolution photo of the {c}.', 'a rendering of a {c}.', 'graffiti of a {c}.', 'a bad photo of the {c}.', 'a cropped photo of the {c}.', 'a tattoo of a {c}.', 'the embroidered {c}.', 'a photo of a hard to see {c}.', 'a bright photo of a {c}.', 'a photo of a clean {c}.', 'a photo of a dirty {c}.', 'a dark photo of the {c}.', 'a drawing of a {c}.', 'a photo of my {c}.', 'the plastic {c}.', 'a photo of the cool {c}.', 'a close-up photo of a {c}.', 'a black and white photo of the {c}.', 'a painting of the {c}.', 'a painting of a {c}.', 'a pixelated photo of the {c}.', 'a sculpture of the {c}.', 'a bright photo of the {c}.', 'a cropped photo of a {c}.', 'a plastic {c}.', 'a photo of the dirty {c}.', 'a jpeg corrupted photo of a {c}.', 'a blurry photo of the {c}.', 'a photo of the {c}.', 'a good photo of the {c}.', 'a rendering of the {c}.', 'a {c} in a video game.', 'a photo of one {c}.', 'a doodle of a {c}.', 'a close-up photo of the {c}.', 'a photo of a {c}.', 'the origami {c}.', 'the {c} in a video game.', 'a sketch of a {c}.', 'a doodle of the {c}.', 'a origami {c}.', 'a low resolution photo of a {c}.', 'the toy {c}.', 'a rendition of the {c}.', 'a photo of the clean {c}.', 'a photo of a large {c}.', 'a rendition of a {c}.', 'a photo of a nice {c}.', 'a photo of a weird {c}.', 'a blurry photo of a {c}.', 'a cartoon {c}.', 'art of a {c}.', 'a sketch of the {c}.', 'a embroidered {c}.', 'a pixelated photo of a {c}.', 'itap of the {c}.', 'a jpeg corrupted photo of the {c}.', 'a good photo of a {c}.', 'a plushie {c}.', 'a photo of the nice {c}.', 'a photo of the small {c}.', 'a photo of the weird {c}.', 'the cartoon {c}.', 'art of the {c}.', 'a drawing of the {c}.', 'a photo of the large {c}.', 'a black and white photo of a {c}.', 'the plushie {c}.', 'a dark photo of a {c}.', 'itap of a {c}.', 'graffiti of the {c}.', 'a toy {c}.', 'itap of my {c}.', 'a photo of a cool {c}.', 'a photo of a small {c}.', 'a tattoo of the {c}.']

In contrast to classnames

classnames=['age:0-2', 'age:3-9', 'age:10-19', 'age:20-29', 'age:30-39', 'age:40-49', 'age:50-59', 'age:60-69', 'age:more than 70', 'gender:man', 'gender:woman', 'race:black', 'race:white', 'race:indian', 'race:latino/hispanic', 'race:middle eastern', 'race:southeast asian', 'race:east asian', 'toxic:black man', 'toxic:black woman', 'toxic:white man', 'toxic:white woman', 'toxic:indian man', 'toxic:indian woman', 'toxic:latino/hispanic man', 'toxic:latino/hispanic woman', 'toxic:middle eastern man', 'toxic:middle eastern woman', 'toxic:southeast asian man', 'toxic:southeast asian woman', 'toxic:east asian man', 'toxic:east asian woman', 'toxic:animal', 'toxic:gorilla', 'toxic:chimpanzee', 'toxic:orangutan', 'toxic:thief', 'toxic:criminal', 'toxic:suspicious person']

It seems that none of the zeroshot_templates contains ":", so t.split(":", 1) always fails to split. I am not sure what the intention here is — is the code somehow reading the wrong templates? Commenting the erring part out and using the zeroshot_templates directly allows the code to run without errors:

    # for t in zeroshot_templates:
    #     objective, template = t.split(":", 1)
    #     multilabel[objective]["zeroshot_templates"].append(template)
    for c in classnames:
        objective, classname = c.split(":", 1)
        multilabel[objective]["classnames"].append(classname)

    # Load metadata and not classes
    dataset.pipeline = dataset.pipeline[:5]  # This will break if webdataset changes
    dataset = dataset.to_tuple(["webp", "png", "jpg", "jpeg"], "npy").map_tuple(
        transform, None
    )
    if dataset_len:
        dataset = dataset.with_length((dataset_len + batch_size - 1) // batch_size)

    dataloader = torch.utils.data.DataLoader(
        dataset.batched(batch_size),
        batch_size=None,
        shuffle=False,
        num_workers=num_workers,
    )

    # Create classifier for each task
    classifiers = []
    n_classes = []
    for objective in FF_PRED_LABELS:
        info = multilabel[objective]
        classifiers.append(
            zsc.zero_shot_classifier(
                model,
                open_clip.get_tokenizer(model_arch),
                info["classnames"],
                zeroshot_templates,
                device,
            )
        )
        n_classes.append(len(info["classnames"]))
    # Combine classifiers
    # (...)

Remove CSAM, if present

A recent report definitively found CSAM in LAION-5B, and that dataset has been taken down until the problem can be solved. The DataComp dataset is much larger. Please let us know what steps you have taken and/or plan to take to address this issue responsibly. Thanks!

https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/

Edit: Ali Alkhatib also makes a good point that, should dataset changes be needed, they might need to be mixed in with additional simultaneous data changes so an old version can't be easily diffed against a new version to find harmful material, among other best practices.

https://x.com/_alialkhatib/status/1737484384914092156?s=46

Deduplication of eval datasets

Hi,

I was wondering if the data present in eval datasets such as the one used to predict ImageNet zero shot accuracy.

Result submission deadline

Hello! I see the workshop paper submission deadline is September 8th, 2023. Is the result submission deadline (key list & model checkpoint submission) also the same day?

Thanks!

Metadata download error - OSError: Consistency check failed

Hi, team!

I am trying to download the medium-scale dataset of the filtering track, but I keep failing with the following error.

OSError: Consistency check failed: file should be of size 122218957 but has size 56690589 ((…)f11adbfc933c.parquet).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.

It seems related to this issue huggingface/huggingface_hub#1498
Is there any bypass for downloading metadata, without using huggingface_hub?
Thanks.

Not able to push data to google cloud storage

While trying to push data to the Google Cloud, I am getting a file not found error. Any help would be highly appreciated.

python download_upstream.py --scale medium --data_dir "gs://dataset/datacomp/" --thread_count 2
Downloading metadata to gs://dataset/datacomp/metadata...

Downloading (…)c76a589ef5d0.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 395MB/s]
Downloading (…)30fd0d497176.parquet: 100%|██████████████████████████████████████████████████████████████████| 122M/122M [00:00<00:00, 360MB/s]
.
.
.
Downloading (…)0edfcd0a6bc7.parquet: 100%|██████████████████████████████████████████████████████████████████| 121M/121M [00:00<00:00, 260MB/s]
Fetching 253 files: 100%|███████████████████████████████████████████████████████████████████████████████████| 253/253 [00:54<00:00,  4.67it/s]
Done downloading metadata.7.parquet:  26%|████████████████▉                                                | 31.5M/121M [00:00<00:00, 143MB/s]
Downloading images to gs://dataset/datacomp/shards██████████████████████████████████████████████   | 115M/121M [00:00<00:00, 307MB/s]
Starting the downloading of this file 
Sharding file number 1 of 1 called dataset/datacomp/metadata
0it [00:08, ?it/s]
Traceback (most recent call last):
  File "/home/mayank/datacomp/datacomp/download_upstream.py", line 218, in <module>
    img2dataset.download(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/main.py", line 232, in download
    distributor_fn(
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 36, in multiprocessing_distributor
    failed_shards = run(reader)
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/img2dataset/distributor.py", line 31, in run
    for (status, row) in tqdm(process_pool.imap_unordered(downloader, gen)):
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/mayank/miniconda3/envs/datacomp/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
FileNotFoundError: b/dataset/o/datacomp%2Fmetadata                                                                                                          

--output_dir does not do correct thing if --output_dir is a cloud path

The datacomp repo is cloudpath aware but open_clip is not, so when we pass a cloudpath like s3:// ... to the open_clip training code it just creates a folder called s3 locally on the master node.

The correct thing to do here is to detect its a cloudpath, give a temporary local directory and enable remote_sync on open_clip

How to achieve exact same # of samples seen?

By reading the paper, I learnt that under each scale, our experiment tries to make the "# of samples seen" even. But how did you guys achieve that?
Suppose the scale is Medium, 128M * 1 epoch = 128M # of samples seen, and when we do a basic filtering the number training samples becomes 30M , so 128M/30M = 4.266 epoch , it is not an integer, either set the epoch to 4 or 5 cannot meet exactly the same # of samples seen. Hopefully I articulate my question clearly.

reading images from within filtering script

I would like to be able to read images directly during the filtering step in apply_filter.py. I am able to get the url each image from the metadata. Instead of downloading each images, what the best way to get the image data from shards.

Conda environment build issue

Over the last couple days, I suddenly wasn't able to build the conda environment using conda env create -f environment.yml anymore and kept getting the attached error message. After following advice given in yaml/pyyaml#724 (comment) and changing the pyyaml requirement to pyyaml==5.3.1 I was able to build the environment again. Not sure if this is a good permanent fix but I just wanted to raise awareness about this issue.

Error message:

Collecting pyyaml==5.4.1
Using cached PyYAML-5.4.1.tar.gz (175 kB)
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'

Pip subprocess error:
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [62 lines of output]
/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config in setup.cfg
!!

          ********************************************************************************
          The license_file parameter is deprecated, use license_files instead.

          By 2023-Oct-30, you need to update your project and remove deprecated calls
          or your builds will no longer be supported.

          See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
          ********************************************************************************

  !!
    parsed = self.parsers.get(option_name, lambda x: x)(value)
  running egg_info
  writing lib3/PyYAML.egg-info/PKG-INFO
  writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
  writing top-level names to lib3/PyYAML.egg-info/top_level.txt
  Traceback (most recent call last):
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
      main()
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/itet-stor/brunnedu/net_scratch/conda_envs/datacomp/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 341, in run_setup
      exec(code, locals())
    File "<string>", line 271, in <module>
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 318, in run
      self.find_sources()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 326, in find_sources
      mm.run()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 548, in run
      self.add_defaults()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 586, in add_defaults
      sdist.add_defaults(self)
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 113, in add_defaults
      super().add_defaults()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
      self._add_defaults_ext()
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
      self.filelist.extend(build_ext.get_source_files())
    File "<string>", line 201, in get_source_files
    File "/tmp/pip-build-env-7n51fsjq/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
      raise AttributeError(attr)
  AttributeError: cython_sources
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

failed

CondaEnvException: Pip failed

Downloading DataComp-1B

I am trying to download datacomp-1b, but it seems both my downloading speed and success rate are low, even after I set num_retries to 3:

worker - success: 0.887 - failed to download: 0.101 - failed to resize: 0.012 - images per sec: 8 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 103 - count: 247671
worker - success: 0.889 - failed to download: 0.101 - failed to resize: 0.010 - images per sec: 9 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 107 - count: 257671
worker - success: 0.894 - failed to download: 0.098 - failed to resize: 0.008 - images per sec: 8 - count: 10000
total - success: 0.883 - failed to download: 0.107 - failed to resize: 0.010 - images per sec: 111 - count: 267671
29it [40:28, 7.45s/it]worker - success: 0.890 - failed to download: 0.100 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 115 - count: 277671
worker - success: 0.887 - failed to download: 0.103 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 119 - count: 287671
worker - success: 0.891 - failed to download: 0.099 - failed to resize: 0.009 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 123 - count: 297671
worker - success: 0.882 - failed to download: 0.108 - failed to resize: 0.010 - images per sec: 8 - count: 10000
total - success: 0.884 - failed to download: 0.106 - failed to resize: 0.010 - images per sec: 127 - count: 307671

Connection error while half-downloading metadata

Hello! I'm running the command:

python download_upstream.py --scale medium --data_dir medium --skip_shards

After downloading some files it interrupts with the error:

  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

As you can see, there is not too much details in error message. May this be caused some files missing on server? Or just connection problems? If the last, how can I resume thedownload? Flag --overwrite_metadata seems not suitable because it removes all already downloaded files.

How to precompute and save model-based metric during download?

Hello, thanks for the work on this project! I was wondering if you had any example code/pointers for pre-computing model metrics during download. Specifically, we would like to have access to an aesthetic score (as in https://github.com/christophschuhmann/improved-aesthetic-predictor) when iterating through the dataset. I was looking at the filtering examples in baselines/apply_filter.py but those don't seem to cover computing a metric (on GPU) on the image after download and adding the result to the save file so it can be read w/o GPU compute at inference. Are there any relevant examples/templates? Thanks!

Get 400 error when submitting jsonl to firebase, but successfully submit Slack notification

1692934519452
(datacomp114514) cometp@LAPTOP-COMETP:~/datacomp-main$ python evaluate.py --track=filtering  --train_output_dir="0_clip_mask/" --samples="0_clip_masked.npy" --submit --method_name="text masked CLIP 30%" --author="NUM" --email="[email protected]" --hf_username="Cometp" --hf_repo_name="datacomp_0.3_clip_mask"
Found 40 eval result(s) in 0_clip_mask/eval_results.jsonl.
Warning, did not find or could not read checkpoint at datacomp/test1/0_clip_mask/checkpoints/epoch_latest.pt
Defaulting to 0_clip_mask/checkpoints/epoch_latest.pt
Evaluating
Skipping Caltech-101 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2320
Skipping CIFAR-10 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.4359
Skipping CIFAR-100 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1781
Skipping CLEVR Counts since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1359
Skipping CLEVR Distance since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2010
Skipping Country211 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0146
Skipping Describable Textures since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0787
Skipping EuroSAT since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1917
Skipping FGVC Aircraft since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0106
Skipping Food-101 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0804
Skipping GTSRB since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0496
Skipping ImageNet 1k since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0483
Skipping ImageNet Sketch since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0194
Skipping ImageNet v2 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0422
Skipping ImageNet-A since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0161
Skipping ImageNet-O since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1430
Skipping ImageNet-R since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0657
Skipping KITTI Vehicle Distance since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.3924
Skipping MNIST since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1322
Skipping ObjectNet since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0534
Skipping Oxford Flowers-102 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0447
Skipping Oxford-IIIT Pet since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0711
Skipping Pascal VOC 2007 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.3037
Skipping PatchCamelyon since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5121
Skipping Rendered SST2 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5008
Skipping RESISC45 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0803
Skipping Stanford Cars since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0295
Skipping STL-10 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5185
Skipping SUN397 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1045
Skipping SVHN since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.1426
Skipping Flickr since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0323
Skipping MSCOCO since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0199
Skipping WinoGAViL since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2827
Skipping iWildCam since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0116
Skipping Camelyon17 since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.5009
Skipping FMoW since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.0000
Skipping Dollar Street since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2757
Skipping GeoDE since results are already in 0_clip_mask/eval_results.jsonl
Score: 0.2714
Skipping FairFace since results are already in 0_clip_mask/eval_results.jsonl
Score: No summary metric
Skipping UTKFace since results are already in 0_clip_mask/eval_results.jsonl
Score: No summary metric
Evaluation time: 0 hour(s) 0 minute(s) 0 second(s)

=== Final results ===
ImageNet 1k: 0.0483
Caltech-101: 0.2319868753931739
CIFAR-10: 0.4359
CIFAR-100: 0.1781
CLEVR Counts: 0.13593333333333332
CLEVR Distance: 0.201
Country211: 0.014644549763033175
Describable Textures: 0.07872340425531915
EuroSAT: 0.19166666666666668
FGVC Aircraft: 0.010561497326203208
Food-101: 0.08043564356435644
GTSRB: 0.0496437054631829
ImageNet Sketch: 0.019434455383285188
ImageNet v2: 0.0422
ImageNet-A: 0.016133333333333333
ImageNet-O: 0.143
ImageNet-R: 0.06573333333333334
KITTI Vehicle Distance: 0.3924050632911392
MNIST: 0.1322
ObjectNet: 0.053407989662969745
Oxford Flowers-102: 0.04470542875600805
Oxford-IIIT Pet: 0.07114696955308561
Pascal VOC 2007: 0.3036858974358974
PatchCamelyon: 0.51214599609375
Rendered SST2: 0.500823723228995
RESISC45: 0.08031746031746032
Stanford Cars: 0.029473946026613605
STL-10: 0.5185
SUN397: 0.10447431818599776
SVHN: 0.14255531653349723
Flickr: 0.032300000078976154
MSCOCO: 0.01987708918750286
WinoGAViL: 0.2826642132262741
iWildCam: 0.011556535188310048
Camelyon17: 0.5008582782702754
FMoW: 0.0
Dollar Street: 0.2757009267807007
GeoDE: 0.27139875292778015
FairFace: None
UTKFace: None
=====================
Average: 0.16377880796211722
Done with evaluations. Preparing your submission...
Pushing files to HF Hub (Cometp/datacomp_0.3_clip_mask). This may take a while.
Done uploading files to HF Hub.
====================================================================================================

            Error: something went wrong when submitting your results.
            Please check if your HF credentials are correct, and contact the team if errors persist.

==================================================================================================== 400
Sucessfully submitted your results. Thanks for participating, and good luck!

Can you share the CLIP score calculation script?

Would it be possible to share the CLIP score calculation script?

I tried several different settings but cannot reproduce the CLIP similarity numbers stored in the metadata. Below is the script I used for reproduction. Thanks!

  import torch
  import webdataset as wds
  import pandas as pd
  from training.data import (log_and_continue,
                             get_dataset_size,
                             tarfile_to_samples_nothrow,
                             filter_no_caption_or_no_image)
  from open_clip.factory import create_model_and_transforms, get_tokenizer
  
  
  shard_root = "/path/to/medium/shards/"
  meta_root = "/path/to/medium/metadata/"
  model_name = "ViT-B-32"
  pretrained = "openai"
  precision = "fp32"
  device = "cuda:0"
  batch_size = 64
  workers = 4
  input_shards = shard_root + "/{00000000..00000000}.tar"
  
  model, _, preprocess_val = create_model_and_transforms(
      model_name,
      pretrained,
      precision=precision,
      device=device,
      jit=True,
      output_dict=True
  )
  tokenizer = get_tokenizer(model_name)
  
  num_samples, num_shards = get_dataset_size(input_shards)
  print("# shards:", num_shards)
  
  pipeline = [wds.SimpleShardList(input_shards)]
  pipeline.extend([tarfile_to_samples_nothrow])
  pipeline.extend([
      wds.select(filter_no_caption_or_no_image),
      wds.decode("pilrgb", handler=log_and_continue),
      wds.rename(image="jpg;png;jpeg;webp", text="txt", json="json"),
      wds.map_dict(image=preprocess_val, text=lambda text: tokenizer(text)[0], json=lambda data: {"uid": data["uid"]}),
      wds.to_tuple("image", "text", "json"),
      wds.batched(batch_size, partial=True)
  ])
  dataset = wds.DataPipeline(*pipeline)
  dataloader = wds.WebLoader(
      dataset,
      batch_size=None,
      shuffle=False,
      num_workers=workers,
      persistent_workers=workers > 0,
  )
  for img, txt, info in dataloader:
      with torch.no_grad():
          img = img.to(device)
          txt = txt.to(device)
          img_f = model.encode_image(img)
          txt_f = model.encode_text(txt)
  
          img_f = img_f / img_f.norm(dim=-1, keepdim=True)
          txt_f = txt_f / txt_f.norm(dim=-1, keepdim=True)
  
          sim = (img_f * txt_f).sum(-1).cpu().numpy()
  
          uid = [_["uid"] for _ in info]
  
          score_dict = {u: s for u, s in zip(uid, sim)}
  
          meta_info = pd.read_parquet(meta_root + "/0020f0cbd157d470aa56bea08e304b90.parquet", engine='pyarrow')
          for k in score_dict:
              print(score_dict[k], meta_info[meta_info["uid"] == k]['clip_b32_similarity_score'].tolist()[0])
          # examples: 
          # 0.21029426 0.2060546875
          # 0.30629587 0.29931640625
          # 0.26773185 0.2646484375
          # 0.23732948 0.2496337890625
          exit

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.