Git Product home page Git Product logo

laion-prepro's Introduction

laion-prepro

Get billions of image+url from the laion datasets and preprocess them.

This repository can be run on

  • for laion400m one machine with 32GB of ram, 8TB of disk, 16 i7 core and a 1Gbps connection.
  • laion5B 10 machines similar to the laion400m one

What is laion ?

The laion project has for objective to use commoncrawl to retrieve billions of aligned image+text pairs. It is composed of a central server that track the progress of decentralized (run by anyone) workers that process small chunks of commoncrawl. Currently, 5B such pairs have already been retrieved. Read more about it at the laion 400M release post

What can be done with these dataset ?

Vision and language modeling has been taking off in 2021. Here are some pointers about what this kind of image + text datasets unlocks and why it seems really interesting:

  • 6 months ago OpenAI released 2 blogposts and papers clip and dall-e. Both model rely on a large amount of (text, image) pairs. They used an unreleased 400M pairs dataset.
    • CLIP is a model that computes how related are a text and an image. This makes it possible to build large text to image search, and it makes it possible to build that kind of crazy text to image art clip-art . They released a small and medium version of the model but no training code.
    • DALL-E is a model that directly generate images from texts. As can be seen from the blogpost, it achieves very impressive results that could have direct impacts on the world, for anything that need drawing and illustrations. OpenAI did not release any model, even through an API

Since then, several efforts have been organized to replicate DALL-E. People organized initially around this awesome dalle replication repository DALLE-pytorch with some nice results that can be seen in the readme. More recently as part of an huggingface events, new results have been achieved (see dalle mini report ) and an online demo is now available dalle-mini demo

The replication effort is still far from achieving the same performance as the original dalle, and it seems it's possible to go even further. Some people also want to make a better CLIP to produce even better generated art.

A large part of the results that can be achieved with such models is thanks to data. Large amount of data. Before laion 400M, the largest open dataset for (image, text) pairs are in the order of 10M (see DALLE-datasets ), which is enough to train okay models, but not enough to reach the best performance. Having a public dataset with hundred of millions of pairs will help a lot to build these image+text models.

Visualization of the dataset

Check the colab and the web demo

laion5B

laion5B and laion400m processing is overall the same, but laion5B being 10x, it required making everything distributed

Read more at laion5B/README.md

laion400m

See laion400m/README.md

laion-prepro's People

Contributors

rom1504 avatar vanpersie32 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

laion-prepro's Issues

Link to download annotated data

Hi @rom1504!
Thanks for open-sourcing the code! Great work!
The project page mentions that 3456 samples of the LAION-400M were annotated for identifying NSFW content. Is is possible to download these annotations with the corresponding image-text pairs?

Thanks!

Define process and load_clip in data loader

Hi! Awesome repo, thanks for building this.

I have 10TB of the Laion dataset downloaded, thanks to your scripts! However, I'm trying to use your data loader, and ran into an issue.

In your WebdatasetReader object, you have a preprocess argument passed. However, this isn't defined in the script.

It seems to come from a ghost function, load_clip(). I don't see this defined, or used anywhere in your repo for that matter. Could you explain? Thanks.

_, preprocess = load_clip()

add content

Then add:

  • how to make clip embeddings out of that
  • how to make knn indices
  • what to train using that (clip, dalle)
  • how to get interesting datasets

How to download the newest version of dataset without duplicate files?

Hi, @rom1504
I know there are three versions of the parquet files as below.

Version Parquet file size Hash value Total size
1.0 1.6G 5b54c5d5 400 million
2.0 3.6G 03f11a48 800 million
3.0 4.9G f27692e1 1.1 billion

So I wonder know if the parquet files in different versions are one-to-one correspondence.
I download the 400 million version dataset. What should I do if I'd like to download the newest version of the dataset without downloading the duplicate files?

How to set '--url_list' parameter in download_images.sh?

Hi @rom1504 !
If I'd like to download images from 'part-[00000-00031]-03f11a48-0c63-4b59-a590-c03169a0d265-c000.snappy.parquet', how to set the '--url_list' parameter? Should I make a dir named 'laion400m-meta' and put all the *.parquet in this dir?

Another question, can I set '--url_list' as one of the '*.parquet' files to download part of this dataset? Like,
img2dataset --url_list part-00000-03f11a48-0c63-4b59-a590-c03169a0d265-c000.snappy.parquet --input_format "parquet"\ --url_col "URL" --caption_col "TEXT" --output_format webdataset\ --output_folder your_output_folder --processes_count 16 --thread_count 128 --image_size 256\ --save_additional_columns '["NSFW","similarity","LICENSE"]' --enable_wandb True

Does https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py work for non-en langs?

Issue

Our team requires removal of all nsfw content (especially nudity)

Fix

I see here - https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ we are pointed at this script:

https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py

However, I see references to 2B rather than 5B

Question

Is the script above usable for non-en langs? Or does the script only work for en langs?

make this more user friendly

  • package installation instructions
  • add path options
  • make it easy to switch on/off incremental mode
  • add end to end scheduling script

How many about the dataset?

Hi, @rom1504
I download the 32 parquet files and compute the total of url. I find about 26760000 urls in every parquet, and 32*26760000 = 800 million. But you said the number of this dataset is 400m?
So what is the difference?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.