Git Product home page Git Product logo

deep1b_gt's Introduction

deep1b_gt

Compute the exact 100 nearest neighbors for deep1M, deep10M, and deep100M datasets. We can use these neighbors as the ground truth for the search task for deep{1, 10, 100}M datasets.

Note that deep{1, 10, 100}M datasets are the top {1, 10, 100}M vectors of deep1b dataset, respectively.

Result

You can download the results from here https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip

  • deep1M_groundtruth.ivecs
  • deep10M_groundtruth.ivecs
  • deep100M_groundtruth.ivecs
wget https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip
unzip gt

# The directory structure will be:
# .
# ├── deep100M_groundtruth.ivecs
# ├── deep10M_groundtruth.ivecs
# ├── deep1M_groundtruth.ivecs
# └── gt.zip

How to run by yourself

git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt

# Download Deep1b data on ./deep1b. This may take several days. I recommend preparing 2TB of the disk space.
python download_deep1b.py --root ./deep1b

# After downloading the data, the structure of the directory would be: 
# .
# ├── base
# │   ├── base_00
# │   ├── base_01
# │   ...
# │   ├── base_35
# │   └── base_36
# ├── base.fvecs                # 388,000,000,000 bytes
# ├── deep1B_groundtruth.ivecs
# ├── deep1B_queries.fvecs
# ├── learn
# │   ├── learn_00
# │   ├── learn_01
# │   ...
# │   ├── learn_12
# │   └── learn_13
# └── learn.fvecs               # 139,090,240,000 bytes

# rm -rf ./deep1b/base ./deep1b/learn    # Optionally, you can delete base and learn, that should not be used anymore

# Compute groundtruth. You need faiss
conda install -c pytorch faiss-cpu
python compute_gt.py --out ./

# You'll get deep1M_groundtruth.ivecs, deep10M_groundtruth.ivecs, and deep100M_groundtruth.ivecs

(Bonus) Deep1M

As the deep1b dataset is too huge, you may want to download its subset (top 1M vectors) only. The following script will

  • pick up the first 1M vectors from base_00 to construct deep1M_base.fvecs
  • pick up the first 100K vectors from learn_00 to construct deep1M_learn.fvecs
git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt

# Download base_00, learn_00, and query on ./deep1b. This may take some hours. I recommend preparing 25GB of the disk space.
python download_deep1b.py --root ./deep1b --base_n 1 --learn_n 1 --ops query base learn 

# Select top 1M vectors from base_00 and save it on deep1M_base.fvecs
python pickup_vecs.py --src ./deep1b/base/base_00 --dst ./deep1b/deep1M_base.fvecs --topk 1000000

# Select top 100K vectors from learn_00 and save it on deep1M_learn.fvecs
python pickup_vecs.py --src ./deep1b/learn/learn_00 --dst ./deep1b/deep1M_learn.fvecs --topk 100000

# After running the above commands, the structure of the directory would be: 
# .
# ├── base
# │   └── base_00
# ├── deep1M_base.fvecs                # 388,000,000 bytes
# ├── deep1B_queries.fvecs             
# ├── learn
# │   └── learn_00
# └── deep1M_learn.fvecs               # 38,800,000 bytes

# rm -rf ./deep1b/base ./deep1b/learn    # Optionally, you can delete base and learn, that should not be used anymore

Reference

Several codes are from:

deep1b_gt's People

Contributors

matsui528 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.