Git Product home page Git Product logo

cdp's Introduction

Implementation of "Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition" (CDP)

Introduction

Original paper: Xiaohang Zhan, Ziwei Liu, Junjie Yan, Dahua Lin, Chen Change Loy, "Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition", ECCV 2018

Project Page: http://mmlab.ie.cuhk.edu.hk/projects/CDP/

You can use this code for:

  1. State-of-the-art face clustering in linear complexity.
  2. High efficiency generic clustering.
  3. Plugging the pair-to-cluster module into your clustering algorithm.

Dependency

  • Please use python3, as we cannot guarantee its compatibility with python2.

  • The version of PyTorch we use is 0.3.1.

  • Other depencencies:

    pip install nmslib

Usage

  1. Clone the repo.

    git clone [email protected]:XiaohangZhan/cdp.git
    cd cdp

Using ready-made data for face clustering

  1. Download the data from Google Drive or Baidu Yun with passwd u8vz , to the repo root, and uncompress it.

    tar -xf data.tar.gz
  2. Make sure the structure looks like the following:

    cdp/data/
    cdp/data/labeled/emore_l200k/
    cdp/data/unlabeled/emore_u200k/
    # ... other directories and files ...
  3. Run CDP

    • Single model case:

      python -u main.py --config experiments/emore_u200k_single/config.yaml
    • Multi-model voting case (committee size: 4):

      python -u main.py --config experiments/emore_u200k_cmt4/config.yaml
    • Multi-model mediator case (committee size: 4):

      # edit `experiments/emore_u200k_cmt4/config.yaml` as following:
      # strategy: mediator
      python -u main.py --config experiments/emore_u200k_cmt4/config.yaml
  4. Collect the results

    Take Multi-model mediator case for example, the results are stored in experiments/emore_u200k_cmt4/output/k15_mediator_111_th0.9915/sz600_step0.05/meta.txt. The order is the same as that in data/unlabeled/emore_u200k/list.txt. The samples labeled as -1 are discarded by CDP. You may assign them with new unique labels if you must use them.

Using your own data

  1. Create your data directory, e.g. mydata

    mkdir data/unlabeled/mydata
  2. Prepare your data list as list.txt and copy it to the directory. If the data is not along with a list file, just make a dummy one, and make sure the length of the list is equal to the number of examples.

  3. (optional) If you want to evaluate the performance on your data, prepare the meta file as meta.txt and copy it to the directory.

  4. Prepare your feature files. Extract face features corresponding to the list.txt with your trained face recognition models, and save it as binary files via feature.tofile("xxx.bin") in numpy. The features should satisfy Cosine Similarity condition. Finally link/copy them to data/unlabeled/mydata/features/. We recommand renaming the feature files using model names, e.g., resnet18.bin. CDP works for single model case, but we recommend you to use multiple models (i.e., preparing multiple feature files extracted from different models) with mediator for better results.

  5. The structure should look like:

    cdp/data/unlabeled/mydata/
    cdp/data/unlabeled/mydata/list.txt
    cdp/data/unlabeled/mydata/meta.txt (optional)
    cdp/data/unlabeled/mydata/features/
    cdp/data/unlabeled/mydata/features/*.bin

    (You do not need to prepare knn files.)

  6. Prepare the config file. Please refer to the examples in experiments/

    mkdir experiments/myexp
    cp experiments/emore_u200k_cmt4/config.yaml experiments/myexp/
    # edit experiments/myexp/config.yaml to fit your case.
    # you may need to change `base`, `committee`, `data_name`, etc.
  7. If you want to use mediator mode, please also prepare the training set, i.e., the features extracted using the same face recognition model as step 4, as well as the meta file containing labels. Organize them in data/labeled/mydata/ similarly to data/labeled/emore_l200k/.

  8. Tips for paramters adjusting

    • Modify threshold to obtain roughly balanced precision and recall to achieve higher fscore.
    • Higher threshold results in higher precision and lower recall.
    • Larger max_sz results in lower precision and higher recall.

Using single model API for generic clustering

  • The example is equivalent to using experiments/emore_u200k_single/config.yaml. However, it is easier to use if you prefer single model version of CDP. With this API, you can perform generic clustering on your own data with plenty of metrics to choose.

    # an example
    python -u test_api.py

Using isoloated pair-to-cluster module

  • This function converts pairs into clusters with extremely high efficiency.

    # pairs: numpy array (N,2) containing indices of pairs, N: number of pairs
    # scores: numpy array (N,) containing edge score of each pair
    # max_sz: maximal size of a cluster
    # step: the step to adjust threshold, default: 0.05
    from source import graph
    import numpy as np
    num = len(np.unique(pairs.flatten()))
    components = graph.graph_propagation(pairs, scores, max_sz, step)
    cluster = [[n.name for n in c] for c in components]
    assert sum([len(c) for c in cluster]) == num, "Fatal error: some samples missing, please report to the author: [email protected]"

Run Baselines

  • We also implement several baseline clustering methods including: KMeans, MiniBatch-KMeans, Spectral, Hierarchical Agglomerative Clustering (HAC), FastHAC, DBSCAN, HDBSCAN, KNN DBSCAN, Approximate Rank-Order.

    sh run_baselines.sh # results stored in `baseline_output/`

Evaluation Results

  1. Data

    • emore_u200k (images: 200K, identities: 2,577)
    • emore_u600k (images: 600K, identities: 8,436)
    • emore_u1.4m (images: 1.4M, identities: 21,433)

    (These datasets are not the one in the paper which cannot be released, but the relative results are similar.)

  2. Baselines

    • emore_u200k
    method #clusters prec, recall, fscore total time
    * kmeans (ncluster=2577) 2577 94.24, 74.89, 83.45 618.1s
    * MiniBatchKMeans (ncluster=2577) 2577 89.98, 87.86, 88.91 122.8s
    * Spectral (ncluster=2577) 2577 97.42, 97.05, 97.24 12.1h
    * HAC (ncluster=2577, knn=30) 2577 97.74, 88.02, 92.62 5.65h
    FastHAC (distance=0.7, method=single) 46767 99.79, 53.18, 69.38 1.66h
    DBSCAN (eps=0.75, nim_samples=10) 52813 99.52, 65.52, 79.02 6.87h
    HDBSCAN (min_samples=10) 31354 99.35, 75.99, 86.11 4.87h
    KNN DBSCAN (knn=80, min_samples=10) 39266 97.54, 74.42, 84.43 60.5s
    ApproxRankOrder (knn=20, th=10) 85150 52.96, 16.93, 25.66 86.4s
    • emore_u600k
    method #clusters prec, recall, fscore total time
    * kmeans (ncluster=8436) 8436 fail (out of memory) -
    * MiniBatchKMeans (ncluster=8436) 8436 81.64, 86.58, 84.04 2265.6s
    * Spectral (ncluster=8436) 8436 fail (out of memory) -
    * HAC (ncluster=8436, knn=30) 8436 95.39, 86.28, 90.60 60.9h
    FastHAC (distance=0.7, method=single) 94949 98.75, 68.49, 80.88 16.3h
    DBSCAN (eps=0.75, nim_samples=10) 174886 99.02, 61.95, 76.22 79.6h
    HDBSCAN (min_samples=10) 124279 99.01, 69.31, 81.54 47.9h
    KNN DBSCAN (knn=80, min_samples=10) 133061 96.60, 70.97, 81.82 644.5s
    ApproxRankOrder (knn=30, th=10) 304022 65.56, 8.139, 14.48 626.9s

    Note: Methods marked * are reported with their theoretical upper bound results, since they need number of clusters as input. We use the values from the ground truth to obtain the results. For each method, we adjust the parameters to achieve the best performance.

  3. CDP (in linear time !!!)

    • emore_u200k
    strategy #model setting prec, recall, fscore knn time cluster time total time
    vote 1 k15_accept0_th0.66 89.35, 88.98, 89.16 14.8s 7.7s 22.5s
    vote 5 k15_accept4_th0.605 93.36, 92.91, 93.13 78.7s 6.0s 84.7s
    mediator 5 k15_110_th0.9938 94.06, 92.45, 93.25 78.7s 77.7s 156.4s
    mediator 5 k15_111_th0.9925 96.66, 94.93, 95.79 78.7s 100.2s 178.9s
    • emore_u600k
    strategy #model setting prec, recall, fscore knn time cluster time total time
    vote 1 k15_accept0_th0.665 88.19, 85.33, 86.74 60.8s 24s 84.8s
    vote 5 k15_accept4_th0.605 90.21, 89.9, 90.05 309.4s 18.3s 327.7s
    mediator 5 k15_110_th0.985 90.43, 89.13, 89.78 309.4s 184.2s 493.6s
    mediator 5 k15_111_th0.982 96.55, 91.98, 94.21 309.4s 246.3s 555.7s
    • emore_u1.4m
    strategy #model setting prec, recall, fscore knn time cluster time total time
    vote 1 k15_accept0_th0.68 89.49, 81.25, 85.17 187.5s 47.7s 235.2s
    vote 5 k15_accept4_th0.62 90.63, 87.32, 88.95 967.0s 44.3s 1011.3s
    mediator 5 k15_110_th0.99 93.67, 84.43, 88.81 967.0s 406.9s 1373.9s
    mediator 5 k15_111_th0.982 95.29, 90.97, 93.08 967.0s 584.7s 1551.7s

    Note:

    • For mediator, 110 means using relationship and affinity; 111 means using relationship, affinity and structure.

    • The results may not be exactly reproduced, because there is randomness in knn search by NMSLIB.

    • Experiments are performed on a server with 48 CPU cores, 8 TITAN XP, 252G memory.

Face recognition framework

You may use this framework to train/evaluate face recognition models and extract features.

url: https://github.com/XiaohangZhan/face_recognition_framework

Bibtex

@inproceedings{zhan2018consensus,
  title={Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition},
  author={Zhan, Xiaohang and Liu, Ziwei and Yan, Junjie and Lin, Dahua and Loy, Chen Change},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={568--583},
  year={2018}
}

cdp's People

Contributors

xiaohangzhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cdp's Issues

some questions

Hello, I have some questions about this program. where is the folder "somewhere"? There is no such "data_name/features/model_name.bin" under the folder "data".And it cannot run.

the question about nmslib

when i use the command pip install nmslib on windows,
error:
Building wheels for collected packages: nmslib Running setup.py bdist_wheel for nmslib ... error Complete output from command C:\Users\Administrator\Anaconda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-install-va5h3yd3\\nmslib\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d C:\Users\ADMINI~1\AppData\Local\Temp\pip-wheel-vu3x5ox6 --python-tag cp36: running bdist_wheel running build running build_ext building 'nmslib' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

Do i have to install "Microsoft Visual C++'' ?

质疑

emore_u200k实际为2577个类别,所以对于
method | #clusters | prec, recall, fscore | total time

  • kmeans (ncluster=2577) | 2577 | 94.24, 74.89, 83.45 | 618.1s
    我认为指标是可信的

但是对于其他的指标比如
method | #clusters | prec, recall, fscore | total time
HDBSCAN (min_samples=10) | 31354 | 99.35, 75.99, 86.11 | 4.87h
虽然看起来指标不错,但是HDBSCAN已经将emore_u200k分为31354个类别了,从正确的2577个类别到31354个类别,意味这存在将一个人脸类别分成了多个人脸类别,这是不太合理的吧

Issue on training with ready-made dataset

Hello
Thanks for contributing this paper and code.
I had a successful inference with ready-made emore_u200k data and pre-trained CDP model.
I tried to retrain CDP model with ready-made emore_l200k in mediator mode.
But I met the following issue.
image
I am using a Tesla V100(GPU memory 16GB) and DDR 128GB.
PyTorch version is 1.4.0 and Python version is 3.7
I used the following config.yml.
image
I thought that this may be due to lack of GPU memory so I decreased the batch_size from 1024 to 8 but the result was same.
Please let me know the reason asap.
Thanks

About meta file

Hi,

I have a question regarding to the meta file for evaluation. What should I put into the file?

Thanks.

about file list.txt

@XiaohangZhan Hello author. When I prepare my own data, the file list.txt is missing. Is this list file necessary? What is its role? Is the address index mentioning features? I already have the feature.bin file, do I not need list.txt, or assist in generating KNN graph? If there is no need to modify the code somewhere. Looking forward to your early reply.

Joint Training

Thank you for your source code and paper!
As for the "Joint Training" ,in my opinion , it combines the loss of two dataset ,which will make the performance of combined model lower than big dataset and higher than the smaller dataset . Maybe the performance of model trained from unlabeled dataset is better than the combined one , so do you have the performance result of only unlabeled dataset ?

Data load

Thanks for your work!
Can you provided a baidu link of your made data,There is a problem when download your made data by link in readme file!

Comparison with hierarchical clustering

Hi Xiaohang,
I am curious about how you apply hierarchical clustering on such a large dataset since the best complexity is only O(n^2). Do you use any technique to accelerate hierarchical clustering?

By the way I'm also interested in how CDP performs against hierarchical clustering based on a single face embedding, I'd appreciate it if you can offer some comparison.

Thanks a lot!

how to calculate the evaluation indicator

(singular removed) prec / recall / fscore: 97.2, 97.33, 97.26
(singular kept) prec / recall / fscore: 97.2, 94.93, 96.05

This is the output when i run this code on my own data, what is the exact definition of precision and recall? And i can not fully understand the code in eval_cluster.py.

Where is train_mediator located

Thanks for this great work. A quick question: train_mediator is called inside cdp but its implementation is not in the repo. Where can we find it?

About reproducing `k15_accept0_th0.66` by simple_api.py

Thank you for your great paper and repo!

I think to reproducing results of k15_accept0_th0.66 model, i.e.,

strategy #model setting prec, recall, fscore
vote 1 k15_accept0_th0.66 89.35, 88.98, 89.16

, we may not need to normalize the distance as shown here.

ERROR when use mediator

Hi, when I run mediator ,there is a error as following:
1.
Creating pair set for: labeled
Loading features
Loading base KNN
Loading committee KNN
Loading pairs
got 4813844 pairs
relationship features exist
getting affinity features
processing: 0/5
Traceback (most recent call last):
File "main.py", line 50, in
main()
File "main.py", line 45, in main
cdp(args)
File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 77, in cdp
pairs, scores = mediator(args)
File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 172, in mediator
create(args.mediator['train_data_name'], args, phase="train")
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 137, in create
affinity_feat = get_affinity_feat(features, pairs)
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 33, in get_affinity_feat
cosine_simi.append(cosine_similarity(feat[pairs[:,0],:], feat[pairs[:,1],:]))
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 24, in cosine_similarity
feat1 /= np.linalg.norm(feat1, axis=1).reshape(-1, 1)
File "/usr/local/lib/python3.5/dist-packages/numpy/linalg/linalg.py", line 2480, in norm
s = (x.conj() * x).real
MemoryError

  1. when I remove affinity form yaml, there is a new error,
    false_pos = ((pred == 1) & (label == 0)).sum() / float((pred == 1).sum())
    Testing set. acc: 0.9997, abs_recall: 0, rel_recall: 0, false_pos: nan, false_neg: 0.0002552
    pair num: 0
    Propagation ...
    Traceback (most recent call last):
    File "main.py", line 50, in
    main()
    File "main.py", line 45, in main
    cdp(args)
    File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 84, in cdp
    components = graph.graph_propagation(pairs, scores, args.propagation['max_sz'], args.propagation['step'])
    File "/media/syl/BIG_ONE/cdp-master/source/graph.py", line 81, in graph_propagation
    th = score.min()
    File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 29, in _amin
    return umr_minimum(a, axis, None, out, keepdims)
    ValueError: zero-size array to reduction operation minimum which has no identity

Do you have some idea? Wiating for ur reply, thank you very much!

some question about the code

感谢你的分享!在看代码时我有一些疑惑:
1.unabled data 里的list.txt和meta.txt内容分别是无标签数据的标签么?这个怎么来的,既然是无标签数据怎么会有标签名称的?
2.在cdp.py中
def sample(): vote_num += (tile_knn == tile_cmt).sum(axis=2) selidx = np.where((simi > th) & (vote_num >= accept) & (knn != -1) & (knn != anchor))
中,vote_num和accept参数指的是什么? (knn != anchor)指的是不满足什么条件的样本?
希望得到您的解答,谢谢!

Question with respect to training the mediator

Hi, @XiaohangZhan , thanks for your implementation.

You said in the paper, The mediator is trained on D_{l}, but actually you explained in the previous section that the input to the mediator is the relationships, the affinity and the local structures on the graphs derived from the unlabeled dataset. How do you get this on D_{l} for training the mediator, did you repeat the process again for the labeled data?

Thanks,

cleaned data

Just wondering whether the CLEANED data used in the paper ("we clean up official training set and crawl images of more identities, producing about 7M images with 385 K identities") will be released for the public research in future? Many Thanks!

Vote is much better than mediator

In my experiments, i found that the effect of vote mode is much better than mediator. Are there some explanations for it or some tips for improving the effect of mediator.

Memory issue when computing affinity features

Hi, thanks for sharing your code

I am trying to reimplement CDP with my own data (the data size is similar to that in your paper). I got hundreds of millions of pairs after loading committee KNN, resulting in MemoryError when computing affinity features. I can modify the source code of this part for lower memory requirements but that will increase the computation time significantly.

Could you please tell me your computation resource when you implement CDP?

Many Thanks

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.