xiaohangzhan / cdp Goto Github PK

View Code? Open in Web Editor NEW

456.0 22.0 94.0 2.43 MB

Code for our ECCV 2018 work.

License: MIT License

Python 86.83% Shell 0.45% Jupyter Notebook 12.71%

face-clustering clustering graph-clustering graph face-recognition

cdp's Introduction

Implementation of "Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition" (CDP)

Introduction

Original paper: Xiaohang Zhan, Ziwei Liu, Junjie Yan, Dahua Lin, Chen Change Loy, "Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition", ECCV 2018

Project Page: http://mmlab.ie.cuhk.edu.hk/projects/CDP/

You can use this code for:

State-of-the-art face clustering in linear complexity.
High efficiency generic clustering.
Plugging the pair-to-cluster module into your clustering algorithm.

Dependency

Please use python3, as we cannot guarantee its compatibility with python2.
The version of PyTorch we use is 0.3.1.
Other depencencies:
```
pip install nmslib
```

Usage

Clone the repo.

git clone [email protected]:XiaohangZhan/cdp.git
cd cdp

Using ready-made data for face clustering

Download the data from Google Drive or Baidu Yun with passwd u8vz , to the repo root, and uncompress it.
```
tar -xf data.tar.gz
```

Make sure the structure looks like the following:

cdp/data/
cdp/data/labeled/emore_l200k/
cdp/data/unlabeled/emore_u200k/
# ... other directories and files ...

Run CDP

Single model case:

python -u main.py --config experiments/emore_u200k_single/config.yaml

Multi-model voting case (committee size: 4):

python -u main.py --config experiments/emore_u200k_cmt4/config.yaml

Multi-model mediator case (committee size: 4):

# edit `experiments/emore_u200k_cmt4/config.yaml` as following:
# strategy: mediator
python -u main.py --config experiments/emore_u200k_cmt4/config.yaml

Collect the results

Take Multi-model mediator case for example, the results are stored in experiments/emore_u200k_cmt4/output/k15_mediator_111_th0.9915/sz600_step0.05/meta.txt. The order is the same as that in data/unlabeled/emore_u200k/list.txt. The samples labeled as -1 are discarded by CDP. You may assign them with new unique labels if you must use them.

Using your own data

Create your data directory, e.g. mydata
```
mkdir data/unlabeled/mydata
```
Prepare your data list as list.txt and copy it to the directory. If the data is not along with a list file, just make a dummy one, and make sure the length of the list is equal to the number of examples.
(optional) If you want to evaluate the performance on your data, prepare the meta file as meta.txt and copy it to the directory.
Prepare your feature files. Extract face features corresponding to the list.txt with your trained face recognition models, and save it as binary files via feature.tofile("xxx.bin") in numpy. The features should satisfy Cosine Similarity condition. Finally link/copy them to data/unlabeled/mydata/features/. We recommand renaming the feature files using model names, e.g., resnet18.bin. CDP works for single model case, but we recommend you to use multiple models (i.e., preparing multiple feature files extracted from different models) with mediator for better results.

The structure should look like:

cdp/data/unlabeled/mydata/
cdp/data/unlabeled/mydata/list.txt
cdp/data/unlabeled/mydata/meta.txt (optional)
cdp/data/unlabeled/mydata/features/
cdp/data/unlabeled/mydata/features/*.bin

(You do not need to prepare knn files.)

Prepare the config file. Please refer to the examples in experiments/

mkdir experiments/myexp
cp experiments/emore_u200k_cmt4/config.yaml experiments/myexp/
# edit experiments/myexp/config.yaml to fit your case.
# you may need to change `base`, `committee`, `data_name`, etc.

If you want to use mediator mode, please also prepare the training set, i.e., the features extracted using the same face recognition model as step 4, as well as the meta file containing labels. Organize them in data/labeled/mydata/ similarly to data/labeled/emore_l200k/.
Tips for paramters adjusting
- Modify threshold to obtain roughly balanced precision and recall to achieve higher fscore.
- Higher threshold results in higher precision and lower recall.
- Larger max_sz results in lower precision and higher recall.

Using single model API for generic clustering

The example is equivalent to using experiments/emore_u200k_single/config.yaml. However, it is easier to use if you prefer single model version of CDP. With this API, you can perform generic clustering on your own data with plenty of metrics to choose.
```
# an example
python -u test_api.py
```

Using isoloated pair-to-cluster module

This function converts pairs into clusters with extremely high efficiency.

# pairs: numpy array (N,2) containing indices of pairs, N: number of pairs
# scores: numpy array (N,) containing edge score of each pair
# max_sz: maximal size of a cluster
# step: the step to adjust threshold, default: 0.05
from source import graph
import numpy as np
num = len(np.unique(pairs.flatten()))
components = graph.graph_propagation(pairs, scores, max_sz, step)
cluster = [[n.name for n in c] for c in components]
assert sum([len(c) for c in cluster]) == num, "Fatal error: some samples missing, please report to the author: [email protected]"

Run Baselines

We also implement several baseline clustering methods including: KMeans, MiniBatch-KMeans, Spectral, Hierarchical Agglomerative Clustering (HAC), FastHAC, DBSCAN, HDBSCAN, KNN DBSCAN, Approximate Rank-Order.
```
sh run_baselines.sh # results stored in `baseline_output/`
```

Evaluation Results

Data
- emore_u200k (images: 200K, identities: 2,577)
- emore_u600k (images: 600K, identities: 8,436)
- emore_u1.4m (images: 1.4M, identities: 21,433)
(These datasets are not the one in the paper which cannot be released, but the relative results are similar.)

Baselines

emore_u200k

method	#clusters	prec, recall, fscore	total time
* kmeans (ncluster=2577)	2577	94.24, 74.89, 83.45	618.1s
* MiniBatchKMeans (ncluster=2577)	2577	89.98, 87.86, 88.91	122.8s
* Spectral (ncluster=2577)	2577	97.42, 97.05, 97.24	12.1h
* HAC (ncluster=2577, knn=30)	2577	97.74, 88.02, 92.62	5.65h
FastHAC (distance=0.7, method=single)	46767	99.79, 53.18, 69.38	1.66h
DBSCAN (eps=0.75, nim_samples=10)	52813	99.52, 65.52, 79.02	6.87h
HDBSCAN (min_samples=10)	31354	99.35, 75.99, 86.11	4.87h
KNN DBSCAN (knn=80, min_samples=10)	39266	97.54, 74.42, 84.43	60.5s
ApproxRankOrder (knn=20, th=10)	85150	52.96, 16.93, 25.66	86.4s

emore_u600k

method	#clusters	prec, recall, fscore	total time
* kmeans (ncluster=8436)	8436	fail (out of memory)	-
* MiniBatchKMeans (ncluster=8436)	8436	81.64, 86.58, 84.04	2265.6s
* Spectral (ncluster=8436)	8436	fail (out of memory)	-
* HAC (ncluster=8436, knn=30)	8436	95.39, 86.28, 90.60	60.9h
FastHAC (distance=0.7, method=single)	94949	98.75, 68.49, 80.88	16.3h
DBSCAN (eps=0.75, nim_samples=10)	174886	99.02, 61.95, 76.22	79.6h
HDBSCAN (min_samples=10)	124279	99.01, 69.31, 81.54	47.9h
KNN DBSCAN (knn=80, min_samples=10)	133061	96.60, 70.97, 81.82	644.5s
ApproxRankOrder (knn=30, th=10)	304022	65.56, 8.139, 14.48	626.9s

Note: Methods marked * are reported with their theoretical upper bound results, since they need number of clusters as input. We use the values from the ground truth to obtain the results. For each method, we adjust the parameters to achieve the best performance.

CDP (in linear time !!!)

emore_u200k

strategy	#model	setting	prec, recall, fscore	knn time	cluster time	total time
vote	1	k15_accept0_th0.66	89.35, 88.98, 89.16	14.8s	7.7s	22.5s
vote	5	k15_accept4_th0.605	93.36, 92.91, 93.13	78.7s	6.0s	84.7s
mediator	5	k15_110_th0.9938	94.06, 92.45, 93.25	78.7s	77.7s	156.4s
mediator	5	k15_111_th0.9925	96.66, 94.93, 95.79	78.7s	100.2s	178.9s

emore_u600k

strategy	#model	setting	prec, recall, fscore	knn time	cluster time	total time
vote	1	k15_accept0_th0.665	88.19, 85.33, 86.74	60.8s	24s	84.8s
vote	5	k15_accept4_th0.605	90.21, 89.9, 90.05	309.4s	18.3s	327.7s
mediator	5	k15_110_th0.985	90.43, 89.13, 89.78	309.4s	184.2s	493.6s
mediator	5	k15_111_th0.982	96.55, 91.98, 94.21	309.4s	246.3s	555.7s

emore_u1.4m

strategy	#model	setting	prec, recall, fscore	knn time	cluster time	total time
vote	1	k15_accept0_th0.68	89.49, 81.25, 85.17	187.5s	47.7s	235.2s
vote	5	k15_accept4_th0.62	90.63, 87.32, 88.95	967.0s	44.3s	1011.3s
mediator	5	k15_110_th0.99	93.67, 84.43, 88.81	967.0s	406.9s	1373.9s
mediator	5	k15_111_th0.982	95.29, 90.97, 93.08	967.0s	584.7s	1551.7s

Note:

For mediator, 110 means using relationship and affinity; 111 means using relationship, affinity and structure.
The results may not be exactly reproduced, because there is randomness in knn search by NMSLIB.
Experiments are performed on a server with 48 CPU cores, 8 TITAN XP, 252G memory.

Face recognition framework

You may use this framework to train/evaluate face recognition models and extract features.

url: https://github.com/XiaohangZhan/face_recognition_framework

Bibtex

@inproceedings{zhan2018consensus,
  title={Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition},
  author={Zhan, Xiaohang and Liu, Ziwei and Yan, Junjie and Lin, Dahua and Loy, Chen Change},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  pages={568--583},
  year={2018}
}

cdp's People

Contributors

Stargazers

Watchers

Forkers

fendaq staceycy zhaozhichao4515 lilin201501 rain2008204 zhangxujinsh aliushn aiiaji daaiyiyejian liuheng0111 iamweiweishi dansonc shubhampachori12110095 lqsunshine shuizhilinxin clhne soccergame trantorrepository luckynote lunwk dlove1204 guojiapeng00 wyk0517 jiashaoyong jwmneu hitmit123 hxl1990 tsingly mathsshen micricket panyiming xiang1563 shockjiang wangyujie413 weihancug timmywy2 medivhna nnu-gisa tinyloop hell-to-heaven daiab jangocheng wuqiangch dev233 jerrychlee yijiuzai jiangxuehan yaoq qianrenjian tinatianth engmubarak48 tensorflow-pool yibox tony109060581 swazer guohongli vareto-forks qinghaizheng1992 zeitgeistqian purity77 lizhaofu zys20170917 face-dl minha12 pmorerio keving7878 lee-man jonsenhb jodie2235337 greendream182 twistedmove noirblack dreameng28 duxi light-- chengyawlow liupengcnu wddan1 stonem2017 hitzht 00mjk chenzhaoplus eziohzy pipichensir zhejiangyyf monica-xiao ccchenhan xddun kii-chan-iine marenan niuxinyue haley-cherry-hu zhouguangxin seongjaehuh

cdp's Issues

some questions

Hello, I have some questions about this program. where is the folder "somewhere"? There is no such "data_name/features/model_name.bin" under the folder "data".And it cannot run.

the question about nmslib

when i use the command pip install nmslib on windows,
error:
Building wheels for collected packages: nmslib Running setup.py bdist_wheel for nmslib ... error Complete output from command C:\Users\Administrator\Anaconda3\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-install-va5h3yd3\\nmslib\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d C:\Users\ADMINI~1\AppData\Local\Temp\pip-wheel-vu3x5ox6 --python-tag cp36: running bdist_wheel running build running build_ext building 'nmslib' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

Do i have to install "Microsoft Visual C++'' ?

质疑

emore_u200k实际为2577个类别，所以对于
method | #clusters | prec, recall, fscore | total time

kmeans (ncluster=2577) | 2577 | 94.24, 74.89, 83.45 | 618.1s
我认为指标是可信的

Issue on training with ready-made dataset

Hello
Thanks for contributing this paper and code.
I had a successful inference with ready-made emore_u200k data and pre-trained CDP model.
I tried to retrain CDP model with ready-made emore_l200k in mediator mode.
But I met the following issue.

I am using a Tesla V100(GPU memory 16GB) and DDR 128GB.
PyTorch version is 1.4.0 and Python version is 3.7
I used the following config.yml.

I thought that this may be due to lack of GPU memory so I decreased the batch_size from 1024 to 8 but the result was same.
Please let me know the reason asap.
Thanks

Does the architecture of Tiny NASNet-A in this code???

Does the architecture of Tiny NASNet-A in https://github.com/XiaohangZhan/face_recognition_framework/blob/master/models/backbones/nasnet.py, and the NASNetASmall in the git link is the same with Tiny NASNet-A, right?

About meta file

Hi,

I have a question regarding to the meta file for evaluation. What should I put into the file?

Thanks.

about file list.txt

@XiaohangZhan Hello author. When I prepare my own data, the file list.txt is missing. Is this list file necessary? What is its role? Is the address index mentioning features? I already have the feature.bin file, do I not need list.txt, or assist in generating KNN graph? If there is no need to modify the code somewhere. Looking forward to your early reply.

Joint Training

Thank you for your source code and paper!
As for the "Joint Training" ,in my opinion , it combines the loss of two dataset ,which will make the performance of combined model lower than big dataset and higher than the smaller dataset . Maybe the performance of model trained from unlabeled dataset is better than the combined one , so do you have the performance result of only unlabeled dataset ?

Data load

Thanks for your work！
Can you provided a baidu link of your made data,There is a problem when download your made data by link in readme file!

Comparison with hierarchical clustering

Hi Xiaohang,
I am curious about how you apply hierarchical clustering on such a large dataset since the best complexity is only O(n^2). Do you use any technique to accelerate hierarchical clustering?

By the way I'm also interested in how CDP performs against hierarchical clustering based on a single face embedding, I'd appreciate it if you can offer some comparison.

Thanks a lot!

how to calculate the evaluation indicator

(singular removed) prec / recall / fscore: 97.2, 97.33, 97.26
(singular kept) prec / recall / fscore: 97.2, 94.93, 96.05

This is the output when i run this code on my own data, what is the exact definition of precision and recall? And i can not fully understand the code in eval_cluster.py.

Where is train_mediator located

Thanks for this great work. A quick question: train_mediator is called inside cdp but its implementation is not in the repo. Where can we find it?

About reproducing `k15_accept0_th0.66` by simple_api.py

Thank you for your great paper and repo!

I think to reproducing results of k15_accept0_th0.66 model, i.e.,

strategy	#model	setting	prec, recall, fscore
vote	1	k15_accept0_th0.66	89.35, 88.98, 89.16

, we may not need to normalize the distance as shown here.

ERROR when use mediator

Hi, when I run mediator ,there is a error as following:
1.
Creating pair set for: labeled
Loading features
Loading base KNN
Loading committee KNN
Loading pairs
got 4813844 pairs
relationship features exist
getting affinity features
processing: 0/5
Traceback (most recent call last):
File "main.py", line 50, in
main()
File "main.py", line 45, in main
cdp(args)
File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 77, in cdp
pairs, scores = mediator(args)
File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 172, in mediator
create(args.mediator['train_data_name'], args, phase="train")
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 137, in create
affinity_feat = get_affinity_feat(features, pairs)
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 33, in get_affinity_feat
cosine_simi.append(cosine_similarity(feat[pairs[:,0],:], feat[pairs[:,1],:]))
File "/media/syl/BIG_ONE/cdp-master/source/create_pair_set.py", line 24, in cosine_similarity
feat1 /= np.linalg.norm(feat1, axis=1).reshape(-1, 1)
File "/usr/local/lib/python3.5/dist-packages/numpy/linalg/linalg.py", line 2480, in norm
s = (x.conj() * x).real
MemoryError

when I remove affinity form yaml, there is a new error,
false_pos = ((pred == 1) & (label == 0)).sum() / float((pred == 1).sum())
Testing set. acc: 0.9997, abs_recall: 0, rel_recall: 0, false_pos: nan, false_neg: 0.0002552
pair num: 0
Propagation ...
Traceback (most recent call last):
File "main.py", line 50, in
main()
File "main.py", line 45, in main
cdp(args)
File "/media/syl/BIG_ONE/cdp-master/source/cdp.py", line 84, in cdp
components = graph.graph_propagation(pairs, scores, args.propagation['max_sz'], args.propagation['step'])
File "/media/syl/BIG_ONE/cdp-master/source/graph.py", line 81, in graph_propagation
th = score.min()
File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 29, in _amin
return umr_minimum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation minimum which has no identity

Do you have some idea? Wiating for ur reply, thank you very much!

some question about the code

感谢你的分享！在看代码时我有一些疑惑：
1.unabled data 里的list.txt和meta.txt内容分别是无标签数据的标签么？这个怎么来的，既然是无标签数据怎么会有标签名称的？
2.在cdp.py中
def sample(): vote_num += (tile_knn == tile_cmt).sum(axis=2) selidx = np.where((simi > th) & (vote_num >= accept) & (knn != -1) & (knn != anchor))
中，vote_num和accept参数指的是什么？ (knn != anchor)指的是不满足什么条件的样本？
希望得到您的解答，谢谢！

Question with respect to training the mediator

Hi, @XiaohangZhan , thanks for your implementation.

You said in the paper, The mediator is trained on D_{l}, but actually you explained in the previous section that the input to the mediator is the relationships, the affinity and the local structures on the graphs derived from the unlabeled dataset. How do you get this on D_{l} for training the mediator, did you repeat the process again for the labeled data?

Thanks,

Does it support data incremental processing?

Zhan,thank you for your outstanding work,and it help me a lot.By the way ,does it support data incremental processing?What if I want to process the data incrementally?

cleaned data

Just wondering whether the CLEANED data used in the paper ("we clean up official training set and crawl images of more identities, producing about 7M images with 385 K identities") will be released for the public research in future? Many Thanks!

Vote is much better than mediator

In my experiments, i found that the effect of vote mode is much better than mediator. Are there some explanations for it or some tips for improving the effect of mediator.

用来做数据清洗真的好用！

Memory issue when computing affinity features

Hi, thanks for sharing your code

I am trying to reimplement CDP with my own data (the data size is similar to that in your paper). I got hundreds of millions of pairs after loading committee KNN, resulting in MemoryError when computing affinity features. I can modify the source code of this part for lower memory requirements but that will increase the computation time significantly.

Could you please tell me your computation resource when you implement CDP?

Many Thanks

Cheers