fwang91 / imdb-face Goto Github PK

A new large-scale noise-controlled face recognition dataset.

face-recognition dataset eccv-2018 imdb-face

imdb-face's Introduction

The Devil of Face Recognition is in the Noise(ECCV'18)

By Fei Wang, Liren Chen, Cheng Li, Shiyao Huang, Yanjie Chen, Chen Qian, Chen Change Loy

IMDb-Face is a new large-scale noise-controlled dataset for face recognition research. The dataset contains about 1.7 million faces, 59k identities, which is manually cleaned from 2.0 million raw images. All images are obtained from the IMDb website. A detailed introduction of IMDb-Face can be found in the paper(https://arxiv.org/abs/1807.11649).

We hope that the IMDb-Face dataset could shed lights on the influences of data noise to the face recognition task, and point to potential labelling strategies to mitigate some of the problems. It could serve as a relatively clean data to facilitate future studies of noises in large-scale face recognition.

Citation

If you find IMDb-Face useful in your research, please cite:

@article{wang2018devil,
	title={The Devil of Face Recognition is in the Noise},
	author={Wang, Fei and Chen, Liren and Li, Cheng and Huang, Shiyao and Chen, Yanjie and Qian, Chen and Loy, Chen Change},
	journal={arXiv preprint arXiv:1807.11649},
	year={2018}
}

Data Download
Data Statistics
Overlap with Face Recognition Benchmarks
Notation
Contact

Data Download

IMDb-Face.csv

GoogleDrive Download: https://drive.google.com/open?id=134kOnRcJgHZ2eREu8QRi99qj996Ap_ML

BaiduDrive Download: https://pan.baidu.com/s/1eRylM-jMgjYL6cyU6qQd8g

Note: We found that the resolution of some images has changed, so we provide the shape information of each image. If the resolution of the newly downloaded image is not the same as the one we provide, you can rescale the rectangle and get the final rectangle information.

Data Statistics

Overall

Total number of images: 1.7M

Total number of identities: 59k

IMDb-Face dataset statistics

Overlap with Face Recognition Benchmarks

We have removed celebrity images of which the identification appear in the LFW dataset, Facescrub (MegaFace evaluation images) and YTF based on names. You can evaluate a face recognition model trained on IMDb-Face on these public benchmarks directly.

Notation

(1) IMDb-Face does not own the copyright of the images. IMDb-Face only provides URLs of images. The images in their original resolutions may be subject to copyright, so we cannot make them publicly available on our server. The dataset is released for non-commercial research and/or educational purposes.

(2) If you are the celebrity included in the IMDb-Face and you do not want to be included in the dataset, please contact us and we will remove the data based on your request.

Contact

Fei Wang

Questions can also be left as issues in the repository.

imdb-face's People

Contributors

Stargazers

Watchers

imdb-face's Issues

The number of images is less than 1.7M

over half of the url are expired, could you please offer the download link for the dataset? thx

over half of the url are expired, could you please offer the download link for the dataset? we really need the dataset for academic research, thx!

Pre-trained model

Hi,
would you please release the pre-trained model for reproducing the results?

Annotation tool

Is is possible for you to publish your annotation tool as well? It would be really helpful for other researchers. (Since most of the image are not available now, only 40K+ out of 1.7M urls are valid, tested on 2018/11/8)

Cleaned subsets of popular face databases

Thanks for your excellent work!

In your paper, you said

We contribute cleaned subsets of popular face databases, i.e., MegaFace and MS-Celeb-1M datasets,

Will you upload the cleaned subsets of MegaFace and MS-Celeb-1M datasets?

Best wishes!

Downloading is too slow.

Great job!
I use python urllib.
Maybe I am in China, the url for downloading is too slow. Is there any way to deal with it? or is there anyone to share the dataset?
Thanks.

duplicate rows in IMDb-Face.csv

I found that there are some duplicate rows in IMDb-Face.csv. For example, for person named Peter_Loung, there are tens of rows recording the same information. You can search nm0522014 in IMDb-Face.csv to prove it.

I try to delete the duplicate rows in IMDb-Face.csv, and the number of rows decreases from 1662888 to 1632928. Is there anything wrong with the csv file?

List of noise classes

I found multiple classes in the dataset that contain vast variety of different people in the same class.
For example, for the person Milos_Twilight with index nm1115471 there are about a 100 photos with different people. Sample two files:
1)
image 10.jpg
rect 470 134 551 214
height width 714 1000

image 101.jpg
rect 198 161 491 475
height width 1000 678

It seems that there are many classes like that in the dataset.

Does anybody have a complete list of such noisy classes?
Can you provide any recommendations how to fix that problem?

IMDb-Face vs celebrity

Do you mean that IMDb-Face has many same images with celebrity. If i want to expand celebrity data, how to use IMDb-Face?

Download Link

Hi,

Is the link still active ? When I go to the link I am unable to see the files. Is it possible for you to give a dropbox link ?

Thanks,

Incorrect bounding box coordinates

I notice that for certain images, the coordinates provided are incorrect. Even though the resolution has not changed.
Eg: file 3.jpg in nm0469103(Jeroen_KrabbÃ©) has the following coordinates: 719 602 831 714, whereas the image size is 528 1264

Duplicate rows in IMDb-Face.csv

Hi, I just wonder if it is ok that only 1300482 urls are unique from the total 1726419 count ?

In top 18 rows of IMDb-Face.csv can be easily checked that each line is two times there.

A lot of boxes are incorrect.

We found that a lot of images are given by wrong box coordinates. A lot of.
Could you please publish your correct boxes? The soft bounding boxes are also ok. And we can do detection on it.
Thank you very much.

Some url could not be found anymore (404 Not Found).

I have found about 300 images "404 not found" when I download the first 2000 images in the .csv file. Besides some links are duplicated. We are interested in your database and trying to use it in research. Would you like to provide the original images?

Bounding boxes are incorrect.

Lots of bounding boxes are incorrect.
I'm aware of issues #11 and #6.
However, I want some clarification.

It seems that bounding boxes are correct only for the some images which resolution hadn’t changed.
For example, image with id 1 is OK:
name Terence_Bernie_Hines
index nm0385722
image 2.jpg
rect 143 246 608 719
height width 999 799

Original image with shape (999, 799) matches the shape listed in .csv file:

Cropped bounding box is valid:

How to handle images which resolution had changed dramatically?

For example, for id 1310606:
name Audrey_Hepburn
index nm0000030
image 2.jpg
rect 650 266 724 378
height width 999 1535

Original image with shape (336, 222) doesn’t match the listed size (999, 1535) at all.

Orignal image with shape (336, 222):

Resized image with shape (999, 1535):

Cropped bounding box is invalid:

Do you have any recommendations how to handle this problem?

Repetitions in file names and class labels

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.

Kanchan - nm0437156
Ilias_Kanchan - nm0437156

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg} should mark a unique entry)

Hope this helps!
Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.

import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')

+ 1662888 entries were found.
- 1632927 unique entries were found.

One suggestion to the sharing

Firstly, thx to the authors for their contribution to data cleaning. I think the "way" (usually related to some scripts like the one released by the insightface ) could be changed a little bit. For example, the method to clean the data will be more important than sharing for the original url. The details are explained very clearly on the paper named "The Devil of Face Recognition is in the Noise", but if we do it again, it will be time consuming and labor intensive. So if the script can also be released, it will be better. In this way, we can repeat your result on our own pc rather downloading the cleaned images for each of us.

The URL for IMDA-FACE dataset

The URL you released is not obtained. Chould you please send me the URL?
Thank you very much!