Abstract:Deep learning methods have been achieved brilliant results in face recognition. One of the important tasks to improve the performance is to collect and label images as many as possible. However, labeling identities and checking qualities of large image data are difficult task and mistakes cannot be avoided in processing large data. Previous works have been trying to deal with the problem only in training domain, however it can cause much serious problem if the mistakes are in gallery data of face identification. We proposed gallery data sampling methods which are robust to outliers including wrong labeled, low quality, and less-informative images and reduce searching time. The proposed sampling-by-pruning and sampling-by-generating methods significantly improved face identification performance on our 5.4M web image dataset of celebrities. The proposed method achieved 0.0975 in terms of FNIR at FPIR=0.01, while conventional method showed 0.3891. The average number of feature vectors for each individual gallery was reduced to 17.1 from 115.9 and it can provide much faster search. We also made experiments on public datasets and our method achieved 0.1314 and 0.0668 FNIRs at FPIR=0.01 on the CASIA-WebFace and MS1MV2, while the convectional method did 0.5446, and 0.1327, respectively.