Large datasets are very useful for training speaker recognition systems, and various research groups have constructed several over the years. Voxceleb is a large dataset for speaker recognition that is extracted from Youtube videos. This paper presents an audio-visual method for acquiring audio data from Youtube given the speaker's name as input. The system follows a pipeline similar to that of the Voxceleb data acquisition method. However, our work focuses on fast data acquisition by using face-tracking in subsequent frames once a face has been detected -- this is preferable over face detection for every frame considering its computational cost. We show that applying audio diarization to our data after acquiring it can yield equal error rates comparable to Voxceleb. A secondary set of experiments showed that we could further decrease the error rate by fine-tuning a pre-trained x-vector system with the acquired data. Like Voxceleb, the work here focuses primarily on developing audio for celebrities. However, unlike Voxceleb, our target audio data is from celebrities in East Asian countries. Finally, we set up a speaker verification task to evaluate the accuracy of our acquired data. After diarization and fine-tuning, we achieved an equal error rate of approximately 4\% across our entire dataset.