Learning from web data has attracted lots of research interest in recent years. However, crawled web images usually have two types of noises, label noise and background noise, which induce extra difficulties in utilizing them effectively. Most existing methods either rely on human supervision or ignore the background noise. In this paper, we propose the novel ProtoNet, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage. Particularly, we use a memory module to identify the representative and discriminative prototypes for each category. Then, we remove noisy images and noisy region proposals from the web dataset with the aid of the memory module. Our approach is efficient and can be easily integrated into arbitrary CNN model. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.