The challenge of object categorization in images is largely due to arbitrary translations and scales of the foreground objects. To attack this difficulty, we propose a new approach called collaborative receptive field learning to extract specific receptive fields (RF's) or regions from multiple images, and the selected RF's are supposed to focus on the foreground objects of a common category. To this end, we solve the problem by maximizing a submodular function over a similarity graph constructed by a pool of RF candidates. However, measuring pairwise distance of RF's for building the similarity graph is a nontrivial problem. Hence, we introduce a similarity metric called pyramid-error distance (PED) to measure their pairwise distances through summing up pyramid-like matching errors over a set of low-level features. Besides, in consistent with the proposed PED, we construct a simple nonparametric classifier for classification. Experimental results show that our method effectively discovers the foreground objects in images, and improves classification performance.