Part-based approaches for fine-grained recognition do not show the expected performance gain over global methods, although being able to explicitly focus on small details that are relevant for distinguishing highly similar classes. We assume that part-based methods suffer from a missing representation of local features, which is invariant to the order of parts and can handle a varying number of visible parts appropriately. The order of parts is artificial and often only given by ground-truth annotations, whereas viewpoint variations and occlusions result in parts that are not observable. Therefore, we propose integrating a Fisher vector encoding of part features into convolutional neural networks. The parameters for this encoding are estimated jointly with those of the neural network in an end-to-end manner. Our approach improves state-of-the-art accuracies for bird species classification on CUB-200-2011 from 90.40\% to 90.95\%, on NA-Birds from 89.20\% to 90.30\%, and on Birdsnap from 84.30\% to 86.97\%.