Natural images contain many variations such as illumination differences, affine transformations, and shape distortions. Correctly classifying these variations poses a long standing problem. The most commonly adopted solution is to build large-scale datasets that contain objects under different variations. However, this approach is not ideal since it is computationally expensive and it is hard to cover all variations in one single dataset. Towards addressing this difficulty, we propose the spatial transformer introspective neural network (ST-INN) that explicitly generates samples with the unseen affine transformation variations in the training set. Experimental results indicate ST-INN achieves classification accuracy improvements on several benchmark datasets, including MNIST, affNIST, SVHN and CIFAR-10. We further extend our method to cross dataset classification tasks and few-shot learning problems to verify our method under extreme conditions and observe substantial improvements from experiment results.