A current trend in industries such as semiconductors and foundry is to shift their visual inspection processes to Automatic Visual Inspection (AVI) systems, to reduce their costs, mistakes, and dependency on human experts. This paper proposes a two-staged fault diagnosis framework for AVI systems. In the first stage, a generation model is designed to synthesize new samples based on real samples. The proposed augmentation algorithm extracts objects from the real samples and blends them randomly, to generate new samples and enhance the performance of the image processor. In the second stage, an improved deep learning architecture based on Faster R-CNN, Feature Pyramid Network (FPN), and a Residual Network is proposed to perform object detection on the enhanced dataset. The performance of the algorithm is validated and evaluated on two multi-class datasets. The experimental results performed over a range of imbalance severities demonstrate the superiority of the proposed framework compared to other solutions.