Abstract:Unsupervised visual anomaly detection is crucial for enhancing industrial production quality and efficiency. Among unsupervised methods, reconstruction approaches are popular due to their simplicity and effectiveness. The key aspect of reconstruction methods lies in the restoration of anomalous regions, which current methods have not satisfactorily achieved. To tackle this issue, we introduce a novel \uline{A}daptive \uline{M}ask \uline{I}npainting \uline{Net}work (AMI-Net) from the perspective of adaptive mask-inpainting. In contrast to traditional reconstruction methods that treat non-semantic image pixels as targets, our method uses a pre-trained network to extract multi-scale semantic features as reconstruction targets. Given the multiscale nature of industrial defects, we incorporate a training strategy involving random positional and quantitative masking. Moreover, we propose an innovative adaptive mask generator capable of generating adaptive masks that effectively mask anomalous regions while preserving normal regions. In this manner, the model can leverage the visible normal global contextual information to restore the masked anomalous regions, thereby effectively suppressing the reconstruction of defects. Extensive experimental results on the MVTec AD and BTAD industrial datasets validate the effectiveness of the proposed method. Additionally, AMI-Net exhibits exceptional real-time performance, striking a favorable balance between detection accuracy and speed, rendering it highly suitable for industrial applications. Code is available at: https://github.com/luow23/AMI-Net
Abstract:Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to multi-class scenarios encounters challenges such as identical shortcut learning, hindering effective discrimination between normal and abnormal instances. To tackle this issue, our study introduces the Prior Normality Prompt Transformer (PNPT) method for multi-class image anomaly detection. PNPT strategically incorporates normal semantics prompting to mitigate the "identical mapping" problem. This entails integrating a prior normality prompt into the reconstruction process, yielding a dual-stream model. This innovative architecture combines normal prior semantics with abnormal samples, enabling dual-stream reconstruction grounded in both prior knowledge and intrinsic sample characteristics. PNPT comprises four essential modules: Class-Specific Normality Prompting Pool (CS-NPP), Hierarchical Patch Embedding (HPE), Semantic Alignment Coupling Encoding (SACE), and Contextual Semantic Conditional Decoding (CSCD). Experimental validation on diverse benchmark datasets and real-world industrial applications highlights PNPT's superior performance in multi-class industrial anomaly detection.
Abstract:Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot texture anomaly detection method named Global-Regularized Neighborhood Regression (GRNR). Unlike conventional approaches, GRNR can detect anomalies on arbitrary textured surfaces without any training data or cost. Drawing from human visual cognition, GRNR derives two intrinsic prior supports directly from the test texture image: local neighborhood priors characterized by coherent similarities and global normality priors featuring typical normal patterns. The fundamental principle of GRNR involves utilizing the two extracted intrinsic support priors for self-reconstructive regression of the query sample. This process employs the transformation facilitated by local neighbor support while being regularized by global normality support, aiming to not only achieve visually consistent reconstruction results but also preserve normality properties. We validate the effectiveness of GRNR across various industrial scenarios using eight benchmark datasets, demonstrating its superior detection performance without the need for training data. Remarkably, our method is applicable for open-set texture defect detection and can even surpass existing vanilla approaches that require extensive training.
Abstract:The unsupervised visual inspection of defects in industrial products poses a significant challenge due to substantial variations in product surfaces. Current unsupervised models struggle to strike a balance between detecting texture and object defects, lacking the capacity to discern latent representations and intricate features. In this paper, we present a novel self-supervised learning algorithm designed to derive an optimal encoder by tackling the renowned jigsaw puzzle. Our approach involves dividing the target image into nine patches, tasking the encoder with predicting the relative position relationships between any two patches to extract rich semantics. Subsequently, we introduce an affinity-augmentation method to accentuate differences between normal and abnormal latent representations. Leveraging the classic support vector data description algorithm yields final detection results. Experimental outcomes demonstrate that our proposed method achieves outstanding detection and segmentation performance on the widely used MVTec AD dataset, with rates of 95.8% and 96.8%, respectively, establishing a state-of-the-art benchmark for both texture and object defects. Comprehensive experimentation underscores the effectiveness of our approach in diverse industrial applications.
Abstract:In this paper, we introduce the novel state-of-the-art Dual-attention Transformer and Discriminative Flow (DADF) framework for visual anomaly detection. Based on only normal knowledge, visual anomaly detection has wide applications in industrial scenarios and has attracted significant attention. However, most existing methods fail to meet the requirements. In contrast, the proposed DTDF presents a new paradigm: it firstly leverages a pre-trained network to acquire multi-scale prior embeddings, followed by the development of a vision Transformer with dual attention mechanisms, namely self-attention and memorial-attention, to achieve two-level reconstruction for prior embeddings with the sequential and normality association. Additionally, we propose using normalizing flow to establish discriminative likelihood for the joint distribution of prior and reconstructions at each scale. The DADF achieves 98.3/98.4 of image/pixel AUROC on Mvtec AD; 83.7 of image AUROC and 67.4 of pixel sPRO on Mvtec LOCO AD benchmarks, demonstrating the effectiveness of our proposed approach.
Abstract:This paper presents a novel framework, named Global-Local Correspondence Framework (GLCF), for visual anomaly detection with logical constraints. Visual anomaly detection has become an active research area in various real-world applications, such as industrial anomaly detection and medical disease diagnosis. However, most existing methods focus on identifying local structural degeneration anomalies and often fail to detect high-level functional anomalies that involve logical constraints. To address this issue, we propose a two-branch approach that consists of a local branch for detecting structural anomalies and a global branch for detecting logical anomalies. To facilitate local-global feature correspondence, we introduce a novel semantic bottleneck enabled by the visual Transformer. Moreover, we develop feature estimation networks for each branch separately to detect anomalies. Our proposed framework is validated using various benchmarks, including industrial datasets, Mvtec AD, Mvtec Loco AD, and the Retinal-OCT medical dataset. Experimental results show that our method outperforms existing methods, particularly in detecting logical anomalies.
Abstract:Industrial vision anomaly detection plays a critical role in the advanced intelligent manufacturing process, while some limitations still need to be addressed under such a context. First, existing reconstruction-based methods struggle with the identity mapping of trivial shortcuts where the reconstruction error gap is legible between the normal and abnormal samples, leading to inferior detection capabilities. Then, the previous studies mainly concentrated on the convolutional neural network (CNN) models that capture the local semantics of objects and neglect the global context, also resulting in inferior performance. Moreover, existing studies follow the individual learning fashion where the detection models are only capable of one category of the product while the generalizable detection for multiple categories has not been explored. To tackle the above limitations, we proposed a self-induction vision Transformer(SIVT) for unsupervised generalizable multi-category industrial visual anomaly detection and localization. The proposed SIVT first extracts discriminatory features from pre-trained CNN as property descriptors. Then, the self-induction vision Transformer is proposed to reconstruct the extracted features in a self-supervisory fashion, where the auxiliary induction tokens are additionally introduced to induct the semantics of the original signal. Finally, the abnormal properties can be detected using the semantic feature residual difference. We experimented with the SIVT on existing Mvtec AD benchmarks, the results reveal that the proposed method can advance state-of-the-art detection performance with an improvement of 2.8-6.3 in AUROC, and 3.3-7.6 in AP.
Abstract:Due to the extreme imbalance in the number of normal data and abnormal data, visual anomaly detection is important for the development of industrial automatic product quality inspection. Unsupervised methods based on reconstruction and embedding have been widely studied for anomaly detection, of which reconstruction-based methods are the most popular. However, establishing a unified model for textured surface defect detection remains a challenge because these surfaces can vary in homogeneous and non regularly ways. Furthermore, existing reconstruction-based methods do not have a strong ability to convert the defect feature to the normal feature. To address these challenges, we propose a novel unsupervised reference-based autoencoder (RB-AE) to accurately inspect a variety of textured defects. Unlike most reconstruction-based methods, artificial defects and a novel pixel-level discrimination loss function are utilized for training to enable the model to obtain pixel-level discrimination ability. First, the RB-AE employs an encoding module to extract multi-scale features of the textured surface. Subsequently, a novel reference-based attention module (RBAM) is proposed to convert the defect features to normal features to suppress the reconstruction of defects. In addition, RBAM can also effectively suppress the defective feature residual caused by skip-connection. Next, a decoding module utilizes the repaired features to reconstruct the normal texture background. Finally, a novel multiscale feature discrimination module (MSFDM) is employed to defect detection and segmentation.
Abstract:Unsupervised visual anomaly detection conveys practical significance in many scenarios and is a challenging task due to the unbounded definition of anomalies. Moreover, most previous methods are application-specific, and establishing a unified model for anomalies across application scenarios remains unsolved. This paper proposes a novel hybrid framework termed Siamese Transition Masked Autoencoders(ST-MAE) to handle various visual anomaly detection tasks uniformly via deep feature transition. Concretely, the proposed method first extracts hierarchical semantics features from a pre-trained deep convolutional neural network and then develops a feature decoupling strategy to split the deep features into two disjoint feature patch subsets. Leveraging the decoupled features, the ST-MAE is developed with the Siamese encoders that operate on each subset of feature patches and perform the latent representations transition of two subsets, along with a lightweight decoder that reconstructs the original feature from the transitioned latent representation. Finally, the anomalous attributes can be detected using the semantic deep feature residual. Our deep feature transition scheme yields a nontrivial and semantic self-supervisory task to extract prototypical normal patterns, which allows for learning uniform models that generalize well for different visual anomaly detection tasks. The extensive experiments conducted demonstrate that the proposed ST-MAE method can advance state-of-the-art performance on multiple benchmarks across application scenarios with a superior inference efficiency, which exhibits great potential to be the uniform model for unsupervised visual anomaly detection.
Abstract:In surface defect detection, due to the extreme imbalance in the number of positive and negative samples, positive-samples-based anomaly detection methods have received more and more attention. Specifically, reconstruction-based methods are the most popular. However, exiting methods are either difficult to repair abnormal foregrounds or reconstruct clear backgrounds. Therefore, we propose a clear memory-augmented auto-encoder. At first, we propose a novel clear memory-augmented module, which combines the encoding and memory-encoding in a way of forgetting and inputting, thereby repairing abnormal foregrounds and preservation clear backgrounds. Secondly, a general artificial anomaly generation algorithm is proposed to simulate anomalies that are as realistic and feature-rich as possible. At last, we propose a novel multi scale feature residual detection method for defect segmentation, which makes the defect location more accurate. CMA-AE conducts comparative experiments using 11 state-of-the-art methods on five benchmark datasets, showing an average 18.6% average improvement in F1-measure.