Abstract:The rapid advancement of deepfake technologies raises significant concerns about the security of face recognition systems. While existing methods leverage the clues left by deepfake techniques for face forgery detection, malicious users may intentionally manipulate forged faces to obscure the traces of deepfake clues and thereby deceive detection tools. Meanwhile, attaining cross-domain robustness for data-based methods poses a challenge due to potential gaps in the training data, which may not encompass samples from all relevant domains. Therefore, in this paper, we introduce a solution - a Cross-Domain Robust Bias Expansion Network (BENet) - designed to enhance face forgery detection. BENet employs an auto-encoder to reconstruct input faces, maintaining the invariance of real faces while selectively enhancing the difference between reconstructed fake faces and their original counterparts. This enhanced bias forms a robust foundation upon which dependable forgery detection can be built. To optimize the reconstruction results in BENet, we employ a bias expansion loss infused with contrastive concepts to attain the aforementioned objective. In addition, to further heighten the amplification of forged clues, BENet incorporates a Latent-Space Attention (LSA) module. This LSA module effectively captures variances in latent features between the auto-encoder's encoder and decoder, placing emphasis on inconsistent forgery-related information. Furthermore, BENet incorporates a cross-domain detector with a threshold to determine whether the sample belongs to a known distribution. The correction of classification results through the cross-domain detector enables BENet to defend against unknown deepfake attacks from cross-domain. Extensive experiments demonstrate the superiority of BENet compared with state-of-the-art methods in intra-database and cross-database evaluations.
Abstract:The automated generation of radiology diagnostic reports helps radiologists make timely and accurate diagnostic decisions while also enhancing clinical diagnostic efficiency. However, the significant imbalance in the distribution of data between normal and abnormal samples (including visual and textual biases) poses significant challenges for a data-driven task like automatically generating diagnostic radiology reports. Therefore, we propose a Dynamic Multi-Domain Knowledge(DMDK) network for radiology diagnostic report generation. The DMDK network consists of four modules: Chest Feature Extractor(CFE), Dynamic Knowledge Extractor(DKE), Specific Knowledge Extractor(SKE), and Multi-knowledge Integrator(MKI) module. Specifically, the CFE module is primarily responsible for extracting the unprocessed visual medical features of the images. The DKE module is responsible for extracting dynamic disease topic labels from the retrieved radiology diagnostic reports. We then fuse the dynamic disease topic labels with the original visual features of the images to highlight the abnormal regions in the original visual features to alleviate the visual data bias problem. The SKE module expands upon the conventional static knowledge graph to mitigate textual data biases and amplify the interpretability capabilities of the model via domain-specific dynamic knowledge graphs. The MKI distills all the knowledge and generates the final diagnostic radiology report. We performed extensive experiments on two widely used datasets, IU X-Ray and MIMIC-CXR. The experimental results demonstrate the effectiveness of our method, with all evaluation metrics outperforming previous state-of-the-art models.
Abstract:A novel approach is suggested for improving the accuracy of fault detection in distribution networks. This technique combines adaptive probability learning and waveform decomposition to optimize the similarity of features. Its objective is to discover the most appropriate linear mapping between simulated and real data to minimize distribution differences. By aligning the data in the same feature space, the proposed method effectively overcomes the challenge posed by limited sample size when identifying faults and classifying real data in distribution networks. Experimental results utilizing simulated system data and real field data demonstrate that this approach outperforms commonly used classification models such as convolutional neural networks, support vector machines, and k-nearest neighbors, especially under adaptive learning conditions. Consequently, this research provides a fresh perspective on fault detection in distribution networks, particularly when adaptive learning conditions are employed.
Abstract:Face parsing infers a pixel-wise label map for each semantic facial component. Previous methods generally work well for uncovered faces, however overlook the facial occlusion and ignore some contextual area outside a single face, especially when facial occlusion has become a common situation during the COVID-19 epidemic. Inspired by the illumination theory of image, we propose a novel homogeneous tanh-transforms for image preprocessing, which made up of four tanh-transforms, that fuse the central vision and the peripheral vision together. Our proposed method addresses the dilemma of face parsing under occlusion and compresses more information of surrounding context. Based on homogeneous tanh-transforms, we propose an occlusion-aware convolutional neural network for occluded face parsing. It combines the information both in Tanh-polar space and Tanh-Cartesian space, capable of enhancing receptive fields. Furthermore, we introduce an occlusion-aware loss to focus on the boundaries of occluded regions. The network is simple and flexible, and can be trained end-to-end. To facilitate future research of occluded face parsing, we also contribute a new cleaned face parsing dataset, which is manually purified from several academic or industrial datasets, including CelebAMask-HQ, Short-video Face Parsing as well as Helen dataset and will make it public. Experiments demonstrate that our method surpasses state-of-art methods of face parsing under occlusion.
Abstract:Thyroid nodule segmentation is a crucial step in the diagnostic procedure of physicians and computer-aided diagnosis systems. Mostly, current studies treat segmentation and diagnosis as independent tasks without considering the correlation between these tasks. The sequence steps of these independent tasks in computer-aided diagnosis systems may lead to the accumulation of errors. Therefore, it is worth combining them as a whole through exploring the relationship between thyroid nodule segmentation and diagnosis. According to the thyroid imaging reporting and data system (TI-RADS), the assessment of shape and margin characteristics is the prerequisite for the discrimination of benign and malignant thyroid nodules. These characteristics can be observed in the thyroid nodule segmentation masks. Inspired by the diagnostic procedure of TI-RADS, this paper proposes a shape-margin knowledge augmented network (SkaNet) for simultaneously thyroid nodule segmentation and diagnosis. Due to the similarity in visual features between segmentation and diagnosis, SkaNet shares visual features in the feature extraction stage and then utilizes a dual-branch architecture to perform thyroid nodule segmentation and diagnosis tasks simultaneously. To enhance effective discriminative features, an exponential mixture module is devised, which incorporates convolutional feature maps and self-attention maps by exponential weighting. Then, SkaNet is jointly optimized by a knowledge augmented multi-task loss function with a constraint penalty term. It embeds shape and margin characteristics through numerical computation and models the relationship between the thyroid nodule diagnosis results and segmentation masks.
Abstract:Face anti-spoofing (FAS) is crucial for securing face recognition systems. However, existing FAS methods with handcrafted binary or pixel-wise labels have limitations due to diverse presentation attacks (PAs). In this paper, we propose an attack type robust face anti-spoofing framework under light flash, called ATR-FAS. Due to imaging differences caused by various attack types, traditional FAS methods based on single binary classification network may result in excessive intra-class distance of spoof faces, leading to a challenge of decision boundary learning. Therefore, we employed multiple networks to reconstruct multi-frame depth maps as auxiliary supervision, and each network experts in one type of attack. A dual gate module (DGM) consisting of a type gate and a frame-attention gate is introduced, which perform attack type recognition and multi-frame attention generation, respectively. The outputs of DGM are utilized as weight to mix the result of multiple expert networks. The multi-experts mixture enables ATR-FAS to generate spoof-differentiated depth maps, and stably detects spoof faces without being affected by different types of PAs. Moreover, we design a differential normalization procedure to convert original flash frames into differential frames. This simple but effective processing enhances the details in flash frames, aiding in the generation of depth maps. To verify the effectiveness of our framework, we collected a large-scale dataset containing 12,660 live and spoof videos with diverse PAs under dynamic flash from the smartphone screen. Extensive experiments illustrate that the proposed ATR-FAS significantly outperforms existing state-of-the-art methods. The code and dataset will be available at https://github.com/Chaochao-Lin/ATR-FAS.
Abstract:In healthcare, multimodal data is prevalent and requires to be comprehensively analyzed before diagnostic decisions, including medical images, clinical reports, etc. However, current large-scale artificial intelligence models predominantly focus on single-modal cognitive abilities and neglect the integration of multiple modalities. Therefore, we propose Stone Needle, a general multimodal large-scale model framework tailored explicitly for healthcare applications. Stone Needle serves as a comprehensive medical multimodal model foundation, integrating various modalities such as text, images, videos, and audio to surpass the limitations of single-modal systems. Through the framework components of intent analysis, medical foundation models, prompt manager, and medical language module, our architecture can perform multi-modal interaction in multiple rounds of dialogue. Our method is a general multimodal large-scale model framework, integrating diverse modalities and allowing us to tailor for specific tasks. The experimental results demonstrate the superior performance of our method compared to single-modal systems. The fusion of different modalities and the ability to process complex medical information in Stone Needle benefits accurate diagnosis, treatment recommendations, and patient care.
Abstract:Natural gradient descent (NGD) provided deep insights and powerful tools to deep neural networks. However the computation of Fisher information matrix becomes more and more difficult as the network structure turns large and complex. This paper proposes a new optimization method whose main idea is to accurately replace the natural gradient optimization by reconstructing the network. More specifically, we reconstruct the structure of the deep neural network, and optimize the new network using traditional gradient descent (GD). The reconstructed network achieves the effect of the optimization way with natural gradient descent. Experimental results show that our optimization method can accelerate the convergence of deep network models and achieve better performance than GD while sharing its computational simplicity.
Abstract:This paper proposes a new convolutional neural network with multiscale processing for detecting ground-glass opacity (GGO) nodules in 3D computed tomography (CT) images, which is referred to as PiaNet for short. PiaNet consists of a feature-extraction module and a prediction module. The former module is constructed by introducing pyramid multiscale source connections into a contracting-expanding structure. The latter module includes a bounding-box regressor and a classifier that are employed to simultaneously recognize GGO nodules and estimate bounding boxes at multiple scales. To train the proposed PiaNet, a two-stage transfer learning strategy is developed. In the first stage, the feature-extraction module is embedded into a classifier network that is trained on a large data set of GGO and non-GGO patches, which are generated by performing data augmentation from a small number of annotated CT scans. In the second stage, the pretrained feature-extraction module is loaded into PiaNet, and then PiaNet is fine-tuned using the annotated CT scans. We evaluate the proposed PiaNet on the LIDC-IDRI data set. The experimental results demonstrate that our method outperforms state-of-the-art counterparts, including the Subsolid CAD and Aidence systems and S4ND and GA-SSD methods. PiaNet achieves a sensitivity of 91.75% with only one false positive per scan
Abstract:The popular softmax loss and its recent extensions have achieved great success in the deep learning-based image classification. However, the data for training image classifiers usually has different quality. Ignoring such problem, the correct classification of low quality data is hard to be solved. In this paper, we discover the positive correlation between the feature norm of an image and its quality through careful experiments on various applications and various deep neural networks. Based on this finding, we propose a contraction mapping function to compress the range of feature norms of training images according to their quality and embed this contraction mapping function into softmax loss or its extensions to produce novel learning objectives. The experiments on various classification applications, including handwritten digit recognition, lung nodule classification, face verification and face recognition, demonstrate that the proposed approach is promising to effectively deal with the problem of learning on the data with different quality and leads to the significant and stable improvements in the classification accuracy.