Abstract:Visual neural decoding refers to the process of extracting and interpreting original visual experiences from human brain activity. Recent advances in metric learning-based EEG visual decoding methods have delivered promising results and demonstrated the feasibility of decoding novel visual categories from brain activity. However, methods that directly map EEG features to the CLIP embedding space may introduce mapping bias and cause semantic inconsistency among features, thereby degrading alignment and impairing decoding performance. To further explore the semantic consistency between visual and neural signals. In this work, we construct a joint semantic space and propose a Visual-EEG Semantic Decouple Framework that explicitly extracts the semantic-related features of these two modalities to facilitate optimal alignment. Specifically, a cross-modal information decoupling module is introduced to guide the extraction of semantic-related information from modalities. Then, by quantifying the mutual information between visual image and EEG features, we observe a strong positive correlation between the decoding performance and the magnitude of mutual information. Furthermore, inspired by the mechanisms of visual object understanding from neuroscience, we propose an intra-class geometric consistency approach during the alignment process. This strategy maps visual samples within the same class to consistent neural patterns, which further enhances the robustness and the performance of EEG visual decoding. Experiments on a large Image-EEG dataset show that our method achieves state-of-the-art results in zero-shot neural decoding tasks.
Abstract:Interpretable deep learning models have received widespread attention in the field of image recognition. Due to the unique multi-instance learning of medical images and the difficulty in identifying decision-making regions, many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. To solve these problems, we propose feature-driven inference network (FeaInfNet). Our first key innovation involves proposing a feature-based network reasoning structure, which is applied to FeaInfNet. The network of this structure compares the similarity of each sub-region image patch with the disease templates and normal templates that may appear in the region, and finally combines the comparison of each sub-region to make the final diagnosis. It simulates the diagnosis process of doctors to make the model interpretable in the reasoning process, while avoiding the misleading caused by the participation of normal areas in reasoning. Secondly, we propose local feature masks (LFM) to extract feature vectors in order to provide global information for these vectors, thus enhancing the expressive ability of the FeaInfNet. Finally, we propose adaptive dynamic masks (Adaptive-DM) to interpret feature vectors and prototypes into human-understandable image patches to provide accurate visual interpretation. We conducted qualitative and quantitative experiments on multiple publicly available medical datasets, including RSNA, iChallenge-PM, Covid-19, ChinaCXRSet, and MontgomerySet. The results of our experiments validate that our method achieves state-of-the-art performance in terms of classification accuracy and interpretability compared to baseline methods in medical image diagnosis. Additional ablation studies verify the effectiveness of each of our proposed components.
Abstract:Saliency methods generating visual explanatory maps representing the importance of image pixels for model classification is a popular technique for explaining neural network decisions. Hierarchical dynamic masks (HDM), a novel explanatory maps generation method, is proposed in this paper to enhance the granularity and comprehensiveness of saliency maps. First, we suggest the dynamic masks (DM), which enables multiple small-sized benchmark mask vectors to roughly learn the critical information in the image through an optimization method. Then the benchmark mask vectors guide the learning of large-sized auxiliary mask vectors so that their superimposed mask can accurately learn fine-grained pixel importance information and reduce the sensitivity to adversarial perturbations. In addition, we construct the HDM by concatenating DM modules. These DM modules are used to find and fuse the regions of interest in the remaining neural network classification decisions in the mask image in a learning-based way. Since HDM forces DM to perform importance analysis in different areas, it makes the fused saliency map more comprehensive. The proposed method outperformed previous approaches significantly in terms of recognition and localization capabilities when tested on natural and medical datasets.
Abstract:The interpretation of decisions made by neural networks is the focus of recent research. In the previous method, by modifying the architecture of the neural network, the network simulates the human reasoning process, that is, by finding the decision elements to make decisions, so that the network has the interpretability of the reasoning process. The specific interpretable architecture will limit the fitting space of the network, resulting in a decrease in the classification performance of the network, unstable convergence, and general interpretability. We propose DProtoNet (Decoupling Prototypical network), it stores the decision basis of the neural network by using feature masks, and it uses Multiple Dynamic Masks (MDM) to explain the decision basis for feature mask retention. It decouples the neural network inference module from the interpretation module, and removes the specific architectural limitations of the interpretable network, so that the decision-making architecture of the network retains the original network architecture as much as possible, making the neural network more expressive, and greatly improving the interpretability. Classification performance and interpretability of explanatory networks. We propose to replace the prototype learning of a single image with the prototype learning of multiple images, which makes the prototype robust, improves the convergence speed of network training, and makes the accuracy of the network more stable during the learning process. We test on multiple datasets, DProtoNet can improve the accuracy of recent advanced interpretable network models by 5% to 10%, and its classification performance is comparable to that of backbone networks without interpretability. It also achieves the state of the art in interpretability performance.
Abstract:The Class Activation Maps(CAM) lookup of a neural network can tell us what regions the neural network is focusing on when making a decision.We propose an algorithm Multiple Dynamic Mask (MDM), which is a general saliency graph query method with interpretability of inference process. The algorithm is based on an assumption: when a picture is input into a trained neural network, only the activation features related to classification will affect the classification results of the neural network, and the features unrelated to classification will hardly affect the classification results of the network. MDM: A learning-based end-to-end algorithm for finding regions of interest for neural network classification.It has the following advantages: 1. It has the interpretability of the reasoning process, and the reasoning process conforms to human cognition. 2. It is universal, it can be used for any neural network and does not depend on the internal structure of the neural network. 3. The search performance is better. The algorithm is based on learning and has the ability to adapt to different data and networks. The performance is better than the method proposed in the previous paper. For the MDM saliency map search algorithm, we experimentally compared ResNet and DenseNet as the trained neural network. The recent advanced saliency map search method and the results of MDM on the performance indicators of each search effect item, the performance of MDM has reached the state of the art. We applied the MDM method to the interpretable neural network ProtoPNet and XProtoNet, which improved the model's interpretability prototype search performance. And we visualize the effect of convolutional neural architecture and Transformer architecture in saliency map search, illustrating the interpretability and generality of MDM.
Abstract:Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. However, compromise does not make the progress. Decayed diversity makes the captioner a repeater, and decayed accuracy makes it a fake advisor. In this work, we exploit a novel Variational Transformer framework to improve accuracy and diversity simultaneously. To ensure accuracy, we introduce the "Invisible Information Prior" along with the "Auto-selectable GMM" to instruct the encoder to learn the precise language information and object relation in different scenes. To ensure diversity, we propose the "Range-Median Reward" baseline to retain more diverse candidates with higher rewards during the RL-based training process. Experiments show that our method achieves the simultaneous promotion of accuracy (CIDEr) and diversity (self-CIDEr), up to 1.1 and 4.8 percent, compared with the baseline. Also, our method outperforms others under the newly proposed measurement of the trade-off gap, with at least 3.55 percent promotion.