Abstract:As Transformer models grow in complexity, their ability to generalize to novel, compositional tasks becomes crucial. This study challenges conventional wisdom about sparse activation in Sparse Mixture of Experts (SMoE) models when faced with increasingly complex compositional tasks. Through experiments on the SRAVEN symbolic reasoning task and SKILL-MIX benchmark, we demonstrate that activating more experts improves performance on difficult tasks, with the optimal number of activated experts scaling with task complexity. Our findings reveal that pretrained SMoE-based Large Language Models achieve better results by increasing experts-per-token on challenging compositional tasks.
Abstract:Mixture-of-Experts (MoE) represents an ensemble methodology that amalgamates predictions from several specialized sub-models (referred to as experts). This fusion is accomplished through a router mechanism, dynamically assigning weights to each expert's contribution based on the input data. Conventional MoE mechanisms select all available experts, incurring substantial computational costs. In contrast, Sparse Mixture-of-Experts (Sparse MoE) selectively engages only a limited number, or even just one expert, significantly reducing computation overhead while empirically preserving, and sometimes even enhancing, performance. Despite its wide-ranging applications and these advantageous characteristics, MoE's theoretical underpinnings have remained elusive. In this paper, we embark on an exploration of Sparse MoE's generalization error concerning various critical factors. Specifically, we investigate the impact of the number of data samples, the total number of experts, the sparsity in expert selection, the complexity of the routing mechanism, and the complexity of individual experts. Our analysis sheds light on \textit{how \textbf{sparsity} contributes to the MoE's generalization}, offering insights from the perspective of classical learning theory.
Abstract:Diffusion-based image synthesis has attracted extensive attention recently. In particular, ControlNet that uses image-based prompts exhibits powerful capability in image tasks such as canny edge detection and generates images well aligned with these prompts. However, vanilla ControlNet generally requires extensive training of around 5000 steps to achieve a desirable control for a single task. Recent context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Thus, two important open issues are yet to be addressed to reach the full potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces learning steps to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any finetuning, and achieves control within only 100 finetuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes is available in https://github.com/JunjieYang97/Meta-ControlNet.
Abstract:Explainability is critical for deep learning applications in healthcare which are mandated to provide interpretations to both patients and doctors according to legal regulations and responsibilities. Explainable AI methods, such as feature importance using integrated gradients, model approximation using LIME, or neuron activation and layer conductance to provide interpretations for certain health risk predictions. In medical imaging diagnosis, disease classification usually achieves high accuracy, but generated bounding boxes have much lower Intersection over Union (IoU). Different methods with self-supervised or semi-supervised learning strategies have been proposed, but few improvements have been identified for bounding box generation. Previous work shows that bounding boxes generated by these methods are usually larger than ground truth and contain major non-disease area. This paper utilizes the advantages of post-hoc AI explainable methods to generate bounding boxes for chest x-ray image diagnosis. In this work, we propose CrossEAI which combines heatmap and gradient map to generate more targeted bounding boxes. By using weighted average of Guided Backpropagation and Grad-CAM++, we are able to generate bounding boxes which are closer to the ground truth. We evaluate our model on a chest x-ray dataset. The performance has significant improvement over the state of the art model with the same setting, with $9\%$ improvement in average of all diseases over all IoU. Moreover, as a model that does not use any ground truth bounding box information for training, we achieve same performance in general as the model that uses $80\%$ of the ground truth bounding box information for training