Abstract:Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.
Abstract:Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an "Activation-Selection-Aggregation" paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.
Abstract:Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality (M3HF), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, M3HF leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that M3HF significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.
Abstract:Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ($\text{M}^3\text{HF}$), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, $\text{M}^3\text{HF}$ leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that $\text{M}^3\text{HF}$ significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.
Abstract:Autonomous driving technology is rapidly evolving and becoming a pivotal element of modern automation systems. Effective decision-making and planning are essential to ensuring autonomous vehicles operate safely and efficiently in complex environments. This paper introduces a decision-making and planning framework for autonomous vehicles, leveraging dynamic programming (DP) for global path planning and quadratic programming (QP) for local trajectory optimization. The proposed approach utilizes S-T graphs to achieve both dynamic and static obstacle avoidance. A comprehensive vehicle dynamics model supports the control system, enabling precise path tracking and obstacle handling. Simulation studies are conducted to evaluate the system's performance in a variety of scenarios, including global path planning, static obstacle avoidance, and dynamic obstacle avoidance involving pedestrian interactions. The results confirm the effectiveness and robustness of the proposed decision-making and planning algorithms in navigating complex environments, demonstrating the feasibility of this approach for autonomous driving applications.
Abstract:Since the release of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities across various domains. A key challenge in developing these general capabilities is efficiently sourcing diverse, high-quality data. This becomes especially critical in reasoning-related tasks with sandbox checkers, such as math or code, where the goal is to generate correct solutions to specific problems with higher probability. In this work, we introduce Flaming-hot Initiation with Regular Execution (FIRE) sampling, a simple yet highly effective method to efficiently find good responses. Our empirical findings show that FIRE sampling enhances inference-time generation quality and also benefits training in the alignment stage. Furthermore, we explore how FIRE sampling improves performance by promoting diversity and analyze the impact of employing FIRE at different positions within a response.
Abstract:Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{$5\%$} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.
Abstract:Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.
Abstract:We present Flow Matching for Reaction Coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov State Models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.
Abstract:In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.