Abstract:Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/
Abstract:Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.
Abstract:Medical images often exhibit distribution shifts due to variations in imaging protocols and scanners across different medical centers. Domain Generalization (DG) methods aim to train models on source domains that can generalize to unseen target domains. Recently, the segment anything model (SAM) has demonstrated strong generalization capabilities due to its prompt-based design, and has gained significant attention in image segmentation tasks. Existing SAM-based approaches attempt to address the need for manual prompts by introducing prompt generators that automatically generate these prompts. However, we argue that auto-generated prompts may not be sufficiently accurate under distribution shifts, potentially leading to incorrect predictions that still require manual verification and correction by clinicians. To address this challenge, we propose a method for 2D medical image segmentation called Self-Correcting SAM (CoSAM). Our approach begins by generating coarse masks using SAM in a prompt-free manner, providing prior prompts for the subsequent stages, and eliminating the need for prompt generators. To automatically refine these coarse masks, we introduce a generalized error decoder that simulates the correction process typically performed by clinicians. Furthermore, we generate diverse prompts as feedback based on the corrected masks, which are used to iteratively refine the predictions within a self-correcting loop, enhancing the generalization performance of our model. Extensive experiments on two medical image segmentation benchmarks across multiple scenarios demonstrate the superiority of CoSAM over state-of-the-art SAM-based methods.
Abstract:How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Abstract:Question Answering (QA) systems face challenges in handling complex questions that require multi-domain knowledge synthesis. The naive RAG models, although effective in information retrieval, struggle with complex questions that require comprehensive and in-depth answers. The pioneering task is defined as explanatory answer generation, which entails handling identified challenges such as the requirement for comprehensive information and logical coherence within the generated context. To address these issues, we refer to systematic thinking theory and propose SynthRAG, an innovative framework designed to enhance QA performance. SynthRAG improves on conventional models by employing adaptive outlines for dynamic content structuring, generating systematic information to ensure detailed coverage, and producing customized answers tailored to specific user inquiries. This structured approach guarantees logical coherence and thorough integration of information, yielding responses that are both insightful and methodically organized. Empirical evaluations underscore SynthRAG's effectiveness, demonstrating its superiority in handling complex questions, overcoming the limitations of naive RAG models, and significantly improving answer quality and depth. Furthermore, an online deployment on the Zhihu platform revealed that SynthRAG's answers achieved notable user engagement, with each response averaging 5.73 upvotes and surpassing the performance of 79.8% of human contributors, highlighting the practical relevance and impact of the proposed framework. Our code is available at https://github.com/czy1999/SynthRAG .
Abstract:Distribution shifts widely exist in medical images acquired from different medical centers, hindering the deployment of semantic segmentation models trained on data from one center (source domain) to another (target domain). While unsupervised domain adaptation (UDA) has shown significant promise in mitigating these shifts, it poses privacy risks due to sharing data between centers. To facilitate adaptation while preserving data privacy, source-free domain adaptation (SFDA) and test-time adaptation (TTA) have emerged as effective paradigms, relying solely on target domain data. However, the scenarios currently addressed by SFDA and TTA are limited, making them less suitable for clinical applications. In a more realistic clinical scenario, the pre-trained model is deployed in a medical centre to assist with clinical tasks during the day and rest at night. During the daytime process, TTA can be employed to enhance inference performance. During the nighttime process, after collecting the test data from the day, the model can be fine-tuned utilizing SFDA to further adapt to the target domain. With above insights, we propose a novel adaptation framework called Day-Night Adaptation (DyNA). This framework adapts the model to the target domain through day-night loops without requiring access to source data. Specifically, we implement distinct adaptation strategies for daytime and nighttime to better meet the demands of clinical settings. During the daytime, model parameters are frozen, and a specific low-frequency prompt is trained for each test sample. Additionally, we construct a memory bank for prompt initialization and develop a warm-up mechanism to enhance prompt training. During nighttime, we integrate a global student model into the traditional teacher-student self-training paradigm to fine-tune the model while ensuring training stability...
Abstract:Precision pose detection is increasingly demanded in fields such as personal fabrication, Virtual Reality (VR), and robotics due to its critical role in ensuring accurate positioning information. However, conventional vision-based systems used in these systems often struggle with achieving high precision and accuracy, particularly when dealing with complex environments or fast-moving objects. To address these limitations, we investigate Laser Speckle Imaging (LSI), an emerging optical tracking method that offers promising potential for improving pose estimation accuracy. Specifically, our proposed LSI-Based Tracking (SpecTrack) leverages the captures from a lensless camera and a retro-reflector marker with a coded aperture to achieve multi-axis rotational pose estimation with high precision. Our extensive trials using our in-house built testbed have shown that SpecTrack achieves an accuracy of 0.31{\deg} (std=0.43{\deg}), significantly outperforming state-of-the-art approaches and improving accuracy up to 200%.
Abstract:Universal segmentation models offer significant potential in addressing a wide range of tasks by effectively leveraging discrete annotations. As the scope of tasks and modalities expands, it becomes increasingly important to generate and strategically position task- and modal-specific priors within the universal model. However, existing universal models often overlook the correlations between different priors, and the optimal placement and frequency of these priors remain underexplored. In this paper, we introduce MedUniSeg, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains. MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks. To generate the related priors, we propose the modal map (MMap) and the fusion and selection (FUSE) modules, which transform modal and task prompts into corresponding priors. These modal and task priors are systematically introduced at the start and end of the encoding process. We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets. The results demonstrate that MedUniSeg achieves superior multi-task segmentation performance, attaining a 1.2% improvement in the mean Dice score across the 17 upstream tasks compared to nnUNet baselines, while using less than 1/10 of the parameters. For tasks that underperform during the initial multi-task joint training, we freeze MedUniSeg and introduce new modules to re-learn these tasks. This approach yields an enhanced version, MedUniSeg*, which consistently outperforms MedUniSeg across all tasks. Moreover, MedUniSeg surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.
Abstract:Large language models (LLMs), such as GPT series and Llama series have demonstrated strong capabilities in natural language processing, contextual understanding, and text generation. In recent years, researchers are trying to enhance the abilities of LLMs in performing various tasks, and numerous studies have proved that well-designed prompts can significantly improve the performance of LLMs on these tasks. This study compares the language style imitation ability of three different large language models under the guidance of the same zero-shot prompt. It also involves comparing the imitation ability of the same large language model when guided by three different prompts individually. Additionally, by applying a Tree-of-Thoughts (ToT) Prompting method to Llama 3, a conversational AI with the language style of a real person was created. In this study, three evaluation methods were used to evaluate LLMs and prompts. The results show that Llama 3 performs best at imitating language styles, and that the ToT prompting method is the most effective to guide it in imitating language styles. Using a ToT framework, Llama 3 was guided to interact with users in the language style of a specific individual without altering its core parameters, thereby creating a text-based conversational AI that reflects the language style of the individual.
Abstract:Parameter-efficient fine-tuning (PEFT) techniques have emerged to address issues of overfitting and high computational costs associated with fully fine-tuning in the paradigm of self-supervised learning. Mainstream methods based on PEFT involve adding a few trainable parameters while keeping the pre-trained parameters of the backbone fixed. These methods achieve comparative, and often superior, performance to fully fine-tuning, demonstrating the powerful representation ability of the pre-trained backbone. Despite its success, these methods typically ignore the initialization of the new parameters, often relying solely on random initialization. We argue that if pre-training is significantly beneficial, it should be applied to all parameters requiring representational capacity. Motivated by this insight, we propose a simple yet effective fine-tuning framework based on Target Parameter Pre-training (TPP). The target parameters refer to the new parameters introduced during fine-tuning. TPP includes an additional stage before PEFT to pre-train these target parameters. During this stage, the pre-trained backbone parameters are frozen, and only the target parameters are trainable. A defined pre-text task is used to encourage the target parameters to learn specific representations of downstream data. When PEFT is subsequently employed, the pre-trained target parameters are loaded to enhance fine-tuning efficiency. The proposed TPP framework is versatile, allowing for the integration of various pretext tasks for pre-training and supporting different PEFT methods as backbones. We evaluated the fine-tining performance of our method using five public datasets, including three modalities and two task types. The results demonstrate that the proposed TPP can be easily integrated into existing PEFT methods, significantly improving performance.