Abstract:Scoliosis is a spinal curvature disorder that is difficult to detect early and can compress the chest cavity, impacting respiratory function and cardiac health. Especially for adolescents, delayed detection and treatment result in worsening compression. Traditional scoliosis detection methods heavily rely on clinical expertise, and X-ray imaging poses radiation risks, limiting large-scale early screening. We propose an Attention-Guided Deep Multi-Instance Learning method (Gait-MIL) to effectively capture discriminative features from gait patterns, which is inspired by ScoNet-MT's pioneering use of gait patterns for scoliosis detection. We evaluate our method on the first large-scale dataset based on gait patterns for scoliosis classification. The results demonstrate that our study improves the performance of using gait as a biomarker for scoliosis detection, significantly enhances detection accuracy for the particularly challenging Neutral cases, where subtle indicators are often overlooked. Our Gait-MIL also performs robustly in imbalanced scenarios, making it a promising tool for large-scale scoliosis screening.
Abstract:Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.
Abstract:The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.
Abstract:User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
Abstract:Traditional sequential recommendation (SR) methods heavily rely on explicit item IDs to capture user preferences over time. This reliance introduces critical limitations in cold-start scenarios and domain transfer tasks, where unseen items and new contexts often lack established ID mappings. To overcome these limitations, recent studies have shifted towards leveraging text-only information for recommendation, thereby improving model generalization and adaptability across domains. Although promising, text-based SR faces unique difficulties: items' text descriptions often share semantic similarities that lead to clustered item representations, compromising their uniformity, a property essential for promoting diversity and enhancing generalization in recommendation systems. In this paper, we explore a novel framework to improve the uniformity of item representations in text-based SR. Our analysis reveals that items within a sequence exhibit marked semantic similarity, meaning they are closer in representation than items overall, and that this effect is more pronounced for less popular items, which form tighter clusters compared to their more popular counterparts. Based on these findings, we propose UniT, a framework that employs three pairwise item sampling strategies: Unified General Sampling Strategy, Sequence-Driven Sampling Strategy, and Popularity-Driven Sampling Strategy. Each strategy applies varying degrees of repulsion to selectively adjust the distances between item pairs, thereby refining representation uniformity while considering both sequence context and item popularity. Extensive experiments on multiple real-world datasets demonstrate that our proposed approach outperforms state-of-the-art models, validating the effectiveness of UniT in enhancing both representation uniformity and recommendation accuracy.The source code is available at https://github.com/ccwwhhh/Model-Rec.
Abstract:Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
Abstract:CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
Abstract:Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
Abstract:Human trust plays a crucial role in the effectiveness of human-robot collaboration. Despite its significance, the development and maintenance of an optimal trust level are obstructed by the complex nature of influencing factors and their mechanisms. This study investigates the effects of cognitive load on human trust within the context of a hybrid human-robot collaboration task. An experiment is conducted where the humans and the robot, acting as team members, collaboratively construct pyramids with differentiated levels of task complexity. Our findings reveal that cognitive load exerts diverse impacts on human trust in the robot. Notably, there is an increase in human trust under conditions of high cognitive load. Furthermore, the rewards for performance are substantially higher in tasks with high cognitive load compared to those with low cognitive load, and a significant correlation exists between human trust and the failure risk of performance in tasks with low and medium cognitive load. By integrating interdependent task steps, this research emphasizes the unique dynamics of hybrid human-robot collaboration scenarios. The insights gained not only contribute to understanding how cognitive load influences trust but also assist developers in optimizing collaborative target selection and designing more effective human-robot interfaces in such environments.
Abstract:Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.