Abstract:Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.
Abstract:Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: https://tripletclip.github.io
Abstract:Large language models (LLMs) have shown remarkable capabilities in natural language processing; however, they still face difficulties when tasked with understanding lengthy contexts and executing effective question answering. These challenges often arise due to the complexity and ambiguity present in longer texts. To enhance the performance of LLMs in such scenarios, we introduce the Long Question Coreference Adaptation (LQCA) method. This innovative framework focuses on coreference resolution tailored to long contexts, allowing the model to identify and manage references effectively. The LQCA method encompasses four key steps: resolving coreferences within sub-documents, computing the distances between mentions, defining a representative mention for coreference, and answering questions through mention replacement. By processing information systematically, the framework provides easier-to-handle partitions for LLMs, promoting better understanding. Experimental evaluations on a range of LLMs and datasets have yielded positive results, with a notable improvements on OpenAI-o1-mini and GPT-4o models, highlighting the effectiveness of leveraging coreference resolution to bridge context gaps in question answering.
Abstract:This paper introduces a novel family of deep dynamical models designed to represent continuous-time sequence data. This family of models generates each data point in the time series by a neural emission model, which is a non-linear transformation of a latent state vector. The trajectory of the latent states is implicitly described by a neural ordinary differential equation (ODE), with the initial state following an informative prior distribution parameterized by an energy-based model. Furthermore, we can extend this model to disentangle dynamic states from underlying static factors of variation, represented as time-invariant variables in the latent space. We train the model using maximum likelihood estimation with Markov chain Monte Carlo (MCMC) in an end-to-end manner, without requiring additional assisting components such as an inference network. Our experiments on oscillating systems, videos and real-world state sequences (MuJoCo) illustrate that ODEs with the learnable energy-based prior outperform existing counterparts, and can generalize to new dynamic parameterization, enabling long-horizon predictions.
Abstract:Recent advancements in deep learning for Medical Artificial Intelligence have demonstrated that models can match the diagnostic performance of clinical experts in adult chest X-ray (CXR) interpretation. However, their application in the pediatric context remains limited due to the scarcity of large annotated pediatric image datasets. Additionally, significant challenges arise from the substantial variability in pediatric CXR images across different hospitals and the diverse age range of patients from 0 to 18 years. To address these challenges, we propose SCC, a novel approach that combines transfer learning with self-supervised contrastive learning, augmented by an unsupervised contrast enhancement technique. Transfer learning from a well-trained adult CXR model mitigates issues related to the scarcity of pediatric training data. Contrastive learning with contrast enhancement focuses on the lungs, reducing the impact of image variations and producing high-quality embeddings across diverse pediatric CXR images. We train SCC on one pediatric CXR dataset and evaluate its performance on two other pediatric datasets from different sources. Our results show that SCC's out-of-distribution (zero-shot) performance exceeds regular transfer learning in terms of AUC by 13.6% and 34.6% on the two test datasets. Moreover, with few-shot learning using 10 times fewer labeled images, SCC matches the performance of regular transfer learning trained on the entire labeled dataset. To test the generality of the framework, we verify its performance on three benchmark breast cancer datasets. Starting from a model trained on natural images and fine-tuned on one breast dataset, SCC outperforms the fully supervised learning baseline on the other two datasets in terms of AUC by 3.6% and 5.5% in zero-shot learning.
Abstract:Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.
Abstract:Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs can address tasks that they cannot complete independently, thereby enhancing their potential across different tasks. However, this approach faces two key challenges. First, redundant error correction leads to unstable planning and long execution time. Additionally, designing a correct plan among multiple tools is also a challenge in tool learning. To address these issues, we propose Tool-Planner, a task-processing framework based on toolkits. Tool-Planner groups tools based on the API functions with the same function into a toolkit and allows LLMs to implement planning across the various toolkits. When a tool error occurs, the language model can reselect and adjust tools based on the toolkit. Experiments show that our approach demonstrates a high pass and win rate across different datasets and optimizes the planning scheme for tool learning in models such as GPT-4 and Claude 3, showcasing the potential of our method.
Abstract:For the cascaded planning and control modules implemented for robot navigation, the frequency gap between the planner and controller has received limited attention. In this study, we introduce a novel B-spline parameterized optimization-based planner (BSPOP) designed to address the frequency gap challenge with limited onboard computational power in robots. The proposed planner generates continuous-time control inputs for low-level controllers running at arbitrary frequencies to track. Furthermore, when considering the convex control action sets, BSPOP uses the convex hull property to automatically constrain the continuous-time control inputs within the convex set. Consequently, compared with the discrete-time optimization-based planners, BSPOP reduces the number of decision variables and inequality constraints, which improves computational efficiency as a byproduct. Simulation results demonstrate that our approach can achieve a comparable planning performance to the high-frequency baseline optimization-based planners while demanding less computational power. Both simulation and experiment results show that the proposed method performs better in planning compared with baseline planners in the same frequency.
Abstract:Quadrotors are increasingly used in the evolving field of aerial robotics for their agility and mechanical simplicity. However, inherent uncertainties, such as aerodynamic effects coupled with quadrotors' operation in dynamically changing environments, pose significant challenges for traditional, nominal model-based control designs. We propose a multi-task meta-learning method called Encoder-Prototype-Decoder (EPD), which has the advantage of effectively balancing shared and distinctive representations across diverse training tasks. Subsequently, we integrate the EPD model into a model predictive control problem (Proto-MPC) to enhance the quadrotor's ability to adapt and operate across a spectrum of dynamically changing tasks with an efficient online implementation. We validate the proposed method in simulations, which demonstrates Proto-MPC's robust performance in trajectory tracking of a quadrotor being subject to static and spatially varying side winds.
Abstract:Model predictive control (MPC) has been applied to many platforms in robotics and autonomous systems for its capability to predict a system's future behavior while incorporating constraints that a system may have. To enhance the performance of a system with an MPC controller, one can manually tune the MPC's cost function. However, it can be challenging due to the possibly high dimension of the parameter space as well as the potential difference between the open-loop cost function in MPC and the overall closed-loop performance metric function. This paper presents DiffTune-MPC, a novel learning method, to learn the cost function of an MPC in a closed-loop manner. The proposed framework is compatible with the scenario where the time interval for performance evaluation and MPC's planning horizon have different lengths. We show the auxiliary problem whose solution admits the analytical gradients of MPC and discuss its variations in different MPC settings. Simulation results demonstrate the capability of DiffTune-MPC and illustrate the influence of constraints (from actuation limits) on learning.