Columbia University
Abstract:Video-based person re-identification (ReID) has become increasingly important due to its applications in video surveillance applications. By employing events in video-based person ReID, more motion information can be provided between continuous frames to improve recognition accuracy. Previous approaches have assisted by introducing event data into the video person ReID task, but they still cannot avoid the privacy leakage problem caused by RGB images. In order to avoid privacy attacks and to take advantage of the benefits of event data, we consider using only event data. To make full use of the information in the event stream, we propose a Cross-Modality and Temporal Collaboration (CMTC) network for event-based video person ReID. First, we design an event transform network to obtain corresponding auxiliary information from the input of raw events. Additionally, we propose a differential modality collaboration module to balance the roles of events and auxiliaries to achieve complementary effects. Furthermore, we introduce a temporal collaboration module to exploit motion information and appearance cues. Experimental results demonstrate that our method outperforms others in the task of event-based video person ReID.
Abstract:Cross-view object geo-localization (CVOGL) aims to locate an object of interest in a captured ground- or drone-view image within the satellite image. However, existing works treat ground-view and drone-view query images equivalently, overlooking their inherent viewpoint discrepancies and the spatial correlation between the query image and the satellite-view reference image. To this end, this paper proposes a novel View-specific Attention Geo-localization method (VAGeo) for accurate CVOGL. Specifically, VAGeo contains two key modules: view-specific positional encoding (VSPE) module and channel-spatial hybrid attention (CSHA) module. In object-level, according to the characteristics of different viewpoints of ground and drone query images, viewpoint-specific positional codings are designed to more accurately identify the click-point object of the query image in the VSPE module. In feature-level, a hybrid attention in the CSHA module is introduced by combining channel attention and spatial attention mechanisms simultaneously for learning discriminative features. Extensive experimental results demonstrate that the proposed VAGeo gains a significant performance improvement, i.e., improving acc@0.25/acc@0.5 on the CVOGL dataset from 45.43%/42.24% to 48.21%/45.22% for ground-view, and from 61.97%/57.66% to 66.19%/61.87% for drone-view.
Abstract:With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
Abstract:Q-learning methods are widely used in robot path planning but often face challenges of inefficient search and slow convergence. We propose an Improved Q-learning (IQL) framework that enhances standard Q-learning in two significant ways. First, we introduce the Path Adaptive Collaborative Optimization (PACO) algorithm to optimize Q-table initialization, providing better initial estimates and accelerating learning. Second, we incorporate a Utility-Controlled Heuristic (UCH) mechanism with dynamically tuned parameters to optimize the reward function, enhancing the algorithm's accuracy and effectiveness in path-planning tasks. Extensive experiments in three different raster grid environments validate the superior performance of our IQL framework. The results demonstrate that our IQL algorithm outperforms existing methods, including FIQL, PP-QL-based CPP, DFQL, and QMABC algorithms, in terms of path-planning capabilities.
Abstract:Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbf{TimelineKGQA}, a universal temporal QA generator applicable to any TKGs. The code is available at: \url{https://github.com/PascalSun/TimelineKGQA} as an open source Python package.
Abstract:Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.
Abstract:ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data. In this chapter, we have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns. Our proposed multimodal scene graph consists of two components: a visual graph and a textual graph, each designed to capture the structural and semantic information within the chart. To unify representations across these different modalities, we introduce a multimodal graph contrastive learning approach that learns unified representations by maximizing similarity between nodes representing the same object across multimodal graphs. The learned graph representations can be seamlessly incorporated into a transformer decoder as a soft prompt. Additionally, given the growing need for Multimodal Large Language Models (MLLMs) in zero-shot scenarios, we have designed Chain-of-Thought (CoT) prompts for MLLMs to reduce hallucinations. We tested both methods on public benchmarks such as ChartQA, OpenCQA, and ChartX, demonstrating improved performance and validating the effectiveness of our proposed methods.
Abstract:Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as the number of ICL demonstrations increases from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DR-ICL, a novel optimization method that enhances model performance through Differentiated Learning and advantage-based Reweighting objectives. Globally, DR-ICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby improving generalization. This approach allows the model to handle varying numbers of shots effectively, mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (MICLB)-a large-scale benchmark covering shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes. MICLB facilitates the evaluation of many-shot ICL strategies across seven prominent NLP tasks and 50 distinct datasets. Experimental results demonstrate that LLMs enhanced with DR-ICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and benchmark dataset hoping to facilitate further research in many-shot ICL.
Abstract:The study of physical adversarial patches is crucial for identifying vulnerabilities in AI-based recognition systems and developing more robust deep learning models. While recent research has focused on improving patch stealthiness for greater practical applicability, achieving an effective balance between stealth and attack performance remains a significant challenge. To address this issue, we propose a novel physical adversarial attack method that leverages knowledge distillation. Specifically, we first define a stealthy color space tailored to the target environment to ensure smooth blending. Then, we optimize an adversarial patch in an unconstrained color space, which serves as the 'teacher' patch. Finally, we use an adversarial knowledge distillation module to transfer the teacher patch's knowledge to the 'student' patch, guiding the optimization of the stealthy patch. Experimental results show that our approach improves attack performance by 20%, while maintaining stealth, highlighting its practical value.
Abstract:While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.