Abstract:As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.
Abstract:As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.
Abstract:Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embedding-based methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA's effectiveness but also its potential in political science research.
Abstract:In the rapidly evolving domain of Artificial Intelligence (AI), the complex interaction between innovation and regulation has become an emerging focus of our society. Despite tremendous advancements in AI's capabilities to excel in specific tasks and contribute to diverse sectors, establishing a high degree of trust in AI-generated outputs and decisions necessitates meticulous caution and continuous oversight. A broad spectrum of stakeholders, including governmental bodies, private sector corporations, academic institutions, and individuals, have launched significant initiatives. These efforts include developing ethical guidelines for AI and engaging in vibrant discussions on AI ethics, both among AI practitioners and within the broader society. This article thoroughly analyzes the ground-breaking AI regulatory framework proposed by the European Union. It delves into the fundamental ethical principles of safety, transparency, non-discrimination, traceability, and environmental sustainability for AI developments and deployments. Considering the technical efforts and strategies undertaken by academics and industry to uphold these principles, we explore the synergies and conflicts among the five ethical principles. Through this lens, work presents a forward-looking perspective on the future of AI regulations, advocating for a harmonized approach that safeguards societal values while encouraging technological advancement.
Abstract:Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.
Abstract:Terahertz (THz) integrated sensing and communication (ISAC) holds the potential to achieve high data rates and high-resolution sensing. Reconstructing the propagation environment is a vital step for THz ISAC, as it enhances the predictability of the communication channel to reduce communication overhead. In this letter, we propose an environment reconstruction methodology (ERM) merging reflectors of multi-targets based on THz single-sided channel small-scale characteristics. In this method, the inclination and position of tiny reflection faces of one single multi-path (MPC) are initially detected by double-triangle equations based on Snells law and geometry properties. Then, those reflection faces of multi-target MPCs, which are filtrated as available and one-order reflection MPCs, are globally merged to accurately reconstruct the entire propagation environment. The ERM is capable of operating with only small-scale parameters of receiving MPC. Subsequently, we validate our ERM through two experiments: bi-static ray-tracing simulations in an L-shaped room and channel measurements in an urban macrocellular (UMa) scenario in THz bands. The validation results demonstrate a small deviation of 0.03 m between the sensing outcomes and the predefined reflectors in the ray-tracing simulation and a small sensing root-mean-square error of 1.28 m and 0.45 m in line-of-sight and non-line-of-sight cases respectively based on channel measurements. Overall, this work is valuable for designing THz communication systems and facilitating the application of THz ISAC communication techniques.
Abstract:Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at https://github.com/hqhQAQ/PatchDPO.
Abstract:As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
Abstract:Low-latency and high-precision vehicle localization plays a significant role in enhancing traffic safety and improving traffic management for intelligent transportation. However, in complex road environments, the low latency and high precision requirements could not always be fulfilled due to the high complexity of localization computation. To tackle this issue, we propose a road-aware localization mechanism in heterogeneous networks (HetNet) of the mobile communication system, which enables real-time acquisition of vehicular position information, including the vehicular current road, segment within the road, and coordinates. By employing this multi-scale localization approach, the computational complexity can be greatly reduced while ensuring accurate positioning. Specifically, to reduce positioning search complexity and ensure positioning precision, roads are partitioned into low-dimensional segments with unequal lengths by the proposed singular point (SP) segmentation method. To reduce feature-matching complexity, distinctive salient features (SFs) are extracted sparsely representing roads and segments, which can eliminate redundant features while maximizing the feature information gain. The Cram\'er-Rao Lower Bound (CRLB) of vehicle positioning errors is derived to verify the positioning accuracy improvement brought from the segment partition and SF extraction. Additionally, through SF matching by integrating the inclusion and adjacency position relationships, a multi-scale vehicle localization (MSVL) algorithm is proposed to identify vehicular road signal patterns and determine the real-time segment and coordinates. Simulation results show that the proposed multi-scale localization mechanism can achieve lower latency and high precision compared to the benchmark schemes.
Abstract:While 3D Gaussian Splatting enables high-quality real-time rendering, existing Gaussian-based frameworks for 3D semantic segmentation still face significant challenges in boundary recognition accuracy. To address this, we propose a novel 3DGS-based framework named GradiSeg, incorporating Identity Encoding to construct a deeper semantic understanding of scenes. Our approach introduces two key modules: Identity Gradient Guided Densification (IGD) and Local Adaptive K-Nearest Neighbors (LA-KNN). The IGD module supervises gradients of Identity Encoding to refine Gaussian distributions along object boundaries, aligning them closely with boundary contours. Meanwhile, the LA-KNN module employs position gradients to adaptively establish locality-aware propagation of Identity Encodings, preventing irregular Gaussian spreads near boundaries. We validate the effectiveness of our method through comprehensive experiments. Results show that GradiSeg effectively addresses boundary-related issues, significantly improving segmentation accuracy without compromising scene reconstruction quality. Furthermore, our method's robust segmentation capability and decoupled Identity Encoding representation make it highly suitable for various downstream scene editing tasks, including 3D object removal, swapping and so on.