Abstract:As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
Abstract:Neural operators are aiming at approximating operators mapping between Banach spaces of functions, achieving much success in the field of scientific computing. Compared to certain deep learning-based solvers, such as Physics-Informed Neural Networks (PINNs), Deep Ritz Method (DRM), neural operators can solve a class of Partial Differential Equations (PDEs). Although much work has been done to analyze the approximation and generalization error of neural operators, there is still a lack of analysis on their training error. In this work, we conduct the convergence analysis of gradient descent for the wide shallow neural operators within the framework of Neural Tangent Kernel (NTK). The core idea lies on the fact that over-parameterization and random initialization together ensure that each weight vector remains near its initialization throughout all iterations, yielding the linear convergence of gradient descent. In this work, we demonstrate that under the setting of over-parametrization, gradient descent can find the global minimum regardless of whether it is in continuous time or discrete time.
Abstract:Vision-centric autonomous driving has demonstrated excellent performance with economical sensors. As the fundamental step, 3D perception aims to infer 3D information from 2D images based on 3D-2D projection. This makes driving perception models susceptible to sensor configuration (e.g., camera intrinsics and extrinsics) variations. However, generalizing across camera configurations is important for deploying autonomous driving models on different car models. In this paper, we present UniDrive, a novel framework for vision-centric autonomous driving to achieve universal perception across camera configurations. We deploy a set of unified virtual cameras and propose a ground-aware projection method to effectively transform the original images into these unified virtual views. We further propose a virtual configuration optimization method by minimizing the expected projection error between original cameras and virtual cameras. The proposed virtual camera projection can be applied to existing 3D perception methods as a plug-and-play module to mitigate the challenges posed by camera parameter variability, resulting in more adaptable and reliable driving perception models. To evaluate the effectiveness of our framework, we collect a dataset on Carla by driving the same routes while only modifying the camera configurations. Experimental results demonstrate that our method trained on one specific camera configuration can generalize to varying configurations with minor performance degradation.
Abstract:Emotion Recognition in Conversations (ERCs) is a vital area within multimodal interaction research, dedicated to accurately identifying and classifying the emotions expressed by speakers throughout a conversation. Traditional ERC approaches predominantly rely on unimodal cues\-such as text, audio, or visual data\-leading to limitations in their effectiveness. These methods encounter two significant challenges: 1) Consistency in multimodal information. Before integrating various modalities, it is crucial to ensure that the data from different sources is aligned and coherent. 2) Contextual information capture. Successfully fusing multimodal features requires a keen understanding of the evolving emotional tone, especially in lengthy dialogues where emotions may shift and develop over time. To address these limitations, we propose a novel Mamba-enhanced Text-Audio-Video alignment network (MaTAV) for the ERC task. MaTAV is with the advantages of aligning unimodal features to ensure consistency across different modalities and handling long input sequences to better capture contextual multimodal information. The extensive experiments on the MELD and IEMOCAP datasets demonstrate that MaTAV significantly outperforms existing state-of-the-art methods on the ERC task with a big margin.
Abstract:Continuous blood pressure (BP) monitoring is essential for timely diagnosis and intervention in critical care settings. However, BP varies significantly across individuals, this inter-patient variability motivates the development of personalized models tailored to each patient's physiology. In this work, we propose a personalized BP forecasting model mainly using electrocardiogram (ECG) and photoplethysmogram (PPG) signals. This time-series model incorporates 2D representation learning to capture complex physiological relationships. Experiments are conducted on datasets collected from three diverse scenarios with BP measurements from 60 subjects total. Results demonstrate that the model achieves accurate and robust BP forecasts across scenarios within the Association for the Advancement of Medical Instrumentation (AAMI) standard criteria. This reliable early detection of abnormal fluctuations in BP is crucial for at-risk patients undergoing surgery or intensive care. The proposed model provides a valuable addition for continuous BP tracking to reduce mortality and improve prognosis.
Abstract:Solving Singularly Perturbed Differential Equations (SPDEs) poses computational challenges arising from the rapid transitions in their solutions within thin regions. The effectiveness of deep learning in addressing differential equations motivates us to employ these methods for solving SPDEs. In this manuscript, we introduce Component Fourier Neural Operator (ComFNO), an innovative operator learning method that builds upon Fourier Neural Operator (FNO), while simultaneously incorporating valuable prior knowledge obtained from asymptotic analysis. Our approach is not limited to FNO and can be applied to other neural network frameworks, such as Deep Operator Network (DeepONet), leading to potential similar SPDEs solvers. Experimental results across diverse classes of SPDEs demonstrate that ComFNO significantly improves accuracy compared to vanilla FNO. Furthermore, ComFNO exhibits natural adaptability to diverse data distributions and performs well in few-shot scenarios, showcasing its excellent generalization ability in practical situations.
Abstract:Physics-informed deep learning (PIDL)-based models have recently garnered remarkable success in traffic state estimation (TSE). However, the prior knowledge used to guide regularization training in current mainstream architectures is based on deterministic physical models. The drawback is that a solely deterministic model fails to capture the universally observed traffic flow dynamic scattering effect, thereby yielding unreliable outcomes for traffic control. This study, for the first time, proposes stochastic physics-informed deep learning (SPIDL) for traffic state estimation. The idea behind such SPIDL is simple and is based on the fact that a stochastic fundamental diagram provides the entire range of possible speeds for any given density with associated probabilities. Specifically, we select percentile-based fundamental diagram and distribution-based fundamental diagram as stochastic physics knowledge, and design corresponding physics-uninformed neural networks for effective fusion, thereby realizing two specific SPIDL models, namely \text{$\alpha$}-SPIDL and \text{$\cal B$}-SPIDL. The main contribution of SPIDL lies in addressing the "overly centralized guidance" caused by the one-to-one speed-density relationship in deterministic models during neural network training, enabling the network to digest more reliable knowledge-based constraints.Experiments on the real-world dataset indicate that proposed SPIDL models achieve accurate traffic state estimation in sparse data scenarios. More importantly, as expected, SPIDL models reproduce well the scattering effect of field observations, demonstrating the effectiveness of fusing stochastic physics model knowledge with deep learning frameworks.
Abstract:Full waveform inversion (FWI) plays a crucial role in the field of geophysics. There has been lots of research about applying deep learning (DL) methods to FWI. The success of DL-FWI relies significantly on the quantity and diversity of the datasets. Nevertheless, existing FWI datasets, like OpenFWI, where sources have fixed locations or identical frequencies, provide limited information and do not represent the complex real-world scene. For instance, low frequencies help in resolving larger-scale structures. High frequencies allow for a more detailed subsurface features. %A single source frequency is insufficient to describe subsurface structural properties. We consider that simultaneously using sources with different frequencies, instead of performing inversion using low frequencies data and then gradually introducing higher frequencies data, has rationale and potential advantages. Hence, we develop three enhanced datasets based on OpenFWI where each source have varying locations, frequencies or both. Moreover, we propose a novel deep operator network (DeepONet) architecture Inversion-DeepONet for FWI. We utilize convolutional neural network (CNN) to extract the features from seismic data in branch net. Source parameters, such as locations and frequencies, are fed to trunk net. Then another CNN is employed as the decoder of DeepONet to reconstruct the velocity models more effectively. Through experiments, we confirm the superior performance on accuracy and generalization ability of our network, compared with existing data-driven FWI methods.
Abstract:Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on underwater vehicle platforms. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Separate Contrastive DeNoising Forward and Deformable Convolution in SIM. To address the underwater color cast noise issue, we propose an underwater color denoising query to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves 61.4 AP (50 epochs), 63.3 AP (72 epochs), 64.0 AP (100 epochs) on DUO, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.
Abstract:First-order methods, such as gradient descent (GD) and stochastic gradient descent (SGD), have been proven effective in training neural networks. In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the learning rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for the $L^2$ regression problems, the learning rate can be improved from $\mathcal{O}(\lambda_0/n^2)$ to $\mathcal{O}(1/\|\bm{H}^{\infty}\|_2)$, which implies that GD actually enjoys a faster convergence rate. Furthermore, we generalize the method to GD in training two-layer Physics-Informed Neural Networks (PINNs), showing a similar improvement for the learning rate. Although the improved learning rate has a mild dependence on the Gram matrix, we still need to set it small enough in practice due to the unknown eigenvalues of the Gram matrix. More importantly, the convergence rate is tied to the least eigenvalue of the Gram matrix, which can lead to slow convergence. In this work, we provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the learning rate can be $\mathcal{O}(1)$, and at this rate, the convergence rate is independent of the Gram matrix.