Senior Member, IEEE
Abstract:The challenges of road network segmentation demand an algorithm capable of adapting to the sparse and irregular shapes, as well as the diverse context, which often leads traditional encoding-decoding methods and simple Transformer embeddings to failure. We introduce a computationally efficient and powerful framework for elegant road-aware segmentation. Our method, called URoadNet, effectively encodes fine-grained local road connectivity and holistic global topological semantics while decoding multiscale road network information. URoadNet offers a novel alternative to the U-Net architecture by integrating connectivity attention, which can exploit intra-road interactions across multi-level sampling features with reduced computational complexity. This local interaction serves as valuable prior information for learning global interactions between road networks and the background through another integrality attention mechanism. The two forms of sparse attention are arranged alternatively and complementarily, and trained jointly, resulting in performance improvements without significant increases in computational complexity. Extensive experiments on various datasets with different resolutions, including Massachusetts, DeepGlobe, SpaceNet, and Large-Scale remote sensing images, demonstrate that URoadNet outperforms state-of-the-art techniques. Our approach represents a significant advancement in the field of road network extraction, providing a computationally feasible solution that achieves high-quality segmentation results.
Abstract:Recent advances in agentic LLMs have demonstrated remarkable automated Verilog code generation capabilities. However, existing approaches either demand substantial computational resources or rely on LLM-assisted single-agent prompt learning techniques, which we observe for the first time has a degeneration issue - characterized by deteriorating generative performance and diminished error detection and correction capabilities. This paper proposes a novel multi-agent prompt learning framework to address these limitations and enhance code generation quality. We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities, resulting in higher-quality Verilog code generation. Experimental results show that the proposed method could achieve 96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks, respectively while attaining 100% Syntax and 99.9% Functionality pass@5 metrics on the RTLLM benchmark.
Abstract:Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guidelines of relative learning difficulty between the teacher and student networks. Inspired by human cognitive science, in this paper, we redefine knowledge from a new perspective -- the student and teacher networks' relative difficulty of samples, and propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD). We propose a two-stage RDD framework: Teacher-Full Evaluated RDD (TFE-RDD) and Teacher-Student Evaluated RDD (TSE-RDD). RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals, thus avoiding adjusting learning weights for multiple losses. Extensive experimental evaluations using a general distillation loss function on popular datasets such as Cityscapes, CamVid, Pascal VOC, and ADE20k demonstrate the effectiveness of RDD against state-of-the-art KD methods. Additionally, our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
Abstract:Ultrasound is a widely used imaging modality in clinical practice due to its low cost, portability, and safety. Current research in general AI for healthcare focuses on large language models and general segmentation models, with insufficient attention to solutions addressing both disease prediction and tissue segmentation. In this study, we propose a novel universal framework for ultrasound, namely DeepUniUSTransformer, which is a promptable model accommodating multiple clinical task. The universality of this model is derived from its versatility across various aspects. It proficiently manages any ultrasound nature, any anatomical position, any input type and excelling not only in segmentation tasks but also in computer-aided diagnosis tasks. We introduce a novel module that incorporates this information as a prompt and seamlessly embedding it within the model's learning process. To train and validate our proposed model, we curated a comprehensive ultrasound dataset from publicly accessible sources, encompassing up to 7 distinct anatomical positions with over 9.7K annotations. Experimental results demonstrate that our model surpasses both a model trained on a single dataset and an ablated version of the network lacking prompt guidance. We will continuously expand the dataset and optimize the task specific prompting mechanism towards the universality in medical ultrasound. Model weights, datasets, and code will be open source to the public.
Abstract:Spatiotemporal prediction plays an important role in solving natural problems and processing video frames, especially in weather forecasting and human action recognition. Recent advances attempt to incorporate prior physical knowledge into the deep learning framework to estimate the unknown governing partial differential equations (PDEs), which have shown promising results in spatiotemporal prediction tasks. However, previous approaches only restrict neural network architectures or loss functions to acquire physical or PDE features, which decreases the representative capacity of a neural network. Meanwhile, the updating process of the physical state cannot be effectively estimated. To solve the above mentioned problems, this paper proposes a physical-guided neural network, which utilizes the frequency-enhanced Fourier module and moment loss to strengthen the model's ability to estimate the spatiotemporal dynamics. Furthermore, we propose an adaptive second-order Runge-Kutta method with physical constraints to model the physical states more precisely. We evaluate our model on both spatiotemporal and video prediction tasks. The experimental results show that our model outperforms state-of-the-art methods and performs best in several datasets, with a much smaller parameter count.
Abstract:Time Series Forecasting plays a crucial role in various fields such as industrial equipment maintenance, meteorology, energy consumption, traffic flow and financial investment. However, despite their considerable advantages over traditional statistical approaches, current deep learning-based predictive models often exhibit a significant deviation between their forecasting outcomes and the ground truth. This discrepancy is largely due to an insufficient emphasis on extracting the sequence's latent information, particularly its global information within the frequency domain and the relationship between different variables. To address this issue, we propose a novel model Frequency-domain Attention In Two Horizons, which decomposes time series into trend and seasonal components using a multi-scale sequence adaptive decomposition and fusion architecture, and processes them separately. FAITH utilizes Frequency Channel feature Extraction Module and Frequency Temporal feature Extraction Module to capture inter-channel relationships and temporal global information in the sequence, significantly improving its ability to handle long-term dependencies and complex patterns. Furthermore, FAITH achieves theoretically linear complexity by modifying the time-frequency domain transformation method, effectively reducing computational costs. Extensive experiments on 6 benchmarks for long-term forecasting and 3 benchmarks for short-term forecasting demonstrate that FAITH outperforms existing models in many fields, such as electricity, weather and traffic, proving its effectiveness and superiority both in long-term and short-term time series forecasting tasks. Our codes and data are available at https://github.com/LRQ577/FAITH.
Abstract:This paper reviews the NTIRE 2024 RAW Image Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. Th goal of this challenge is to upscale RAW Bayer images by 2x, considering unknown degradations such as noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. The performance of the top-5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in RAW Image Super-Resolution.
Abstract:Ensuring both accuracy and robustness in time series prediction is critical to many applications, ranging from urban planning to pandemic management. With sufficient training data where all spatiotemporal patterns are well-represented, existing deep-learning models can make reasonably accurate predictions. However, existing methods fail when the training data are drawn from different circumstances (e.g., traffic patterns on regular days) compared to test data (e.g., traffic patterns after a natural disaster). Such challenges are usually classified under domain generalization. In this work, we show that one way to address this challenge in the context of spatiotemporal prediction is by incorporating domain differential equations into Graph Convolutional Networks (GCNs). We theoretically derive conditions where GCNs incorporating such domain differential equations are robust to mismatched training and testing data compared to baseline domain agnostic models. To support our theory, we propose two domain-differential-equation-informed networks called Reaction-Diffusion Graph Convolutional Network (RDGCN), which incorporates differential equations for traffic speed evolution, and Susceptible-Infectious-Recovered Graph Convolutional Network (SIRGCN), which incorporates a disease propagation model. Both RDGCN and SIRGCN are based on reliable and interpretable domain differential equations that allow the models to generalize to unseen patterns. We experimentally show that RDGCN and SIRGCN are more robust with mismatched testing data than the state-of-the-art deep learning methods.
Abstract:Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-\alpha)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice, when the conditional distribution $P_{Y|X}$ is different on calibration and test data, the coverage is not guaranteed and it is essential to measure and minimize the coverage loss under distributional shift at \textit{all} possible confidence levels. To address these issues, we upper bound the coverage difference at all levels using the cumulative density functions of calibration and test conformal scores and Wasserstein distance. Inspired by the invariance of physics across data distributions, we propose a physics-informed structural causal model (PI-SCM) to reduce the upper bound. We validated that PI-SCM can improve coverage robustness along confidence level and test domain on a traffic speed prediction task and an epidemic spread task with multiple real-world datasets.
Abstract:Recently, the FourCastNet Neural Earth System Model (NESM) has shown impressive results on predicting various atmospheric variables, trained on the ERA5 reanalysis dataset. While FourCastNet enjoys quasi-linear time and memory complexity in sequence length compared to quadratic complexity in vanilla transformers, training FourCastNet on ERA5 from scratch still requires large amount of compute resources, which is expensive or even inaccessible to most researchers. In this work, we will show improved methods that can train FourCastNet using only 1% of the compute required by the baseline, while maintaining model performance or par or even better than the baseline.