School of Information Science and Technology, ShanghaiTech University
Abstract:Executing precise and agile flight maneuvers is critical for quadrotors in various applications. Traditional quadrotor control approaches are limited by their reliance on flat trajectories or time-consuming optimization, which restricts their flexibility. Recently, RL-based policy has emerged as a promising alternative due to its ability to directly map observations to actions, reducing the need for detailed system knowledge and actuation constraints. However, a significant challenge remains in bridging the sim-to-real gap, where RL-based policies often experience instability when deployed in real world. In this paper, we investigate key factors for learning robust RL-based control policies that are capable of zero-shot deployment in real-world quadrotors. We identify five critical factors and we develop a PPO-based training framework named SimpleFlight, which integrates these five techniques. We validate the efficacy of SimpleFlight on Crazyflie quadrotor, demonstrating that it achieves more than a 50% reduction in trajectory tracking error compared to state-of-the-art RL baselines, and achieves 70% improvement over the traditional MPC. The policy derived by SimpleFlight consistently excels across both smooth polynominal trajectories and challenging infeasible zigzag trajectories on small thrust-to-weight quadrotors. In contrast, baseline methods struggle with high-speed or infeasible trajectories. To support further research and reproducibility, we integrate SimpleFlight into a GPU-based simulator Omnidrones and provide open-source access to the code and model checkpoints. We hope SimpleFlight will offer valuable insights for advancing RL-based quadrotor control. For more details, visit our project website at https://sites.google.com/view/simpleflight/.
Abstract:With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.
Abstract:Recently Transformer-based hyperspectral image (HSI) change detection methods have shown remarkable performance. Nevertheless, existing attention mechanisms in Transformers have limitations in local feature representation. To address this issue, we propose Global and Local Attention-based Transformer (GLAFormer), which incorporates a global and local attention module (GLAM) to combine high-frequency and low-frequency signals. Furthermore, we introduce a cross-gating mechanism, called cross-gated feed-forward network (CGFN), to emphasize salient features and suppress noise interference. Specifically, the GLAM splits attention heads into global and local attention components to capture comprehensive spatial-spectral features. The global attention component employs global attention on downsampled feature maps to capture low-frequency information, while the local attention component focuses on high-frequency details using non-overlapping window-based local attention. The CGFN enhances the feature representation via convolutions and cross-gating mechanism in parallel paths. The proposed GLAFormer is evaluated on three HSI datasets. The results demonstrate its superiority over state-of-the-art HSI change detection methods. The source code of GLAFormer is available at \url{https://github.com/summitgao/GLAFormer}.
Abstract:Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while RL-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control, which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the framework's robustness in sim-to-real transfer. Our code is released at https://github.com/thu-uav/NeuralIMC.
Abstract:Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.
Abstract:Formation control of multiple Unmanned Aerial Vehicles (UAVs) is vital for practical applications. This paper tackles the task of behavior-based UAV formation while avoiding static and dynamic obstacles during directed flight. We present a two-stage reinforcement learning (RL) training pipeline to tackle the challenge of multi-objective optimization, large exploration spaces, and the sim-to-real gap. The first stage searches in a simplified scenario for a linear utility function that balances all task objectives simultaneously, whereas the second stage applies the utility function in complex scenarios, utilizing curriculum learning to navigate large exploration spaces. Additionally, we apply an attention-based observation encoder to enhance formation maintenance and manage varying obstacle quantity. Experiments in simulation and real world demonstrate that our method outperforms planning-based and RL-based baselines regarding collision-free rate and formation maintenance in scenarios with static, dynamic, and mixed obstacles.
Abstract:In this paper, we propose a novel and efficient parameter estimator based on $k$-Nearest Neighbor ($k$NN) and data generation method for the Lognormal-Rician turbulence channel. The Kolmogorov-Smirnov (KS) goodness-of-fit statistical tools are employed to investigate the validity of $k$NN approximation under different channel conditions and it is shown that the choice of $k$ plays a significant role in the approximation accuracy. We present several numerical results to illustrate that solving the constructed objective function can provide a reasonable estimate for the actual values. The accuracy of the proposed estimator is investigated in terms of the mean square error. The simulation results show that increasing the number of generation samples by two orders of magnitude does not lead to a significant improvement in estimation performance when solving the optimization problem by the gradient descent algorithm. However, the estimation performance under the genetic algorithm (GA) approximates to that of the saddlepoint approximation and expectation-maximization estimators. Therefore, combined with the GA, we demonstrate that the proposed estimator achieves the best tradeoff between the computation complexity and the accuracy.
Abstract:In multi-source remote sensing image classification field, remarkable progress has been made by convolutional neural network and Transformer. However, existing methods are still limited due to the inherent local reductive bias. Recently, Mamba-based methods built upon the State Space Model have shown great potential for long-range dependency modeling with linear complexity, but it has rarely been explored for the multi-source remote sensing image classification task. To this end, we propose Multi-Scale Feature Fusion Mamba (MSFMamba) network for hyperspectral image (HSI) and LiDAR/SAR data joint classification. Specifically, MSFMamba mainly comprises three parts: Multi-Scale Spatial Mamba (MSpa-Mamba) block, Spectral Mamba (Spe-Mamba) block, and Fusion Mamba (Fus-Mamba) block. Specifically, to solve the feature redundancy in multiple canning routes, the MSpa-Mamba block incorporates the multi-scale strategy to minimize the computational redundancy and alleviate the feature redundancy of SSM. In addition, Spe-Mamba is designed for spectral feature exploration, which is essential for HSI feature modeling. Moreover, to alleviate the heterogeneous gap between HSI and LiDAR/SAR data, we design Fus-Mamba block for multi-source feature fusion. The original Mamba is extended to accommodate dual inputs, and cross-modal feature interaction is enhanced. Extensive experimental results on three multi-source remote sensing datasets demonstrate the superiority performance of the proposed MSFMamba over the state-of-the-art models. Source codes of MSFMamba will be made public available at https://github.com/summitgao/MSFMamba .
Abstract:Hyperspectral image (HSI) and synthetic aperture radar (SAR) data joint classification is a crucial and yet challenging task in the field of remote sensing image interpretation. However, feature modeling in existing methods is deficient to exploit the abundant global, spectral, and local features simultaneously, leading to sub-optimal classification performance. To solve the problem, we propose a hierarchical attention and parallel filter fusion network for multi-source data classification. Concretely, we design a hierarchical attention module for hyperspectral feature extraction. This module integrates global, spectral, and local features simultaneously to provide more comprehensive feature representation. In addition, we develop parallel filter fusion module which enhances cross-modal feature interactions among different spatial locations in the frequency domain. Extensive experiments on two multi-source remote sensing data classification datasets verify the superiority of our proposed method over current state-of-the-art classification approaches. Specifically, our proposed method achieves 91.44% and 80.51% of overall accuracy (OA) on the respective datasets, highlighting its superior performance.
Abstract:Photoacoustic (PA) imaging technology combines the advantages of optical imaging and ultrasound imaging, showing great potential in biomedical applications. Many preclinical studies and clinical applications urgently require fast, high-quality, low-cost and portable imaging system. Translating advanced image reconstruction algorithms into hardware implementations is highly desired. However, existing iterative PA image reconstructions, although exhibit higher accuracy than delay-and-sum algorithm, suffer from high computational cost. In this paper, we introduce a model-based hardware acceleration architecture based on superposed Wave (s-Wave) for palm-size PA tomography (palm-PAT), aiming at enhancing both the speed and performance of image reconstruction at a much lower system cost. To achieve this, we propose an innovative data reuse method that significantly reduces hardware storage resource consumption. We conducted experiments by FPGA implementation of the algorithm, using both phantoms and in vivo human finger data to verify the feasibility of the proposed method. The results demonstrate that our proposed architecture can substantially reduce system cost while maintaining high imaging performance. The hardware-accelerated implementation of the model-based algorithm achieves a speedup of up to approximately 270 times compared to the CPU, while the corresponding energy efficiency ratio is improved by more than 2700 times.