School of Software, Tianjin University
Abstract:Transporting a heavy payload using multiple aerial robots (MARs) is an efficient manner to extend the load capacity of a single aerial robot. However, existing schemes for the multiple aerial robots transportation system (MARTS) still lack the capability to generate a collision-free and dynamically feasible trajectory in real-time and further track an agile trajectory especially when there are no sensors available to measure the states of payload and cable. Therefore, they are limited to low-agility transportation in simple environments. To bridge the gap, we propose complete planning and control schemes for the MARTS, achieving safe and agile aerial transportation (SAAT) of a cable-suspended payload in complex environments. Flatness maps for the aerial robot considering the complete kinematical constraint and the dynamical coupling between each aerial robot and payload are derived. To improve the responsiveness for the generation of the safe, dynamically feasible, and agile trajectory in complex environments, a real-time spatio-temporal trajectory planning scheme is proposed for the MARTS. Besides, we break away from the reliance on the state measurement for both the payload and cable, as well as the closed-loop control for the payload, and propose a fully distributed control scheme to track the agile trajectory that is robust against imprecise payload mass and non-point mass payload. The proposed schemes are extensively validated through benchmark comparisons, ablation studies, and simulations. Finally, extensive real-world experiments are conducted on a MARTS integrated by three aerial robots with onboard computers and sensors. The result validates the efficiency and robustness of our proposed schemes for SAAT in complex environments.
Abstract:There is a dearth of publications on the subject of spreading-aided Orthogonal Time Frequency Space (OTFS) solutions, especially for Integrated Sensing and Communication (ISAC), even though Code Division Multiple Access (CDMA) assisted multi-user OTFS (CDMA/OTFS) exhibits tangible benefits. Hence, this work characterises both the communication Bit Error Rate (BER) and sensing Root Mean Square Error (RMSE) performance of CDMA/OTFS, and contrasts them to pure OTFS. Three CDMA/OTFS configurations are considered: Delay Code Division Multiple Access OTFS (Dl-CDMA/OTFS), Doppler Code Division Multiple Access OTFS (Dp-CDMA/OTFS), and Delay Doppler Code Division Multiple Access OTFS (DD-CDMA/OTFS), which harness direct sequence spreading along the delay axis, Doppler axis, and DD domains respectively. For each configuration, the performance of Gold, Hadamard, and Zadoff-Chu sequences is investigated. The results demonstrate that Zadoff-Chu Dl-CDMA/OTFS and DD-CDMA/OTFS consistently outperform pure OTFS sensing, whilst maintaining a similar communication performance at the same throughput. The extra modulation complexity of CDMA/OTFS is similar to that of other OTFS multi-user methodologies, but the demodulation complexity of CDMA/OTFS is lower than that of some other OTFS multi-user methodologies. CDMA/OTFS sensing can also consistently outperform OTFS sensing whilst not requiring any additional complexity for target parameter estimation. Therefore, CDMA/OTFS is an appealing candidate for implementing multi-user OTFS ISAC.
Abstract:Low Earth Orbit (LEO) satellite networks are capable of improving the global Internet service coverage. In this context, we propose a hybrid beamforming design for holographic metasurface based terrestrial users in multi-altitude LEO satellite networks. Firstly, the holographic beamformer is optimized by maximizing the downlink channel gain from the serving satellite to the terrestrial user. Then, the digital beamformer is designed by conceiving a minimum mean square error (MMSE) based detection algorithm for mitigating the interference arriving from other satellites. To dispense with excessive overhead of full channel state information (CSI) acquisition of all satellites, we propose a low-complexity MMSE beamforming algorithm that only relies on the distribution of the LEO satellite constellation harnessing stochastic geometry, which can achieve comparable throughput to that of the algorithm based on the full CSI in the case of a dense LEO satellite deployment. Furthermore, it outperforms the maximum ratio combining (MRC) algorithm, thanks to its inter-satellite interference mitigation capacity. The simulation results show that our proposed holographic metasurface based hybrid beamforming architecture is capable of outperforming the state-of-the-art antenna array architecture in terms of its throughput, given the same physical size of the transceivers. Moreover, we demonstrate that the beamforming performance attained can be substantially improved by taking into account the mutual coupling effect, imposed by the dense placement of the holographic metasurface elements.
Abstract:Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC
Abstract:Given their extensive geographic coverage, low Earth orbit (LEO) satellites are envisioned to find their way into next-generation (6G) wireless communications. This paper explores space-air-ground integrated networks (SAGINs) leveraging LEOs to support terrestrial and non-terrestrial users. We first propose a practical satellite-ground channel model that incorporates five key aspects: 1) the small-scale fading characterized by the Shadowed-Rician distribution in terms of the Rician factor K, 2) the path loss effect of bending rays due to atmospheric refraction, 3) the molecular absorption modelled by the Beer-Lambert law, 4) the Doppler effects including the Earth's rotation, and 5) the impact of weather conditions according to the International Telecommunication Union Recommendations (ITU-R). Harnessing the proposed model, we analyze the long-term performance of the SAGIN considered. Explicitly, the closed-form expressions of both the outage probability and of the ergodic rates are derived. Additionally, the upper bounds of bit-error rates and of the Goodput are investigated. The numerical results yield the following insights: 1) The shadowing effect and the ratio between the line-of-sight and scattering components can be conveniently modeled by the factors of K and m in the proposed Shadowed-Rician small-scale fading model. 2) The atmospheric refraction has a modest effect on the path loss. 3) When calculating the transmission distance of waves, Earth's curvature and its geometric relationship with the satellites must be considered, particularly at small elevation angles. 3) High-frequency carriers suffer from substantial path loss, and 4) the Goodput metric is eminently suitable for characterizing the performance of different coding as well as modulation methods and of the estimation error of the Doppler effects.
Abstract:Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. Our evaluation benchmark will be publicly available at https://github.com/bilibili/Index-anisora.
Abstract:Existing 2D methods utilize UNet-based diffusion models to generate multi-view physically-based rendering (PBR) maps but struggle with multi-view inconsistency, while some 3D methods directly generate UV maps, encountering generalization issues due to the limited 3D data. To address these problems, we propose a two-stage approach, including multi-view generation and UV materials refinement. In the generation stage, we adopt a Diffusion Transformer (DiT) model to generate PBR materials, where both the specially designed multi-branch DiT and reference-based DiT blocks adopt a global attention mechanism to promote feature interaction and fusion between different views, thereby improving multi-view consistency. In addition, we adopt a PBR-based diffusion loss to ensure that the generated materials align with realistic physical principles. In the refinement stage, we propose a material-refined DiT that performs inpainting in empty areas and enhances details in UV space. Except for the normal condition, this refinement also takes the material map from the generation stage as an additional condition to reduce the learning difficulty and improve generalization. Extensive experiments show that our method achieves state-of-the-art performance in texturing 3D objects with PBR materials and provides significant advantages for graphics relighting applications. Project Page: https://lingtengqiu.github.io/2024/MCMat/
Abstract:Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. %We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our model access API and evaluation benchmark will be publicly available.
Abstract:Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.
Abstract:Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.