Abstract:The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.
Abstract:This paper proposes an incremental voxel-based life-long localization method, LL-Localizer, which enables robots to localize robustly and accurately in multi-session mode using prior maps. Meanwhile, considering that it is difficult to be aware of changes in the environment in the prior map and robots may traverse between mapped and unmapped areas during actual operation, we will update the map when needed according to the established strategies through incremental voxel map. Besides, to ensure high performance in real-time and facilitate our map management, we utilize Dynamic i-Octree, an efficient organization of 3D points based on Dynamic Octree to load local map and update the map during the robot's operation. The experiments show that our system can perform stable and accurate localization comparable to state-of-the-art LIO systems. And even if the environment in the prior map changes or the robots traverse between mapped and unmapped areas, our system can still maintain robust and accurate localization without any distinction. Our demo can be found on Blibili (https://www.bilibili.com/video/BV1faZHYCEkZ) and youtube (https://youtu.be/UWn7RCb9kA8) and the program will be available at https://github.com/M-Evanovic/LL-Localizer.
Abstract:The accuracy of the initial state, including initial velocity, gravity direction, and IMU biases, is critical for the initialization of LiDAR-inertial SLAM systems. Inaccurate initial values can reduce initialization speed or lead to failure. When the system faces urgent tasks, robust and fast initialization is required while the robot is moving, such as during the swift assessment of rescue environments after natural disasters, bomb disposal, and restarting LiDAR-inertial SLAM in rescue missions. However, existing initialization methods usually require the platform to remain stationary, which is ineffective when the robot is in motion. To address this issue, this paper introduces a robust and fast dynamic initialization method for LiDAR-inertial systems (D-LI-Init). This method iteratively aligns LiDAR-based odometry with IMU measurements to achieve system initialization. To enhance the reliability of the LiDAR odometry module, the LiDAR and gyroscope are tightly integrated within the ESIKF framework. The gyroscope compensates for rotational distortion in the point cloud. Translational distortion compensation occurs during the iterative update phase, resulting in the output of LiDAR-gyroscope odometry. The proposed method can initialize the system no matter the robot is moving or stationary. Experiments on public datasets and real-world environments demonstrate that the D-LI-Init algorithm can effectively serve various platforms, including vehicles, handheld devices, and UAVs. D-LI-Init completes dynamic initialization regardless of specific motion patterns. To benefit the research community, we have open-sourced our code and test datasets on GitHub.
Abstract:SLAM plays a crucial role in automation tasks, such as warehouse logistics, healthcare robotics, and restaurant delivery. These scenes come with various challenges, including navigating around crowds of people, dealing with flying plastic bags that can temporarily blind sensors, and addressing reduced LiDAR density caused by cooking smoke. Such scenarios can result in over-degeneracy, causing the map to drift. To address this issue, this paper presents a multi-map LiDAR-inertial system (MM-LINS) for the first time. The front-end employs an iterated error state Kalman filter for state estimation and introduces a reliable evaluation strategy for degeneracy detection. If over-degeneracy is detected, the active map will be stored into sleeping maps. Subsequently, the system continuously attempts to construct new maps using a dynamic initialization method to ensure successful initialization upon leaving the over-degeneracy. Regarding the back-end, the Scan Context descriptor is utilized to detect inter-map similarity. Upon successful recognition of a sleeping map that shares a common region with the active map, the overlapping trajectory region is utilized to constrain the positional transformation near the edge of the prior map. In response to this, a constraint-enhanced map fusion strategy is proposed to achieve high-precision positional and mapping results. Experiments have been conducted separately on both public datasets that exhibited over-degenerate conditions and in real-world environments. These tests demonstrated the effectiveness of MM-LINS in over-degeneracy environment. Our codes are open-sourced on Github.
Abstract:Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.
Abstract:Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.
Abstract:Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2\% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.
Abstract:LiDAR-Inertial Odometry (LIO) is widely used for autonomous navigation, but its deployment on Size, Weight, and Power (SWaP)-constrained platforms remains challenging due to the computational cost of processing dense point clouds. Conventional LIO frameworks rely on a single onboard processor, leading to computational bottlenecks and high memory demands, making real-time execution difficult on embedded systems. To address this, we propose QLIO, a multi-processor distributed quantized LIO framework that reduces computational load and bandwidth consumption while maintaining localization accuracy. QLIO introduces a quantized state estimation pipeline, where a co-processor pre-processes LiDAR measurements, compressing point-to-plane residuals before transmitting only essential features to the host processor. Additionally, an rQ-vector-based adaptive resampling strategy intelligently selects and compresses key observations, further reducing computational redundancy. Real-world evaluations demonstrate that QLIO achieves a 14.1% reduction in per-observation residual data while preserving localization accuracy. Furthermore, we release an open-source implementation to facilitate further research and real-world deployment. These results establish QLIO as an efficient and scalable solution for real-time autonomous systems operating under computational and bandwidth constraints.
Abstract:Zero-Shot Object Navigation (ZSON) requires agents to navigate to objects specified via open-ended natural language without predefined categories or prior environmental knowledge. While recent methods leverage foundation models or multi-modal maps, they often rely on 2D representations and greedy strategies or require additional training or modules with high computation load, limiting performance in complex environments and real applications. We propose WTRP-Searcher, a novel framework that formulates ZSON as a Weighted Traveling Repairman Problem (WTRP), minimizing the weighted waiting time of viewpoints. Using a Vision-Language Model (VLM), we score viewpoints based on object-description similarity, projected onto a 2D map with depth information. An open-vocabulary detector identifies targets, dynamically updating goals, while a 3D embedding feature map enhances spatial awareness and environmental recall. WTRP-Searcher outperforms existing methods, offering efficient global planning and improved performance in complex ZSON tasks. Code and more demos will be avaliable on https://github.com/lrm20011/WTRP_Searcher.
Abstract:Traditional unmanned aerial vehicle (UAV) swarm missions rely heavily on expensive custom-made drones with onboard perception or external positioning systems, limiting their widespread adoption in research and education. To address this issue, we propose AirSwarm. AirSwarm democratizes multi-drone coordination using low-cost commercially available drones such as Tello or Anafi, enabling affordable swarm aerial robotics research and education. Key innovations include a hierarchical control architecture for reliable multi-UAV coordination, an infrastructure-free visual SLAM system for precise localization without external motion capture, and a ROS-based software framework for simplified swarm development. Experiments demonstrate cm-level tracking accuracy, low-latency control, communication failure resistance, formation flight, and trajectory tracking. By reducing financial and technical barriers, AirSwarm makes multi-robot education and research more accessible. The complete instructions and open source code will be available at