Abstract:Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.
Abstract:In this paper, we present HalluCana, a canary lookahead to detect and correct factuality hallucinations of Large Language Models (LLMs) in long-form generation. HalluCana detects and intervenes as soon as traces of hallucination emerge, during and even before generation. To support timely detection, we exploit the internal factuality representation in the LLM hidden space, where we investigate various proxies to the LLMs' factuality self-assessment, and discuss its relation to the models' context familiarity from their pre-training. On biography generation, our method improves generation quality by up to 2.5x, while consuming over 6 times less compute.
Abstract:Emerging digital twin technology has the potential to revolutionize voltage control in power systems. However, the state-of-the-art digital twin method suffers from low computational and sampling efficiency, which hinders its applications. To address this issue, we propose a Gumbel-Consistency Digital Twin (GC-DT) method that enhances voltage control with improved computational and sampling efficiency. First, the proposed method incorporates a Gumbel-based strategy improvement that leverages the Gumbel-top trick to enhance non-repetitive sampling actions and reduce the reliance on Monte Carlo Tree Search simulations, thereby improving computational efficiency. Second, a consistency loss function aligns predicted hidden states with actual hidden states in the latent space, which increases both prediction accuracy and sampling efficiency. Experiments on IEEE 123-bus, 34-bus, and 13-bus systems demonstrate that the proposed GC-DT outperforms the state-of-the-art DT method in both computational and sampling efficiency.
Abstract:Recent advancements in text-to-image diffusion models have significantly transformed visual content generation, yet their application in specialized fields such as interior design remains underexplored. In this paper, we present RoomDiffusion, a pioneering diffusion model meticulously tailored for the interior design industry. To begin with, we build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. Subsequently, techniques such as multiaspect training, multi-stage fine-tune and model fusion are applied to enhance both the visual appeal and precision of the generated results. Lastly, leveraging the latent consistency Distillation method, we distill and expedite the model for optimal efficiency. Unlike existing models optimized for general scenarios, RoomDiffusion addresses specific challenges in interior design, such as lack of fashion, high furniture duplication rate, and inaccurate style. Through our holistic human evaluation protocol with more than 20 professional human evaluators, RoomDiffusion demonstrates industry-leading performance in terms of aesthetics, accuracy, and efficiency, surpassing all existing open source models such as stable diffusion and SDXL.
Abstract:Large Language Models (LLMs) are reported to hold undesirable attestation bias on inference tasks: when asked to predict if a premise P entails a hypothesis H, instead of considering H's conditional truthfulness entailed by P, LLMs tend to use the out-of-context truth label of H as a fragile proxy. In this paper, we propose a pipeline that exploits this bias to do explicit inductive inference. Our pipeline uses an LLM to transform a premise into a set of attested alternatives, and then aggregate answers of the derived new entailment inquiries to support the original inference prediction. On a directional predicate entailment benchmark, we demonstrate that by applying this simple pipeline, we can improve the overall performance of LLMs on inference and substantially alleviate the impact of their attestation bias.
Abstract:Efficient data transmission and reasonable task allocation are important to improve multi-robot exploration efficiency. However, most communication data types typically contain redundant information and thus require massive communication volume. Moreover, exploration-oriented task allocation is far from trivial and becomes even more challenging for resource-limited unmanned aerial vehicles (UAVs). In this paper, we propose a fast and communication-efficient multi-UAV exploration method for exploring large environments. We first design a multi-robot dynamic topological graph (MR-DTG) consisting of nodes representing the explored and exploring regions and edges connecting nodes. Supported by MR-DTG, our method achieves efficient communication by only transferring the necessary information required by exploration planning. To further improve the exploration efficiency, a hierarchical multi-UAV exploration method is devised using MR-DTG. Specifically, the \emph{graph Voronoi partition} is used to allocate MR-DTG's nodes to the closest UAVs, considering the actual motion cost, thus achieving reasonable task allocation. To our knowledge, this is the first work to address multi-UAV exploration using \emph{graph Voronoi partition}. The proposed method is compared with a state-of-the-art method in simulations. The results show that the proposed method is able to reduce the exploration time and communication volume by up to 38.3\% and 95.5\%, respectively. Finally, the effectiveness of our method is validated in the real-world experiment with 6 UAVs. We will release the source code to benefit the community.
Abstract:Spatio-temporal (ST) trajectories are sequences of timestamped locations, which enable a variety of analyses that in turn enable important real-world applications. It is common to map trajectories to vectors, called embeddings, before subsequent analyses. Thus, the qualities of embeddings are very important. Methods for pre-training embeddings, which leverage unlabeled trajectories for training universal embeddings, have shown promising applicability across different tasks, thus attracting considerable interest. However, research progress on this topic faces two key challenges: a lack of a comprehensive overview of existing methods, resulting in several related methods not being well-recognized, and the absence of a unified pipeline, complicating the development new methods and the analysis of methods. To overcome these obstacles and advance the field of pre-training of trajectory embeddings, we present UniTE, a survey and a unified pipeline for this domain. In doing so, we present a comprehensive list of existing methods for pre-training trajectory embeddings, which includes methods that either explicitly or implicitly employ pre-training techniques. Further, we present a unified and modular pipeline with publicly available underlying code, simplifying the process of constructing and evaluating methods for pre-training trajectory embeddings. Additionally, we contribute a selection of experimental results using the proposed pipeline on real-world datasets.
Abstract:The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.
Abstract:Identifying and understanding user intents is a pivotal task for E-Commerce. Despite its popularity, intent understanding has not been consistently defined or accurately benchmarked. In this paper, we focus on predicative user intents as "how a customer uses a product", and pose intent understanding as a natural language reasoning task, independent of product ontologies. We identify two weaknesses of FolkScope, the SOTA E-Commerce Intent Knowledge Graph, that limit its capacity to reason about user intents and to recommend diverse useful products. Following these observations, we introduce a Product Recovery Benchmark including a novel evaluation framework and an example dataset. We further validate the above FolkScope weaknesses on this benchmark.
Abstract:Many factors influence the yielding result of a driver-pedestrian interaction, including traffic volume, vehicle speed, roadway characteristics, etc. While individual aspects of these interactions have been explored, comprehensive, naturalistic studies, particularly those considering the built environment's influence on driver-yielding behavior, are lacking. To address this gap, our study introduces an extensive open-source dataset, compiled from video data at 18 unsignalized intersections across Minnesota. Documenting more than 3000 interactions, this dataset provides a detailed view of driver-pedestrian interactions and over 50 distinct contextual variables. The data, which covers individual driver-pedestrian interactions and contextual factors, is made publicly available at https://github.com/tianyi17/pedestrian_yielding_data_MN. Using logistic regression, we developed a classification model that predicts driver yielding based on the identified variables. Our analysis indicates that vehicle speed, the presence of parking lots, proximity to parks or schools, and the width of major road crossings significantly influence driver yielding at unsignalized intersections. This study contributes to one of the most comprehensive driver-pedestrian datasets in the US, offering valuable insights for traffic safety improvements. By making this information available, our study will support communities across Minnesota and the United States in their ongoing efforts to improve road safety for pedestrians.