Abstract:Large language models (LLMs) have demonstrated strong potential and impressive performance in automating the generation and optimization of workflows. However, existing approaches are marked by limited reasoning capabilities, high computational demands, and significant resource requirements. To address these issues, we propose DebFlow, a framework that employs a debate mechanism to optimize workflows and integrates reflexion to improve based on previous experiences. We evaluated our method across six benchmark datasets, including HotpotQA, MATH, and ALFWorld. Our approach achieved a 3\% average performance improvement over the latest baselines, demonstrating its effectiveness in diverse problem domains. In particular, during training, our framework reduces resource consumption by 37\% compared to the state-of-the-art baselines. Additionally, we performed ablation studies. Removing the Debate component resulted in a 4\% performance drop across two benchmark datasets, significantly greater than the 2\% drop observed when the Reflection component was removed. These findings strongly demonstrate the critical role of Debate in enhancing framework performance, while also highlighting the auxiliary contribution of reflexion to overall optimization.
Abstract:Large Language Models (LLMs) excel at generating creative narratives but struggle with long-term coherence and emotional consistency in complex stories. To address this, we propose SCORE (Story Coherence and Retrieval Enhancement), a framework integrating three components: 1) Dynamic State Tracking (monitoring objects/characters via symbolic logic), 2) Context-Aware Summarization (hierarchical episode summaries for temporal progression), and 3) Hybrid Retrieval (combining TF-IDF keyword relevance with cosine similarity-based semantic embeddings). The system employs a temporally-aligned Retrieval-Augmented Generation (RAG) pipeline to validate contextual consistency. Evaluations show SCORE achieves 23.6% higher coherence (NCI-2.0 benchmark), 89.7% emotional consistency (EASM metric), and 41.8% fewer hallucinations versus baseline GPT models. Its modular design supports incremental knowledge graph construction for persistent story memory and multi-LLM backend compatibility, offering an explainable solution for industrial-scale narrative systems requiring long-term consistency.
Abstract:Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.
Abstract:Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training an image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
Abstract:Modern image generation systems can produce high-quality visuals, yet user prompts often contain ambiguities, requiring multiple revisions. Existing methods struggle to address the nuanced needs of non-expert users. We propose Visual Co-Adaptation (VCA), a novel framework that iteratively refines prompts and aligns generated images with user preferences. VCA employs a fine-tuned language model with reinforcement learning and multi-turn dialogues for prompt disambiguation. Key components include the Incremental Context-Enhanced Dialogue Block for interactive clarification, the Semantic Exploration and Disambiguation Module (SESD) leveraging Retrieval-Augmented Generation (RAG) and CLIP scoring, and the Pixel Precision and Consistency Optimization Module (PPCO) for refining image details using Proximal Policy Optimization (PPO). A human-in-the-loop feedback mechanism further improves performance. Experiments show that VCA surpasses models like DALL-E 3 and Stable Diffusion, reducing dialogue rounds to 4.3, achieving a CLIP score of 0.92, and enhancing user satisfaction to 4.73/5. Additionally, we introduce a novel multi-round dialogue dataset with prompt-image pairs and user intent annotations.
Abstract:Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions via temporally-aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; and (3) Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment using shared prompt tokens and frame-specific tokens. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves enhanced fidelity and frame-to-frame coherence, offering a practical solution for T2V editing.
Abstract:Ensuring safe, comfortable, and efficient navigation is a critical goal for autonomous driving systems. While end-to-end models trained on large-scale datasets excel in common driving scenarios, they often struggle with rare, long-tail events. Recent progress in large language models (LLMs) has introduced enhanced reasoning capabilities, but their computational demands pose challenges for real-time decision-making and precise planning. This paper presents FASIONAD, a novel dual-system framework inspired by the cognitive model "Thinking, Fast and Slow." The fast system handles routine navigation tasks using rapid, data-driven path planning, while the slow system focuses on complex reasoning and decision-making in challenging or unfamiliar situations. A dynamic switching mechanism based on score distribution and feedback allows seamless transitions between the two systems. Visual prompts generated by the fast system enable human-like reasoning in the slow system, which provides high-quality feedback to enhance the fast system's decision-making. To evaluate FASIONAD, we introduce a new benchmark derived from the nuScenes dataset, specifically designed to differentiate fast and slow scenarios. FASIONAD achieves state-of-the-art performance on this benchmark, establishing a new standard for frameworks integrating fast and slow cognitive processes in autonomous driving. This approach paves the way for more adaptive, human-like autonomous driving systems.
Abstract:Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggled to align with user intent in coding scenarios. In particular, they were hampered by datasets that lacked diversity and failed to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) led to failures in generating precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization (i.e., FALCON). FALCON is structured into two hierarchical levels. From the global level, long-term memory improves code quality by retaining and applying learned knowledge. At the local level, short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem and enhance the model's adaptability across diverse code generation tasks. Extensive experiments demonstrate that our technique achieves state-of-the-art performance, leading other reinforcement learning methods by more than 4.5 percentage points on the MBPP benchmark and 6.1 percentage points on the Humaneval benchmark. The open-sourced code is publicly available at https://github.com/titurte/FALCON.
Abstract:With the increasing complexity of modern power systems, conventional dynamic load modeling with ZIP and induction motors (ZIP + IM) is no longer adequate to address the current load characteristic transitions. In recent years, the WECC composite load model (WECC CLM) has shown to effectively capture the dynamic load responses over traditional load models in various stability studies and contingency analyses. However, a detailed WECC CLM model typically has a high degree of complexity, with over one hundred parameters, and no systematic approach to identifying and calibrating these parameters. Enabled by the wide deployment of PMUs and advanced deep learning algorithms, proposed here is a double deep Q-learning network (DDQN)-based, two-stage load modeling framework for the WECC CLM. This two-stage method decomposes the complicated WECC CLM for more efficient identification and does not require explicit model details. In the first stage, the DDQN agent determines an accurate load composition. In the second stage, the parameters of the WECC CLM are selected from a group of Monte-Carlo simulations. The set of selected load parameters is expected to best approximate the true transient responses. The proposed framework is verified using an IEEE 39-bus test system on commercial simulation platforms.
Abstract:Creating and modeling real-world graphs is a crucial problem in various applications of engineering, biology, and social sciences; however, learning the distributions of nodes/edges and sampling from them to generate realistic graphs is still challenging. Moreover, generating a diverse set of synthetic graphs that all imitate a real network is not addressed. In this paper, the novel problem of creating diverse synthetic graphs is solved. First, we devise the deep supervised subset selection (DeepS3) algorithm; Given a ground-truth set of data points, DeepS3 selects a diverse subset of all items (i.e. data points) that best represent the items in the ground-truth set. Furthermore, we propose the deep graph representation recurrent network (GRRN) as a novel generative model that learns a probabilistic representation of a real weighted graph. Training the GRRN, we generate a large set of synthetic graphs that are likely to follow the same features and adjacency patterns as the original one. Incorporating GRRN with DeepS3, we select a diverse subset of generated graphs that best represent the behaviors of the real graph (i.e. our ground-truth). We apply our model to the novel problem of power grid synthesis, where a synthetic power network is created with the same physical/geometric properties as a real power system without revealing the real locations of the substations (nodes) and the lines (edges), since such data is confidential. Experiments on the Synthetic Power Grid Data Set show accurate synthetic networks that follow similar structural and spatial properties as the real power grid.