Fred
Abstract:Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.
Abstract:In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the \textbf{H}adamard \textbf{A}ttention \textbf{R}ecurrent Stereo \textbf{T}ransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked \textbf{1st} on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at \url{https://github.com/ZYangChen/HART}.
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via https://github.com/zhaoy777/AFICE.git
Abstract:Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.
Abstract:Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.
Abstract:Traditional rule-based physical models are limited by their reliance on singular physical formulas and parameters, making it difficult to effectively tackle the intricate tasks associated with crowd simulation. Recent research has introduced deep learning methods to tackle these issues, but most current approaches focus primarily on generating pedestrian trajectories, often lacking interpretability and failing to provide real-time dynamic simulations.To address the aforementioned issues, we propose a novel data-driven crowd simulation framework that integrates Physics-informed Machine Learning (PIML) with navigation potential fields. Our approach leverages the strengths of both physical models and PIML. Specifically, we design an innovative Physics-informed Spatio-temporal Graph Convolutional Network (PI-STGCN) as a data-driven module to predict pedestrian movement trends based on crowd spatio-temporal data. Additionally, we construct a physical model of navigation potential fields based on flow field theory to guide pedestrian movements, thereby reinforcing physical constraints during the simulation. In our framework, navigation potential fields are dynamically computed and updated based on the movement trends predicted by the PI-STGCN, while the updated crowd dynamics, guided by these fields, subsequently feed back into the PI-STGCN. Comparative experiments on two publicly available large-scale real-world datasets across five scenes demonstrate that our proposed framework outperforms existing rule-based methods in accuracy and fidelity. The similarity between simulated and actual pedestrian trajectories increases by 10.8%, while the average error is reduced by 4%. Moreover, our framework exhibits greater adaptability and better interpretability compared to methods that rely solely on deep learning for trajectory generation.
Abstract:Spatial Crowdsourcing (SC) is gaining traction in both academia and industry, with tasks on SC platforms becoming increasingly complex and requiring collaboration among workers with diverse skills. Recent research works address complex tasks by dividing them into subtasks with dependencies and assigning them to suitable workers. However, the dependencies among subtasks and their heterogeneous skill requirements, as well as the need for efficient utilization of workers' limited work time in the multi-task allocation mode, pose challenges in achieving an optimal task allocation scheme. Therefore, this paper formally investigates the problem of Dependency-aware Multi-task Allocation (DMA) and presents a well-designed framework to solve it, known as Heterogeneous Graph Reinforcement Learning-based Task Allocation (HGRL-TA). To address the challenges associated with representing and embedding diverse problem instances to ensure robust generalization, we propose a multi-relation graph model and a Compound-path-based Heterogeneous Graph Attention Network (CHANet) for effectively representing and capturing intricate relations among tasks and workers, as well as providing embedding of problem state. The task allocation decision is determined sequentially by a policy network, which undergoes simultaneous training with CHANet using the proximal policy optimization algorithm. Extensive experiment results demonstrate the effectiveness and generality of the proposed HGRL-TA in solving the DMA problem, leading to average profits that is 21.78% higher than those achieved using the metaheuristic methods.
Abstract:Robotic manipulation, owing to its multi-modal nature, often faces significant training ambiguity, necessitating explicit instructions to clearly delineate the manipulation details in tasks. In this work, we highlight that vision instruction is naturally more comprehensible to recent robotic policies than the commonly adopted text instruction, as these policies are born with some vision understanding ability like human infants. Building on this premise and drawing inspiration from cognitive science, we introduce the robotic imagery paradigm, which realizes large-scale robotic data pre-training without text annotations. Additionally, we propose the robotic gaze strategy that emulates the human eye gaze mechanism, thereby guiding subsequent actions and focusing the attention of the policy on the manipulated object. Leveraging these innovations, we develop VIRT, a fully Transformer-based policy. We design comprehensive tasks using both a physical robot and simulated environments to assess the efficacy of VIRT. The results indicate that VIRT can complete very competitive tasks like ``opening the lid of a tightly sealed bottle'', and the proposed techniques boost the success rates of the baseline policy on diverse challenging tasks from nearly 0% to more than 65%.
Abstract:Object grasping is a crucial technology enabling robots to perceive and interact with the environment sufficiently. However, in practical applications, researchers are faced with missing or noisy ground truth while training the convolutional neural network, which decreases the accuracy of the model. Therefore, different loss functions are proposed to deal with these problems to improve the accuracy of the neural network. For missing ground truth, a new predicted category probability method is defined for unlabeled samples, which works effectively in conjunction with the pseudo-labeling method. Furthermore, for noisy ground truth, a symmetric loss function is introduced to resist the corruption of label noises. The proposed loss functions are powerful, robust, and easy to use. Experimental results based on the typical grasping neural network show that our method can improve performance by 2 to 13 percent.
Abstract:Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.