Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingwen Wang

Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

Apr 23, 2025

Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di

Abstract:The XLLM@ACL2025 Shared Task-III formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/Jiahao-Yuan/Less-is-More.

Via

Access Paper or Ask Questions

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Sep 26, 2024

Jian Li, Haojing Huang, Yujia Zhang, Pengfei Xu, Xi Chen, Rui Song, Lida Shi, Jingwen Wang, Hao Xu

Figure 1 for Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Figure 2 for Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Figure 3 for Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Figure 4 for Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Abstract:Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred responses, respectively. However, while this training strategy omits the reward model, it also overlooks the varying preference degrees within different responses. We hypothesize that this is a key factor hindering LLMs from sufficiently understanding human preferences. To address this problem, we propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss, thereby helping LLMs improve their ability to understand the degree of preference. Extensive experiments are conducted on two widely used datasets of different tasks. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods and significantly boost their performance to achieve state-of-the-art performance. We also conduct detailed analyses to offer comprehensive insights into SPO, which verifies its effectiveness. The code is available at https://github.com/lijian16/SPO.

* Accepted at EMNLP 2024 Findings

Via

Access Paper or Ask Questions

CIER: A Novel Experience Replay Approach with Causal Inference in Deep Reinforcement Learning

May 14, 2024

Jingwen Wang, Dehui Du, Yida Li, Yiyang Li, Yikang Chen

Abstract:In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.

Via

Access Paper or Ask Questions

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

Dec 01, 2023

Hengyi Wang, Jingwen Wang, Lourdes Agapito

Abstract:Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.

* Project page: https://hengyiwang.github.io/projects/morpheus

Via

Access Paper or Ask Questions

SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Jun 28, 2023

Jingwen Wang, Juan Tarrio, Lourdes Agapito, Pablo F. Alcantarilla, Alexander Vakhitov

Figure 1 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Figure 2 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Figure 3 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Figure 4 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Abstract:The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data will be released on project page: http://jingwenwang95.github.io/SeMLaPS

* 8 pages, 7 figures, submitted to RA-L. Project page: http://jingwenwang95.github.io/SeMLaPS

Via

Access Paper or Ask Questions

First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Jun 23, 2023

Tom Tongjia Chen, Hongshan Yu, Zhengeng Yang, Ming Li, Zechuan Li, Jingwen Wang, Wei Miao, Wei Sun, Chen Chen

Figure 1 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Figure 2 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Figure 3 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Figure 4 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Abstract:Affordance-Centric Question-driven Task Completion (AQTC) has been proposed to acquire knowledge from videos to furnish users with comprehensive and systematic instructions. However, existing methods have hitherto neglected the necessity of aligning spatiotemporal visual and linguistic signals, as well as the crucial interactional information between humans and objects. To tackle these limitations, we propose to combine large-scale pre-trained vision-language and video-language models, which serve to contribute stable and reliable multimodal data and facilitate effective spatiotemporal visual-textual alignment. Additionally, a novel hand-object-interaction (HOI) aggregation module is proposed which aids in capturing human-object interaction information, thereby further augmenting the capacity to understand the presented scenario. Our method achieved first place in the CVPR'2023 AQTC Challenge, with a Recall@1 score of 78.7\%. The code is available at https://github.com/tomchen-ctj/CVPR23-LOVEU-AQTC.

* Winner of CVPR2023 Long-form Video Understanding and Generation Challenge (Track 3)

Via

Access Paper or Ask Questions

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

Apr 27, 2023

Hengyi Wang, Jingwen Wang, Lourdes Agapito

Abstract:We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM

* CVPR2023. First two authors contributed equally. Project page: https://hengyiwang.github.io/projects/CoSLAM

Via

Access Paper or Ask Questions

MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Apr 02, 2023

Ethan Weitkamp, Yusuke Satani, Adam Omundsen, Jingwen Wang, Peilong Li

Figure 1 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Figure 2 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Figure 3 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Figure 4 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Abstract:The machine learning approach is vital in Internet of Things (IoT) malware traffic detection due to its ability to keep pace with the ever-evolving nature of malware. Machine learning algorithms can quickly and accurately analyze the vast amount of data produced by IoT devices, allowing for the real-time identification of malicious network traffic. The system can handle the exponential growth of IoT devices thanks to the usage of distributed systems like Apache Kafka and Apache Spark, and Intel's oneAPI software stack accelerates model inference speed, making it a useful tool for real-time malware traffic detection. These technologies work together to create a system that can give scalable performance and high accuracy, making it a crucial tool for defending against cyber threats in smart communities and medical institutions.

Via

Access Paper or Ask Questions

CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Dec 12, 2022

Jinhong Wang, Jingwen Wang, Tingting Chen, Wenhao Zheng, Zhe Xu, Xingdi Wu, Wen Xu, Haochao Ying, Danny Chen, Jian Wu

Abstract:Surgery is the only viable treatment for cataract patients with visual acuity (VA) impairment. Clinically, to assess the necessity of cataract surgery, accurately predicting postoperative VA before surgery by analyzing multi-view optical coherence tomography (OCT) images is crucially needed. Unfortunately, due to complicated fundus conditions, determining postoperative VA remains difficult for medical experts. Deep learning methods for this problem were developed in recent years. Although effective, these methods still face several issues, such as not efficiently exploring potential relations between multi-view OCT images, neglecting the key role of clinical prior knowledge (e.g., preoperative VA value), and using only regression-based metrics which are lacking reference. In this paper, we propose a novel Cross-token Transformer Network (CTT-Net) for postoperative VA prediction by analyzing both the multi-view OCT images and preoperative VA. To effectively fuse multi-view features of OCT images, we develop cross-token attention that could restrict redundant/unnecessary attention flow. Further, we utilize the preoperative VA value to provide more information for postoperative VA prediction and facilitate fusion between views. Moreover, we design an auxiliary classification loss to improve model performance and assess VA recovery more sufficiently, avoiding the limitation by only using the regression metrics. To evaluate CTT-Net, we build a multi-view OCT image dataset collected from our collaborative hospital. A set of extensive experiments validate the effectiveness of our model compared to existing methods in various metrics. Code is available at: https://github.com/wjh892521292/Cataract OCT.

* 5 pages, 3 figures, accepted for publication in BIBM

Via

Access Paper or Ask Questions

Visual Subtitle Feature Enhanced Video Outline Generation

Sep 01, 2022

Qi Lv, Ziqiang Cao, Wenrui Xie, Derui Wang, Jingwen Wang, Zhiwei Hu, Tangkun Zhang, Ba Yuan, Yuanhang Li, Min Cao(+3 more)

Figure 1 for Visual Subtitle Feature Enhanced Video Outline Generation

Figure 2 for Visual Subtitle Feature Enhanced Video Outline Generation

Figure 3 for Visual Subtitle Feature Enhanced Video Outline Generation

Figure 4 for Visual Subtitle Feature Enhanced Video Outline Generation

Abstract:With the tremendously increasing number of videos, there is a great demand for techniques that help people quickly navigate to the video segments they are interested in. However, current works on video understanding mainly focus on video content summarization, while little effort has been made to explore the structure of a video. Inspired by textual outline generation, we introduce a novel video understanding task, namely video outline generation (VOG). This task is defined to contain two sub-tasks: (1) first segmenting the video according to the content structure and then (2) generating a heading for each segment. To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG. Specifically, we use OCR tools to recognize subtitles of videos. Then annotators are asked to divide subtitles into chapters and title each chapter. In videos, highlighted text tends to be the headline since it is more likely to attract attention. Therefore we propose a Visual Subtitle feature Enhanced video outline generation model (VSENet) which takes as input the textual subtitles together with their visual font sizes and positions. We consider the VOG task as a sequence tagging problem that extracts spans where the headings are located and then rewrites them to form the final outlines. Furthermore, based on the similarity between video outlines and textual outlines, we use a large number of articles with chapter headings to pretrain our model. Experiments on DuVOG show that our model largely outperforms other baseline methods, achieving 77.1 of F1-score for the video segmentation level and 85.0 of ROUGE-L_F0.5 for the headline generation level.

Via

Access Paper or Ask Questions