Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cong Yang

Soochow University

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

May 05, 2025

Lingxiao Kong, Cong Yang, Susanne Neufang, Oya Deniz Beyan, Zeyd Boukhers

Abstract:Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.

* 13 pages, 9 figures, submitted to SIGDIAL 2025 conference

Via

Access Paper or Ask Questions

Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Jan 09, 2025

Zeyd Boukhers, Cong Yang

Figure 1 for Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Figure 2 for Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Figure 3 for Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Figure 4 for Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Abstract:The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison of these methods, we provide comprehensive experimental results, analyzing their accuracy and efficiency in extracting metadata. Additionally, we provide valuable insights into the strengths and weaknesses of various feature learning and prediction methods, which can guide future research in this field.

Via

Access Paper or Ask Questions

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Dec 19, 2024

Yonghao He, Hu Su, Haiyong Yu, Cong Yang, Wei Sui, Cong Wang, Song Liu

Figure 1 for A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Figure 2 for A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Figure 3 for A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Figure 4 for A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Abstract:Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of $26.7\%$, compared to $26.2\%$ for YOLO-World-v1-S and $22.7\%$ for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is $57.1\%$ higher than YOLO-World-v1-S and $29.6\%$ higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at https://github.com/D-Robotics-AI-Lab/DOSOD.

Via

Access Paper or Ask Questions

Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding

Nov 11, 2024

Zeyd Boukhers, AmeerAli Khan, Qusai Ramadan, Cong Yang

Figure 1 for Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding

Figure 2 for Large Language Model in Medical Informatics: Direct Classification and Enhanced Text Representations for Automatic ICD Coding

Abstract:Addressing the complexity of accurately classifying International Classification of Diseases (ICD) codes from medical discharge summaries is challenging due to the intricate nature of medical documentation. This paper explores the use of Large Language Models (LLM), specifically the LLAMA architecture, to enhance ICD code classification through two methodologies: direct application as a classifier and as a generator of enriched text representations within a Multi-Filter Residual Convolutional Neural Network (MultiResCNN) framework. We evaluate these methods by comparing them against state-of-the-art approaches, revealing LLAMA's potential to significantly improve classification outcomes by providing deep contextual insights into medical texts.

* accepted at the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2024)

Via

Access Paper or Ask Questions

CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Jul 31, 2024

Shiyuan Chen, Jiaxin Zhang, Ruohong Mei, Yingfeng Cai, Haoran Yin, Tao Chen, Wei Sui, Cong Yang

Figure 1 for CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Figure 2 for CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Figure 3 for CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Figure 4 for CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

Abstract:The recent development of online static map element (a.k.a. HD map) construction algorithms has raised a vast demand for data with ground truth annotations. However, available public datasets currently cannot provide high-quality training data regarding consistency and accuracy. For instance, the manual labelled (low efficiency) nuScenes still contains misalignment and inconsistency between the HD maps and images (e.g., around 8.03 pixels reprojection error on average). To this end, we present CAMAv2: a vision-centric approach for Consistent and Accurate Map Annotation. Without LiDAR inputs, our proposed framework can still generate high-quality 3D annotations of static map elements. Specifically, the annotation can achieve high reprojection accuracy across all surrounding cameras and is spatial-temporal consistent across the whole sequence. We apply our proposed framework to the popular nuScenes dataset to provide efficient and highly accurate annotations. Compared with the original nuScenes static map element, our CAMAv2 annotations achieve lower reprojection errors (e.g., 4.96 vs. 8.03 pixels). Models trained with annotations from CAMAv2 also achieve lower reprojection errors (e.g., 5.62 vs. 8.43 pixels).

* arXiv admin note: text overlap with arXiv:2309.11754

Via

Access Paper or Ask Questions

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Jun 07, 2024

Cong Yang, Zuchao Li, Lefei Zhang

Figure 1 for MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Figure 2 for MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Figure 3 for MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Figure 4 for MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Abstract:Recently, large multimodal models have built a bridge from visual to textual information, but they tend to underperform in remote sensing scenarios. This underperformance is due to the complex distribution of objects and the significant scale differences among targets in remote sensing images, leading to visual ambiguities and insufficient descriptions by these multimodal models. Moreover, the lack of multimodal fine-tuning data specific to the remote sensing field makes it challenging for the model's behavior to align with user queries. To address these issues, this paper proposes an attribute-guided \textbf{Multi-Granularity Instruction Multimodal Model (MGIMM)} for remote sensing image detailed description. MGIMM guides the multimodal model to learn the consistency between visual regions and corresponding text attributes (such as object names, colors, and shapes) through region-level instruction tuning. Then, with the multimodal model aligned on region-attribute, guided by multi-grain visual features, MGIMM fully perceives both region-level and global image information, utilizing large language models for comprehensive descriptions of remote sensing images. Due to the lack of a standard benchmark for generating detailed descriptions of remote sensing images, we construct a dataset featuring 38,320 region-attribute pairs and 23,463 image-detailed description pairs. Compared with various advanced methods on this dataset, the results demonstrate the effectiveness of MGIMM's region-attribute guided learning approach. Code can be available at https://github.com/yangcong356/MGIMM.git

Via

Access Paper or Ask Questions

Falcon 7b for Software Mention Detection in Scholarly Documents

May 14, 2024

AmeerAli Khan, Qusai Ramadan, Cong Yang, Zeyd Boukhers

Abstract:This paper aims to tackle the challenge posed by the increasing integration of software tools in research across various disciplines by investigating the application of Falcon-7b for the detection and classification of software mentions within scholarly texts. Specifically, the study focuses on solving Subtask I of the Software Mention Detection in Scholarly Publications (SOMD), which entails identifying and categorizing software mentions from academic literature. Through comprehensive experimentation, the paper explores different training strategies, including a dual-classifier approach, adaptive sampling, and weighted loss scaling, to enhance detection accuracy while overcoming the complexities of class imbalance and the nuanced syntax of scholarly writing. The findings highlight the benefits of selective labelling and adaptive sampling in improving the model's performance. However, they also indicate that integrating multiple strategies does not necessarily result in cumulative improvements. This research offers insights into the effective application of large language models for specific tasks such as SOMD, underlining the importance of tailored approaches to address the unique challenges presented by academic text analysis.

* Accepted for publication by the first Workshop on Natural Scientific Language Processing and Research Knowledge Graphs - NSLP (@ ESCAI)

Via

Access Paper or Ask Questions

VRSO: Visual-Centric Reconstruction for Static Object Annotation

Mar 22, 2024

Chenyao Yu, Yingfeng Cai, Jiaxin Zhang, Hui Kong, Wei Sui, Cong Yang

Abstract:As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labeling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive (requires LiDAR scanners) and low-efficient (time-consuming and unscalable) in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual labeling is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline. (3) Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo labeling (10.6 pixels). Source code is available at: https://github.com/CaiYingFeng/VRSO.

* submitted to iros 2024

Via

Access Paper or Ask Questions

Gyroscope-Assisted Motion Deblurring Network

Feb 10, 2024

Simin Luan, Cong Yang, Zeyd Boukhers, Xue Qin, Dongfeng Cheng, Wei Sui, Zhijun Li

Abstract:Image research has shown substantial attention in deblurring networks in recent years. Yet, their practical usage in real-world deblurring, especially motion blur, remains limited due to the lack of pixel-aligned training triplets (background, blurred image, and blur heat map) and restricted information inherent in blurred images. This paper presents a simple yet efficient framework to synthetic and restore motion blur images using Inertial Measurement Unit (IMU) data. Notably, the framework includes a strategy for training triplet generation, and a Gyroscope-Aided Motion Deblurring (GAMD) network for blurred image restoration. The rationale is that through harnessing IMU data, we can determine the transformation of the camera pose during the image exposure phase, facilitating the deduction of the motion trajectory (aka. blur trajectory) for each point inside the three-dimensional space. Thus, the synthetic triplets using our strategy are inherently close to natural motion blur, strictly pixel-aligned, and mass-producible. Through comprehensive experiments, we demonstrate the advantages of the proposed framework: only two-pixel errors between our synthetic and real-world blur trajectories, a marked improvement (around 33.17%) of the state-of-the-art deblurring method MIMO on Peak Signal-to-Noise Ratio (PSNR).

Via

Access Paper or Ask Questions

SuperEdge: Towards a Generalization Model for Self-Supervised Edge Detection

Jan 04, 2024

Leng Kai, Zhang Zhijie, Liu Jie, Zed Boukhers, Sui Wei, Cong Yang, Li Zhijun

Abstract:Edge detection is a fundamental technique in various computer vision tasks. Edges are indeed effectively delineated by pixel discontinuity and can offer reliable structural information even in textureless areas. State-of-the-art heavily relies on pixel-wise annotations, which are labor-intensive and subject to inconsistencies when acquired manually. In this work, we propose a novel self-supervised approach for edge detection that employs a multi-level, multi-homography technique to transfer annotations from synthetic to real-world datasets. To fully leverage the generated edge annotations, we developed SuperEdge, a streamlined yet efficient model capable of concurrently extracting edges at pixel-level and object-level granularity. Thanks to self-supervised training, our method eliminates the dependency on manual annotated edge labels, thereby enhancing its generalizability across diverse datasets. Comparative evaluations reveal that SuperEdge advances edge detection, demonstrating improvements of 4.9% in ODS and 3.3% in OIS over the existing STEdge method on BIPEDv2.

* 7pages

Via

Access Paper or Ask Questions