Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Zheng

A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

Mar 07, 2025

Shanhe You, Xuewen Luo, Xinhe Liang, Jiashu Yu, Chen Zheng, Jiangtao Gong

Figure 1 for A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

Figure 2 for A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

Figure 3 for A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

Abstract:Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a natural language evaluation dataset of human professional drivers and passengers through naturalistic driving experiments and post-driving behavior evaluation interviews. Based on this dataset, we developed an LLM-powered driving evaluation framework. The effectiveness of this framework was validated through simulated experiments in the CARLA urban traffic simulator and further corroborated by human assessment. Our research provides valuable insights for evaluating and designing more intelligent, human-like autonomous driving agents. The implementation details of the framework and detailed information about the dataset can be found at Github.

* ICRA2025
* 8 pages, 3 figures

Via

Access Paper or Ask Questions

Instruct Large Language Models to Generate Scientific Literature Survey Step by Step

Aug 15, 2024

Yuxuan Lai, Yupeng Wu, Yidan Wang, Wenpeng Hu, Chen Zheng

Abstract:Abstract. Automatically generating scientific literature surveys is a valuable task that can significantly enhance research efficiency. However, the diverse and complex nature of information within a literature survey poses substantial challenges for generative models. In this paper, we design a series of prompts to systematically leverage large language models (LLMs), enabling the creation of comprehensive literature surveys through a step-by-step approach. Specifically, we design prompts to guide LLMs to sequentially generate the title, abstract, hierarchical headings, and the main content of the literature survey. We argue that this design enables the generation of the headings from a high-level perspective. During the content generation process, this design effectively harnesses relevant information while minimizing costs by restricting the length of both input and output content in LLM queries. Our implementation with Qwen-long achieved third place in the NLPCC 2024 Scientific Literature Survey Generation evaluation task, with an overall score only 0.03% lower than the second-place team. Additionally, our soft heading recall is 95.84%, the second best among the submissions. Thanks to the efficient prompt design and the low cost of the Qwen-long API, our method reduces the expense for generating each literature survey to 0.1 RMB, enhancing the practical value of our method.

* NLPCC 2024

Via

Access Paper or Ask Questions

Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging Dataset

Jun 20, 2024

Chen Zheng

Abstract:Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.

Via

Access Paper or Ask Questions

Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

Jun 12, 2024

Chen Zheng, Ke Sun, Xun Zhou

Figure 1 for Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

Figure 2 for Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

Figure 3 for Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

Figure 4 for Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

Abstract:Despite the advances in Large Language Models (LLMs), exemplified by models like GPT-4 and Claude, smaller-scale LLMs such as Llama and Mistral often struggle with generating in-depth and coherent dialogues. This paper presents a novel two-step Coarse-to-Fine Actor model to address the inherent limitations in conversational and analytical capabilities of small-sized LLMs. Our approach begins with the Policy-based Coarse Actor, employing a technique we term "Continuous Maximization". The Coarse Actor establishes an enhanced, knowledge-rich pool adept at aligning with human preference styles in analysis and reasoning. Through the RLHF process, it employs Continuous Maximization, a strategy that dynamically and adaptively extends the output length limit, enabling the generation of more detailed and analytical content. Subsequently, the Fine Actor refines this analytical content, addressing the generation of excessively redundant information from the Coarse Actor. We introduce a "Knowledge Residue Merger" approach, refining the content from the Coarse Actor and merging it with an existing Instruction model to improve quality, correctness, and reduce redundancies. We applied our methodology to the popular Mistral model, creating Mistral-C2F, which has demonstrated exceptional performance across 11 general language tasks and the MT-Bench Dialogue task, outperforming similar-scale models and even larger models with 13B and 30B parameters. Our model has significantly improved conversational and analytical reasoning abilities.

Via

Access Paper or Ask Questions

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Mar 04, 2024

Chen Zheng, Ke Sun, Hang Wu, Chenguang Xi, Xun Zhou

Figure 1 for Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Figure 2 for Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Figure 3 for Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Figure 4 for Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF

Abstract:In recent advancements in Conversational Large Language Models (LLMs), a concerning trend has emerged, showing that many new base LLMs experience a knowledge reduction in their foundational capabilities following Supervised Fine-Tuning (SFT). This process often leads to issues such as forgetting or a decrease in the base model's abilities. Moreover, fine-tuned models struggle to align with user preferences, inadvertently increasing the generation of toxic outputs when specifically prompted. To overcome these challenges, we adopted an innovative approach by completely bypassing SFT and directly implementing Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only preserves the base model's general capabilities but also significantly enhances its conversational abilities, while notably reducing the generation of toxic outputs. Our approach holds significant implications for fields that demand a nuanced understanding and generation of responses, such as customer service. We applied this methodology to Mistral, the most popular base model, thereby creating Mistral-Plus. Our validation across 11 general tasks demonstrates that Mistral-Plus outperforms similarly sized open-source base models and their corresponding instruct versions. Importantly, the conversational abilities of Mistral-Plus were significantly improved, indicating a substantial advancement over traditional SFT models in both safety and user preference alignment.

Via

Access Paper or Ask Questions

ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers

Jan 04, 2024

Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou

Abstract:The emergence of Large Language Models (LLMs) such as ChatGPT and LLaMA encounter limitations in domain-specific tasks, with these models often lacking depth and accuracy in specialized areas, and exhibiting a decrease in general capabilities when fine-tuned, particularly analysis ability in small sized models. To address these gaps, we introduce ICE-GRT, utilizing Reinforcement Learning from Human Feedback (RLHF) grounded in Proximal Policy Optimization (PPO), demonstrating remarkable ability in in-domain scenarios without compromising general task performance. Our exploration of ICE-GRT highlights its understanding and reasoning ability to not only generate robust answers but also to provide detailed analyses of the reasons behind the answer. This capability marks a significant progression beyond the scope of Supervised Fine-Tuning models. The success of ICE-GRT is dependent on several crucial factors, including Appropriate Data, Reward Size Scaling, KL-Control, Advantage Normalization, etc. The ICE-GRT model exhibits state-of-the-art performance in domain-specific tasks and across 12 general Language tasks against equivalent size and even larger size LLMs, highlighting the effectiveness of our approach. We provide a comprehensive analysis of the ICE-GRT, underscoring the significant advancements it brings to the field of LLM.

Via

Access Paper or Ask Questions

A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest

Nov 17, 2023

Ruohong Zhang, Luyu Gao, Chen Zheng, Zhen Fan, Guokun Lai, Zheng Zhang, Fangzhou Ai, Yiming Yang, Hongxia Yang

Figure 1 for A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest

Figure 2 for A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest

Figure 3 for A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest

Figure 4 for A Self-enhancement Approach for Domain-specific Chatbot Training via Knowledge Mining and Digest

Abstract:Large Language Models (LLMs), despite their great power in language generation, often encounter challenges when dealing with intricate and knowledge-demanding queries in specific domains. This paper introduces a novel approach to enhance LLMs by effectively extracting the relevant knowledge from domain-specific textual sources, and the adaptive training of a chatbot with domain-specific inquiries. Our two-step approach starts from training a knowledge miner, namely LLMiner, which autonomously extracts Question-Answer pairs from relevant documents through a chain-of-thought reasoning process. Subsequently, we blend the mined QA pairs with a conversational dataset to fine-tune the LLM as a chatbot, thereby enriching its domain-specific expertise and conversational capabilities. We also developed a new evaluation benchmark which comprises four domain-specific text corpora and associated human-crafted QA pairs for testing. Our model shows remarkable performance improvement over generally aligned LLM and surpasses domain-adapted models directly fine-tuned on domain corpus. In particular, LLMiner achieves this with minimal human intervention, requiring only 600 seed instances, thereby providing a pathway towards self-improvement of LLMs through model-synthesized training data.

* Work in progress

Via

Access Paper or Ask Questions

Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Oct 07, 2023

Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun Zhou, Liang Zhao

Figure 1 for Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Figure 2 for Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Figure 3 for Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Figure 4 for Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

Abstract:This paper introduces a multifaceted methodology for fine-tuning and evaluating large language models (LLMs) for specialized monetization tasks. The goal is to balance general language proficiency with domain-specific skills. The methodology has three main components: 1) Carefully blending in-domain and general-purpose data during fine-tuning to achieve an optimal balance between general and specialized capabilities; 2) Designing a comprehensive evaluation framework with 45 questions tailored to assess performance on functionally relevant dimensions like reliability, consistency, and business impact; 3) Analyzing how model size and continual training influence metrics to guide efficient resource allocation during fine-tuning. The paper details the design, data collection, analytical techniques, and results validating the proposed frameworks. It aims to provide businesses and researchers with actionable insights on effectively adapting LLMs for specialized contexts. We also intend to make public the comprehensive evaluation framework, which includes the 45 tailored questions and their respective scoring guidelines, to foster transparency and collaboration in adapting LLMs for specialized tasks.

Via

Access Paper or Ask Questions

Improving the Generalization Ability in Essay Coherence Evaluation through Monotonic Constraints

Jul 25, 2023

Chen Zheng, Huan Zhang, Yan Zhao, Yuxuan Lai

Abstract:Coherence is a crucial aspect of evaluating text readability and can be assessed through two primary factors when evaluating an essay in a scoring scenario. The first factor is logical coherence, characterized by the appropriate use of discourse connectives and the establishment of logical relationships between sentences. The second factor is the appropriateness of punctuation, as inappropriate punctuation can lead to confused sentence structure. To address these concerns, we propose a coherence scoring model consisting of a regression model with two feature extractors: a local coherence discriminative model and a punctuation correction model. We employ gradient-boosting regression trees as the regression model and impose monotonicity constraints on the input features. The results show that our proposed model better generalizes unseen data. The model achieved third place in track 1 of NLPCC 2023 shared task 7. Additionally, we briefly introduce our solution for the remaining tracks, which achieves second place for track 2 and first place for both track 3 and track 4.

* 12 pages, 1 figure, accepted to NLPCC 2023

Via

Access Paper or Ask Questions

GLUECons: A Generic Benchmark for Learning Under Constraints

Feb 16, 2023

Hossein Rajaby Faghihi, Aliakbar Nafar, Chen Zheng, Roshanak Mirzaee, Yue Zhang, Andrzej Uszok, Alexander Wan, Tanawan Premsri, Dan Roth, Parisa Kordjamshidi

Abstract:Recent research has shown that integrating domain knowledge into deep learning architectures is effective -- it helps reduce the amount of required data, improves the accuracy of the models' decisions, and improves the interpretability of models. However, the research community is missing a convened benchmark for systematically evaluating knowledge integration methods. In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. In all cases, we model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints. We report the results of these models using a new set of extended evaluation criteria in addition to the task performances for a more in-depth analysis. This effort provides a framework for a more comprehensive and systematic comparison of constraint integration techniques and for identifying related research challenges. It will facilitate further research for alleviating some problems of state-of-the-art neural models.

* 8 pages, Accepted in AAAI 2023 proceedings

Via

Access Paper or Ask Questions