Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lili Mou

KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation

Apr 26, 2025

Jiabin Fan, Guoqing Luo, Michael Bowling, Lili Mou

Abstract:We propose a novel k-step return estimation method (called KETCHUP) for Reinforcement Learning(RL)-based knowledge distillation (KD) in text generation tasks. Our idea is to induce a K-step return by using the Bellman Optimality Equation for multiple steps. Theoretical analysis shows that this K-step formulation reduces the variance of the gradient estimates, thus leading to improved RL optimization especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation. These results suggest that our K-step return induction offers a promising direction for enhancing RL-based KD in LLM research.

Via

Access Paper or Ask Questions

ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Feb 06, 2025

Zijun Wu, Yongchang Hao, Lili Mou

Figure 1 for ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Figure 2 for ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Figure 3 for ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Figure 4 for ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization

Abstract:Large language models achieve state-of-the-art performance but are costly to fine-tune due to their size. Parameter-efficient fine-tuning methods, such as prompt tuning, address this by reducing trainable parameters while maintaining strong performance. However, prior methods tie prompt embeddings to the model's dimensionality, which may not scale well with larger LLMs and more customized LLMs. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), which optimizes prompts in a low-dimensional space (e.g., 2D) and use a random but frozen matrix for the up-projection. To enhance alignment, we introduce learnable shift and scale embeddings. ULPT drastically reduces the trainable parameters, e.g., 2D only using 2% parameters compared with vanilla prompt tuning while retaining most of the performance across 21 NLP tasks. Our theoretical analysis shows that random projections can capture high-rank structures effectively, and experimental results demonstrate ULPT's competitive performance over existing parameter-efficient methods.

Via

Access Paper or Ask Questions

Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Feb 06, 2025

Chenyang Huang, Fei Huang, Zaixiang Zheng, Osmar R. Zaïane, Hao Zhou, Lili Mou

Figure 1 for Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Figure 2 for Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Figure 3 for Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Figure 4 for Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation

Abstract:Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.

* In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023

Via

Access Paper or Ask Questions

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Feb 06, 2025

Zony Yu, Yuqiao Wen, Lili Mou

Figure 1 for Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Figure 2 for Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Figure 3 for Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Abstract:Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. KD can be divided into two categories: prediction matching and intermediate-layer matching. We explore an intriguing phenomenon: layer-selection strategy does not matter (much) in intermediate-layer matching. In this paper, we show that seemingly nonsensical matching strategies such as matching the teacher's layers in reverse still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective.

Via

Access Paper or Ask Questions

A Decoding Algorithm for Length-Control Summarization Based on Directed Acyclic Transformers

Feb 06, 2025

Chenyang Huang, Hao Zhou, Cameron Jen, Kangjie Zheng, Osmar R. Zaïane, Lili Mou

Abstract:Length-control summarization aims to condense long texts into a short one within a certain length limit. Previous approaches often use autoregressive (AR) models and treat the length requirement as a soft constraint, which may not always be satisfied. In this study, we propose a novel length-control decoding algorithm based on the Directed Acyclic Transformer (DAT). Our approach allows for multiple plausible sequence fragments and predicts a \emph{path} to connect them. In addition, we propose a Sequence Maximum a Posteriori (SeqMAP) decoding algorithm that marginalizes different possible paths and finds the most probable summary satisfying the length budget. Our algorithm is based on beam search, which further facilitates a reranker for performance improvement. Experimental results on the Gigaword and DUC2004 datasets demonstrate our state-of-the-art performance for length-control summarization.

* Findings of the Association for Computational Linguistics: EMNLP 2024

Via

Access Paper or Ask Questions

Error Diversity Matters: An Error-Resistant Ensemble Method for Unsupervised Dependency Parsing

Dec 16, 2024

Behzad Shayegh, Hobie H. -B. Lee, Xiaodan Zhu, Jackie Chi Kit Cheung, Lili Mou

Abstract:We address unsupervised dependency parsing by building an ensemble of diverse existing models through post hoc aggregation of their output dependency parse structures. We observe that these ensembles often suffer from low robustness against weak ensemble components due to error accumulation. To tackle this problem, we propose an efficient ensemble-selection approach that avoids error accumulation. Results demonstrate that our approach outperforms each individual model as well as previous ensemble techniques. Additionally, our experiments show that the proposed ensemble-selection method significantly enhances the performance and robustness of our ensemble, surpassing previously proposed strategies, which have not accounted for error diversity.

* Accepted by the AAAI Conference on Artificial Intelligence (AAAI) 2025

Via

Access Paper or Ask Questions

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Oct 28, 2024

Yongchang Hao, Yanshuai Cao, Lili Mou

Figure 1 for NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Figure 2 for NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Figure 3 for NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Figure 4 for NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Abstract:The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

Via

Access Paper or Ask Questions

A Dual-View Approach to Classifying Radiology Reports by Co-Training

Jun 10, 2024

Yutong Han, Yan Yuan, Lili Mou

Abstract:Radiology report analysis provides valuable information that can aid with public health initiatives, and has been attracting increasing attention from the research community. In this work, we present a novel insight that the structure of a radiology report (namely, the Findings and Impression sections) offers different views of a radiology scan. Based on this intuition, we further propose a co-training approach, where two machine learning models are built upon the Findings and Impression sections, respectively, and use each other's information to boost performance with massive unlabeled data in a semi-supervised manner. We conducted experiments in a public health surveillance study, and results show that our co-training approach is able to improve performance using the dual views and surpass competing supervised and semi-supervised methods.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

Action Controlled Paraphrasing

May 18, 2024

Ning Shi, Zijun Wu, Lili Mou

Figure 1 for Action Controlled Paraphrasing

Figure 2 for Action Controlled Paraphrasing

Figure 3 for Action Controlled Paraphrasing

Figure 4 for Action Controlled Paraphrasing

Abstract:Recent studies have demonstrated the potential to control paraphrase generation, such as through syntax, which has broad applications in various downstream tasks. However, these methods often require detailed parse trees or syntactic exemplars, which are not user-friendly. Furthermore, an inference gap exists, as control specifications are only available during training but not inference. In this work, we propose a new setup for controlled paraphrasing. Specifically, we represent user-intended actions as action tokens, allowing embedding and concatenating them with text embeddings, thus flowing together to a self-attention encoder for representation fusion. To address the inference gap, we introduce an optional action token as a placeholder that encourages the model to determine the appropriate action when control specifications are inaccessible. Experimental results show that our method successfully enables specific action-controlled paraphrasing and preserves the same or even better performance compared to conventional uncontrolled methods when actions are not given. Our findings thus promote the concept of optional action control for a more user-centered design via representation learning.

Via

Access Paper or Ask Questions

EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Feb 29, 2024

Yuqiao Wen, Behzad Shayegh, Chenyang Huang, Yanshuai Cao, Lili Mou

Figure 1 for EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Figure 2 for EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Figure 3 for EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Figure 4 for EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Abstract:The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but they are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-based distillation does not sacrifice, or even improves, the translation quality.

Via

Access Paper or Ask Questions