Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miaoran Zhang

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Feb 25, 2025

Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li(+1 more)

Figure 1 for Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Figure 2 for Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Figure 3 for Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Figure 4 for Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Abstract:Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.

Via

Access Paper or Ask Questions

AFRIDOC-MT: Document-level MT Corpus for African Languages

Jan 10, 2025

Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab(+6 more)

Figure 1 for AFRIDOC-MT: Document-level MT Corpus for African Languages

Figure 2 for AFRIDOC-MT: Document-level MT Corpus for African Languages

Figure 3 for AFRIDOC-MT: Document-level MT Corpus for African Languages

Figure 4 for AFRIDOC-MT: Document-level MT Corpus for African Languages

Abstract:This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

* under review

Via

Access Paper or Ask Questions

Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Aug 07, 2024

Anupama Chingacham, Miaoran Zhang, Vera Demberg, Dietrich Klakow

Figure 1 for Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Figure 2 for Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Figure 3 for Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Figure 4 for Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?

Abstract:Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text. However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic. We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise. Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline. Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at a signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.

* Accepted at HuCLLM @ ACL 2024

Via

Access Paper or Ask Questions

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Jul 23, 2024

Pin-Jie Lin, Miaoran Zhang, Marius Mosbach, Dietrich Klakow

Figure 1 for Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Figure 2 for Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Figure 3 for Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Figure 4 for Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Abstract:Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.

* Accepted to ACL SRW 2024

Via

Access Paper or Ask Questions

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Apr 22, 2024

Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, Dietrich Klakow

Figure 1 for Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Figure 2 for Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Figure 3 for Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Figure 4 for Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Abstract:Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of all these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 training instances, and that fine-tuning on a single translation direction effectively enables LLMs to translate in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with English on the target side can lead to task misinterpretation, which hinders translations into non-English languages. A similar problem arises when noise is introduced into the target side of parallel data, especially when the target language is well-represented in the LLM's pre-training. In contrast, noise in an under-represented language has a less pronounced effect. Our findings suggest that attaining successful alignment hinges on teaching the model to maintain a "superficial" focus, thereby avoiding the learning of erroneous biases beyond translation.

Via

Access Paper or Ask Questions

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Apr 01, 2024

Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

Figure 1 for AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Figure 2 for AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Figure 3 for AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Figure 4 for AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Abstract:This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

* To appear at the SemEval-2024 workshop (NAACL 2024)

Via

Access Paper or Ask Questions

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Feb 20, 2024

Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach

Figure 1 for The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Figure 2 for The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Figure 3 for The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Figure 4 for The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Abstract:In-context learning is a popular inference strategy where large language models solve a task using only a few labelled demonstrations without needing any parameter updates. Compared to work on monolingual (English) in-context learning, multilingual in-context learning is under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.

Via

Access Paper or Ask Questions

A Lightweight Method to Generate Unanswerable Questions in English

Oct 30, 2023

Vagrant Gautam, Miaoran Zhang, Dietrich Klakow

Figure 1 for A Lightweight Method to Generate Unanswerable Questions in English

Figure 2 for A Lightweight Method to Generate Unanswerable Questions in English

Figure 3 for A Lightweight Method to Generate Unanswerable Questions in English

Figure 4 for A Lightweight Method to Generate Unanswerable Questions in English

Abstract:If a question cannot be answered with the available information, robust systems for question answering (QA) should know _not_ to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.

* Accepted to Findings of EMNLP 2023

Via

Access Paper or Ask Questions

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Apr 22, 2022

Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, Dietrich Klakow

Figure 1 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 2 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 3 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 4 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Abstract:Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman's correlation by 1.7%. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.

* Accepted by NAACL 2022 main conference (short paper), 11 pages

Via

Access Paper or Ask Questions

Knowledge Base Index Compression via Dimensionality and Precision Reduction

Apr 18, 2022

Vilém Zouhar, Marius Mosbach, Miaoran Zhang, Dietrich Klakow

Figure 1 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 2 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 3 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 4 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Abstract:Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100$\times$ compression with 75%, and (2) 24$\times$ compression with 92% original retrieval performance.

* To be presented at Spa-NLP workshop at ACL 2022

Via

Access Paper or Ask Questions