Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiang Lou

Task Oriented In-Domain Data Augmentation

Jun 24, 2024

Xiao Liang, Xinyu Hu, Simiao Zuo, Yeyun Gong, Qiang Lou, Yi Liu, Shao-Lun Huang, Jian Jiao

Figure 1 for Task Oriented In-Domain Data Augmentation

Figure 2 for Task Oriented In-Domain Data Augmentation

Figure 3 for Task Oriented In-Domain Data Augmentation

Figure 4 for Task Oriented In-Domain Data Augmentation

Abstract:Large Language Models (LLMs) have shown superior performance in various applications and fields. To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data. However, existing approaches suffer from two major issues. First, in-domain data are scarce compared with general domain-agnostic data. Second, data used for continual pre-training are not task-aware, such that they may not be helpful to downstream applications. We propose TRAIT, a task-oriented in-domain data augmentation framework. Our framework is divided into two parts: in-domain data selection and task-oriented synthetic passage generation. The data selection strategy identifies and selects a large amount of in-domain data from general corpora, and thus significantly enriches domain knowledge in the continual pre-training data. The synthetic passages contain guidance on how to use domain knowledge to answer questions about downstream tasks. By training on such passages, the model aligns with the need of downstream applications. We adapt LLMs to two domains: advertisement and math. On average, TRAIT improves LLM performance by 8% in the advertisement domain and 7.5% in the math domain.

Via

Access Paper or Ask Questions

Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Oct 20, 2023

Xinyu Hu, Pengfei Tang, Simiao Zuo, Zihan Wang, Bowen Song, Qiang Lou, Jian Jiao, Denis Charles

Figure 1 for Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Figure 2 for Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Figure 3 for Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Figure 4 for Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

Abstract:Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.

Via

Access Paper or Ask Questions

DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries

Jun 30, 2023

Simiao Zuo, Pengfei Tang, Xinyu Hu, Qiang Lou, Jian Jiao, Denis Charles

Figure 1 for DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries

Figure 2 for DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries

Figure 3 for DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries

Figure 4 for DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries

Abstract:Named entity recognition (NER) is a crucial task for online advertisement. State-of-the-art solutions leverage pre-trained language models for this task. However, three major challenges remain unresolved: web queries differ from natural language, on which pre-trained models are trained; web queries are short and lack contextual information; and labeled data for NER is scarce. We propose DeepTagger, a knowledge-enhanced NER model for web-based ads queries. The proposed knowledge enhancement framework leverages both model-free and model-based approaches. For model-free enhancement, we collect unlabeled web queries to augment domain knowledge; and we collect web search results to enrich the information of ads queries. We further leverage effective prompting methods to automatically generate labels using large language models such as ChatGPT. Additionally, we adopt a model-based knowledge enhancement method based on adversarial data augmentation. We employ a three-stage training framework to train DeepTagger models. Empirical results in various NER tasks demonstrate the effectiveness of the proposed framework.

Via

Access Paper or Ask Questions

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

May 16, 2023

Ziheng Li, Shaohan Huang, Zihan Zhang, Zhi-Hong Deng, Qiang Lou, Haizhen Huang, Jian Jiao, Furu Wei, Weiwei Deng, Qi Zhang

Figure 1 for Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Figure 2 for Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Figure 3 for Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Figure 4 for Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Abstract:Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

* ACL 2023

Via

Access Paper or Ask Questions