Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kosuke Arima

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Apr 12, 2024

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Figure 1 for Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Figure 2 for Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Figure 3 for Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Figure 4 for Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Abstract:Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

* 9 pages. preprint of COLM2024

Via

Access Paper or Ask Questions

Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Oct 13, 2023

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

Figure 1 for Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Figure 2 for Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Figure 3 for Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Figure 4 for Training Generative Question-Answering on Synthetic Data Obtained from an Instruct-tuned Model

Abstract:This paper presents a simple and cost-effective method for synthesizing data to train question-answering systems. For training, fine-tuning GPT models is a common practice in resource-rich languages like English, however, it becomes challenging for non-English languages due to the scarcity of sufficient question-answer (QA) pairs. Existing approaches use question and answer generators trained on human-authored QA pairs, which involves substantial human expenses. In contrast, we use an instruct-tuned model to generate QA pairs in a zero-shot or few-shot manner. We conduct experiments to compare various strategies for obtaining QA pairs from the instruct-tuned model. The results demonstrate that a model trained on our proposed synthetic data achieves comparable performance to a model trained on manually curated datasets, without incurring human costs.

* PACLIC 2023 short paper, 4 pages (6 pages including references), 4 figures

Via

Access Paper or Ask Questions