Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daisuke Kawahara

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

May 21, 2025

Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

Abstract:The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

* https://github.com/nlp-waseda/traveling-across-languages

Via

Access Paper or Ask Questions

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Oct 30, 2024

Keito Sasagawa, Koki Maeda, Issa Sugiura, Shuhei Kurita, Naoaki Okazaki, Daisuke Kawahara

Figure 1 for Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Figure 2 for Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Figure 3 for Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Figure 4 for Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Abstract:To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose a method for rapidly creating Japanese multimodal datasets from scratch. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data directly from images using an existing VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Jul 04, 2024

LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto(+72 more)

Figure 1 for LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Figure 2 for LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Figure 3 for LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Figure 4 for LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Abstract:This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.

Via

Access Paper or Ask Questions

Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

May 02, 2024

Hao Wang, Tetsuro Morimura, Ukyo Honda, Daisuke Kawahara

Figure 1 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 2 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 3 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Figure 4 for Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Abstract:Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models' training.

Via

Access Paper or Ask Questions

Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Feb 22, 2024

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine

Figure 1 for Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Figure 2 for Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Figure 3 for Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Figure 4 for Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Abstract:We investigate the impact of politeness levels in prompts on the performance of large language models (LLMs). Polite language in human communications often garners more compliance and effectiveness, while rudeness can cause aversion, impacting response quality. We consider that LLMs mirror human communication traits, suggesting they align with human cultural norms. We assess the impact of politeness in prompts on LLMs across English, Chinese, and Japanese tasks. We observed that impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes. The best politeness level is different according to the language. This phenomenon suggests that LLMs not only reflect human behavior but are also influenced by language, particularly in different cultural contexts. Our findings highlight the need to factor in politeness for cross-cultural natural language processing and LLM usage.

Via

Access Paper or Ask Questions

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Jan 18, 2024

Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara

Figure 1 for SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Figure 2 for SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Figure 3 for SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Figure 4 for SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

Abstract:Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

Via

Access Paper or Ask Questions

Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Oct 17, 2023

Tomohito Kasahara, Daisuke Kawahara

Figure 1 for Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Figure 2 for Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Figure 3 for Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Figure 4 for Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation

Abstract:Automatic evaluation of text generation is essential for improving the accuracy of generation tasks. In light of the current trend towards increasingly larger decoder-based language models, we investigate automatic evaluation methods based on such models for text generation. This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions, on two different tasks, machine translation evaluation and semantic textual similarity, in two languages, Japanese and English. Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly. The analysis of the causes for this suggests that the decoder-based models focus on surface word sequences and do not capture meaning. It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.

* Accepted to IJCNLP-AACL 2023 SRW

Via

Access Paper or Ask Questions

PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Oct 11, 2023

Tatsuya Ide, Eiki Murata, Daisuke Kawahara, Takato Yamazaki, Shengzhe Li, Kenta Shinzato, Toshinori Sato

Figure 1 for PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Figure 2 for PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Figure 3 for PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Figure 4 for PHALM: Building a Knowledge Graph from Scratch by Prompting Humans and a Language Model

Abstract:Despite the remarkable progress in natural language understanding with pretrained Transformers, neural language models often do not handle commonsense knowledge well. Toward commonsense-aware models, there have been attempts to obtain knowledge, ranging from automatic acquisition to crowdsourcing. However, it is difficult to obtain a high-quality knowledge base at a low cost, especially from scratch. In this paper, we propose PHALM, a method of building a knowledge graph from scratch, by prompting both crowdworkers and a large language model (LLM). We used this method to build a Japanese event knowledge graph and trained Japanese commonsense generation models. Experimental results revealed the acceptability of the built graph and inferences generated by the trained models. We also report the difference in prompting humans and an LLM. Our code, data, and models are available at github.com/nlp-waseda/comet-atomic-ja.

Via

Access Paper or Ask Questions

Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

May 22, 2023

Hao Wang, Hirofumi Shimizu, Daisuke Kawahara

Figure 1 for Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Figure 2 for Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Figure 3 for Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Figure 4 for Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Abstract:Recent studies in natural language processing (NLP) have focused on modern languages and achieved state-of-the-art results in many tasks. Meanwhile, little attention has been paid to ancient texts and related tasks. Classical Chinese first came to Japan approximately 2,000 years ago. It was gradually adapted to a Japanese form called Kanbun-Kundoku (Kanbun) in Japanese reading and translating methods, which has significantly impacted Japanese literature. However, compared to the rich resources for ancient texts in mainland China, Kanbun resources remain scarce in Japan. To solve this problem, we construct the first Classical-Chinese-to-Kanbun dataset in the world. Furthermore, we introduce two tasks, character reordering and machine translation, both of which play a significant role in Kanbun comprehension. We also test the current language models on these tasks and discuss the best evaluation method by comparing the results with human scores. We release our code and dataset on GitHub.

Via

Access Paper or Ask Questions

Grounding in social media: An approach to building a chit-chat dialogue model

Jun 12, 2022

Ritvik Choudhary, Daisuke Kawahara

Figure 1 for Grounding in social media: An approach to building a chit-chat dialogue model

Figure 2 for Grounding in social media: An approach to building a chit-chat dialogue model

Figure 3 for Grounding in social media: An approach to building a chit-chat dialogue model

Figure 4 for Grounding in social media: An approach to building a chit-chat dialogue model

Abstract:Building open-domain dialogue systems capable of rich human-like conversational ability is one of the fundamental challenges in language generation. However, even with recent advancements in the field, existing open-domain generative models fail to capture and utilize external knowledge, leading to repetitive or generic responses to unseen utterances. Current work on knowledge-grounded dialogue generation primarily focuses on persona incorporation or searching a fact-based structured knowledge source such as Wikipedia. Our method takes a broader and simpler approach, which aims to improve the raw conversation ability of the system by mimicking the human response behavior through casual interactions found on social media. Utilizing a joint retriever-generator setup, the model queries a large set of filtered comment data from Reddit to act as additional context for the seq2seq generator. Automatic and human evaluations on open-domain dialogue datasets demonstrate the effectiveness of our approach.

* Accepted to NAACL 2022 SRW

Via

Access Paper or Ask Questions