Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noriki Nishida

The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

Feb 10, 2026

Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto

Abstract:Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ($r$$\geq$0.95, $p$$<$0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW$>$10: use diversity; CW$<$7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.

* Under review

Via

Access Paper or Ask Questions

Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

Jan 23, 2026

Shanshan Liu, Noriki Nishida, Fei Cheng, Narumi Tokunaga, Rumana Ferdous Munne, Yuki Yamagata, Kouji Kozaki, Takehito Utsuro, Yuji Matsumoto

Abstract:Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.

* Accepted to EACL 2026 (Main)

Via

Access Paper or Ask Questions

Post Persona Alignment for Multi-Session Dialogue Generation

Jun 13, 2025

Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, Yuji Matsumoto

Figure 1 for Post Persona Alignment for Multi-Session Dialogue Generation

Figure 2 for Post Persona Alignment for Multi-Session Dialogue Generation

Figure 3 for Post Persona Alignment for Multi-Session Dialogue Generation

Figure 4 for Post Persona Alignment for Multi-Session Dialogue Generation

Abstract:Multi-session persona-based dialogue generation presents challenges in maintaining long-term consistency and generating diverse, personalized responses. While large language models (LLMs) excel in single-session dialogues, they struggle to preserve persona fidelity and conversational coherence across extended interactions. Existing methods typically retrieve persona information before response generation, which can constrain diversity and result in generic outputs. We propose Post Persona Alignment (PPA), a novel two-stage framework that reverses this process. PPA first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker's persona. This post-hoc alignment strategy promotes naturalness and diversity while preserving consistency and personalization. Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance, offering a more flexible and effective paradigm for long-term personalized dialogue generation.

Via

Access Paper or Ask Questions

MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

May 19, 2025

Shanshan Liu, Noriki Nishida, Rumana Ferdous Munne, Narumi Tokunaga, Yuki Yamagata, Kouji Kozaki, Yuji Matsumoto

Abstract:Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.

* preprint

Via

Access Paper or Ask Questions

Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

May 28, 2024

Yi-Pei Chen, Noriki Nishida, Hideki Nakayama, Yuji Matsumoto

Figure 1 for Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

Figure 2 for Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

Figure 3 for Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

Figure 4 for Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations

Abstract:Enhancing user engagement through personalization in conversational agents has gained significance, especially with the advent of large language models that generate fluent responses. Personalized dialogue generation, however, is multifaceted and varies in its definition -- ranging from instilling a persona in the agent to capturing users' explicit and implicit cues. This paper seeks to systemically survey the recent landscape of personalized dialogue generation, including the datasets employed, methodologies developed, and evaluation metrics applied. Covering 22 datasets, we highlight benchmark datasets and newer ones enriched with additional features. We further analyze 17 seminal works from top conferences between 2021-2023 and identify five distinct types of problems. We also shed light on recent progress by LLMs in personalized dialogue generation. Our evaluation section offers a comprehensive summary of assessment facets and metrics utilized in these works. In conclusion, we discuss prevailing challenges and envision prospect directions for future research in personalized dialogue generation.

* Presented in LREC-COLING 2024

Via

Access Paper or Ask Questions

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Jul 23, 2017

Hideki Nakayama, Noriki Nishida

Figure 1 for Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Figure 2 for Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Figure 3 for Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Figure 4 for Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Abstract:We propose an approach to build a neural machine translation system with no supervised resources (i.e., no parallel corpora) using multimodal embedded representation over texts and images. Based on the assumption that text documents are often likely to be described with other multimedia information (e.g., images) somewhat related to the content, we try to indirectly estimate the relevance between two languages. Using multimedia as the "pivot", we project all modalities into one common hidden space where samples belonging to similar semantic concepts should come close to each other, whatever the observed space of each sample is. This modality-agnostic representation is the key to bridging the gap between different modalities. Putting a decoder on top of it, our network can flexibly draw the outputs from any input modality. Notably, in the testing phase, we need only source language texts as the input for translation. In experiments, we tested our method on two benchmarks to show that it can achieve reasonable translation performance. We compared and investigated several possible implementations and found that an end-to-end model that simultaneously optimized both rank loss in multimodal encoders and cross-entropy loss in decoders performed the best.

* Some error corrections in Sect.2.2 and Table 5, Machine Translation, 2017

Via

Access Paper or Ask Questions