Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Oct 03, 2024

Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

Figure 1 for Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Figure 2 for Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Figure 3 for Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Figure 4 for Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Share this with someone who'll enjoy it:

Abstract:Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:Unified Multi-Modal Interleaved Document Representation for Information Retrieval

Paper and Code