Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pralekha: An Indic Document Alignment Evaluation Benchmark

Nov 28, 2024

Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M. Khapra, Raj Dabre

Figure 1 for Pralekha: An Indic Document Alignment Evaluation Benchmark

Figure 2 for Pralekha: An Indic Document Alignment Evaluation Benchmark

Figure 3 for Pralekha: An Indic Document Alignment Evaluation Benchmark

Figure 4 for Pralekha: An Indic Document Alignment Evaluation Benchmark

Share this with someone who'll enjoy it:

Abstract:Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

* Work in Progress

View paper on

Share this with someone who'll enjoy it:

Title:Pralekha: An Indic Document Alignment Evaluation Benchmark

Paper and Code