Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Validating and Exploring Large Geographic Corpora

Mar 13, 2024

Jonathan Dunn

Figure 1 for Validating and Exploring Large Geographic Corpora

Figure 2 for Validating and Exploring Large Geographic Corpora

Figure 3 for Validating and Exploring Large Geographic Corpora

Figure 4 for Validating and Exploring Large Geographic Corpora

Share this with someone who'll enjoy it:

Abstract:This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

View paper on

Share this with someone who'll enjoy it:

Title:Validating and Exploring Large Geographic Corpora

Paper and Code