Abstract:A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.
Abstract:This project explores methods to enhance sign language translation of German sign language, specifically focusing on disambiguation of homonyms. Sign language is ambiguous and understudied which is the basis for our experiments. We approach the improvement by training transformer-based models on various bodypart representations to shift the focus on said bodypart. To determine the impact of, e.g., the hand or mouth representations, we experiment with different combinations. The results show that focusing on the mouth increases the performance in small dataset settings while shifting the focus on the hands retrieves better results in larger dataset settings. Our results contribute to better accessibility for non-hearing persons by improving the systems powering digital assistants, enabling a more accurate interaction. The code for this project can be found on GitHub.