Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chandramouli Sanchi

Language Detection Engine for Multilingual Texting on Mobile Devices

Jan 07, 2021

Sourabh Vasant Gothe, Sourav Ghosh, Sharmila Mani, Guggilla Bhanodai, Ankur Agarwal, Chandramouli Sanchi

Figure 1 for Language Detection Engine for Multilingual Texting on Mobile Devices

Figure 2 for Language Detection Engine for Multilingual Texting on Mobile Devices

Figure 3 for Language Detection Engine for Multilingual Texting on Mobile Devices

Figure 4 for Language Detection Engine for Multilingual Texting on Mobile Devices

Abstract:More than 2 billion mobile users worldwide type in multiple languages in the soft keyboard. On a monolingual keyboard, 38% of falsely auto-corrected words are valid in another language. This can be easily avoided by detecting the language of typed words and then validating it in its respective language. Language detection is a well-known problem in natural language processing. In this paper, we present a fast, light-weight and accurate Language Detection Engine (LDE) for multilingual typing that dynamically adapts to user intended language in real-time. We propose a novel approach where the fusion of character N-gram model and logistic regression based selector model is used to identify the language. Additionally, we present a unique method of reducing the inference time significantly by parameter reduction technique. We also discuss various optimizations fabricated across LDE to resolve ambiguity in input text among the languages with the same character pattern. Our method demonstrates an average accuracy of 94.5% for Indian languages in Latin script and that of 98% for European languages on the code-switched data. This model outperforms fastText by 60.39% and ML-Kit by 23.67% in F1 score for European languages. LDE is faster on mobile device with an average inference time of 25.91 microseconds.

* 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 2020, pp. 279-286
* 2020 IEEE 14th International Conference on Semantic Computing (ICSC). Accessible at https://ieeexplore.ieee.org/document/9031474

Via

Access Paper or Ask Questions

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Jan 05, 2021

Sourav Ghosh, Sourabh Vasant Gothe, Chandramouli Sanchi, Barath Raj Kandur Raja

Figure 1 for edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Figure 2 for edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Figure 3 for edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Figure 4 for edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Abstract:Abugida refers to a phonogram writing system where each syllable is represented using a single consonant or typographic ligature, along with a default vowel or optional diacritic(s) to denote other vowels. However, texting in these languages has some unique challenges in spite of the advent of devices with soft keyboard supporting custom key layouts. The number of characters in these languages is large enough to require characters to be spread over multiple views in the layout. Having to switch between views many times to type a single word hinders the natural thought process. This prevents popular usage of native keyboard layouts. On the other hand, supporting romanized scripts (native words transcribed using Latin characters) with language model based suggestions is also set back by the lack of uniform romanization rules. To this end, we propose a disambiguation algorithm and showcase its usefulness in two novel mutually non-exclusive input methods for languages natively using the abugida writing system: (a) disambiguation of ambiguous input for abugida scripts, and (b) disambiguation of word variants in romanized scripts. We benchmark these approaches using public datasets, and show an improvement in typing speed by 19.49%, 25.13%, and 14.89%, in Hindi, Bengali, and Thai, respectively, using Ambiguous Input, owing to the human ease of locating keys combined with the efficiency of our inference method. Our Word Variant Disambiguation (WDA) maps valid variants of romanized words, previously treated as Out-of-Vocab, to a vocabulary of 100k words with high accuracy, leading to an increase in Error Correction F1 score by 10.03% and Next Word Prediction (NWP) by 62.50% on average.

* Accepted for publication in the 15th IEEE International Conference on Semantic Computing (IEEE ICSC 2021)

Via

Access Paper or Ask Questions