Abstract:Application domains such as digital humanities and tool like chatbots involve some form of processing natural language, from digitising hardcopies to speech generation. The language of the content is typically characterised as either a low resource language (LRL) or high resource language (HRL), also known as resource-scarce and well-resourced languages, respectively. African languages have been characterized as resource-scarce languages (Bosch et al. 2007; Pretorius & Bosch 2003; Keet & Khumalo 2014) and English is by far the most well-resourced language. Varied language resources are used to develop software systems for these languages to accomplish a wide range of tasks. In this paper we argue that the dichotomous typology LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterizes languages as Very LRL, LRL, RL, HRL and Very HRL. The characterization is based on the typology of contextual features for each category, rather than counting tools, and motivation is provided for each feature and each characterization. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterization of language resources within a given scale in a project is an indispensable component particularly in the context of low-resourced languages.
Abstract:The isiZulu verb is known for its morphological complexity, which is a subject for on-going linguistics research, as well as for prospects of computational use, such as controlled natural language interfaces, machine translation, and spellcheckers. To this end, we seek to answer the question as to what the precise grammar rules for the isiZulu complex verb are (and, by extension, the Bantu verb morphology). To this end, we iteratively specify the grammar as a Context Free Grammar, and evaluate it computationally. The grammar presented in this paper covers the subject and object concords, negation, present tense, aspect, mood, and the causative, applicative, stative, and the reciprocal verbal extensions, politeness, the wh-question modifiers, and aspect doubling, ensuring their correct order as they appear in verbs. The grammar conforms to specification.
Abstract:IsiZulu is one of the eleven official languages of South Africa and roughly half the population can speak it. It is the first (home) language for over 10 million people in South Africa. Only a few computational resources exist for isiZulu and its related Nguni languages, yet the imperative for tool development exists. We focus on natural language generation, and the grammar options and preferences in particular, which will inform verbalization of knowledge representation languages and could contribute to machine translation. The verbalization pattern specification shows that the grammar rules are elaborate and there are several options of which one may have preference. We devised verbalization patterns for subsumption, basic disjointness, existential and universal quantification, and conjunction. This was evaluated in a survey among linguists and non-linguists. Some differences between linguists and non-linguists can be observed, with the former much more in agreement, and preferences depend on the overall structure of the sentence, such as singular for subsumption and plural in other cases.