Abstract:Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406).
Abstract:The adoption of data science brings vast benefits to Small and Medium-sized Enterprises (SMEs) including business productivity, economic growth, innovation and jobs creation. Data Science can support SMEs to optimise production processes, anticipate customers' needs, predict machinery failures and deliver efficient smart services. Businesses can also harness the power of Artificial Intelligence (AI) and Big Data and the smart use of digital technologies to enhance productivity and performance, paving the way for innovation. However, integrating data science decisions into an SME requires both skills and IT investments. In most cases, such expenses are beyond the means of SMEs due to limited resources and restricted access to financing. This paper presents trends and challenges towards an effective data-driven decision making for organisations based on a case study of 85 SMEs, mostly from the West Midlands region of England. The work is supported as part of a 3 years ERDF (European Regional Development Funded project) in the areas of big data management, analytics and business intelligence. We present two case studies that demonstrates the potential of Digitisation, AI and Machine Learning and use these as examples to unveil challenges and showcase the wealth of current available opportunities for SMEs.
Abstract:African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
Abstract:Small and Medium Enterprises (SMEs) now generate digital data at an unprecedented rate from online transactions, social media marketing and associated customer interactions, online product or service reviews and feedback, clinical diagnosis, Internet of Things (IoT) sensors, and production processes. All these forms of data can be transformed into monetary value if put into a proper data value chain. This requires both skills and IT investments for the long-term benefit of businesses. However, such spending is beyond the capacity of most SMEs due to their limited resources and restricted access to finances. This paper presents lessons learned from a case study of 53 UK SMEs, mostly from the West Midlands region of England, supported as part of a 3-year ERDF project, Big Data Corridor, in the areas of big data management, analytics and related IT issues. Based on our study's sample companies, several perspectives including the digital technology trends, challenges facing the UK SMEs, and the state of their adoption in data analytics and big data, are presented in the paper.