Abstract:Language models such as Bidirectional Encoder Representations from Transformers (BERT) have been very effective in various Natural Language Processing (NLP) and text mining tasks including text classification. However, some tasks still pose challenges for these models, including text classification with limited labels. This can result in a cold-start problem. Although some approaches have attempted to address this problem through single-stage clustering as an intermediate training step coupled with a pre-trained language model, which generates pseudo-labels to improve classification, these methods are often error-prone due to the limitations of the clustering algorithms. To overcome this, we have developed a novel two-stage intermediate clustering with subsequent fine-tuning that models the pseudo-labels reliably, resulting in reduced prediction errors. The key novelty in our model, IDoFew, is that the two-stage clustering coupled with two different clustering algorithms helps exploit the advantages of the complementary algorithms that reduce the errors in generating reliable pseudo-labels for fine-tuning. Our approach has shown significant improvements compared to strong comparative models.
Abstract:COVID-19 was announced by the World Health Organisation (WHO) as a global pandemic. The severity of the disease spread is determined by various factors such as the countries' health care capacity and the enforced lockdown. However, it is not clear if a country's climate acts as a contributing factor towards the number of infected cases. This paper aims to examine the relationship between COVID-19 and the weather of 89 cities in Saudi Arabia using machine learning techniques. We compiled and preprocessed data using the official daily report of the Ministry of Health of Saudi Arabia for COVID-19 cases and obtained historical weather data aligned with the reported case daily reports. We preprocessed and prepared the data to be used in models' training and evaluation. Our results show that temperature and wind have the strongest association with the spread of the pandemic. Our main contribution is data collection, preprocessing, and prediction of daily cases. For all tested models, we used cross-validation of K-fold of K=5. Our best model is the random forest that has a Mean Square Error(MSE), Root Mean Square (RMSE), Mean Absolute Error (MAE), and R{2} of 97.30, 9.86, 1.85, and 82.3\%, respectively.