Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marina Sokolova

IBDA@Dalhousie University and University of Ottawa

Explainable Multi-Label Classification of MBTI Types

May 07, 2024

Siana Kong, Marina Sokolova

Abstract:In this study, we aim to identify the most effective machine learning model for accurately classifying Myers-Briggs Type Indicator (MBTI) types from Reddit posts and a Kaggle data set. We apply multi-label classification using the Binary Relevance method. We use Explainable Artificial Intelligence (XAI) approach to highlight the transparency and understandability of the process and result. To achieve this, we experiment with glass-box learning models, i.e. models designed for simplicity, transparency, and interpretability. We selected k-Nearest Neighbour, Multinomial Naive Bayes, and Logistic Regression for the glass-box models. We show that Multinomial Naive Bayes and k-Nearest Neighbour perform better if classes with Observer (S) traits are excluded, whereas Logistic Regression obtains its best results when all classes have > 550 entries.

* 22 pages, 12 tables, 2 figure

Via

Access Paper or Ask Questions

Longitudinal Sentiment Topic Modelling of Reddit Posts

Jan 24, 2024

Fabian Nwaoha, Ziyad Gaffar, Ho Joon Chun, Marina Sokolova

Abstract:In this study, we analyze texts of Reddit posts written by students of four major Canadian universities. We gauge the emotional tone and uncover prevailing themes and discussions through longitudinal topic modeling of posts textual data. Our study focuses on four years, 2020-2023, covering COVID-19 pandemic and after pandemic years. Our results highlight a gradual uptick in discussions related to mental health.

* 21 pages, 4 figures, 13 tables. arXiv admin note: text overlap with arXiv:2401.12382

Via

Access Paper or Ask Questions

Longitudinal Sentiment Classification of Reddit Posts

Jan 22, 2024

Fabian Nwaoha, Ziyad Gaffar, Ho Joon Chun, Marina Sokolova

Abstract:We report results of a longitudinal sentiment classification of Reddit posts written by students of four major Canadian universities. We work with the texts of the posts, concentrating on the years 2020-2023. By finely tuning a sentiment threshold to a range of [-0.075,0.075], we successfully built classifiers proficient in categorizing post sentiments into positive and negative categories. Noticeably, our sentiment classification results are consistent across the four university data sets.

* 11 pages, 10 figures, 4 tables

Via

Access Paper or Ask Questions

Sentiment Analysis of Covid-related Reddits

May 13, 2022

Yilin Yang, Tomas Fieg, Marina Sokolova

Figure 1 for Sentiment Analysis of Covid-related Reddits

Figure 2 for Sentiment Analysis of Covid-related Reddits

Figure 3 for Sentiment Analysis of Covid-related Reddits

Figure 4 for Sentiment Analysis of Covid-related Reddits

Abstract:This paper focuses on Sentiment Analysis of Covid-19 related messages from the r/Canada and r/Unitedkingdom subreddits of Reddit. We apply manual annotation and three Machine Learning algorithms to analyze sentiments conveyed in those messages. We use VADER and TextBlob to label messages for Machine Learning experiments. Our results show that removal of shortest and longest messages improves VADER and TextBlob agreement on positive sentiments and F-score of sentiment classification by all the three algorithms

* 10 pages, 1 figure, 5 tables

Via

Access Paper or Ask Questions

Sentiment Analysis of the COVID-related r/Depression Posts

Jul 28, 2021

Zihan Chen, Marina Sokolova

Figure 1 for Sentiment Analysis of the COVID-related r/Depression Posts

Figure 2 for Sentiment Analysis of the COVID-related r/Depression Posts

Figure 3 for Sentiment Analysis of the COVID-related r/Depression Posts

Figure 4 for Sentiment Analysis of the COVID-related r/Depression Posts

Abstract:Reddit.com is a popular social media platform among young people. Reddit users share their stories to seek support from other users, especially during the Covid-19 pandemic. Messages posted on Reddit and their content have provided researchers with opportunity to analyze public concerns. In this study, we analyzed sentiments of COVID-related messages posted on r/Depression. Our study poses the following questions: a) What are the common topics that the Reddit users discuss? b) Can we use these topics to classify sentiments of the posts? c) What matters concern people more during the pandemic? Key Words: Sentiment Classification, Depression, COVID-19, Reddit, LDA, BERT

* 16 pages, 7 figures, 5 tables, 1 appendix

Via

Access Paper or Ask Questions

Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

May 27, 2021

YuanZheng Hu, Marina Sokolova

Figure 1 for Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

Figure 2 for Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

Figure 3 for Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

Figure 4 for Explainable Multi-class Classification of the CAMH COVID-19 Mental Health Data

Abstract:Application of Machine Learning algorithms to the medical domain is an emerging trend that helps to advance medical knowledge. At the same time, there is a significant a lack of explainable studies that promote informed, transparent, and interpretable use of Machine Learning algorithms. In this paper, we present explainable multi-class classification of the Covid-19 mental health data. In Machine Learning study, we aim to find the potential factors to influence a personal mental health during the Covid-19 pandemic. We found that Random Forest (RF) and Gradient Boosting (GB) have scored the highest accuracy of 68.08% and 68.19% respectively, with LIME prediction accuracy 65.5% for RF and 61.8% for GB. We then compare a Post-hoc system (Local Interpretable Model-Agnostic Explanations, or LIME) and an Ante-hoc system (Gini Importance) in their ability to explain the obtained Machine Learning results. To the best of these authors knowledge, our study is the first explainable Machine Learning study of the mental health data collected during Covid-19 pandemics.

* 22 pages, including Appendixes; 7 tables and 5 figures in the main text

Via

Access Paper or Ask Questions

Convolutional Neural Networks in Multi-Class Classification of Medical Data

Dec 28, 2020

YuanZheng Hu, Marina Sokolova

Figure 1 for Convolutional Neural Networks in Multi-Class Classification of Medical Data

Figure 2 for Convolutional Neural Networks in Multi-Class Classification of Medical Data

Figure 3 for Convolutional Neural Networks in Multi-Class Classification of Medical Data

Figure 4 for Convolutional Neural Networks in Multi-Class Classification of Medical Data

Abstract:We report applications of Convolutional Neural Networks (CNN) to multi-classification classification of a large medical data set. We discuss in detail how changes in the CNN model and the data pre-processing impact the classification results. In the end, we introduce an ensemble model that consists of both deep learning (CNN) and shallow learning models (Gradient Boosting). The method achieves Accuracy of 64.93, the highest three-class classification accuracy we achieved in this study. Our results also show that CNN and the ensemble consistently obtain a higher Recall than Precision. The highest Recall is 68.87, whereas the highest Precision is 65.04.

* 13 pages; 14 tables

Via

Access Paper or Ask Questions

Explainable Multi-class Classification of Medical Data

Dec 26, 2020

YuanZheng Hu, Marina Sokolova

Figure 1 for Explainable Multi-class Classification of Medical Data

Figure 2 for Explainable Multi-class Classification of Medical Data

Figure 3 for Explainable Multi-class Classification of Medical Data

Figure 4 for Explainable Multi-class Classification of Medical Data

Abstract:Machine Learning applications have brought new insights into a secondary analysis of medical data. Machine Learning helps to develop new drugs, define populations susceptible to certain illnesses, identify predictors of many common diseases. At the same time, Machine Learning results depend on convolution of many factors, including feature selection, class (im)balance, algorithm preference, and performance metrics. In this paper, we present explainable multi-class classification of a large medical data set. We in details discuss knowledge-based feature engineering, data set balancing, best model selection, and parameter tuning. Six algorithms are used in this study: Support Vector Machine (SVM), Na\"ive Bayes, Gradient Boosting, Decision Trees, Random Forest, and Logistic Regression. Our empirical evaluation is done on the UCI Diabetes 130-US hospitals for years 1999-2008 dataset, with the task to classify patient hospital re-admission stay into three classes: 0 days, <30 days, or > 30 days. Our results show that using 23 medication features in learning experiments improves Recall of five out of the six applied learning algorithms. This is a new result that expands the previous studies conducted on the same data. Gradient Boosting and Random Forest outperformed other algorithms in terms of the three-class classification Accuracy.

* 21 pages; 23 tables; 2 appendixes

Via

Access Paper or Ask Questions

Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Oct 19, 2020

Marina Sokolova, Victoria Bobicev

Figure 1 for Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Figure 2 for Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Figure 3 for Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Figure 4 for Machine Learning Evaluation of the Echo-Chamber Effect in Medical Forums

Abstract:We propose the Echo-Chamber Effect assessment of an online forum. Sentiments perceived by the forum readers are at the core of the analysis; a complete message is the unit of the study. We build 14 models and apply those to represent discussions gathered from an online medical forum. We use four multi-class sentiment classification applications and two Machine Learning algorithms to evaluate prowess of the assessment models.

* 17 pages, including Appendix; 6 figures in the main text; 5 tables in the main text and 7 tables in Appendix

Via

Access Paper or Ask Questions

Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

May 01, 2018

Qufei Chen, Marina Sokolova

Figure 1 for Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

Figure 2 for Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

Figure 3 for Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

Figure 4 for Word2Vec and Doc2Vec in Unsupervised Sentiment Analysis of Clinical Discharge Summaries

Abstract:In this study, we explored application of Word2Vec and Doc2Vec for sentiment analysis of clinical discharge summaries. We applied unsupervised learning since the data sets did not have sentiment annotations. Note that unsupervised learning is a more realistic scenario than supervised learning which requires an access to a training set of sentiment-annotated data. We aim to detect if there exists any underlying bias towards or against a certain disease. We used SentiWordNet to establish a gold sentiment standard for the data sets and evaluate performance of Word2Vec and Doc2Vec methods. We have shown that the Word2vec and Doc2Vec methods complement each other results in sentiment analysis of the data sets.

* 23 pages, 3 figures, 16 tables

Via

Access Paper or Ask Questions