Abstract:Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.
Abstract:Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.