Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hazem Hajj

American University of Beirut, Electrical and Computer Engineering Department, Beirut, Lebanon

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Mar 07, 2021

Tarek Naous, Wissam Antoun, Reem A. Mahmoud, Hazem Hajj

Figure 1 for Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Figure 2 for Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Figure 3 for Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Figure 4 for Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Abstract:Enabling empathetic behavior in Arabic dialogue agents is an important aspect of building human-like conversational models. While Arabic Natural Language Processing has seen significant advances in Natural Language Understanding (NLU) with language models such as AraBERT, Natural Language Generation (NLG) remains a challenge. The shortcomings of NLG encoder-decoder models are primarily due to the lack of Arabic datasets suitable to train NLG models such as conversational agents. To overcome this issue, we propose a transformer-based encoder-decoder initialized with AraBERT parameters. By initializing the weights of the encoder and decoder with AraBERT pre-trained weights, our model was able to leverage knowledge transfer and boost performance in response generation. To enable empathy in our conversational model, we train it using the ArabicEmpatheticDialogues dataset and achieve high performance in empathetic response generation. Specifically, our model achieved a low perplexity value of 17.0 and an increase in 5 BLEU points compared to the previous state-of-the-art model. Also, our proposed model was rated highly by 85 human evaluators, validating its high capability in exhibiting empathy while generating relevant and fluent responses in open-domain settings.

Via

Access Paper or Ask Questions

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Dec 31, 2020

Wissam Antoun, Fady Baly, Hazem Hajj

Figure 1 for AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Figure 2 for AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Figure 3 for AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Figure 4 for AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Abstract:Recently, pretrained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic is still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on large Arabic corpora of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. We evaluate different size variants of AraGPT2 using the perplexity measure, where AraGPT2-mega achieves a perplexity of 29.8 on held-out articles from Wikipedia. Pretrained variants of AraGPT2 (base, medium, large, mega) are publicly available on https://github.com/aub-mind/arabert/aragpt2 hoping to encourage new research directions and applications for Arabic NLP.

Via

Access Paper or Ask Questions

AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Dec 31, 2020

Wissam Antoun, Fady Baly, Hazem Hajj

Figure 1 for AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Figure 2 for AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Abstract:Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, current Arabic language representation approaches rely only on pretraining via masked language modeling. In this paper, we develop an Arabic language representation model, which we name AraELECTRA. Our model is pretrained using the replaced token detection objective on large Arabic text corpora. We evaluate our model on two Arabic reading comprehension tasks, and we show that AraELECTRA outperforms current state-of-the-art Arabic language representation models given the same pretraining data and with even a smaller model size.

Via

Access Paper or Ask Questions

AraBERT: Transformer-based Model for Arabic Language Understanding

Mar 30, 2020

Wissam Antoun, Fady Baly, Hazem Hajj

Figure 1 for AraBERT: Transformer-based Model for Arabic Language Understanding

Figure 2 for AraBERT: Transformer-based Model for Arabic Language Understanding

Figure 3 for AraBERT: Transformer-based Model for Arabic Language Understanding

Abstract:The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The pretrained araBERT models are publicly available on https://github.com/aub-mind/arabert hoping to encourage research and applications for Arabic NLP.

* Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France (2020)

Via

Access Paper or Ask Questions

Neural Arabic Question Answering

Jun 12, 2019

Hussein Mozannar, Karl El Hajal, Elie Maamary, Hazem Hajj

Figure 1 for Neural Arabic Question Answering

Figure 2 for Neural Arabic Question Answering

Figure 3 for Neural Arabic Question Answering

Figure 4 for Neural Arabic Question Answering

Abstract:This paper tackles the problem of open domain factual Arabic question answering (QA) using Wikipedia as our knowledge source. This constrains the answer of any question to be a span of text in Wikipedia. Open domain QA for Arabic entails three challenges: annotated QA datasets in Arabic, large scale efficient information retrieval and machine reading comprehension. To deal with the lack of Arabic QA datasets we present the Arabic Reading Comprehension Dataset (ARCD) composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD). Our system for open domain question answering in Arabic (SOQAL) is based on two components: (1) a document retriever using a hierarchical TF-IDF approach and (2) a neural reading comprehension model using the pre-trained bi-directional transformer BERT. Our experiments on ARCD indicate the effectiveness of our approach with our BERT-based reader achieving a 61.3 F1 score, and our open domain system SOQAL achieving a 27.6 F1 score.

* WANLP 2019

Via

Access Paper or Ask Questions

ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets

May 25, 2019

Ramy Baly, Alaa Khaddaj, Hazem Hajj, Wassim El-Hajj, Khaled Bashir Shaban

Figure 1 for ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets

Figure 2 for ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets

Figure 3 for ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets

Abstract:Sentiment analysis is a highly subjective and challenging task. Its complexity further increases when applied to the Arabic language, mainly because of the large variety of dialects that are unstandardized and widely used in the Web, especially in social media. While many datasets have been released to train sentiment classifiers in Arabic, most of these datasets contain shallow annotation, only marking the sentiment of the text unit, as a word, a sentence or a document. In this paper, we present the Arabic Sentiment Twitter Dataset for the Levantine dialect (ArSenTD-LEV). Based on findings from analyzing tweets from the Levant region, we created a dataset of 4,000 tweets with the following annotations: the overall sentiment of the tweet, the target to which the sentiment was expressed, how the sentiment was expressed, and the topic of the tweet. Results confirm the importance of these annotations at improving the performance of a baseline sentiment classifier. They also confirm the gap of training in a certain domain, and testing in another domain.

* Corpus development, Levantine tweets, multi-topic, sentiment analysis, sentiment target, LREC-2018, OSACT-2018

Via

Access Paper or Ask Questions