Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Atakan Kara

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Sep 20, 2023

Atakan Kara, Farrin Marouf Sofian, Andrew Bond, Gözde Gül Şahin

Figure 1 for GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Figure 2 for GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Figure 3 for GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Figure 4 for GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Abstract:Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.

* Accepted at Findings of IJCNLP-AACL 2023

Via

Access Paper or Ask Questions

Extracting Relations Between Sectors

Aug 30, 2022

Atakan Kara, F. Serhan Daniş, Günce Keziban Orman, Sultan Nezihe Turhan

Figure 1 for Extracting Relations Between Sectors

Figure 2 for Extracting Relations Between Sectors

Figure 3 for Extracting Relations Between Sectors

Figure 4 for Extracting Relations Between Sectors

Abstract:The term "sector" in professional business life is a vague concept since companies tend to identify themselves as operating in multiple sectors simultaneously. This ambiguity poses problems in recommending jobs to job seekers or finding suitable candidates for open positions. The latter holds significant importance when available candidates in a specific sector are also scarce; hence, finding candidates from similar sectors becomes crucial. This work focuses on discovering possible sector similarities through relational analysis. We employ several algorithms from the frequent pattern mining and collaborative filtering domains, namely negFIN, Alternating Least Squares, Bilateral Variational Autoencoder, and Collaborative Filtering based on Pearson's Correlation, Kendall and Spearman's Rank Correlation coefficients. The algorithms are compared on a real-world dataset supplied by a major recruitment company, Kariyer.net, from Turkey. The insights and methods gained through this work are expected to increase the efficiency and accuracy of various methods, such as recommending jobs or finding suitable candidates for open positions.

* 13 pages and 3 figures

Via

Access Paper or Ask Questions