Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinying Qiu

Label Confidence Weighted Learning for Target-level Sentence Simplification

Oct 08, 2024

Xinying Qiu, Jingshen Zhang

Abstract:Multi-level sentence simplification generates simplified sentences with varying language proficiency levels. We propose Label Confidence Weighted Learning (LCWL), a novel approach that incorporates a label confidence weighting scheme in the training loss of the encoder-decoder model, setting it apart from existing confidence-weighting methods primarily designed for classification. Experimentation on English grade-level simplification dataset shows that LCWL outperforms state-of-the-art unsupervised baselines. Fine-tuning the LCWL model on in-domain data and combining with Symmetric Cross Entropy (SCE) consistently delivers better simplifications compared to strong supervised methods. Our results highlight the effectiveness of label confidence weighting techniques for text simplification tasks with encoder-decoder architectures.

* Accepted to EMNLP 2024

Via

Access Paper or Ask Questions

System Report for CCL24-Eval Task 7: Multi-Error Modeling and Fluency-Targeted Pre-training for Chinese Essay Evaluation

Jul 11, 2024

Jingshen Zhang, Xiangyu Yang, Xinkai Su, Xinglu Chen, Tianyou Huang, Xinying Qiu

Abstract:This system report presents our approaches and results for the Chinese Essay Fluency Evaluation (CEFE) task at CCL-2024. For Track 1, we optimized predictions for challenging fine-grained error types using binary classification models and trained coarse-grained models on the Chinese Learner 4W corpus. In Track 2, we enhanced performance by constructing a pseudo-dataset with multiple error types per sentence. For Track 3, where we achieved first place, we generated fluency-rated pseudo-data via back-translation for pre-training and used an NSP-based strategy with Symmetric Cross Entropy loss to capture context and mitigate long dependencies. Our methods effectively address key challenges in Chinese Essay Fluency Evaluation.

Via

Access Paper or Ask Questions

Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Jul 06, 2024

Jingshen Zhang, Xinying Qiu, Teng Shen, Wenyu Wang, Kailin Zhang, Wenhe Feng

Figure 1 for Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Figure 2 for Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Figure 3 for Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Figure 4 for Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Abstract:Cross-lingual word alignment plays a crucial role in various natural language processing tasks, particularly for low-resource languages. Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings. However, their model only considers the similarity of word embedding spaces and does not explicitly model the differences between word embeddings. To address this limitation, we propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework. Our approach introduces a multi-view negative sampling strategy to learn the differences between word pairs in the shared cross-lingual embedding space. We evaluate our model on five bilingual aligned datasets spanning four ASEAN languages: Lao, Vietnamese, Thai, and Indonesian. Experimental results demonstrate that integrating contrastive learning consistently improves word alignment accuracy across all datasets, confirming the effectiveness of the proposed method in low-resource scenarios. We will release our data set and code to support future research on ASEAN or more low-resource word alignment.

Via

Access Paper or Ask Questions

Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction

Jul 03, 2024

Kailin Zhang, Xinying Qiu

Abstract:Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.

* Accepted to IALP 2024

Via

Access Paper or Ask Questions

Readability-guided Idiom-aware Sentence Simplification (RISS) for Chinese

Jun 05, 2024

Jingshen Zhang, Xinglu Chen, Xinying Qiu, Zhimin Wang, Wenhe Feng

Abstract:Chinese sentence simplification faces challenges due to the lack of large-scale labeled parallel corpora and the prevalence of idioms. To address these challenges, we propose Readability-guided Idiom-aware Sentence Simplification (RISS), a novel framework that combines data augmentation techniques with lexcial simplification. RISS introduces two key components: (1) Readability-guided Paraphrase Selection (RPS), a method for mining high-quality sentence pairs, and (2) Idiom-aware Simplification (IAS), a model that enhances the comprehension and simplification of idiomatic expressions. By integrating RPS and IAS using multi-stage and multi-task learning strategies, RISS outperforms previous state-of-the-art methods on two Chinese sentence simplification datasets. Furthermore, RISS achieves additional improvements when fine-tuned on a small labeled dataset. Our approach demonstrates the potential for more effective and accessible Chinese text simplification.

* Accepted to the 23rd China National Conference on Computational Linguistics (CCL 2024)

Via

Access Paper or Ask Questions

Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment

Mar 03, 2024

Jingshen Zhang, Jiajun Xie, Xinying Qiu

Abstract:Item difficulty plays a crucial role in adaptive testing. However, few works have focused on generating questions of varying difficulty levels, especially for multiple-choice (MC) cloze tests. We propose training pre-trained language models (PLMs) as surrogate models to enable item response theory (IRT) assessment, avoiding the need for human test subjects. We also propose two strategies to control the difficulty levels of both the gaps and the distractors using ranking rules to reduce invalid distractors. Experimentation on a benchmark dataset demonstrates that our proposed framework and methods can effectively control and evaluate the difficulty levels of MC cloze tests.

Via

Access Paper or Ask Questions

Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

Nov 24, 2022

Xinying Qiu, Shuxuan Liao, Jiajun Xie, Jian-Yun Nie

Figure 1 for Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

Figure 2 for Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

Figure 3 for Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

Figure 4 for Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

Abstract:In the prompt-specific holistic score prediction task for Automatic Essay Scoring, the general approaches include pre-trained neural model, coherence model, and hybrid model that incorporate syntactic features with neural model. In this paper, we propose a novel approach to extract and represent essay coherence features with prompt-learning NSP that shows to match the state-of-the-art AES coherence model, and achieves the best performance for long essays. We apply syntactic feature dense embedding to augment BERT-based model and achieve the best performance for hybrid methodology for AES. In addition, we explore various ideas to combine coherence, syntactic information and semantic embeddings, which no previous study has done before. Our combined model also performs better than the SOTA available for combined model, even though it does not outperform our syntactic enhanced neural model. We further offer analyses that can be useful for future study.

* Accepted to "2022 International Conference on Asian Language Processing (IALP)"

Via

Access Paper or Ask Questions

InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

Nov 24, 2022

Xinying Qiu, Guofeng Shi

Abstract:We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.

* Accepted to "2022 International Conference on Asian Language Processing (IALP)"

Via

Access Paper or Ask Questions

Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Apr 06, 2022

Yingwen Fu, Jinyi Chen, Nankai Lin, Xixuan Huang, Xinying Qiu, Shengyi Jiang

Figure 1 for Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Figure 2 for Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Figure 3 for Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Figure 4 for Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Abstract:The Yunshan Cup 2020 track focused on creating a framework for evaluating different methods of part-of-speech (POS). There were two tasks for this track: (1) POS tagging for the Indonesian language, and (2) POS tagging for the Lao tagging. The Indonesian dataset is comprised of 10000 sentences from Indonesian news within 29 tags. And the Lao dataset consists of 8000 sentences within 27 tags. 25 teams registered for the task. The methods of participants ranged from feature-based to neural networks using either classical machine learning techniques or ensemble methods. The best performing results achieve an accuracy of 95.82% for Indonesian and 93.03%, showing that neural sequence labeling models significantly outperform classic feature-based methods and rule-based methods.

Via

Access Paper or Ask Questions

Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Jul 09, 2021

Xinying Qiu, Yuan Chen, Hanwu Chen, Jian-Yun Nie, Yuming Shen, Dawei Lu

Figure 1 for Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Figure 2 for Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Figure 3 for Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Figure 4 for Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Abstract:Deep learning models for automatic readability assessment generally discard linguistic features traditionally used in machine learning models for the task. We propose to incorporate linguistic features into neural network models by learning syntactic dense embeddings based on linguistic features. To cope with the relationships between the features, we form a correlation graph among features and use it to learn their embeddings so that similar features will be represented by similar embeddings. Experiments with six data sets of two proficiency levels demonstrate that our proposed methodology can complement BERT-only model to achieve significantly better performances for automatic readability assessment.

* Accepted to the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021)

Via

Access Paper or Ask Questions