Abstract:The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; and Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT's foundational alignment strategies to DeepSeek-R1's innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.
Abstract:Data augmentation (DA) is indispensable in modern machine learning and deep neural networks. The basic idea of DA is to construct new training data to improve the model's generalization by adding slightly disturbed versions of existing data or synthesizing new data. In this work, we review a small but essential subset of DA -- Mix-based Data Augmentation (MixDA) that generates novel samples by mixing multiple examples. Unlike conventional DA approaches based on a single-sample operation or requiring domain knowledge, MixDA is more general in creating a broad spectrum of new data and has received increasing attention in the community. We begin with proposing a new taxonomy classifying MixDA into, Mixup-based, Cutmix-based, and hybrid approaches according to a hierarchical view of the data mix. Various MixDA techniques are then comprehensively reviewed in a more fine-grained way. Owing to its generalization, MixDA has penetrated a variety of applications which are also completely reviewed in this work. We also examine why MixDA works from different aspects of improving model performance, generalization, and calibration while explaining the model behavior based on the properties of MixDA. Finally, we recapitulate the critical findings and fundamental challenges of current MixDA studies, and outline the potential directions for future works. Different from previous related works that summarize the DA approaches in a specific domain (e.g., images or natural language processing) or only review a part of MixDA studies, we are the first to provide a systematical survey of MixDA in terms of its taxonomy, methodology, applications, and explainability. This work can serve as a roadmap to MixDA techniques and application reviews while providing promising directions for researchers interested in this exciting area.