Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ruijie Jiang

Delta Knowledge Distillation for Large Language Models

Sep 18, 2025

Yihan Cao, Yanbin Kang, Zhengming Xing, Ruijie Jiang

Abstract:Knowledge distillation (KD) is a widely adopted approach for compressing large neural networks by transferring knowledge from a large teacher model to a smaller student model. In the context of large language models, token level KD, typically minimizing the KL divergence between student output distribution and teacher output distribution, has shown strong empirical performance. However, prior work assumes student output distribution and teacher output distribution share the same optimal representation space, a premise that may not hold in many cases. To solve this problem, we propose Delta Knowledge Distillation (Delta-KD), a novel extension of token level KD that encourages the student to approximate an optimal representation space by explicitly preserving the distributional shift Delta introduced during the teacher's supervised finetuning (SFT). Empirical results on ROUGE metrics demonstrate that Delta KD substantially improves student performance while preserving more of the teacher's knowledge.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

On neural and dimensional collapse in supervised and unsupervised contrastive learning with hard negative sampling

Nov 09, 2023

Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar

Abstract:For a widely-studied data model and general loss and sample-hardening functions we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.

Via

Access Paper or Ask Questions

Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

Jul 18, 2023

Zhe Huang, Ruijie Jiang, Shuchin Aeron, Michael C. Hughes

Figure 1 for Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

Figure 2 for Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

Figure 3 for Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

Figure 4 for Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

Abstract:For many applications of classifiers to medical images, a trustworthy label for each image can be difficult or expensive to obtain. In contrast, images without labels are more readily available. Two major research directions both promise that additional unlabeled data can improve classifier performance: self-supervised learning pretrains useful representations on unlabeled data only, then fine-tunes a classifier on these representations via the labeled set; semi-supervised learning directly trains a classifier on labeled and unlabeled data simultaneously. Recent methods from both directions have claimed significant gains on non-medical tasks, but do not systematically assess medical images and mostly compare only to methods in the same direction. This study contributes a carefully-designed benchmark to help answer a practitioner's key question: given a small labeled dataset and a limited budget of hours to spend on training, what gains from additional unlabeled images are possible and which methods best achieve them? Unlike previous benchmarks, ours uses realistic-sized validation sets to select hyperparameters, assesses runtime-performance tradeoffs, and bridges two research fields. By comparing 6 semi-supervised methods and 5 self-supervised methods to strong labeled-only baselines on 3 medical datasets with 30-1000 labels per class, we offer insights to resource-constrained, results-focused practitioners: MixMatch, SimCLR, and BYOL represent strong choices that were not surpassed by more recent methods. After much effort selecting hyperparameters on one dataset, we publish settings that enable strong methods to perform well on new medical tasks within a few hours, with further search over dozens of hours delivering modest additional gains.

* Semi-supervised Learning; Self-supervised Learning; Medical Imaging

Via

Access Paper or Ask Questions

Measure Estimation in the Barycentric Coding Model

Jan 28, 2022

Matthew Werenski, Ruijie Jiang, Abiy Tasissa, Shuchin Aeron, James M. Murphy

Figure 1 for Measure Estimation in the Barycentric Coding Model

Figure 2 for Measure Estimation in the Barycentric Coding Model

Figure 3 for Measure Estimation in the Barycentric Coding Model

Figure 4 for Measure Estimation in the Barycentric Coding Model

Abstract:This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycenteric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.

Via

Access Paper or Ask Questions

Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Nov 04, 2021

Ruijie Jiang, Prakash Ishwar, Shuchin Aeron

Figure 1 for Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Figure 2 for Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Figure 3 for Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Figure 4 for Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Abstract:We study the problem of designing hard negative sampling distributions for unsupervised contrastive representation learning. We analyze a novel min-max framework that seeks a representation which minimizes the maximum (worst-case) generalized contrastive learning loss over all couplings (joint distributions between positive and negative samples subject to marginal constraints) and prove that the resulting min-max optimum representation will be degenerate. This provides the first theoretical justification for incorporating additional regularization constraints on the couplings. We re-interpret the min-max problem through the lens of Optimal Transport theory and utilize regularized transport couplings to control the degree of hardness of negative examples. We demonstrate that the state-of-the-art hard negative sampling distributions that were recently proposed are a special case corresponding to entropic regularization of the coupling.

Via

Access Paper or Ask Questions

Interpretable contrastive word mover's embedding

Nov 01, 2021

Ruijie Jiang, Julia Gouvea, Eric Miller, David Hammer, Shuchin Aeron

Figure 1 for Interpretable contrastive word mover's embedding

Figure 2 for Interpretable contrastive word mover's embedding

Figure 3 for Interpretable contrastive word mover's embedding

Figure 4 for Interpretable contrastive word mover's embedding

Abstract:This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the \textit{novel problem of assessing student work for scientific writing and thinking} - a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Dec 01, 2020

Ruijie Jiang, Julia Gouvea, David Hammer, Eric Miller, Shuchin Aeron

Figure 1 for Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Figure 2 for Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Figure 3 for Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Figure 4 for Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Abstract:Qualitative analysis of verbal data is of central importance in the learning sciences. It is labor-intensive and time-consuming, however, which limits the amount of data researchers can include in studies. This work is a step towards building a statistical machine learning (ML) method for achieving an automated support for qualitative analyses of students' writing, here specifically in score laboratory reports in introductory biology for sophistication of argumentation and reasoning. We start with a set of lab reports from an undergraduate biology course, scored by a four-level scheme that considers the complexity of argument structure, the scope of evidence, and the care and nuance of conclusions. Using this set of labeled data, we show that a popular natural language modeling processing pipeline, namely vector representation of words, a.k.a word embeddings, followed by Long Short Term Memory (LSTM) model for capturing language generation as a state-space model, is able to quantitatively capture the scoring, with a high Quadratic Weighted Kappa (QWK) prediction score, when trained in via a novel contrastive learning set-up. We show that the ML algorithm approached the inter-rater reliability of human analysis. Ultimately, we conclude, that machine learning (ML) for natural language processing (NLP) holds promise for assisting learning sciences researchers in conducting qualitative studies at much larger scales than is currently possible.

Via

Access Paper or Ask Questions