Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rui Xin

Spurious Rewards: Rethinking Training Signals in RLVR

Jun 12, 2025

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna(+4 more)

Abstract:We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

Via

Access Paper or Ask Questions

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Apr 28, 2025

Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh

Abstract:Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.

Via

Access Paper or Ask Questions

DataComp-LM: In search of the next generation of training sets for language models

Jun 18, 2024

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora(+49 more)

Figure 1 for DataComp-LM: In search of the next generation of training sets for language models

Figure 2 for DataComp-LM: In search of the next generation of training sets for language models

Figure 3 for DataComp-LM: In search of the next generation of training sets for language models

Figure 4 for DataComp-LM: In search of the next generation of training sets for language models

Abstract:We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

* Project page: https://www.datacomp.ai/dclm/

Via

Access Paper or Ask Questions

Language models scale reliably with over-training and on downstream tasks

Mar 13, 2024

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh(+13 more)

Figure 1 for Language models scale reliably with over-training and on downstream tasks

Figure 2 for Language models scale reliably with over-training and on downstream tasks

Figure 3 for Language models scale reliably with over-training and on downstream tasks

Figure 4 for Language models scale reliably with over-training and on downstream tasks

Abstract:Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

Via

Access Paper or Ask Questions

Optimal Sparse Survival Trees

Jan 27, 2024

Rui Zhang, Rui Xin, Margo Seltzer, Cynthia Rudin

Figure 1 for Optimal Sparse Survival Trees

Figure 2 for Optimal Sparse Survival Trees

Figure 3 for Optimal Sparse Survival Trees

Figure 4 for Optimal Sparse Survival Trees

Abstract:Interpretability is crucial for doctors, hospitals, pharmaceutical companies and biotechnology corporations to analyze and make decisions for high stakes problems that involve human health. Tree-based methods have been widely adopted for \textit{survival analysis} due to their appealing interpretablility and their ability to capture complex relationships. However, most existing methods to produce survival trees rely on heuristic (or greedy) algorithms, which risk producing sub-optimal models. We present a dynamic-programming-with-bounds approach that finds provably-optimal sparse survival tree models, frequently in only a few seconds.

* AISTATS2024 preprint. arXiv admin note: text overlap with arXiv:2211.14980

Via

Access Paper or Ask Questions

Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

Aug 18, 2023

Yusheng Liu, Rui Xin, Tao Yang, Lisheng Wang

Figure 1 for Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

Figure 2 for Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

Figure 3 for Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

Abstract:Inferior Alveolar Nerve (IAN) canal detection in CBCT is an important step in many dental and maxillofacial surgery applications to prevent irreversible damage to the nerve during the procedure.The ToothFairy2023 Challenge aims to establish a 3D maxillofacial dataset consisting of all sparse labels and partial dense labels, and improve the ability of automatic IAN segmentation. In this work, in order to avoid the negative impact brought by sparse labeling, we transform the mixed supervised problem into a semi-supervised problem. Inspired by self-training via pseudo labeling, we propose a selective re-training framework based on IAN connectivity. Our method is quantitatively evaluated on the ToothFairy verification cases, achieving the dice similarity coefficient (DSC) of 0.7956, and 95\% hausdorff distance (HD95) of 4.4905, and wining the champion in the competition. Code is available at https://github.com/GaryNico517/SSL-IAN-Retraining.

* technical paper for Miccai ToothFairy2023 Challenge

Via

Access Paper or Ask Questions

Optimal Sparse Regression Trees

Dec 02, 2022

Rui Zhang, Rui Xin, Margo Seltzer, Cynthia Rudin

Figure 1 for Optimal Sparse Regression Trees

Figure 2 for Optimal Sparse Regression Trees

Figure 3 for Optimal Sparse Regression Trees

Figure 4 for Optimal Sparse Regression Trees

Abstract:Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.

* AAAI 2023, camera ready version

Via

Access Paper or Ask Questions

TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Sep 19, 2022

Zijie J. Wang, Chudi Zhong, Rui Xin, Takuya Takagi, Zhi Chen, Duen Horng Chau, Cynthia Rudin, Margo Seltzer

Figure 1 for TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Figure 2 for TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Figure 3 for TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Figure 4 for TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive Visualization

Abstract:Given thousands of equally accurate machine learning (ML) models, how can users choose among them? A recent ML technique enables domain experts and data scientists to generate a complete Rashomon set for sparse decision trees--a huge set of almost-optimal interpretable ML models. To help ML practitioners identify models with desirable properties from this Rashomon set, we develop TimberTrek, the first interactive visualization system that summarizes thousands of sparse decision trees at scale. Two usage scenarios highlight how TimberTrek can empower users to easily explore, compare, and curate models that align with their domain knowledge and values. Our open-source tool runs directly in users' computational notebooks and web browsers, lowering the barrier to creating more responsible ML models. TimberTrek is available at the following public demo link: https://poloclub.github.io/timbertrek.

* Accepted at IEEE VIS 2022. 5 pages, 6 figures. For a demo video, see https://youtu.be/3eGqTmsStJM. For a live demo, visit https://poloclub.github.io/timbertrek

Via

Access Paper or Ask Questions

Exploring the Whole Rashomon Set of Sparse Decision Trees

Sep 16, 2022

Rui Xin, Chudi Zhong, Zhi Chen, Takuya Takagi, Margo Seltzer, Cynthia Rudin

Figure 1 for Exploring the Whole Rashomon Set of Sparse Decision Trees

Figure 2 for Exploring the Whole Rashomon Set of Sparse Decision Trees

Figure 3 for Exploring the Whole Rashomon Set of Sparse Decision Trees

Abstract:In any given machine learning problem, there may be many models that could explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed within a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be extremely complicated, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.

* Accepted to NeurIPS 2022

Via

Access Paper or Ask Questions

Materials Transformers Language Models for Generative Materials Design: a benchmark study

Jun 27, 2022

Nihang Fu, Lai Wei, Yuqi Song, Qinyang Li, Rui Xin, Sadman Sadeed Omee, Rongzhi Dong, Edirisuriya M. Dilanga Siriwardane, Jianjun Hu

Figure 1 for Materials Transformers Language Models for Generative Materials Design: a benchmark study

Figure 2 for Materials Transformers Language Models for Generative Materials Design: a benchmark study

Figure 3 for Materials Transformers Language Models for Generative Materials Design: a benchmark study

Figure 4 for Materials Transformers Language Models for Generative Materials Design: a benchmark study

Abstract:Pre-trained transformer language models on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns of inorganic materials. Here we train a series of seven modern transformer language models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) using the expanded formulas from material deposited in the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or balanced electronegativity samples are used to benchmark the performances and uncover the generation biases of modern transformer models for the generative design of materials compositions. Our extensive experiments showed that the causal language models based materials transformers can generate chemically valid materials compositions with as high as 97.54\% to be charge neutral and 91.40\% to be electronegativity balanced, which has more than 6 times higher enrichment compared to a baseline pseudo-random sampling algorithm. These models also demonstrate high novelty and their potential in new materials discovery has been proved by their capability to recover the leave-out materials. We also find that the properties of the generated samples can be tailored by training the models with selected training sets such as high-bandgap materials. Our experiments also showed that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformer models to discover a set of new materials as validated using DFT calculations.

* 18 pages

Via

Access Paper or Ask Questions