Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianze Shi

Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models

Feb 21, 2025

Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova

Abstract:We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures. Experiments with Transformer decoder-only language models in the range of 70M to 1B parameters on the SlimPajama dataset show that our method achieves significantly better performance than approaches that train regression models using only the mixture rates as input features. Combining this improved optimization method with an objective that takes into account cross-entropy on end task data leads to superior performance on few-shot downstream evaluations. We also provide theoretical insights on why aggregation of data expert predictions can provide good approximations to model losses for data mixtures.

Via

Access Paper or Ask Questions

Weakly Supervised Headline Dependency Parsing

Jan 25, 2023

Adrian Benton, Tianze Shi, Ozan İrsoy, Igor Malioutov

Figure 1 for Weakly Supervised Headline Dependency Parsing

Figure 2 for Weakly Supervised Headline Dependency Parsing

Figure 3 for Weakly Supervised Headline Dependency Parsing

Figure 4 for Weakly Supervised Headline Dependency Parsing

Abstract:English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.

* In Proceedings of Findings of EMNLP 2022
* Findings of EMNLP 2022

Via

Access Paper or Ask Questions

Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

May 24, 2022

Linlu Qiu, Peter Shaw, Panupong Pasupat, Tianze Shi, Jonathan Herzig, Emily Pitler, Fei Sha, Kristina Toutanova

Figure 1 for Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Figure 2 for Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Figure 3 for Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Figure 4 for Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing

Abstract:Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling. Can scaling up model size also improve compositional generalization in semantic parsing? We evaluate encoder-decoder models up to 11B parameters and decoder-only models up to 540B parameters, and compare model scaling curves for three different methods for transfer learning: fine-tuning all parameters, prompt tuning, and in-context learning. We observe that fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization in semantic parsing evaluations. In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models. Prompt-tuning can outperform fine-tuning, suggesting further potential improvements from scaling as it exhibits a more positive scaling curve. Additionally, we identify several error trends that vary with model scale. For example, larger models are generally better at modeling the syntax of the output space, but are also more prone to certain types of overfitting. Overall, our study highlights limitations of current techniques for effectively leveraging model scale for compositional generalization, while our analysis also suggests promising directions for future work.

Via

Access Paper or Ask Questions

TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Jul 14, 2021

Tianze Shi, Lillian Lee

Figure 1 for TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Figure 2 for TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Figure 3 for TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Figure 4 for TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Abstract:We present our contribution to the IWPT 2021 shared task on parsing into enhanced Universal Dependencies. Our main system component is a hybrid tree-graph parser that integrates (a) predictions of spanning trees for the enhanced graphs with (b) additional graph edges not present in the spanning trees. We also adopt a finetuning strategy where we first train a language-generic parser on the concatenation of data from all available languages, and then, in a second step, finetune on each individual language separately. Additionally, we develop our own complete set of pre-processing modules relevant to the shared task, including tokenization, sentence segmentation, and multiword token expansion, based on pre-trained XLM-R models and our own pre-training of character-level language models. Our submission reaches a macro-average ELAS of 89.24 on the test set. It ranks top among all teams, with a margin of more than 2 absolute ELAS over the next best-performing submission, and best score on 16 out of 17 languages.

* Proceedings of IWPT 2021
* IWPT 2021 Shared Task

Via

Access Paper or Ask Questions

Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

Jul 14, 2021

Tianze Shi, Lillian Lee

Figure 1 for Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

Figure 2 for Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

Figure 3 for Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

Figure 4 for Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

Abstract:We propose a transition-based bubble parser to perform coordination structure identification and dependency-based syntactic analysis simultaneously. Bubble representations were proposed in the formal linguistics literature decades ago; they enhance dependency trees by encoding coordination boundaries and internal relationships within coordination structures explicitly. In this paper, we introduce a transition system and neural models for parsing these bubble-enhanced structures. Experimental results on the English Penn Treebank and the English GENIA corpus show that our parsers beat previous state-of-the-art approaches on the task of coordination structure prediction, especially for the subset of sentences with complex coordination structures.

* In Proceedings of ACL 2021
* ACL 2021

Via

Access Paper or Ask Questions

Diversity-Aware Batch Active Learning for Dependency Parsing

Apr 28, 2021

Tianze Shi, Adrian Benton, Igor Malioutov, Ozan İrsoy

Figure 1 for Diversity-Aware Batch Active Learning for Dependency Parsing

Figure 2 for Diversity-Aware Batch Active Learning for Dependency Parsing

Figure 3 for Diversity-Aware Batch Active Learning for Dependency Parsing

Figure 4 for Diversity-Aware Batch Active Learning for Dependency Parsing

Abstract:While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversityaware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation.

* In Proceedings of NAACL 2021
* NAACL 2021

Via

Access Paper or Ask Questions

Learning Syntax from Naturally-Occurring Bracketings

Apr 28, 2021

Tianze Shi, Ozan İrsoy, Igor Malioutov, Lillian Lee

Figure 1 for Learning Syntax from Naturally-Occurring Bracketings

Figure 2 for Learning Syntax from Naturally-Occurring Bracketings

Figure 3 for Learning Syntax from Naturally-Occurring Bracketings

Abstract:Naturally-occurring bracketings, such as answer fragments to natural language questions and hyperlinks on webpages, can reflect human syntactic intuition regarding phrasal boundaries. Their availability and approximate correspondence to syntax make them appealing as distant information sources to incorporate into unsupervised constituency parsing. But they are noisy and incomplete; to address this challenge, we develop a partial-brackets-aware structured ramp loss in learning. Experiments demonstrate that our distantly-supervised models trained on naturally-occurring bracketing data are more accurate in inducing syntactic structures than competing unsupervised systems. On the English WSJ corpus, our models achieve an unlabeled F1 score of 68.9 for constituency parsing.

* In Proceedings of NAACL 2021
* NAACL 2021

Via

Access Paper or Ask Questions

On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Oct 21, 2020

Tianze Shi, Chen Zhao, Jordan Boyd-Graber, Hal Daumé III, Lillian Lee

Figure 1 for On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Figure 2 for On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Figure 3 for On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Figure 4 for On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Abstract:Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce Squall, a dataset that enriches 11,276 WikiTableQuestions English-language questions with manually created SQL equivalents plus alignments between SQL and question fragments. Our annotation enables new training possibilities for encoder-decoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.

* Findings of ACL: EMNLP 2020
* Findings of ACL: EMNLP 2020

Via

Access Paper or Ask Questions

Semantic Role Labeling as Syntactic Dependency Parsing

Oct 21, 2020

Tianze Shi, Igor Malioutov, Ozan İrsoy

Figure 1 for Semantic Role Labeling as Syntactic Dependency Parsing

Figure 2 for Semantic Role Labeling as Syntactic Dependency Parsing

Figure 3 for Semantic Role Labeling as Syntactic Dependency Parsing

Figure 4 for Semantic Role Labeling as Syntactic Dependency Parsing

Abstract:We reduce the task of (span-based) PropBank-style semantic role labeling (SRL) to syntactic dependency parsing. Our approach is motivated by our empirical analysis that shows three common syntactic patterns account for over 98% of the SRL annotations for both English and Chinese data. Based on this observation, we present a conversion scheme that packs SRL annotations into dependency tree representations through joint labels that permit highly accurate recovery back to the original format. This representation allows us to train statistical dependency parsers to tackle SRL and achieve competitive performance with the current state of the art. Our findings show the promise of syntactic dependency trees in encoding semantic role relations within their syntactic domain of locality, and point to potential further integration of syntactic methods into semantic role labeling in the future.

* Appeared in EMNLP 2020

Via

Access Paper or Ask Questions

Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

May 06, 2020

Tianze Shi, Lillian Lee

Figure 1 for Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

Figure 2 for Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

Figure 3 for Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

Figure 4 for Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

Abstract:An interesting and frequent type of multi-word expression (MWE) is the headless MWE, for which there are no true internal syntactic dominance relations; examples include many named entities ("Wells Fargo") and dates ("July 5, 2020") as well as certain productive constructions ("blow for blow", "day after day"). Despite their special status and prevalence, current dependency-annotation schemes require treating such flat structures as if they had internal syntactic heads, and most current parsers handle them in the same fashion as headed constructions. Meanwhile, outside the context of parsing, taggers are typically used for identifying MWEs, but taggers might benefit from structural information. We empirically compare these two common strategies--parsing and tagging--for predicting flat MWEs. Additionally, we propose an efficient joint decoding algorithm that combines scores from both strategies. Experimental results on the MWE-Aware English Dependency Corpus and on six non-English dependency treebanks with frequent flat structures show that: (1) tagging is more accurate than parsing for identifying flat-structure MWEs, (2) our joint decoder reconciles the two different views and, for non-BERT features, leads to higher accuracies, and (3) most of the gains result from feature sharing between the parsers and taggers.

* Proceedings of ACL, 2020
* Proceedings of ACL, 2020

Via

Access Paper or Ask Questions