Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junmo Kang

Seeing the Forest for the Trees: A Large Scale, Continuously Updating Meta-Analysis of Frontier LLMs

Feb 26, 2025

Jungsoo Park, Junmo Kang, Gabriel Stanovsky, Alan Ritter

Abstract:The surge of LLM studies makes synthesizing their findings challenging. Meta-analysis can uncover important trends across studies, but its use is limited by the time-consuming nature of manual data extraction. Our study presents a semi-automated approach for meta-analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset. We conduct a comprehensive meta-analysis of frontier LLMs using an automatically extracted dataset, reducing the effort of paper surveying and data extraction by more than 93\% compared to manual approaches. We validate our dataset by showing that it reproduces key findings from a recent manual meta-analysis about Chain-of-Thought (CoT), and also uncovers new insights that go beyond it, showing for example that in-context examples benefit multimodal tasks but offer limited gains in mathematical tasks compared to CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through our scientific artifacts and empirical analysis, we provide novel insights into LLMs while facilitating ongoing meta-analyses of their behavior.

* 21 pages, 9 figures

Via

Access Paper or Ask Questions

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Jun 17, 2024

Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

Abstract:We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity and the potential of self-improvement in achieving efficient, scalable, and adaptable systems.

Via

Access Paper or Ask Questions

Self-Specialization: Uncovering Latent Expertise within Large Language Models

Sep 29, 2023

Junmo Kang, Hongyin Luo, Yada Zhu, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky

Figure 1 for Self-Specialization: Uncovering Latent Expertise within Large Language Models

Figure 2 for Self-Specialization: Uncovering Latent Expertise within Large Language Models

Figure 3 for Self-Specialization: Uncovering Latent Expertise within Large Language Models

Figure 4 for Self-Specialization: Uncovering Latent Expertise within Large Language Models

Abstract:Recent works have demonstrated the effectiveness of self-alignment in which a large language model is, by itself, aligned to follow general instructions through the automatic generation of instructional data using a handful of human-written seeds. Instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine), discovering it to be very effective for improving zero-shot and few-shot performance in target domains of interest. As a preliminary, we first present the benchmark results of existing aligned models within a specialized domain, which reveals the marginal effect that "generic" instruction-following training has on downstream expert domains' performance. To remedy this, we explore self-specialization that leverages domain-specific unlabelled data and a few labeled seeds for the self-alignment process. When augmented with retrieval to reduce hallucination and enhance concurrency of the alignment, self-specialization offers an effective (and efficient) way of "carving out" an expert model out of a "generalist", pre-trained LLM where different domains of expertise are originally combined in a form of "superposition". Our experimental results on a biomedical domain show that our self-specialized model (30B) outperforms its base model, MPT-30B by a large margin and even surpasses larger popular models based on LLaMA-65B, highlighting its potential and practicality for specialization, especially considering its efficiency in terms of data and parameters.

Via

Access Paper or Ask Questions

Schema-Driven Information Extraction from Heterogeneous Tables

May 23, 2023

Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Alan Ritter

Abstract:In this paper, we explore the question of whether language models (LLMs) can support cost-efficient information extraction from complex tables. We introduce schema-driven information extraction, a new task that uses LLMs to transform tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from three diverse domains: machine learning papers, chemistry tables, and webpages. Accompanying the benchmark, we present InstrucTE, a table extraction method based on instruction-tuned LLMs. This method necessitates only a human-constructed extraction schema, and incorporates an error-recovery strategy. Notably, InstrucTE demonstrates competitive performance without task-specific labels, achieving an F1 score ranging from 72.3 to 95.7. Moreover, we validate the feasibility of distilling more compact table extraction models to minimize extraction costs and reduce API reliance. This study paves the way for the future development of instruction-following models for cost-efficient table extraction.

Via

Access Paper or Ask Questions

Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

May 03, 2023

Junmo Kang, Wei Xu, Alan Ritter

Figure 1 for Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Figure 2 for Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Figure 3 for Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Figure 4 for Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Abstract:Fine-tuning large models is highly effective, however, inference using these models can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner who needs a compact model might also choose to simply allocate an available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through our extensive experiments on six diverse NLP tasks, we find that distilling from T5-XXL (11B) to T5-Small (60M) leads to almost always a cost-efficient option compared to annotating more data to directly train a compact model (T5-Small (60M)). We further demonstrate that the optimal amount of distillation that maximizes utility varies across different budgetary scenarios.

* Accepted at ACL 2023 main conference

Via

Access Paper or Ask Questions

Discern and Answer: Mitigating the Impact of Misinformation in Retrieval-Augmented Models with Discriminators

May 02, 2023

Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, Joyce Jiyoung Whang

Abstract:Most existing retrieval-augmented language models (LMs) for question answering assume all retrieved information is factually correct. In this work, we study a more realistic scenario in which retrieved documents may contain misinformation, causing conflicts among them. We observe that the existing models are highly brittle to such information in both fine-tuning and in-context few-shot learning settings. We propose approaches to make retrieval-augmented LMs robust to misinformation by explicitly fine-tuning a discriminator or prompting to elicit discrimination capability in GPT-3. Our empirical results on open-domain question answering show that these approaches significantly improve LMs' robustness to knowledge conflicts. We also provide our findings on interleaving the fine-tuned model's decision with the in-context learning process, paving a new path to leverage the best of both worlds.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Maximizing Efficiency of Language Model Pre-training for Learning Representation

Oct 13, 2021

Junmo Kang, Suwon Shin, Jeonghwan Kim, Jaeyoung Jo, Sung-Hyon Myaeng

Figure 1 for Maximizing Efficiency of Language Model Pre-training for Learning Representation

Figure 2 for Maximizing Efficiency of Language Model Pre-training for Learning Representation

Figure 3 for Maximizing Efficiency of Language Model Pre-training for Learning Representation

Figure 4 for Maximizing Efficiency of Language Model Pre-training for Learning Representation

Abstract:Pre-trained language models in the past years have shown exponential growth in model parameters and compute time. ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models (e.g. BERT) based on masked language modeling (MLM) by addressing the sample inefficiency problem with the replaced token detection (RTD) task. Our work proposes adaptive early exit strategy to maximize the efficiency of the pre-training process by relieving the model's subsequent layers of the need to process latent features by leveraging earlier layer representations. Moreover, we evaluate an initial approach to the problem that has not succeeded in maintaining the accuracy of the model while showing a promising compute efficiency by thoroughly investigating the necessity of the generator module of ELECTRA.

* Published in KSC 2020

Via

Access Paper or Ask Questions

UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

Apr 15, 2021

Kyoung-Rok Jang, Junmo Kang, Giwon Hong, Sung-Hyon Myaeng, Joohee Park, Taewon Yoon, Heecheol Seo

Figure 1 for UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

Figure 2 for UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

Figure 3 for UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

Figure 4 for UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking

Abstract:Neural information retrieval (IR) models are promising mainly because their semantic matching capabilities can ameliorate the well-known synonymy and polysemy problems of word-based symbolic approaches. However, the power of neural models' dense representations comes at the cost of inefficiency, limiting it to be used as a re-ranker. Sparse representations, on the other hand, can help enhance symbolic or latent-term representations and yet take advantage of an inverted index for efficiency, being amenable to symbolic IR techniques that have been around for decades. In order to transcend the trade-off between sparse representations (symbolic or latent-term based) and dense representations, we propose an ultra-high dimensional (UHD) representation scheme equipped with directly controllable sparsity. With the high dimensionality, we attempt to make the meaning of each dimension less entangled and polysemous than dense embeddings. The sparsity allows for not only efficiency for vector calculations but also the possibility of making individual dimensions attributable to interpretable concepts. Our model, UHD-BERT, maximizes the benefits of ultra-high dimensional (UHD) sparse representations based on BERT language modeling, by adopting a bucketing method. With this method, different segments of an embedding (horizontal buckets) or the embeddings from multiple layers of BERT (vertical buckets) can be selected and merged so that diverse linguistic aspects can be represented. An additional and important benefit of our highly disentangled (high-dimensional) and efficient (sparse) representations is that this neural approach can be harmonized with well-studied symbolic IR techniques (e.g., inverted index, pseudo-relevance feedback, BM25), enabling us to build a powerful and efficient neuro-symbolic information retrieval system.

Via

Access Paper or Ask Questions

A Sequence-Oblivious Generation Method for Context-Aware Hashtag Recommendation

Dec 05, 2020

Junmo Kang, Jeonghwan Kim, Suwon Shin, Sung-Hyon Myaeng

Figure 1 for A Sequence-Oblivious Generation Method for Context-Aware Hashtag Recommendation

Figure 2 for A Sequence-Oblivious Generation Method for Context-Aware Hashtag Recommendation

Figure 3 for A Sequence-Oblivious Generation Method for Context-Aware Hashtag Recommendation

Figure 4 for A Sequence-Oblivious Generation Method for Context-Aware Hashtag Recommendation

Abstract:Like search, a recommendation task accepts an input query or cue and provides desirable items, often based on a ranking function. Such a ranking approach rarely considers explicit dependency among the recommended items. In this work, we propose a generative approach to tag recommendation, where semantic tags are selected one at a time conditioned on the previously generated tags to model inter-dependency among the generated tags. We apply this tag recommendation approach to an Instagram data set where an array of context feature types (image, location, time, and text) are available for posts. To exploit the inter-dependency among the distinct types of features, we adopt a simple yet effective architecture using self-attention, making deep interactions possible. Empirical results show that our method is significantly superior to not only the usual ranking schemes but also autoregressive models for tag recommendation. They indicate that it is critical to fuse mutually supporting features at an early stage to induce extensive and comprehensive view on inter-context interaction in generating tags in a recurrent feedback loop.

Via

Access Paper or Ask Questions

Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Oct 30, 2019

Junmo Kang, Haritz Puerto San Roman, Sung-Hyon Myaeng

Figure 1 for Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Figure 2 for Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Figure 3 for Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Figure 4 for Let Me Know What to Ask: Interrogative-Word-Aware Question Generation

Abstract:Question Generation (QG) is a Natural Language Processing (NLP) task that aids advances in Question Answering (QA) and conversational assistants. Existing models focus on generating a question based on a text and possibly the answer to the generated question. They need to determine the type of interrogative word to be generated while having to pay attention to the grammar and vocabulary of the question. In this work, we propose Interrogative-Word-Aware Question Generation (IWAQG), a pipelined system composed of two modules: an interrogative word classifier and a QG model. The first module predicts the interrogative word that is provided to the second module to create the question. Owing to an increased recall of deciding the interrogative words to be used for the generated questions, the proposed model achieves new state-of-the-art results on the task of QG in SQuAD, improving from 46.58 to 47.69 in BLEU-1, 17.55 to 18.53 in BLEU-4, 21.24 to 22.33 in METEOR, and from 44.53 to 46.94 in ROUGE-L.

* Accepted at 2nd Workshop on Machine Reading for Question Answering (MRQA), EMNLP 2019

Via

Access Paper or Ask Questions