Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Hanbo Li

A1: Asynchronous Test-Time Scaling via Conformal Prediction

Sep 18, 2025

Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan(+4 more)

Figure 1 for A1: Asynchronous Test-Time Scaling via Conformal Prediction

Figure 2 for A1: Asynchronous Test-Time Scaling via Conformal Prediction

Figure 3 for A1: Asynchronous Test-Time Scaling via Conformal Prediction

Figure 4 for A1: Asynchronous Test-Time Scaling via Conformal Prediction

Abstract:Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

* Tech Report

Via

Access Paper or Ask Questions

DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

May 20, 2025

He Wang, Alexander Hanbo Li, Yiqun Hu, Sheng Zhang, Hideo Kobayashi, Jiani Zhang, Henry Zhu, Chung-Wei Hang, Patrick Ng

Abstract:Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning -- a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves -- to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent's learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.

Via

Access Paper or Ask Questions

Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models

Jan 24, 2025

Naihao Deng, Sheng Zhang, Henghui Zhu, Shuaichen Chang, Jiani Zhang, Alexander Hanbo Li, Chung-Wei Hang, Hideo Kobayashi, Yiqun Hu, Patrick Ng

Figure 1 for Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models

Figure 2 for Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models

Figure 3 for Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models

Figure 4 for Towards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models

Abstract:Recent advances in natural language processing have leveraged instruction tuning to enhance Large Language Models (LLMs) for table-related tasks. However, previous works train different base models with different training data, lacking an apples-to-apples comparison across the result table LLMs. To address this, we fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets. Our replication achieves performance on par with or surpassing existing table LLMs, establishing new state-of-the-art performance on Hitab, a table question-answering dataset. More importantly, through systematic out-of-domain evaluation, we decouple the contributions of training data and the base model, providing insight into their individual impacts. In addition, we assess the effects of table-specific instruction tuning on general-purpose benchmarks, revealing trade-offs between specialization and generalization.

Via

Access Paper or Ask Questions

Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

Aug 10, 2023

Alexander Hanbo Li, Mingyue Shang, Evangelia Spiliopoulou, Jie Ma, Patrick Ng, Zhiguo Wang, Bonan Min, William Wang, Kathleen McKeown, Vittorio Castelli(+2 more)

Figure 1 for Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

Figure 2 for Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

Figure 3 for Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

Figure 4 for Few-Shot Data-to-Text Generation via Unified Representation and Multi-Source Learning

Abstract:We present a novel approach for structured data-to-text generation that addresses the limitations of existing methods that primarily focus on specific types of structured data. Our proposed method aims to improve performance in multi-task training, zero-shot and few-shot scenarios by providing a unified representation that can handle various forms of structured data such as tables, knowledge graph triples, and meaning representations. We demonstrate that our proposed approach can effectively adapt to new structured forms, and can improve performance in comparison to current methods. For example, our method resulted in a 66% improvement in zero-shot BLEU scores when transferring models trained on table inputs to a knowledge graph dataset. Our proposed method is an important step towards a more general data-to-text generation framework.

Via

Access Paper or Ask Questions

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

May 30, 2023

Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang, Zhiguo Wang, Vittorio Castelli(+3 more)

Figure 1 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Figure 2 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Figure 3 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Figure 4 for Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Abstract:The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code and models are released at http://cogcomp.org/page/publication_view/1010

* Accepted to ACL 2023 Findings

Via

Access Paper or Ask Questions

Benchmarking Diverse-Modal Entity Linking with Generative Models

May 27, 2023

Sijia Wang, Alexander Hanbo Li, Henry Zhu, Sheng Zhang, Chung-Wei Hang, Pramuditha Perera, Jie Ma, William Wang, Zhiguo Wang, Vittorio Castelli(+2 more)

Abstract:Entities can be expressed in diverse formats, such as texts, images, or column names and cell values in tables. While existing entity linking (EL) models work well on per modality configuration, such as text-only EL, visual grounding, or schema linking, it is more challenging to design a unified model for diverse modality configurations. To bring various modality configurations together, we constructed a benchmark for diverse-modal EL (DMEL) from existing EL datasets, covering all three modalities including text, image, and table. To approach the DMEL task, we proposed a generative diverse-modal model (GDMM) following a multimodal-encoder-decoder paradigm. Pre-training \Model with rich corpora builds a solid foundation for DMEL without storing the entire KB for inference. Fine-tuning GDMM builds a stronger DMEL baseline, outperforming state-of-the-art task-specific EL models by 8.51 F1 score on average. Additionally, extensive error analyses are conducted to highlight the challenges of DMEL, facilitating future research on this task.

* 15 pages. ACL 2023

Via

Access Paper or Ask Questions

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Jan 21, 2023

Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien(+6 more)

Figure 1 for Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Figure 2 for Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Figure 3 for Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Figure 4 for Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Abstract:Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

* ICLR 2023

Via

Access Paper or Ask Questions

DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Sep 30, 2022

Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Yiqun Hu, William Wang, Zhiguo Wang, Bing Xiang

Figure 1 for DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Figure 2 for DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Figure 3 for DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Figure 4 for DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Abstract:Question answering over knowledge bases (KBs) aims to answer natural language questions with factual information such as entities and relations in KBs. Previous methods either generate logical forms that can be executed over KBs to obtain final answers or predict answers directly. Empirical results show that the former often produces more accurate answers, but it suffers from non-execution issues due to potential syntactic and semantic errors in the generated logical forms. In this work, we propose a novel framework DecAF that jointly generates both logical forms and direct answers, and then combines the merits of them to get the final answers. Moreover, different from most of the previous methods, DecAF is based on simple free-text retrieval without relying on any entity linking tools -- this simplification eases its adaptation to different datasets. DecAF achieves new state-of-the-art accuracy on WebQSP, FreebaseQA, and GrailQA benchmarks, while getting competitive results on the ComplexWebQuestions benchmark.

Via

Access Paper or Ask Questions

Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding

Sep 28, 2022

Jun Wang, Patrick Ng, Alexander Hanbo Li, Jiarong Jiang, Zhiguo Wang, Ramesh Nallapati, Bing Xiang, Sudipta Sengupta

Figure 1 for Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding

Figure 2 for Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding

Figure 3 for Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding

Figure 4 for Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding

Abstract:Most recent research on Text-to-SQL semantic parsing relies on either parser itself or simple heuristic based approach to understand natural language query (NLQ). When synthesizing a SQL query, there is no explicit semantic information of NLQ available to the parser which leads to undesirable generalization performance. In addition, without lexical-level fine-grained query understanding, linking between query and database can only rely on fuzzy string match which leads to suboptimal performance in real applications. In view of this, in this paper we present a general-purpose, modular neural semantic parsing framework that is based on token-level fine-grained query understanding. Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural semantic parser (NSP). By jointly modeling query and database, NER model analyzes user intents and identifies entities in the query. NEL model links typed entities to schema and cell values in database. Parser model leverages available semantic information and linking results and synthesizes tree-structured SQL queries based on dynamically generated grammar. Experiments on SQUALL, a newly released semantic parsing dataset, show that we can achieve 56.8% execution accuracy on WikiTableQuestions (WTQ) test set, which outperforms the state-of-the-art model by 2.7%.

* EMNLP Industry Track 2022

Via

Access Paper or Ask Questions

Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Sep 25, 2021

Kaize Ding, Dingcheng Li, Alexander Hanbo Li, Xing Fan, Chenlei Guo, Yang Liu, Huan Liu

Figure 1 for Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Figure 2 for Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Figure 3 for Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Figure 4 for Learning to Selectively Learn for Weakly-supervised Paraphrase Generation

Abstract:Paraphrase generation is a longstanding NLP task that has diverse applications for downstream NLP tasks. However, the effectiveness of existing efforts predominantly relies on large amounts of golden labeled data. Though unsupervised endeavors have been proposed to address this issue, they may fail to generate meaningful paraphrases due to the lack of supervision signals. In this work, we go beyond the existing paradigms and propose a novel approach to generate high-quality paraphrases with weak supervision data. Specifically, we tackle the weakly-supervised paraphrase generation problem by: (1) obtaining abundant weakly-labeled parallel sentences via retrieval-based pseudo paraphrase expansion; and (2) developing a meta-learning framework to progressively select valuable samples for fine-tuning a pre-trained language model, i.e., BART, on the sentential paraphrasing task. We demonstrate that our approach achieves significant improvements over existing unsupervised approaches, and is even comparable in performance with supervised state-of-the-arts.

* Accepted by EMNLP 2021 (long)

Via

Access Paper or Ask Questions