Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ting-Han Fan

Virtual Width Networks

Nov 17, 2025

Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chengyin Xu, Chi Zhang(+108 more)

Abstract:We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

Via

Access Paper or Ask Questions

LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Jan 25, 2025

Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

Figure 1 for LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Figure 2 for LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Figure 3 for LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Figure 4 for LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Abstract:Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases. Our further analysis shows that even state-of-the-art LLMs still have significant room for improvement in providing robust reasoning across different tasks. We will open-source LongReason to support the comprehensive evaluation of LLMs' long-context reasoning capabilities.

Via

Access Paper or Ask Questions

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Nov 15, 2023

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky

Figure 1 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 2 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 3 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Figure 4 for Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Abstract:An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.

Via

Access Paper or Ask Questions

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

Sep 14, 2023

Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky

Figure 1 for Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

Figure 2 for Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

Abstract:In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language modeling and long-range modeling while offering rapid parallel training and constant inference costs. With the resurged interest in LRNNs, we study whether they can learn the hidden rules in training sequences, such as the grammatical structures of regular language. We theoretically analyze some existing LRNNs and discover their limitations on regular language. Motivated by the analysis, we propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix. Experiments suggest that the proposed model is the only LRNN that can perform length extrapolation on regular language tasks such as Sum, Even Pair, and Modular Arithmetic.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

May 23, 2023

Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 2 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 3 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Figure 4 for Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

Abstract:The use of positional embeddings in transformer language models is widely accepted. However, recent research has called into question the necessity of such embeddings. We further extend this inquiry by demonstrating that a randomly initialized and frozen transformer language model, devoid of positional embeddings, inherently encodes strong positional information through the shrinkage of self-attention variance. To quantify this variance, we derive the underlying distribution of each step within a transformer layer. Through empirical validation using a fully pretrained model, we show that the variance shrinkage effect still persists after extensive gradient updates. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.

* Accepted by ACL 2023

Via

Access Paper or Ask Questions

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

May 05, 2023

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 2 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 3 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Figure 4 for Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Abstract:Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.

Via

Access Paper or Ask Questions

Receptive Field Alignment Enables Transformer Length Extrapolation

Dec 20, 2022

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky

Abstract:Length extrapolation is a desirable property that permits training a transformer language model on short sequences and retaining similar perplexities when the model is tested on substantially longer sequences. A relative positional embedding mechanism applied on the transformer self-attention matrix, ALiBi, demonstrates the length extrapolation property with the widest usage to date. In this paper, we show that ALiBi surprisingly does not utilize tokens further than the training sequence length, which can be explained by its implicit windowed attention effect that aligns the receptive field during training and testing stages. Inspired by ALiBi and the receptive filed alignment hypothesis, we propose another transformer positional embedding design named~\textbf{Sandwich} that uses longer than training sequence length information, and it is a greatly simplified formulation of the earliest proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and Sandwich enable efficient inference thanks to their implicit windowed attention effect.

* Work In progress

Via

Access Paper or Ask Questions

Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Jun 15, 2022

Ting-Han Fan, Ta-Chung Chi, Alexander I. Rudnicky, Peter J. Ramadge

Figure 1 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 2 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 3 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Figure 4 for Training Discrete Deep Generative Models via Gapped Straight-Through Estimator

Abstract:While deep generative models have succeeded in image processing, natural language processing, and reinforcement learning, training that involves discrete random variables remains challenging due to the high variance of its gradient estimation process. Monte Carlo is a common solution used in most variance reduction approaches. However, this involves time-consuming resampling and multiple function evaluations. We propose a Gapped Straight-Through (GST) estimator to reduce the variance without incurring resampling overhead. This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax. We determine these properties and show via an ablation study that they are essential. Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks, MNIST-VAE and ListOps.

* Accepted at the International Conference on Machine Learning (ICML) 2022. The first two authors contributed equally

Via

Access Paper or Ask Questions

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

May 20, 2022

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, Alexander I. Rudnicky

Figure 1 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 2 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 3 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Figure 4 for KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Abstract:Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Oct 06, 2021

Ting-Han Fan, Peter J. Ramadge

Figure 1 for Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Figure 2 for Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Figure 3 for Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Figure 4 for Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective

Abstract:Off-policy Actor-Critic algorithms have demonstrated phenomenal experimental performance but still require better explanations. To this end, we show its policy evaluation error on the distribution of transitions decomposes into: a Bellman error, a bias from policy mismatch, and a variance term from sampling. By comparing the magnitude of bias and variance, we explain the success of the Emphasizing Recent Experience sampling and 1/age weighted sampling. Both sampling strategies yield smaller bias and variance and are hence preferable to uniform sampling.

Via

Access Paper or Ask Questions