Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianqi Hou

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Feb 08, 2025

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng

Abstract:There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.

Via

Access Paper or Ask Questions

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Oct 16, 2024

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

Figure 1 for MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Figure 2 for MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Figure 3 for MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Figure 4 for MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Abstract:KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.

Via

Access Paper or Ask Questions

In-context KV-Cache Eviction for LLMs via Attention-Gate

Oct 15, 2024

Zihao Zeng, Bokai Lin, Tianqi Hou, Hao Zhang, Zhijie Deng

Figure 1 for In-context KV-Cache Eviction for LLMs via Attention-Gate

Figure 2 for In-context KV-Cache Eviction for LLMs via Attention-Gate

Figure 3 for In-context KV-Cache Eviction for LLMs via Attention-Gate

Figure 4 for In-context KV-Cache Eviction for LLMs via Attention-Gate

Abstract:The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate, which accepts the whole context as input and yields eviction flags for each token to realize in-context eviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.

Via

Access Paper or Ask Questions

Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Mar 28, 2024

Xinyu Bian, Yuhao Liu, Yizhou Xu, Tianqi Hou, Wenjie Wang, Yuyi Mao, Jun Zhang

Figure 1 for Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Figure 2 for Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Figure 3 for Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Figure 4 for Decentralizing Coherent Joint Transmission Precoding via Fast ADMM with Deterministic Equivalents

Abstract:Inter-cell interference (ICI) suppression is critical for multi-cell multi-user networks. In this paper, we investigate advanced precoding techniques for coordinated multi-point (CoMP) with downlink coherent joint transmission, an effective approach for ICI suppression. Different from the centralized precoding schemes that require frequent information exchange among the cooperating base stations, we propose a decentralized scheme to minimize the total power consumption. In particular, based on the covariance matrices of global channel state information, we estimate the ICI bounds via the deterministic equivalents and decouple the original design problem into sub-problems, each of which can be solved in a decentralized manner. To solve the sub-problems at each base station, we develop a low-complexity solver based on the alternating direction method of multipliers (ADMM) in conjunction with the convex-concave procedure (CCCP). Simulation results demonstrate the effectiveness of our proposed decentralized precoding scheme, which achieves performance similar to the optimal centralized precoding scheme. Besides, our proposed ADMM solver can substantially reduce the computational complexity, while maintaining outstanding performance.

Via

Access Paper or Ask Questions

Decentralizing Coherent Joint Transmission Precoding via Deterministic Equivalents

Mar 15, 2024

Yuhao Liu, Xinyu Bian, Yizhou Xu, Tianqi Hou, Wenjie Wang, Yuyi Mao, Jun Zhang

Figure 1 for Decentralizing Coherent Joint Transmission Precoding via Deterministic Equivalents

Abstract:In order to control the inter-cell interference for a multi-cell multi-user multiple-input multiple-output network, we consider the precoder design for coordinated multi-point with downlink coherent joint transmission. To avoid costly information exchange among the cooperating base stations in a centralized precoding scheme, we propose a decentralized one by considering the power minimization problem. By approximating the inter-cell interference using the deterministic equivalents, this problem is decoupled to sub-problems which are solved in a decentralized manner at different base stations. Simulation results demonstrate the effectiveness of our proposed decentralized precoding scheme, where only 2 ~ 7% more transmit power is needed compared with the optimal centralized precoder.

Via

Access Paper or Ask Questions

Spectrum of non-Hermitian deep-Hebbian neural networks

Aug 24, 2022

Zijian Jiang, Ziming Chen, Tianqi Hou, Haiping Huang

Figure 1 for Spectrum of non-Hermitian deep-Hebbian neural networks

Figure 2 for Spectrum of non-Hermitian deep-Hebbian neural networks

Figure 3 for Spectrum of non-Hermitian deep-Hebbian neural networks

Figure 4 for Spectrum of non-Hermitian deep-Hebbian neural networks

Abstract:Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacobian matrix in neural dynamics. The spectra bears several distinct features, such as breaking rotational symmetry about the origin, and the emergence of nested voids within the spectrum boundary. The spectral density is thus highly non-uniformly distributed in the complex plane. The random matrix theory also predicts a transition to chaos. In particular, the edge of chaos provides computational benefits for the sequential retrieval of memories. Our work provides a systematic study of time-lagged correlations with arbitrary time delays, and thus can inspire future studies of a broad class of memory models, and even big data analysis of biological time series.

* 63 pages, 11 figures

Via

Access Paper or Ask Questions

Statistical physics of unsupervised learning with prior knowledge in neural networks

Nov 06, 2019

Tianqi Hou, Haiping Huang

Figure 1 for Statistical physics of unsupervised learning with prior knowledge in neural networks

Figure 2 for Statistical physics of unsupervised learning with prior knowledge in neural networks

Abstract:Integrating sensory inputs with prior beliefs from past experiences in unsupervised learning is a common and fundamental characteristic of brain or artificial neural computation. However, a quantitative role of prior knowledge in unsupervised learning remains unclear, prohibiting a scientific understanding of unsupervised learning. Here, we propose a statistical physics model of unsupervised learning with prior knowledge, revealing that the sensory inputs drive a series of continuous phase transitions related to spontaneous intrinsic-symmetry breaking. The intrinsic symmetry includes both reverse symmetry and permutation symmetry, commonly observed in most artificial neural networks. Compared to the prior-free scenario, the prior reduces more strongly the minimal data size triggering the reverse symmetry breaking transition, and moreover, the prior merges, rather than separates, permutation symmetry breaking phases. We claim that the prior can be learned from data samples, which in physics corresponds to a two-parameter Nishimori plane constraint. This work thus reveals mechanisms about the influence of the prior on unsupervised learning.

* 13 pages, 2 figures with supplemental material

Via

Access Paper or Ask Questions

Minimal model of permutation symmetry in unsupervised learning

Apr 30, 2019

Tianqi Hou, K. Y. Michael Wong, Haiping Huang

Figure 1 for Minimal model of permutation symmetry in unsupervised learning

Figure 2 for Minimal model of permutation symmetry in unsupervised learning

Figure 3 for Minimal model of permutation symmetry in unsupervised learning

Figure 4 for Minimal model of permutation symmetry in unsupervised learning

Abstract:Permutation of any two hidden units yields invariant properties in typical deep generative neural networks. This permutation symmetry plays an important role in understanding the computation performance of a broad class of neural networks with two or more hidden units. However, a theoretical study of the permutation symmetry is still lacking. Here, we propose a minimal model with only two hidden units in a restricted Boltzmann machine, which aims to address how the permutation symmetry affects the critical learning data size at which the concept-formation (or spontaneous symmetry breaking in physics language) starts, and moreover semi-rigorously prove a conjecture that the critical data size is independent of the number of hidden units once this number is finite. Remarkably, we find that the embedded correlation between two receptive fields of hidden units reduces the critical data size. In particular, the weakly-correlated receptive fields have the benefit of significantly reducing the minimal data size that triggers the transition, given less noisy data. Inspired by the theory, we also propose an efficient fully-distributed algorithm to infer the receptive fields of hidden units. Overall, our results demonstrate that the permutation symmetry is an interesting property that affects the critical data size for computation performances of related learning algorithms. All these effects can be analytically probed based on the minimal model, providing theoretical insights towards understanding unsupervised learning in a more general context.

* 33 pages, 103 equations, 4 figures

Via

Access Paper or Ask Questions