Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonas Kübler

Inference Optimization of Foundation Models on AI Accelerators

Jul 12, 2024

Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis

Figure 1 for Inference Optimization of Foundation Models on AI Accelerators

Figure 2 for Inference Optimization of Foundation Models on AI Accelerators

Figure 3 for Inference Optimization of Foundation Models on AI Accelerators

Figure 4 for Inference Optimization of Foundation Models on AI Accelerators

Abstract:Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context.

* Tutorial published at KDD 2024. Camera-ready version

Via

Access Paper or Ask Questions

Kernel Conditional Moment Test via Maximum Moment Restriction

Mar 07, 2020

Krikamol Muandet, Wittawat Jitkrittum, Jonas Kübler

Figure 1 for Kernel Conditional Moment Test via Maximum Moment Restriction

Figure 2 for Kernel Conditional Moment Test via Maximum Moment Restriction

Figure 3 for Kernel Conditional Moment Test via Maximum Moment Restriction

Figure 4 for Kernel Conditional Moment Test via Maximum Moment Restriction

Abstract:We propose a new family of specification tests called kernel conditional moment (KCM) tests. Our tests are built on conditional moment embeddings (CMME)---a novel representation of conditional moment restrictions in a reproducing kernel Hilbert space (RKHS). After transforming the conditional moment restrictions into a continuum of unconditional counterparts, the test statistic is defined as the maximum moment restriction within the unit ball of the RKHS. We show that the CMME fully characterizes the original conditional moment restrictions, leading to consistency in both hypothesis testing and parameter estimation. The proposed test also has an analytic expression that is easy to compute as well as closed-form asymptotic distributions. Our empirical studies show that the KCM test has a promising finite-sample performance compared to existing tests.

Via

Access Paper or Ask Questions