Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Mar 04, 2025

Yujiao Yang, Jing Lian, Linhui Li

Figure 1 for Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Figure 2 for Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Figure 3 for Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Figure 4 for Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Share this with someone who'll enjoy it:

Abstract:Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

* 17 pages, 6 figures, 5 tables

View paper on

Share this with someone who'll enjoy it:

Title:Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Paper and Code