Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bowen Shen

DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

Jun 11, 2025

Yuchen Feng, Bowen Shen, Naibin Gu, Jiaxuan Zhao, Peng Fu, Zheng Lin, Weiping Wang

Abstract:Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.

* ACL 2025

Via

Access Paper or Ask Questions

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

May 26, 2025

Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

Abstract:The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.

Via

Access Paper or Ask Questions

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

May 12, 2025

Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu(+55 more)

Abstract:We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

Via

Access Paper or Ask Questions

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Jul 08, 2024

Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

Figure 1 for Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Figure 2 for Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Figure 3 for Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Figure 4 for Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Abstract:Structured pruning fundamentally reduces computational and memory overheads of large language models (LLMs) and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture, significantly reducing weights, KV Cache and attention computation. TransAct is implemented on the LLaMA model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both efficiency and performance. Further, ablation studies reveal the strength of activation-guided iterative pruning and provide experimental analysis on the redundancy of MHA and MLP modules.

* Findings of ACL 2024

Via

Access Paper or Ask Questions

Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning

Jun 06, 2024

Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as the predominant technique for fine-tuning in the era of large language models. However, existing PEFT methods still have inadequate training efficiency. Firstly, the utilization of large-scale foundation models during the training process is excessively redundant for certain fine-tuning tasks. Secondly, as the model size increases, the growth in trainable parameters of empirically added PEFT modules becomes non-negligible and redundant, leading to inefficiency. To achieve task-specific efficient fine-tuning, we propose the Light-PEFT framework, which includes two methods: Masked Early Pruning of the Foundation Model and Multi-Granularity Early Pruning of PEFT. The Light-PEFT framework allows for the simultaneous estimation of redundant parameters in both the foundation model and PEFT modules during the early stage of training. These parameters can then be pruned for more efficient fine-tuning. We validate our approach on GLUE, SuperGLUE, QA tasks, and various models. With Light-PEFT, parameters of the foundation model can be pruned by up to over 40%, while still controlling trainable parameters to be only 25% of the original PEFT method. Compared to utilizing the PEFT method directly, Light-PEFT achieves training and inference speedup, reduces memory usage, and maintains comparable performance and the plug-and-play feature of PEFT.

* Findings of ACL 2024

Via

Access Paper or Ask Questions

SMAT: A Self-Reinforcing Framework for Simultaneous Mapping and Tracking in Unbounded Urban Environments

Apr 27, 2023

Tingxiang Fan, Bowen Shen, Yinqiang Zhang, Chuye Zhang, Lei Yang, Hua Chen, Wei Zhang, Jia Pan

Abstract:With the increasing prevalence of robots in daily life, it is crucial to enable robots to construct a reliable map online to navigate in unbounded and changing environments. Although existing methods can individually achieve the goals of spatial mapping and dynamic object detection and tracking, limited research has been conducted on an effective combination of these two important abilities. The proposed framework, SMAT (Simultaneous Mapping and Tracking), integrates the front-end dynamic object detection and tracking module with the back-end static mapping module using a self-reinforcing mechanism, which promotes mutual improvement of mapping and tracking performance. The conducted experiments demonstrate the framework's effectiveness in real-world applications, achieving successful long-range navigation and mapping in multiple urban environments using only one LiDAR, a CPU-only onboard computer, and a consumer-level GPS receiver.

* homepage: https://sites.google.com/view/smat-nav

Via

Access Paper or Ask Questions

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Oct 27, 2022

Bowen Shen, Zheng Lin, Yuanxin Liu, Zhengxiao Liu, Lei Wang, Weiping Wang

Abstract:Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. For resource-constrained devices, there is an urgent need for a spatially and temporally efficient model which retains the major capacity of PLMs. However, existing statically compressed models are unaware of the diverse complexities between input instances, potentially resulting in redundancy and inadequacy for simple and complex inputs. Also, miniature models with early exiting encounter challenges in the trade-off between making predictions and serving the deeper layers. Motivated by such considerations, we propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration. Specifically, the PLM is slenderized in width while the depth remains intact, complementing layer-wise early exiting to speed up inference dynamically. To address the trade-off of early exiting, we propose a joint training approach that calibrates slenderization and preserves contributive structures to each exit instead of only the final layer. Experiments are conducted on GLUE benchmark and the results verify the Pareto optimality of our approach at high compression and acceleration rate with 1/8 parameters and 1/19 FLOPs of BERT.

* Accepted in EMNLP 2022 main conference

Via

Access Paper or Ask Questions

DynamicFilter: an Online Dynamic Objects Removal Framework for Highly Dynamic Environments

Jun 30, 2022

Tingxiang Fan, Bowen Shen, Hua Chen, Wei Zhang, Jia Pan

Figure 1 for DynamicFilter: an Online Dynamic Objects Removal Framework for Highly Dynamic Environments

Figure 2 for DynamicFilter: an Online Dynamic Objects Removal Framework for Highly Dynamic Environments

Figure 3 for DynamicFilter: an Online Dynamic Objects Removal Framework for Highly Dynamic Environments

Figure 4 for DynamicFilter: an Online Dynamic Objects Removal Framework for Highly Dynamic Environments

Abstract:Emergence of massive dynamic objects will diversify spatial structures when robots navigate in urban environments. Therefore, the online removal of dynamic objects is critical. In this paper, we introduce a novel online removal framework for highly dynamic urban environments. The framework consists of the scan-to-map front-end and the map-to-map back-end modules. Both the front- and back-ends deeply integrate the visibility-based approach and map-based approach. The experiments validate the framework in highly dynamic simulation scenarios and real-world datasets.

* ICRA 2022

Via

Access Paper or Ask Questions

Deep Learning for Highly Accelerated Diffusion Tensor Imaging

Feb 03, 2020

Hongyu Li, Zifei Liang, Chaoyi Zhang, Ruiying Liu, Jing Li, Weihong Zhang, Dong Liang, Bowen Shen, Xiaoliang Zhang, Yulin Ge(+2 more)

Figure 1 for Deep Learning for Highly Accelerated Diffusion Tensor Imaging

Figure 2 for Deep Learning for Highly Accelerated Diffusion Tensor Imaging

Figure 3 for Deep Learning for Highly Accelerated Diffusion Tensor Imaging

Figure 4 for Deep Learning for Highly Accelerated Diffusion Tensor Imaging

Abstract:Diffusion tensor imaging (DTI) is widely used to examine the human brain white matter structures, including their microarchitecture integrity and spatial fiber tract trajectories, with clinical applications in several neurological disorders and neurosurgical guidance. However, a major factor that prevents DTI from being incorporated in clinical routines is its long scan time due to the acquisition of a large number (typically 30 or more) of diffusion-weighted images (DWIs) required for reliable tensor estimation. Here, a deep learning-based technique is developed to obtain diffusion tensor images with only six DWIs, resulting in a significant reduction in imaging time. The method uses deep convolutional neural networks to learn the highly nonlinear relationship between DWIs and several tensor-derived maps, bypassing the conventional tensor fitting procedure, which is well known to be highly susceptible to noises in DWIs. The performance of the method was evaluated using DWI datasets from the Human Connectome Project and patients with ischemic stroke. Our results demonstrate that the proposed technique is able to generate quantitative maps of good quality fractional anisotropy (FA) and mean diffusivity (MD), as well as the fiber tractography from as few as six DWIs. The proposed method achieves a quantification error of less than 5% in all regions of interest of the brain, which is the rate of in vivo reproducibility of diffusion tensor imaging. Tractography reconstruction is also comparable to the ground truth obtained from 90 DWIs. In addition, we also demonstrate that the neural network trained on healthy volunteers can be directly applied/tested on stroke patients' DWIs data without compromising the lesion detectability. Such a significant reduction in scan time will allow inclusion of DTI into clinical routine for many potential applications.

* 13 pages, 9 figures

Via

Access Paper or Ask Questions