Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Weiyu Cheng

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Jun 16, 2025

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu(+118 more)

Abstract:We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

* A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

Via

Access Paper or Ask Questions

MiniMax-01: Scaling Foundation Models with Lightning Attention

Jan 14, 2025

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen(+80 more)

Abstract:We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

* A technical report from MiniMax. The authors are listed in alphabetical order. We open-sourced our MiniMax-01 at https://github.com/MiniMax-AI

Via

Access Paper or Ask Questions

RESUS: Warm-Up Cold Users via Meta-Learning Residual User Preferences in CTR Prediction

Oct 28, 2022

Yanyan Shen, Lifan Zhao, Weiyu Cheng, Zibin Zhang, Wenwen Zhou, Kangyi Lin

Abstract:Click-Through Rate (CTR) prediction on cold users is a challenging task in recommender systems. Recent researches have resorted to meta-learning to tackle the cold-user challenge, which either perform few-shot user representation learning or adopt optimization-based meta-learning. However, existing methods suffer from information loss or inefficient optimization process, and they fail to explicitly model global user preference knowledge which is crucial to complement the sparse and insufficient preference information of cold users. In this paper, we propose a novel and efficient approach named RESUS, which decouples the learning of global preference knowledge contributed by collective users from the learning of residual preferences for individual users. Specifically, we employ a shared predictor to infer basis user preferences, which acquires global preference knowledge from the interactions of different users. Meanwhile, we develop two efficient algorithms based on the nearest neighbor and ridge regression predictors, which infer residual user preferences via learning quickly from a few user-specific interactions. Extensive experiments on three public datasets demonstrate that our RESUS approach is efficient and effective in improving CTR prediction accuracy on cold users, compared with various state-of-the-art methods.

* Accepted by TOIS 2022. Code are available in https://github.com/MogicianXD/RESUS

Via

Access Paper or Ask Questions

Differentiable Neural Input Search for Recommender Systems

Jun 08, 2020

Weiyu Cheng, Yanyan Shen, Linpeng Huang

Figure 1 for Differentiable Neural Input Search for Recommender Systems

Figure 2 for Differentiable Neural Input Search for Recommender Systems

Figure 3 for Differentiable Neural Input Search for Recommender Systems

Figure 4 for Differentiable Neural Input Search for Recommender Systems

Abstract:Latent factor models are the driving forces of the state-of-the-art recommender systems, with an important insight of vectorizing raw input features into dense embeddings. The dimensions of different feature embeddings are often set to a uniform value manually or through grid search, which may yield suboptimal model performance. Existing work applied heuristic methods or reinforcement learning to search for varying embedding dimensions. However, the embedding dimension per feature is rigidly chosen from a restricted set of candidates due to the scalability issue involved in the optimization process over a large search space. In this paper, we propose a differentiable neural input search algorithm towards learning more flexible dimensions of feature embeddings, namely a mixed dimension scheme, leading to better recommendation performance and lower memory cost. Our method can be seamlessly incorporated with various existing architectures of latent factor models for recommendation. We conduct experiments with 6 state-of-the-art model architectures on two typical recommendation tasks: Collaborative Filtering (CF) and Click-Through-Rate (CTR) prediction. The results demonstrate that our method achieves the best recommendation performance compared with 3 neural input search approaches over all the model architectures, and can reduce the number of embedding parameters by 2x and 20x on CF and CTR prediction, respectively.

Via

Access Paper or Ask Questions

Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Sep 07, 2019

Weiyu Cheng, Yanyan Shen, Linpeng Huang

Figure 1 for Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Figure 2 for Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Figure 3 for Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Figure 4 for Adaptive Factorization Network: Learning Adaptive-Order Feature Interactions

Abstract:Various factorization-based methods have been proposed to leverage second-order, or higher-order cross features for boosting the performance of predictive models. They generally enumerate all the cross features under a predefined maximum order, and then identify useful feature interactions through model training, which suffer from two drawbacks. First, they have to make a trade-off between the expressiveness of higher-order cross features and the computational cost, resulting in suboptimal predictions. Second, enumerating all the cross features, including irrelevant ones, may introduce noisy feature combinations that degrade model performance. In this work, we propose the Adaptive Factorization Network (AFN), a new model that learns arbitrary-order cross features adaptively from data. The core of AFN is a logarithmic transformation layer to convert the power of each feature in a feature combination into the coefficient to be learned. The experimental results on four real datasets demonstrate the superior predictive performance of AFN against the start-of-the-arts.

Via

Access Paper or Ask Questions

Explaining Latent Factor Models for Recommendation with Influence Functions

Nov 20, 2018

Weiyu Cheng, Yanyan Shen, Yanmin Zhu, Linpeng Huang

Figure 1 for Explaining Latent Factor Models for Recommendation with Influence Functions

Figure 2 for Explaining Latent Factor Models for Recommendation with Influence Functions

Figure 3 for Explaining Latent Factor Models for Recommendation with Influence Functions

Figure 4 for Explaining Latent Factor Models for Recommendation with Influence Functions

Abstract:Latent factor models (LFMs) such as matrix factorization achieve the state-of-the-art performance among various Collaborative Filtering (CF) approaches for recommendation. Despite the high recommendation accuracy of LFMs, a critical issue to be resolved is the lack of explainability. Extensive efforts have been made in the literature to incorporate explainability into LFMs. However, they either rely on auxiliary information which may not be available in practice, or fail to provide easy-to-understand explanations. In this paper, we propose a fast influence analysis method named FIA, which successfully enforces explicit neighbor-style explanations to LFMs with the technique of influence functions stemmed from robust statistics. We first describe how to employ influence functions to LFMs to deliver neighbor-style explanations. Then we develop a novel influence computation algorithm for matrix factorization with high efficiency. We further extend it to the more general neural collaborative filtering and introduce an approximation algorithm to accelerate influence analysis over neural network models. Experimental results on real datasets demonstrate the correctness, efficiency and usefulness of our proposed method.

Via

Access Paper or Ask Questions