Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shihao Xu

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

May 21, 2025

Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao(+239 more)

Abstract:As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

Via

Access Paper or Ask Questions

Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

Dec 12, 2024

Shihao Xu, Yiyang Luo, Wei Shi

Abstract:Geometry mathematics problems pose significant challenges for large language models (LLMs) because they involve visual elements and spatial reasoning. Current methods primarily rely on symbolic character awareness to address these problems. Considering geometry problem solving is a relatively nascent field with limited suitable datasets and currently almost no work on solid geometry problem solving, we collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath. It contains solid geometry questions and answers with accurate reasoning steps as compensation for existing plane geometry datasets. Additionally, we propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance. Our fine-tuned model with ICL attains the state-of-the-art performance of 65.25% and 42.36% on selected questions of the GeoQA dataset and GeoMath dataset respectively with proper inference steps. Notably, our model initially endows the ability to solve solid geometry problems and supports the generation of reasonable solid geometry picture descriptions and problem-solving steps. Our research sets the stage for further exploration of LLMs in multi-modal math problem-solving, particularly in geometry math problems.

Via

Access Paper or Ask Questions

Long Term Memory: The Foundation of AI Self-Evolution

Oct 21, 2024

Xun Jiang, Feng Li, Han Zhao, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen(+4 more)

Figure 1 for Long Term Memory: The Foundation of AI Self-Evolution

Figure 2 for Long Term Memory: The Foundation of AI Self-Evolution

Figure 3 for Long Term Memory: The Foundation of AI Self-Evolution

Figure 4 for Long Term Memory: The Foundation of AI Self-Evolution

Abstract:Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning, achieving human-level performance in various tasks. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. While training stronger models is important, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution. Unlike large-scale training, self-evolution may rely on limited data or interactions. Inspired by the columnar organization of the human cerebral cortex, we hypothesize that AI models could develop cognitive abilities and build internal representations through iterative interactions with their environment. To achieve this, models need long-term memory (LTM) to store and manage processed interaction data. LTM supports self-evolution by representing diverse experiences across environments and agents. In this report, we explore AI self-evolution and its potential to enhance models during inference. We examine LTM's role in lifelong learning, allowing models to evolve based on accumulated interactions. We outline the structure of LTM and the systems needed for effective data retention and representation. We also classify approaches for building personalized models with LTM data and show how these models achieve self-evolution through interaction. Using LTM, our multi-agent framework OMNE achieved first place on the GAIA benchmark, demonstrating LTM's potential for AI self-evolution. Finally, we present a roadmap for future research, emphasizing the importance of LTM for advancing AI technology and its practical applications.

* 56 pages, 13 figures

Via

Access Paper or Ask Questions

Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Jun 06, 2021

Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, Bin Hu

Figure 1 for Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Figure 2 for Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Figure 3 for Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Figure 4 for Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Abstract:Skeleton-based person re-identification (Re-ID) is an emerging open topic providing great value for safety-critical applications. Existing methods typically extract hand-crafted features or model skeleton dynamics from the trajectory of body joints, while they rarely explore valuable relation information contained in body structure or motion. To fully explore body relations, we construct graphs to model human skeletons from different levels, and for the first time propose a Multi-level Graph encoding approach with Structural-Collaborative Relation learning (MG-SCR) to encode discriminative graph features for person Re-ID. Specifically, considering that structurally-connected body components are highly correlated in a skeleton, we first propose a multi-head structural relation layer to learn different relations of neighbor body-component nodes in graphs, which helps aggregate key correlative features for effective node representations. Second, inspired by the fact that body-component collaboration in walking usually carries recognizable patterns, we propose a cross-level collaborative relation layer to infer collaboration between different level components, so as to capture more discriminative skeleton graph features. Finally, to enhance graph dynamics encoding, we propose a novel self-supervised sparse sequential prediction task for model pre-training, which facilitates encoding high-level graph semantics for person Re-ID. MG-SCR outperforms state-of-the-art skeleton-based methods, and it achieves superior performance to many multi-modal methods that utilize extra RGB or depth features. Our codes are available at https://github.com/Kali-Hac/MG-SCR.

* In IJCAI, 2021
* Accepted at IJCAI 2021 Main Track. Sole copyright holder is IJCAI. Codes are available at https://github.com/Kali-Hac/MG-SCR

Via

Access Paper or Ask Questions

Prototypical Contrast and Reverse Prediction: Unsupervised Skeleton Based Action Recognition

Nov 14, 2020

Shihao Xu, Haocong Rao, Xiping Hu, Bin Hu

Figure 1 for Prototypical Contrast and Reverse Prediction: Unsupervised Skeleton Based Action Recognition

Figure 2 for Prototypical Contrast and Reverse Prediction: Unsupervised Skeleton Based Action Recognition

Figure 3 for Prototypical Contrast and Reverse Prediction: Unsupervised Skeleton Based Action Recognition

Figure 4 for Prototypical Contrast and Reverse Prediction: Unsupervised Skeleton Based Action Recognition

Abstract:In this paper, we focus on unsupervised representation learning for skeleton-based action recognition. Existing approaches usually learn action representations by sequential prediction but they suffer from the inability to fully learn semantic information. To address this limitation, we propose a novel framework named Prototypical Contrast and Reverse Prediction (PCRP), which not only creates reverse sequential prediction to learn low-level information (e.g., body posture at every frame) and high-level pattern (e.g., motion order), but also devises action prototypes to implicitly encode semantic similarity shared among sequences. In general, we regard action prototypes as latent variables and formulate PCRP as an expectation-maximization task. Specifically, PCRP iteratively runs (1) E-step as determining the distribution of prototypes by clustering action encoding from the encoder, and (2) M-step as optimizing the encoder by minimizing the proposed ProtoMAE loss, which helps simultaneously pull the action encoding closer to its assigned prototype and perform reverse prediction task. Extensive experiments on N-UCLA, NTU 60, and NTU 120 dataset present that PCRP outperforms state-of-the-art unsupervised methods and even achieves superior performance over some of supervised methods. Codes are available at https://github.com/Mikexu007/PCRP.

* Codes are available at https://github.com/Mikexu007/PCRP

Via

Access Paper or Ask Questions

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Aug 18, 2020

Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, Bin Hu

Figure 1 for Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Figure 2 for Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Figure 3 for Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Figure 4 for Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Abstract:Action recognition via 3D skeleton data is an emerging important topic in these years. Most existing methods either extract hand-crafted descriptors or learn action representations by supervised learning paradigms that require massive labeled data. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL that can leverage different augmentations of unlabeled skeleton data to learn action representations in an unsupervised manner. Specifically, we first propose to contrast similarity between augmented instances (query and key) of the input skeleton sequence, which are transformed by multiple novel augmentation strategies, to learn inherent action patterns ("pattern-invariance") of different skeleton transformations. Second, to encourage learning the pattern-invariance with more consistent action representations, we propose a momentum LSTM, which is implemented as the momentum-based moving average of LSTM based query encoder, to encode long-term action dynamics of the key sequence. Third, we introduce a queue to store the encoded keys, which allows our model to flexibly reuse proceeding keys and build a more consistent dictionary to improve contrastive learning. Last, by temporally averaging the hidden states of action learned by the query encoder, a novel representation named Contrastive Action Encoding (CAE) is proposed to represent human's action effectively. Extensive experiments show that our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy, and it can achieve comparable or even superior performance to numerous supervised learning methods.

* Our codes are available at https://github.com/Mikexu007/AS-CAL

Via

Access Paper or Ask Questions

Emotion Recognition From Gait Analyses: Current Research and Future Directions

Mar 13, 2020

Shihao Xu, Jing Fang, Xiping Hu, Edith Ngai, Yi Guo, Victor C. M. Leung, Jun Cheng, Bin Hu

Figure 1 for Emotion Recognition From Gait Analyses: Current Research and Future Directions

Figure 2 for Emotion Recognition From Gait Analyses: Current Research and Future Directions

Figure 3 for Emotion Recognition From Gait Analyses: Current Research and Future Directions

Figure 4 for Emotion Recognition From Gait Analyses: Current Research and Future Directions

Abstract:Human gait refers to a daily motion that represents not only mobility, but it can also be used to identify the walker by either human observers or computers. Recent studies reveal that gait even conveys information about the walker's emotion. Individuals in different emotion states may show different gait patterns. The mapping between various emotions and gait patterns provides a new source for automated emotion recognition. Compared to traditional emotion detection biometrics, such as facial expression, speech and physiological parameters, gait is remotely observable, more difficult to imitate, and requires less cooperation from the subject. These advantages make gait a promising source for emotion detection. This article reviews current research on gait-based emotion detection, particularly on how gait parameters can be affected by different emotion states and how the emotion states can be recognized through distinct gait patterns. We focus on the detailed methods and techniques applied in the whole process of emotion recognition: data collection, preprocessing, and classification. At last, we discuss possible future developments of efficient and effective gait-based emotion recognition using the state of the art techniques on intelligent computation and big data.

* Submitted to a journal on 06-Aug-2019

Via

Access Paper or Ask Questions