Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zihan Jiang

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

Mar 06, 2025

Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran(+7 more)

Abstract:The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

* Preprint

Via

Access Paper or Ask Questions

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Sep 26, 2024

Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang

Figure 1 for INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Figure 2 for INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Figure 3 for INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Figure 4 for INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Abstract:As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.

Via

Access Paper or Ask Questions

MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme

Sep 13, 2023

Yuanhao Liu, Dehui Du, Zihan Jiang, Anyan Huang, Yiyang Li

Figure 1 for MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme

Figure 2 for MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme

Figure 3 for MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme

Figure 4 for MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme

Abstract:Causal inference permits us to discover covert relationships of various variables in time series. However, in most existing works, the variables mentioned above are the dimensions. The causality between dimensions could be cursory, which hinders the comprehension of the internal relationship and the benefit of the causal graph to the neural networks (NNs). In this paper, we find that causality exists not only outside but also inside the time series because it reflects a succession of events in the real world. It inspires us to seek the relationship between internal subsequences. However, the challenges are the hardship of discovering causality from subsequences and utilizing the causal natural structures to improve NNs. To address these challenges, we propose a novel framework called Mining Causal Natural Structure (MCNS), which is automatic and domain-agnostic and helps to find the causal natural structures inside time series via the internal causality scheme. We evaluate the MCNS framework and impregnation NN with MCNS on time series classification tasks. Experimental results illustrate that our impregnation, by refining attention, shape selection classification, and pruning datasets, drives NN, even the data itself preferable accuracy and interpretability. Besides, MCNS provides an in-depth, solid summary of the time series and datasets.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

CMLCompiler: A Unified Compiler for Classical Machine Learning

Feb 01, 2023

Xu Wen, Wanling Gao, Anzheng Li, Lei Wang, Zihan Jiang, Jianfeng Zhan

Figure 1 for CMLCompiler: A Unified Compiler for Classical Machine Learning

Figure 2 for CMLCompiler: A Unified Compiler for Classical Machine Learning

Figure 3 for CMLCompiler: A Unified Compiler for Classical Machine Learning

Figure 4 for CMLCompiler: A Unified Compiler for Classical Machine Learning

Abstract:Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38x speedup on CPU, 3.31x speedup on GPU, and 5.09x speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations.

Via

Access Paper or Ask Questions

FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods

Aug 26, 2022

Lichen Jia, Bowen Tang, Chenggang Wu, Zhe Wang, Zihan Jiang, Yuanming Lai, Yan Kang, Ning Liu, Jingfeng Zhang

Figure 1 for FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods

Figure 2 for FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods

Figure 3 for FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods

Figure 4 for FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods

Abstract:The binary code similarity detection (BCSD) method measures the similarity of two binary executable codes. Recently, the learning-based BCSD methods have achieved great success, outperforming traditional BCSD in detection accuracy and efficiency. However, the existing studies are rather sparse on the adversarial vulnerability of the learning-based BCSD methods, which cause hazards in security-related applications. To evaluate the adversarial robustness, this paper designs an efficient and black-box adversarial code generation algorithm, namely, FuncFooler. FuncFooler constrains the adversarial codes 1) to keep unchanged the program's control flow graph (CFG), and 2) to preserve the same semantic meaning. Specifically, FuncFooler consecutively 1) determines vulnerable candidates in the malicious code, 2) chooses and inserts the adversarial instructions from the benign code, and 3) corrects the semantic side effect of the adversarial code to meet the constraints. Empirically, our FuncFooler can successfully attack the three learning-based BCSD models, including SAFE, Asm2Vec, and jTrans, which calls into question whether the learning-based BCSD is desirable.

* 9 pages, 4 figures

Via

Access Paper or Ask Questions

OpenClinicalAI: enabling AI to diagnose diseases in real-world clinical settings

Sep 09, 2021

Yunyou Huang, Nana Wang, Suqin Tang, Li Ma, Tianshu Hao, Zihan Jiang, Fan Zhang, Guoxin Kang, Xiuxia Miao, Xianglong Guan(+3 more)

Abstract:This paper quantitatively reveals the state-of-the-art and state-of-the-practice AI systems only achieve acceptable performance on the stringent conditions that all categories of subjects are known, which we call closed clinical settings, but fail to work in real-world clinical settings. Compared to the diagnosis task in the closed setting, real-world clinical settings pose severe challenges, and we must treat them differently. We build a clinical AI benchmark named Clinical AIBench to set up real-world clinical settings to facilitate researches. We propose an open, dynamic machine learning framework and develop an AI system named OpenClinicalAI to diagnose diseases in real-world clinical settings. The first versions of Clinical AIBench and OpenClinicalAI target Alzheimer's disease. In the real-world clinical setting, OpenClinicalAI significantly outperforms the state-of-the-art AI system. In addition, OpenClinicalAI develops personalized diagnosis strategies to avoid unnecessary testing and seamlessly collaborates with clinicians. It is promising to be embedded in the current medical systems to improve medical services.

Via

Access Paper or Ask Questions

Pinpointing the Memory Behaviors of DNN Training

Apr 01, 2021

Jiansong Li, Xiao Dong, Guangli Li, Peng Zhao, Xueying Wang, Xiaobing Chen, Xianzhi Yu, Yongxin Yang, Zihan Jiang, Wei Cao(+2 more)

Figure 1 for Pinpointing the Memory Behaviors of DNN Training

Figure 2 for Pinpointing the Memory Behaviors of DNN Training

Figure 3 for Pinpointing the Memory Behaviors of DNN Training

Figure 4 for Pinpointing the Memory Behaviors of DNN Training

Abstract:The training of deep neural networks (DNNs) is usually memory-hungry due to the limited device memory capacity of DNN accelerators. Characterizing the memory behaviors of DNN training is critical to optimize the device memory pressures. In this work, we pinpoint the memory behaviors of each device memory block of GPU during training by instrumenting the memory allocators of the runtime system. Our results show that the memory access patterns of device memory blocks are stable and follow an iterative fashion. These observations are useful for the future optimization of memory-efficient training from the perspective of raw memory access patterns.

* Submitted to ISPASS'21 poster

Via

Access Paper or Ask Questions

AIBench: Scenario-distilling AI Benchmarking

May 06, 2020

Wanling Gao, Fei Tang, Jianfeng Zhan, Xu Wen, Lei Wang, Zheng Cao, Chuanxin Lan, Chunjie Luo, Zihan Jiang

Figure 1 for AIBench: Scenario-distilling AI Benchmarking

Figure 2 for AIBench: Scenario-distilling AI Benchmarking

Figure 3 for AIBench: Scenario-distilling AI Benchmarking

Figure 4 for AIBench: Scenario-distilling AI Benchmarking

Abstract:Real-world application scenarios like modern Internet services consist of diversity of AI and non-AI modules with very long and complex execution paths. Using component or micro AI benchmarks alone can lead to error-prone conclusions. This paper proposes a scenario-distilling AI benchmarking methodology. Instead of using real-world applications, we propose the permutations of essential AI and non-AI tasks as a scenario-distilling benchmark. We consider scenario-distilling benchmarks, component and micro benchmarks as three indispensable parts of a benchmark suite. Together with seventeen industry partners, we identify nine important real-world application scenarios. We design and implement a highly extensible, configurable, and flexible benchmark framework. On the basis of the framework, we propose the guideline for building scenario-distilling benchmarks, and present two Internet service AI ones. The preliminary evaluation shows the advantage of scenario-distilling AI benchmarking against using component or micro AI benchmarks alone. The specifications, source code, testbed, and results are publicly available from the web site \url{http://www.benchcouncil.org/AIBench/index.html}.

* 23 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:2002.07162

Via

Access Paper or Ask Questions

AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Apr 30, 2020

Fei Tang, Wanling Gao, Jianfeng Zhan, Chuanxin Lan, Xu Wen, Lei Wang, Chunjie Luo, Jiahui Dai, Zheng Cao, Xingwang Xiong(+24 more)

Figure 1 for AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Figure 2 for AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Figure 3 for AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Figure 4 for AIBench: An Industry Standard AI Benchmark Suite from Internet Services

Abstract:The booming successes of machine learning in different domains boost industry-scale deployments of innovative AI algorithms, systems, and architectures, and thus the importance of benchmarking grows. However, the confidential nature of the workloads, the paramount importance of the representativeness and diversity of benchmarks, and the prohibitive cost of training a state-of-the-art model mutually aggravate the AI benchmarking challenges. In this paper, we present a balanced AI benchmarking methodology for meeting the subtly different requirements of different stages in developing a new system/architecture and ranking/purchasing commercial off-the-shelf ones. Performing an exhaustive survey on the most important AI domain-Internet services with seventeen industry partners, we identify and include seventeen representative AI tasks to guarantee the representativeness and diversity of the benchmarks. Meanwhile, for reducing the benchmarking cost, we select a benchmark subset to a minimum-three tasks-according to the criteria: diversity of model complexity, computational cost, and convergence rate, repeatability, and having widely-accepted metrics or not. We contribute by far the most comprehensive AI benchmark suite-AIBench. The evaluations show AIBench outperforms MLPerf in terms of the diversity and representativeness of model complexity, computational cost, convergent rate, computation and memory access patterns, and hotspot functions. With respect to the AIBench full benchmarks, its subset shortens the benchmarking cost by 41%, while maintaining the primary workload characteristics. The specifications, source code, and performance numbers are publicly available from the web site http://www.benchcouncil.org/AIBench/index.html.

Via

Access Paper or Ask Questions

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Feb 17, 2020

Wanling Gao, Fei Tang, Jianfeng Zhan, Chuanxin Lan, Chunjie Luo, Lei Wang, Jiahui Dai, Zheng Cao, Xiongwang Xiong, Zihan Jiang(+24 more)

Figure 1 for AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Figure 2 for AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Figure 3 for AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Figure 4 for AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Abstract:Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark. The preliminary evaluation shows the value of our benchmark suite---AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site \url{http://www.benchcouncil.org/AIBench/index.html}.

* 25 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1908.08998

Via

Access Paper or Ask Questions