Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haibo Qiu

Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning

Jun 16, 2025

Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma

Abstract:Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.

* Project Page: https://github.com/MM-Thinking/Metis-RISE

Via

Access Paper or Ask Questions

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Apr 06, 2025

Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang

Abstract:We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.

* Accpeted to CVPR 2025 workshop

Via

Access Paper or Ask Questions

PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Oct 11, 2023

Haibo Qiu, Baosheng Yu, Yixin Chen, Dacheng Tao

Figure 1 for PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Figure 2 for PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Figure 3 for PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Figure 4 for PointHR: Exploring High-Resolution Architectures for 3D Point Cloud Segmentation

Abstract:Significant progress has been made recently in point cloud segmentation utilizing an encoder-decoder framework, which initially encodes point clouds into low-resolution representations and subsequently decodes high-resolution predictions. Inspired by the success of high-resolution architectures in image dense prediction, which always maintains a high-resolution representation throughout the entire learning process, we consider it also highly important for 3D dense point cloud analysis. Therefore, in this paper, we explore high-resolution architectures for 3D point cloud segmentation. Specifically, we generalize high-resolution architectures using a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator to efficiently communicate different resolutions. Additionally, we propose to avoid numerous on-the-fly computations of high-resolution architectures by pre-computing the indices for both sequence and resampling operators. By doing so, we deliver highly competitive high-resolution architectures while capitalizing on the benefits of well-designed point cloud blocks without additional effort. To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets, where the proposed PointHR outperforms recent state-of-the-art methods without any bells and whistles. The source code is available at \url{https://github.com/haibo-qiu/PointHR}.

* Code is available at \url{https://github.com/haibo-qiu/PointHR}

Via

Access Paper or Ask Questions

Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Jun 02, 2023

Haibo Qiu, Baosheng Yu, Dacheng Tao

Figure 1 for Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Figure 2 for Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Figure 3 for Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Figure 4 for Collect-and-Distribute Transformer for 3D Point Cloud Analysis

Abstract:Although remarkable advancements have been made recently in point cloud analysis through the exploration of transformer architecture, it remains challenging to effectively learn local and global structures within point clouds. In this paper, we propose a new transformer architecture equipped with a collect-and-distribute mechanism to communicate short- and long-range contexts of point clouds, which we refer to as CDFormer. Specifically, we first utilize self-attention to capture short-range interactions within each local patch, and the updated local features are then collected into a set of proxy reference points from which we can extract long-range contexts. Afterward, we distribute the learned long-range contexts back to local points via cross-attention. To address the position clues for short- and long-range contexts, we also introduce context-aware position encoding to facilitate position-aware communications between points. We perform experiments on four popular point cloud datasets, namely ModelNet40, ScanObjectNN, S3DIS, and ShapeNetPart, for classification and segmentation. Results show the effectiveness of the proposed CDFormer, delivering several new state-of-the-art performances on point cloud classification and segmentation tasks. The code is available at \url{https://github.com/haibo-qiu/CDFormer}.

* Code is available at https://github.com/haibo-qiu/CDFormer

Via

Access Paper or Ask Questions

GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Jul 06, 2022

Haibo Qiu, Baosheng Yu, Dacheng Tao

Figure 1 for GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Figure 2 for GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Figure 3 for GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Figure 4 for GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation

Abstract:Point cloud semantic segmentation from projected views, such as range-view (RV) and bird's-eye-view (BEV), has been intensively investigated. Different views capture different information of point clouds and thus are complementary to each other. However, recent projection-based methods for point cloud semantic segmentation usually utilize a vanilla late fusion strategy for the predictions of different views, failing to explore the complementary information from a geometric perspective during the representation learning. In this paper, we introduce a geometric flow network (GFNet) to explore the geometric correspondence between different views in an align-before-fuse manner. Specifically, we devise a novel geometric flow module (GFM) to bidirectionally align and propagate the complementary information across different views according to geometric relationships under the end-to-end learning scheme. We perform extensive experiments on two widely used benchmark datasets, SemanticKITTI and nuScenes, to demonstrate the effectiveness of our GFNet for project-based point cloud semantic segmentation. Concretely, GFNet not only significantly boosts the performance of each individual view but also achieves state-of-the-art results over all existing projection-based models. Code is available at \url{https://github.com/haibo-qiu/GFNet}.

* Code is available at \url{https://github.com/haibo-qiu/GFNet}

Via

Access Paper or Ask Questions

End2End Occluded Face Recognition by Masking Corrupted Features

Aug 21, 2021

Haibo Qiu, Dihong Gong, Zhifeng Li, Wei Liu, Dacheng Tao

Figure 1 for End2End Occluded Face Recognition by Masking Corrupted Features

Figure 2 for End2End Occluded Face Recognition by Masking Corrupted Features

Figure 3 for End2End Occluded Face Recognition by Masking Corrupted Features

Figure 4 for End2End Occluded Face Recognition by Masking Corrupted Features

Abstract:With the recent advancement of deep convolutional neural networks, significant progress has been made in general face recognition. However, the state-of-the-art general face recognition models do not generalize well to occluded face images, which are exactly the common cases in real-world scenarios. The potential reasons are the absences of large-scale occluded face data for training and specific designs for tackling corrupted features brought by occlusions. This paper presents a novel face recognition method that is robust to occlusions based on a single end-to-end deep neural network. Our approach, named FROM (Face Recognition with Occlusion Masks), learns to discover the corrupted features from the deep convolutional neural networks, and clean them by the dynamically learned masks. In addition, we construct massive occluded face images to train FROM effectively and efficiently. FROM is simple yet powerful compared to the existing methods that either rely on external detectors to discover the occlusions or employ shallow models which are less discriminative. Experimental results on the LFW, Megaface challenge 1, RMF2, AR dataset and other simulated occluded/masked datasets confirm that FROM dramatically improves the accuracy under occlusions, and generalizes well on general face recognition.

* Accepted by TPAMI 2021

Via

Access Paper or Ask Questions

SynFace: Face Recognition with Synthetic Data

Aug 18, 2021

Haibo Qiu, Baosheng Yu, Dihong Gong, Zhifeng Li, Wei Liu, Dacheng Tao

Figure 1 for SynFace: Face Recognition with Synthetic Data

Figure 2 for SynFace: Face Recognition with Synthetic Data

Figure 3 for SynFace: Face Recognition with Synthetic Data

Figure 4 for SynFace: Face Recognition with Synthetic Data

Abstract:With the recent success of deep neural networks, remarkable progress has been achieved on face recognition. However, collecting large-scale real-world training data for face recognition has turned out to be challenging, especially due to the label noise and privacy issues. Meanwhile, existing face recognition datasets are usually collected from web images, lacking detailed annotations on attributes (e.g., pose and expression), so the influences of different attributes on face recognition have been poorly investigated. In this paper, we address the above-mentioned issues in face recognition using synthetic face images, i.e., SynFace. Specifically, we first explore the performance gap between recent state-of-the-art face recognition models trained with synthetic and real face images. We then analyze the underlying causes behind the performance gap, e.g., the poor intra-class variations and the domain gap between synthetic and real face images. Inspired by this, we devise the SynFace with identity mixup (IM) and domain mixup (DM) to mitigate the above performance gap, demonstrating the great potentials of synthetic data for face recognition. Furthermore, with the controllable face synthesis model, we can easily manage different factors of synthetic face generation, including pose, expression, illumination, the number of identities, and samples per identity. Therefore, we also perform a systematically empirical analysis on synthetic face images to provide some insights on how to effectively utilize synthetic data for face recognition.

* Accepted by ICCV 2021

Via

Access Paper or Ask Questions

Cross View Fusion for 3D Human Pose Estimation

Sep 03, 2019

Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, Wenjun Zeng

Figure 1 for Cross View Fusion for 3D Human Pose Estimation

Figure 2 for Cross View Fusion for 3D Human Pose Estimation

Figure 3 for Cross View Fusion for 3D Human Pose Estimation

Figure 4 for Cross View Fusion for 3D Human Pose Estimation

Abstract:We present an approach to recover absolute 3D human poses from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. We test our method on two public datasets H36M and Total Capture. The Mean Per Joint Position Errors on the two datasets are 26mm and 29mm, which outperforms the state-of-the-arts remarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at \url{https://github.com/microsoft/multiview-human-pose-estimation-pytorch}.

* Accepted by ICCV 2019

Via

Access Paper or Ask Questions