Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaiyi Zhang

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

Jul 03, 2025

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their ``comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its ``comfort zone'' and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.

Via

Access Paper or Ask Questions

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Sep 03, 2024

Qi Zhang, Kaiyi Zhang, Antoni B. Chan, Hui Huang

Figure 1 for Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Figure 2 for Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Figure 3 for Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Figure 4 for Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Abstract:Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets. Project page: https://vcc.tech/research/2024/MVOT.

* ECCV 2024

Via

Access Paper or Ask Questions

A New Robust Partial $p$-Wasserstein-Based Metric for Comparing Distributions

May 06, 2024

Sharath Raghvendra, Pouyan Shirzadian, Kaiyi Zhang

Abstract:The $2$-Wasserstein distance is sensitive to minor geometric differences between distributions, making it a very powerful dissimilarity metric. However, due to this sensitivity, a small outlier mass can also cause a significant increase in the $2$-Wasserstein distance between two similar distributions. Similarly, sampling discrepancy can cause the empirical $2$-Wasserstein distance on $n$ samples in $\mathbb{R}^2$ to converge to the true distance at a rate of $n^{-1/4}$, which is significantly slower than the rate of $n^{-1/2}$ for $1$-Wasserstein distance. We introduce a new family of distances parameterized by $k \ge 0$, called $k$-RPW, that is based on computing the partial $2$-Wasserstein distance. We show that (1) $k$-RPW satisfies the metric properties, (2) $k$-RPW is robust to small outlier mass while retaining the sensitivity of $2$-Wasserstein distance to minor geometric differences, and (3) when $k$ is a constant, $k$-RPW distance between empirical distributions on $n$ samples in $\mathbb{R}^2$ converges to the true distance at a rate of $n^{-1/3}$, which is faster than the convergence rate of $n^{-1/4}$ for the $2$-Wasserstein distance. Using the partial $p$-Wasserstein distance, we extend our distance to any $p \in [1,\infty]$. By setting parameters $k$ or $p$ appropriately, we can reduce our distance to the total variation, $p$-Wasserstein, and the L\'evy-Prokhorov distances. Experiments show that our distance function achieves higher accuracy in comparison to the $1$-Wasserstein, $2$-Wasserstein, and TV distances for image retrieval tasks on noisy real-world data sets.

Via

Access Paper or Ask Questions

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

Apr 09, 2024

Ang Lv, Kaiyi Zhang, Yuhan Chen, Yulong Wang, Lifeng Liu, Ji-Rong Wen, Jian Xie, Rui Yan

Abstract:In this paper, we deeply explore the mechanisms employed by Transformer-based language models in factual recall tasks. In zero-shot scenarios, given a prompt like "The capital of France is," task-specific attention heads extract the topic entity, such as "France," from the context and pass it to subsequent MLPs to recall the required answer such as "Paris." We introduce a novel analysis method aimed at decomposing the outputs of the MLP into components understandable by humans. Through this method, we quantify the function of the MLP layer following these task-specific heads. In the residual stream, it either erases or amplifies the information originating from individual heads. Moreover, it generates a component that redirects the residual stream towards the direction of its expected answer. These zero-shot mechanisms are also employed in few-shot scenarios. Additionally, we observed a widely existent anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall performance. Our interpretations have been evaluated across various language models, from the GPT-2 families to 1.3B OPT, and across tasks covering different domains of factual knowledge.

Via

Access Paper or Ask Questions

Batch-ICL: Effective, Efficient, and Order-Agnostic In-Context Learning

Jan 12, 2024

Kaiyi Zhang, Ang Lv, Yuhan Chen, Hansen Ha, Tao Xu, Rui Yan

Abstract:In this paper, by treating in-context learning (ICL) as a meta-optimization process, we explain why LLMs are sensitive to the order of ICL examples. This understanding leads us to the development of Batch-ICL, an effective, efficient, and order-agnostic inference algorithm for ICL. Differing from the standard N-shot learning approach, Batch-ICL employs $N$ separate 1-shot forward computations and aggregates the resulting meta-gradients. These aggregated meta-gradients are then applied to a zero-shot learning to generate the final prediction. This batch processing approach renders the LLM agnostic to the order of ICL examples. Through extensive experiments and analysis, we demonstrate that Batch-ICL consistently outperforms most permutations of example sequences. In some cases, it even exceeds the performance of the optimal order for standard ICL, all while reducing the computational resources required. Furthermore, we develop a novel variant of Batch-ICL featuring multiple "epochs" of meta-optimization. This variant implicitly explores permutations of ICL examples, further enhancing ICL performance.

Via

Access Paper or Ask Questions

Point Cloud Part Editing: Segmentation, Generation, Assembly, and Selection

Dec 19, 2023

Kaiyi Zhang, Yang Chen, Ximing Yang, Weizhong Zhang, Cheng Jin

Abstract:Ideal part editing should guarantee the diversity of edited parts, the fidelity to the remaining parts, and the quality of the results. However, previous methods do not disentangle each part completely, which means the edited parts will affect the others, resulting in poor diversity and fidelity. In addition, some methods lack constraints between parts, which need manual selections of edited results to ensure quality. Therefore, we propose a four-stage process for point cloud part editing: Segmentation, Generation, Assembly, and Selection. Based on this process, we introduce SGAS, a model for part editing that employs two strategies: feature disentanglement and constraint. By independently fitting part-level feature distributions, we realize the feature disentanglement. By explicitly modeling the transformation from object-level distribution to part-level distributions, we realize the feature constraint. Considerable experiments on different datasets demonstrate the efficiency and effectiveness of SGAS on point cloud part editing. In addition, SGAS can be pruned to realize unsupervised part-aware point cloud generation and achieves state-of-the-art results.

* 9 pages, 7 figures, AAAI 2024

Via

Access Paper or Ask Questions

Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse

Nov 16, 2023

Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan

Abstract:Recent studies have highlighted a phenomenon in large language models (LLMs) known as "the reversal curse," in which the order of knowledge entities in the training data biases the models' comprehension. For example, if a model is trained on sentences where entity A consistently appears before entity B, it can respond to queries about A by providing B as the answer. However, it may encounter confusion when presented with questions concerning B. We contend that the reversal curse is partially a result of specific model training objectives, particularly evident in the prevalent use of the next-token prediction within most causal language models. For the next-token prediction, models solely focus on a token's preceding context, resulting in a restricted comprehension of the input. In contrast, we illustrate that the GLM, trained using the autoregressive blank infilling objective where tokens to be predicted have access to the entire context, exhibits better resilience against the reversal curse. We propose a novel training method, BIdirectional Casual language modeling Optimization (BICO), designed to mitigate the reversal curse when fine-tuning pretrained causal language models on new data. BICO modifies the causal attention mechanism to function bidirectionally and employs a mask denoising optimization. In the task designed to assess the reversal curse, our approach improves Llama's accuracy from the original 0% to around 70%. We hope that more attention can be focused on exploring and addressing these inherent weaknesses of the current LLMs, in order to achieve a higher level of intelligence.

Via

Access Paper or Ask Questions

A Push-Relabel Based Additive Approximation for Optimal Transport

Mar 07, 2022

Nathaniel Lahn, Sharath Raghvendra, Kaiyi Zhang

Figure 1 for A Push-Relabel Based Additive Approximation for Optimal Transport

Figure 2 for A Push-Relabel Based Additive Approximation for Optimal Transport

Abstract:Optimal Transport is a popular distance metric for measuring similarity between distributions. Exact algorithms for computing Optimal Transport can be slow, which has motivated the development of approximate numerical solvers (e.g. Sinkhorn method). We introduce a new and very simple combinatorial approach to find an $\varepsilon$-approximation of the OT distance. Our algorithm achieves a near-optimal execution time of $O(n^2/\varepsilon^2)$ for computing OT distance and, for the special case of the assignment problem, the execution time improves to $O(n^2/\varepsilon)$. Our algorithm is based on the push-relabel framework for min-cost flow problems. Unlike the other combinatorial approach (Lahn, Mulchandani and Raghvendra, NeurIPS 2019) which does not have a fast parallel implementation, our algorithm has a parallel execution time of $O(\log n/\varepsilon^2)$. Interestingly, unlike the Sinkhorn algorithm, our method also readily provides a compact transport plan as well as a solution to an approximate version of the dual formulation of the OT problem, both of which have numerous applications in Machine Learning. For the assignment problem, we provide both a CPU implementation as well as an implementation that exploits GPU parallelism. Experiments suggest that our algorithm is faster than the Sinkhorn algorithm, both in terms of CPU and GPU implementations, especially while computing matchings with a high accuracy.

Via

Access Paper or Ask Questions

SRPCN: Structure Retrieval based Point Completion Network

Feb 06, 2022

Kaiyi Zhang, Ximing Yang, Yuan Wu, Cheng Jin

Figure 1 for SRPCN: Structure Retrieval based Point Completion Network

Figure 2 for SRPCN: Structure Retrieval based Point Completion Network

Figure 3 for SRPCN: Structure Retrieval based Point Completion Network

Figure 4 for SRPCN: Structure Retrieval based Point Completion Network

Abstract:Given partial objects and some complete ones as references, point cloud completion aims to recover authentic shapes. However, existing methods pay little attention to general shapes, which leads to the poor authenticity of completion results. Besides, the missing patterns are diverse in reality, but existing methods can only handle fixed ones, which means a poor generalization ability. Considering that a partial point cloud is a subset of the corresponding complete one, we regard them as different samples of the same distribution and propose Structure Retrieval based Point Completion Network (SRPCN). It first uses k-means clustering to extract structure points and disperses them into distributions, and then KL Divergence is used as a metric to find the complete structure point cloud that best matches the input in a database. Finally, a PCN-like decoder network is adopted to generate the final results based on the retrieved structure point clouds. As structure plays an important role in describing the general shape of an object and the proposed structure retrieval method is robust to missing patterns, experiments show that our method can generate more authentic results and has a stronger generalization ability.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Attention-based Transformation from Latent Features to Point Clouds

Dec 10, 2021

Kaiyi Zhang, Ximing Yang, Yuan Wu, Cheng Jin

Figure 1 for Attention-based Transformation from Latent Features to Point Clouds

Figure 2 for Attention-based Transformation from Latent Features to Point Clouds

Figure 3 for Attention-based Transformation from Latent Features to Point Clouds

Figure 4 for Attention-based Transformation from Latent Features to Point Clouds

Abstract:In point cloud generation and completion, previous methods for transforming latent features to point clouds are generally based on fully connected layers (FC-based) or folding operations (Folding-based). However, point clouds generated by FC-based methods are usually troubled by outliers and rough surfaces. For folding-based methods, their data flow is large, convergence speed is slow, and they are also hard to handle the generation of non-smooth surfaces. In this work, we propose AXform, an attention-based method to transform latent features to point clouds. AXform first generates points in an interim space, using a fully connected layer. These interim points are then aggregated to generate the target point cloud. AXform takes both parameter sharing and data flow into account, which makes it has fewer outliers, fewer network parameters, and a faster convergence speed. The points generated by AXform do not have the strong 2-manifold constraint, which improves the generation of non-smooth surfaces. When AXform is expanded to multiple branches for local generations, the centripetal constraint makes it has properties of self-clustering and space consistency, which further enables unsupervised semantic segmentation. We also adopt this scheme and design AXformNet for point cloud completion. Considerable experiments on different datasets show that our methods achieve state-of-the-art results.

* 9 pages, 7 figures, AAAI 2022

Via

Access Paper or Ask Questions