Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhigang Zhu

Dept. of Computer Science, City College of New York

Learning to Reason Across Parallel Samples for LLM Reasoning

Jun 10, 2025

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

Abstract:Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on multiple reasoning datasets show that SSA outperforms other test-time scaling methods such as reward model-based re-ranking. Our approach also shows a promising generalization ability, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

Via

Access Paper or Ask Questions

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Mar 21, 2025

Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu

Abstract:Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.

Via

Access Paper or Ask Questions

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Oct 10, 2024

Jianing Qi, Hao Tang, Zhigang Zhu

Figure 1 for VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Figure 2 for VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Figure 3 for VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Figure 4 for VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Abstract:Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a novel Conservative Q-learning (CQL) formulation for balanced Q-value estimation. Our method enables parallel Q-value computation and improving training efficiency. While recent work has explored RL techniques like MCTS for generators, VerifierQ is among the first to investigate the verifier (critic) aspect in LLMs through Q-learning. This integration of RL principles into verifier models complements existing advancements in generator techniques, potentially enabling more robust and adaptive reasoning in LLMs. Experimental results on mathematical reasoning tasks demonstrate VerifierQ's superior performance compared to traditional supervised fine-tuning approaches, with improvements in efficiency, accuracy and robustness. By enhancing the synergy between generation and evaluation capabilities, VerifierQ contributes to the ongoing evolution of AI systems in addressing complex cognitive tasks across various domains.

Via

Access Paper or Ask Questions

GMC: A General Framework of Multi-stage Context Learning and Utilization for Visual Detection Tasks

Jul 08, 2024

Xuan Wang, Hao Tang, Zhigang Zhu

Abstract:Various contextual information has been employed by many approaches for visual detection tasks. However, most of the existing approaches only focus on specific context for specific tasks. In this paper, GMC, a general framework is proposed for multistage context learning and utilization, with various deep network architectures for various visual detection tasks. The GMC framework encompasses three stages: preprocessing, training, and post-processing. In the preprocessing stage, the representation of local context is enhanced by utilizing commonly used labeling standards. During the training stage, semantic context information is fused with visual information, leveraging prior knowledge from the training dataset to capture semantic relationships. In the post-processing stage, general topological relations and semantic masks for stuff are incorporated to enable spatial context reasoning between objects. The proposed framework provides a comprehensive and adaptable solution for context learning and utilization in visual detection scenarios. The framework offers flexibility with user-defined configurations and provide adaptability to diverse network architectures and visual detection tasks, offering an automated and streamlined solution that minimizes user effort and inference time in context learning and reasoning. Experimental results on the visual detection tasks, for storefront object detection, pedestrian detection and COCO object detection, demonstrate that our framework outperforms previous state-of-the-art detectors and transformer architectures. The experiments also demonstrate that three contextual learning components can not only be applied individually and in combination, but can also be applied to various network architectures, and its flexibility and effectiveness in various detection scenarios.

Via

Access Paper or Ask Questions

Segment Anything Model for Pedestrian Infrastructure Inventory: Assessing Zero-Shot Segmentation on Multi-Mode Geospatial Data

Oct 24, 2023

Jiahao Xia, Gavin Gong, Jiawei Liu, Zhigang Zhu, Hao Tang

Abstract:In this paper, a Segment Anything Model (SAM)-based pedestrian infrastructure segmentation workflow is designed and optimized, which is capable of efficiently processing multi-sourced geospatial data including LiDAR data and satellite imagery data. We used an expanded definition of pedestrian infrastructure inventory which goes beyond the traditional transportation elements to include street furniture objects often omitted from the traditional definition. Our contributions lie in producing the necessary knowledge to answer the following two questions. First, which data representation can facilitate zero-shot segmentation of infrastructure objects with SAM? Second, how well does the SAM-based method perform on segmenting pedestrian infrastructure objects? Our findings indicate that street view images generated from mobile LiDAR point cloud data, when paired along with satellite imagery data, can work efficiently with SAM to create a scalable pedestrian infrastructure inventory approach with immediate benefits to GIS professionals, city managers, transportation owners, and walkers, especially those with travel-limiting disabilities.

Via

Access Paper or Ask Questions

Robots in the Garden: Artificial Intelligence and Adaptive Landscapes

May 22, 2023

Zihao Zhang, Susan L. Epstein, Casey Breen, Sophia Xia, Zhigang Zhu, Christian Volkmann

Abstract:This paper introduces ELUA, the Ecological Laboratory for Urban Agriculture, a collaboration among landscape architects, architects and computer scientists who specialize in artificial intelligence, robotics and computer vision. ELUA has two gantry robots, one indoors and the other outside on the rooftop of a 6-story campus building. Each robot can seed, water, weed, and prune in its garden. To support responsive landscape research, ELUA also includes sensor arrays, an AI-powered camera, and an extensive network infrastructure. This project demonstrates a way to integrate artificial intelligence into an evolving urban ecosystem, and encourages landscape architects to develop an adaptive design framework where design becomes a long-term engagement with the environment.

* Journal of Digital Landscape Architecture, 2023
* 4 figures, 9 pages

Via

Access Paper or Ask Questions

Context Understanding in Computer Vision: A Survey

Feb 10, 2023

Xuan Wang, Zhigang Zhu

Figure 1 for Context Understanding in Computer Vision: A Survey

Figure 2 for Context Understanding in Computer Vision: A Survey

Figure 3 for Context Understanding in Computer Vision: A Survey

Figure 4 for Context Understanding in Computer Vision: A Survey

Abstract:Contextual information plays an important role in many computer vision tasks, such as object detection, video action detection, image classification, etc. Recognizing a single object or action out of context could be sometimes very challenging, and context information may help improve the understanding of a scene or an event greatly. Appearance context information, e.g., colors or shapes of the background of an object can improve the recognition accuracy of the object in the scene. Semantic context (e.g. a keyboard on an empty desk vs. a keyboard next to a desktop computer ) will improve accuracy and exclude unrelated events. Context information that are not in the image itself, such as the time or location of an images captured, can also help to decide whether certain event or action should occur. Other types of context (e.g. 3D structure of a building) will also provide additional information to improve the accuracy. In this survey, different context information that has been used in computer vision tasks is reviewed. We categorize context into different types and different levels. We also review available machine learning models and image/video datasets that can employ context information. Furthermore, we compare context based integration and context-free integration in mainly two classes of tasks: image-based and video-based. Finally, this survey is concluded by a set of promising future directions in context learning and utilization.

Via

Access Paper or Ask Questions

SnapshotNet: Self-supervised Feature Learning for Point Cloud Data Segmentation Using Minimal Labeled Data

Jan 13, 2022

Xingye Li, Ling Zhang, Zhigang Zhu

Figure 1 for SnapshotNet: Self-supervised Feature Learning for Point Cloud Data Segmentation Using Minimal Labeled Data

Figure 2 for SnapshotNet: Self-supervised Feature Learning for Point Cloud Data Segmentation Using Minimal Labeled Data

Figure 3 for SnapshotNet: Self-supervised Feature Learning for Point Cloud Data Segmentation Using Minimal Labeled Data

Figure 4 for SnapshotNet: Self-supervised Feature Learning for Point Cloud Data Segmentation Using Minimal Labeled Data

Abstract:Manually annotating complex scene point cloud datasets is both costly and error-prone. To reduce the reliance on labeled data, a new model called SnapshotNet is proposed as a self-supervised feature learning approach, which directly works on the unlabeled point cloud data of a complex 3D scene. The SnapshotNet pipeline includes three stages. In the snapshot capturing stage, snapshots, which are defined as local collections of points, are sampled from the point cloud scene. A snapshot could be a view of a local 3D scan directly captured from the real scene, or a virtual view of such from a large 3D point cloud dataset. Snapshots could also be sampled at different sampling rates or fields of view (FOVs), thus multi-FOV snapshots, to capture scale information from the scene. In the feature learning stage, a new pre-text task called multi-FOV contrasting is proposed to recognize whether two snapshots are from the same object or not, within the same FOV or across different FOVs. Snapshots go through two self-supervised learning steps: the contrastive learning step with both part and scale contrasting, followed by a snapshot clustering step to extract higher level semantic features. Then a weakly-supervised segmentation stage is implemented by first training a standard SVM classifier on the learned features with a small fraction of labeled snapshots. The trained SVM is used to predict labels for input snapshots and predicted labels are converted into point-wise label assignments for semantic segmentation of the entire scene using a voting procedure. The experiments are conducted on the Semantic3D dataset and the results have shown that the proposed method is capable of learning effective features from snapshots of complex scene data without any labels. Moreover, the proposed method has shown advantages when comparing to the SOA method on weakly-supervised point cloud semantic segmentation.

* Computer Vision and Image Understanding, Volume 216, 2022, 103339, ISSN 1077-3142

Via

Access Paper or Ask Questions

NIDA-CLIFGAN: Natural Infrastructure Damage Assessment through Efficient Classification Combining Contrastive Learning, Information Fusion and Generative Adversarial Networks

Oct 27, 2021

Jie Wei, Zhigang Zhu, Erik Blasch, Bilal Abdulrahman, Billy Davila, Shuoxin Liu, Jed Magracia, Ling Fang

Figure 1 for NIDA-CLIFGAN: Natural Infrastructure Damage Assessment through Efficient Classification Combining Contrastive Learning, Information Fusion and Generative Adversarial Networks

Figure 2 for NIDA-CLIFGAN: Natural Infrastructure Damage Assessment through Efficient Classification Combining Contrastive Learning, Information Fusion and Generative Adversarial Networks

Figure 3 for NIDA-CLIFGAN: Natural Infrastructure Damage Assessment through Efficient Classification Combining Contrastive Learning, Information Fusion and Generative Adversarial Networks

Abstract:During natural disasters, aircraft and satellites are used to survey the impacted regions. Usually human experts are needed to manually label the degrees of the building damage so that proper humanitarian assistance and disaster response (HADR) can be achieved, which is labor-intensive and time-consuming. Expecting human labeling of major disasters over a wide area gravely slows down the HADR efforts. It is thus of crucial interest to take advantage of the cutting-edge Artificial Intelligence and Machine Learning techniques to speed up the natural infrastructure damage assessment process to achieve effective HADR. Accordingly, the paper demonstrates a systematic effort to achieve efficient building damage classification. First, two novel generative adversarial nets (GANs) are designed to augment data used to train the deep-learning-based classifier. Second, a contrastive learning based method using novel data structures is developed to achieve great performance. Third, by using information fusion, the classifier is effectively trained with very few training data samples for transfer learning. All the classifiers are small enough to be loaded in a smart phone or simple laptop for first responders. Based on the available overhead imagery dataset, results demonstrate data and computational efficiency with 10% of the collected data combined with a GAN reducing the time of computation from roughly half a day to about 1 hour with roughly similar classification performances.

Via

Access Paper or Ask Questions

Unsupervised Feature Learning for Point Cloud by Contrasting and Clustering With Graph Convolutional Neural Network

May 03, 2019

Ling Zhang, Zhigang Zhu

Figure 1 for Unsupervised Feature Learning for Point Cloud by Contrasting and Clustering With Graph Convolutional Neural Network

Figure 2 for Unsupervised Feature Learning for Point Cloud by Contrasting and Clustering With Graph Convolutional Neural Network

Figure 3 for Unsupervised Feature Learning for Point Cloud by Contrasting and Clustering With Graph Convolutional Neural Network

Figure 4 for Unsupervised Feature Learning for Point Cloud by Contrasting and Clustering With Graph Convolutional Neural Network

Abstract:To alleviate the cost of collecting and annotating large-scale point cloud datasets, we propose an unsupervised learning approach to learn features from unlabeled point cloud "3D object" dataset by using part contrasting and object clustering with deep graph neural networks (GNNs). In the contrast learning step, all the samples in the 3D object dataset are cut into two parts and put into a "part" dataset. Then a contrast learning GNN (ContrastNet) is trained to verify whether two randomly sampled parts from the part dataset belong to the same object. In the cluster learning step, the trained ContrastNet is applied to all the samples in the original 3D object dataset to extract features, which are used to group the samples into clusters. Then another GNN for clustering learning (ClusterNet) is trained to predict the cluster ID of all the training samples. The contrasting learning forces the ContrastNet to learn high-level semantic features of objects but probably ignores low-level features, while the ClusterNet improves the quality of learned features by being trained to discover objects that probably belong to the same semantic categories by the use of cluster IDs. We have conducted extensive experiments to evaluate the proposed framework on point cloud classification tasks. The proposed unsupervised learning approach obtained comparable performance to the state-of-the-art unsupervised learning methods that used much more complicated network structures. The code of this work is publicly available via: https://github.com/lingzhang1/ContrastNet.

Via

Access Paper or Ask Questions