Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xian Zhang

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

Jun 13, 2025

Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, Fan Yang

Abstract:Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing off-the-shelf reasoning models and tactic step provers can achieve comparable performance. This paper introduces \textbf{DSP+}, an improved version of the Draft, Sketch, and Prove framework, featuring a \emph{fine-grained and integrated} neuro-symbolic enhancement for each phase: (1) In the draft phase, we prompt reasoning models to generate concise natural-language subgoals to benefit the sketch phase, removing thinking tokens and references to human-written proofs; (2) In the sketch phase, subgoals are autoformalized with hypotheses to benefit the proving phase, and sketch lines containing syntactic errors are masked according to predefined rules; (3) In the proving phase, we tightly integrate symbolic search methods like Aesop with step provers to establish proofs for the sketch subgoals. Experimental results show that, without any additional model training or fine-tuning, DSP+ solves 80.7\%, 32.8\%, and 24 out of 644 problems from miniF2F, ProofNet, and PutnamBench, respectively, while requiring fewer budgets compared to state-of-the-arts. DSP+ proves \texttt{imo\_2019\_p1}, an IMO problem in miniF2F that is not solved by any prior work. Additionally, DSP+ generates proof patterns comprehensible by human experts, facilitating the identification of formalization errors; For example, eight wrongly formalized statements in miniF2F are discovered. Our results highlight the potential of classical reasoning patterns besides the RL-based training. All components will be open-sourced.

* 31 pages. Associated code and results are available at https://github.com/microsoft/DSP-Plus

Via

Access Paper or Ask Questions

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

May 23, 2025

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

Abstract:Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

Via

Access Paper or Ask Questions

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Feb 19, 2025

Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma

Figure 1 for Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Figure 2 for Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Figure 3 for Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Figure 4 for Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Abstract:Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

* Published as a conference paper at ICLR 2025. Code is available at https://github.com/Lizn-zn/NeqLIPS/

Via

Access Paper or Ask Questions

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Jan 12, 2025

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

Abstract:Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.

* AAAI2025

Via

Access Paper or Ask Questions

Neuro-Symbolic Data Generation for Math Reasoning

Dec 06, 2024

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma

Abstract:A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

* Published as a conference paper at NeurIPS 2024

Via

Access Paper or Ask Questions

Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Oct 28, 2024

Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, Xiaoxing Ma

Figure 1 for Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Figure 2 for Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Figure 3 for Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Figure 4 for Autoformalize Mathematical Statements by Symbolic Equivalence and Semantic Consistency

Abstract:Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0.22-1.35x relative improvements across various LLMs and baseline methods.

* Published as a conference paper at NeurIPS 2024. Code is available at [this https URL](https://github.com/Miracle-Messi/Isa-AutoFormal)

Via

Access Paper or Ask Questions

Exploit CAM by itself: Complementary Learning System for Weakly Supervised Semantic Segmentation

Mar 04, 2023

Jiren Mai, Fei Zhang, Junjie Ye, Marcus Kalander, Xian Zhang, WanKou Yang, Tongliang Liu, Bo Han

Abstract:Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has long been suffering from fragmentary object regions led by Class Activation Map (CAM), which is incapable of generating fine-grained masks for semantic segmentation. To guide CAM to find more non-discriminating object patterns, this paper turns to an interesting working mechanism in agent learning named Complementary Learning System (CLS). CLS holds that the neocortex builds a sensation of general knowledge, while the hippocampus specially learns specific details, completing the learned patterns. Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM) to explicitly drive a coarse-grained CAM to a fine-grained pseudo mask. Specifically, GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM). The GLM is trained with image-level supervision to extract coarse and general localization representations from CAM. Based on the general knowledge in the GLM, the SLM progressively exploits the specific spatial knowledge from the localization representations, expanding the CAM in an explicit way. To this end, we propose the Seed Reactivation to help SLM reactivate non-discriminating regions by setting a boundary for activation values, which successively identifies more regions of CAM. Without extra refinement processes, our method is able to achieve breakthrough improvements for CAM of over 20.0% mIoU on PASCAL VOC 2012 and 10.0% mIoU on MS COCO 2014 datasets, representing a new state-of-the-art among existing WSSS methods.

Via

Access Paper or Ask Questions

EfficientGrasp: A Unified Data-Efficient Learning to Grasp Method for Multi-fingered Robot Hands

Jun 30, 2022

Kelin Li, Nicholas Baron, Xian Zhang, Nicolas Rojas

Figure 1 for EfficientGrasp: A Unified Data-Efficient Learning to Grasp Method for Multi-fingered Robot Hands

Figure 2 for EfficientGrasp: A Unified Data-Efficient Learning to Grasp Method for Multi-fingered Robot Hands

Figure 3 for EfficientGrasp: A Unified Data-Efficient Learning to Grasp Method for Multi-fingered Robot Hands

Figure 4 for EfficientGrasp: A Unified Data-Efficient Learning to Grasp Method for Multi-fingered Robot Hands

Abstract:Autonomous grasping of novel objects that are previously unseen to a robot is an ongoing challenge in robotic manipulation. In the last decades, many approaches have been presented to address this problem for specific robot hands. The UniGrasp framework, introduced recently, has the ability to generalize to different types of robotic grippers; however, this method does not work on grippers with closed-loop constraints and is data-inefficient when applied to robot hands with multigrasp configurations. In this paper, we present EfficientGrasp, a generalized grasp synthesis and gripper control method that is independent of gripper model specifications. EfficientGrasp utilizes a gripper workspace feature rather than UniGrasp's gripper attribute inputs. This reduces memory use by 81.7% during training and makes it possible to generalize to more types of grippers, such as grippers with closed-loop constraints. The effectiveness of EfficientGrasp is evaluated by conducting object grasping experiments both in simulation and real-world; results show that the proposed method also outperforms UniGrasp when considering only grippers without closed-loop constraints. In these cases, EfficientGrasp shows 9.85% higher accuracy in generating contact points and 3.10% higher grasping success rate in simulation. The real-world experiments are conducted with a gripper with closed-loop constraints, which UniGrasp fails to handle while EfficientGrasp achieves a success rate of 83.3%. The main causes of grasping failures of the proposed method are analyzed, highlighting ways of enhancing grasp performance.

* This paper has been accepted for publication in IEEE Robotics and Automation Letters (RA-L), and for presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)

Via

Access Paper or Ask Questions

Identification and classification of exfoliated graphene flakes from microscopy images using a hierarchical deep convolutional neural network

Mar 29, 2022

Soroush Mahjoubi, Fan Ye, Yi Bao, Weina Meng, Xian Zhang

Figure 1 for Identification and classification of exfoliated graphene flakes from microscopy images using a hierarchical deep convolutional neural network

Figure 2 for Identification and classification of exfoliated graphene flakes from microscopy images using a hierarchical deep convolutional neural network

Figure 3 for Identification and classification of exfoliated graphene flakes from microscopy images using a hierarchical deep convolutional neural network

Figure 4 for Identification and classification of exfoliated graphene flakes from microscopy images using a hierarchical deep convolutional neural network

Abstract:Identification of the mechanically exfoliated graphene flakes and classification of the thickness is important in the nanomanufacturing of next-generation materials and devices that overcome the bottleneck of Moore's Law. Currently, identification and classification of exfoliated graphene flakes are conducted by human via inspecting the optical microscope images. The existing state-of-the-art automatic identification by machine learning is not able to accommodate images with different backgrounds while different backgrounds are unavoidable in experiments. This paper presents a deep learning method to automatically identify and classify the thickness of exfoliated graphene flakes on Si/SiO2 substrates from optical microscope images with various settings and background colors. The presented method uses a hierarchical deep convolutional neural network that is capable of learning new images while preserving the knowledge from previous images. The deep learning model was trained and used to classify exfoliated graphene flakes into monolayer (1L), bi-layer (2L), tri-layer (3L), four-to-six-layer (4-6L), seven-to-ten-layer (7-10L), and bulk categories. Compared with existing machine learning methods, the presented method possesses high accuracy and efficiency as well as robustness to the backgrounds and resolutions of images. The results indicated that our deep learning model has accuracy as high as 99% in identifying and classifying exfoliated graphene flakes. This research will shed light on scaled-up manufacturing and characterization of graphene for advanced materials and devices.

* 17 pages

Via

Access Paper or Ask Questions

Face Deblurring Based on Separable Normalization and Adaptive Denormalization

Dec 18, 2021

Xian Zhang, Hao Zhang, Jiancheng Lv, Xiaojie Li

Figure 1 for Face Deblurring Based on Separable Normalization and Adaptive Denormalization

Figure 2 for Face Deblurring Based on Separable Normalization and Adaptive Denormalization

Figure 3 for Face Deblurring Based on Separable Normalization and Adaptive Denormalization

Figure 4 for Face Deblurring Based on Separable Normalization and Adaptive Denormalization

Abstract:Face deblurring aims to restore a clear face image from a blurred input image with more explicit structure and facial details. However, most conventional image and face deblurring methods focus on the whole generated image resolution without consideration of special face part texture and generally produce unsufficient details. Considering that faces and backgrounds have different distribution information, in this study, we designed an effective face deblurring network based on separable normalization and adaptive denormalization (SNADNet). First, We fine-tuned the face parsing network to obtain an accurate face structure. Then, we divided the face parsing feature into face foreground and background. Moreover, we constructed a new feature adaptive denormalization to regularize fafcial structures as a condition of the auxiliary to generate more harmonious and undistorted face structure. In addition, we proposed a texture extractor and multi-patch discriminator to enhance the generated facial texture information. Experimental results on both CelebA and CelebA-HQ datasets demonstrate that the proposed face deblurring network restores face structure with more facial details and performs favorably against state-of-the-art methods in terms of structured similarity indexing method (SSIM), peak signal-to-noise ratio (PSNR), Frechet inception distance (FID) and L1, and qualitative comparisons.

Via

Access Paper or Ask Questions