Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin Zeng

Seed1.5-VL Technical Report

May 11, 2025

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang(+187 more)

Abstract:We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Via

Access Paper or Ask Questions

Towards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model

Mar 25, 2025

Changyong He, Jin Zeng, Jiawei Zhang, Jiajie Guo

Abstract:Time-of-Flight (ToF) sensors efficiently capture scene depth, but the nonlinear depth construction procedure often results in extremely large noise variance or even invalid areas. Recent methods based on deep neural networks (DNNs) achieve enhanced ToF denoising accuracy but tend to struggle when presented with severe noise corruption due to limited prior knowledge of ToF data distribution. In this paper, we propose DepthCAD, a novel ToF denoising approach that ensures global structural smoothness by leveraging the rich prior knowledge in Stable Diffusion and maintains local metric accuracy by steering the diffusion process with confidence guidance. To adopt the pretrained image diffusion model to ToF depth denoising, we apply the diffusion on raw ToF correlation measurements with dynamic range normalization before converting to depth maps. Experimental results validate the state-of-the-art performance of the proposed scheme, and the evaluation on real data further verifies its robustness against real-world ToF noise.

Via

Access Paper or Ask Questions

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Jan 08, 2025

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang

Figure 1 for URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Figure 2 for URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Figure 3 for URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Figure 4 for URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Abstract:Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.

* 27 pages, 10 tables, 17 figures. The training data has been released. The code and model are currently undergoing internal review. They will be made available soon. Project url: https://ursa-math.github.io

Via

Access Paper or Ask Questions

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Aug 16, 2024

Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng

Abstract:Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities.

Via

Access Paper or Ask Questions

Sparse Graph Learning with Eigen-gap for Spectral Filter Training in Graph Convolutional Networks

Feb 28, 2022

Jin Zeng, Saghar Bagheri, Yang Liu, Gene Cheung, Wei Hu

Figure 1 for Sparse Graph Learning with Eigen-gap for Spectral Filter Training in Graph Convolutional Networks

Figure 2 for Sparse Graph Learning with Eigen-gap for Spectral Filter Training in Graph Convolutional Networks

Figure 3 for Sparse Graph Learning with Eigen-gap for Spectral Filter Training in Graph Convolutional Networks

Abstract:It is now known that the expressive power of graph convolutional neural nets (GCN) does not grow infinitely with the number of layers. Instead, the GCN output approaches a subspace spanned by the first eigenvector of the normalized graph Laplacian matrix with the convergence rate characterized by the "eigen-gap": the difference between the Laplacian's first two distinct eigenvalues. To promote a deeper GCN architecture with sufficient expressiveness, in this paper, given an empirical covariance matrix $\bar{C}$ computed from observable data, we learn a sparse graph Laplacian matrix $L$ closest to $\bar{C}^{-1}$ while maintaining a desirable eigen-gap that slows down convergence. Specifically, we first define a sparse graph learning problem with constraints on the first eigenvector (the most common signal) and the eigen-gap. We solve the corresponding dual problem greedily, where a locally optimal eigen-pair is computed one at a time via a fast approximation of a semi-definite programming (SDP) formulation. The computed $L$ with the desired eigen-gap is normalized spectrally and used for supervised training of GCN for a targeted task. Experiments show that our proposal produced deeper GCNs and smaller errors compared to a competing scheme without explicit eigen-gap optimization.

Via

Access Paper or Ask Questions

Towards Geometry Guided Neural Relighting with Flash Photography

Aug 12, 2020

Di Qiu, Jin Zeng, Zhanghan Ke, Wenxiu Sun, Chengxi Yang

Figure 1 for Towards Geometry Guided Neural Relighting with Flash Photography

Figure 2 for Towards Geometry Guided Neural Relighting with Flash Photography

Figure 3 for Towards Geometry Guided Neural Relighting with Flash Photography

Figure 4 for Towards Geometry Guided Neural Relighting with Flash Photography

Abstract:Previous image based relighting methods require capturing multiple images to acquire high frequency lighting effect under different lighting conditions, which needs nontrivial effort and may be unrealistic in certain practical use scenarios. While such approaches rely entirely on cleverly sampling the color images under different lighting conditions, little has been done to utilize geometric information that crucially influences the high-frequency features in the images, such as glossy highlight and cast shadow. We therefore propose a framework for image relighting from a single flash photograph with its corresponding depth map using deep learning. By incorporating the depth map, our approach is able to extrapolate realistic high-frequency effects under novel lighting via geometry guided image decomposition from the flashlight image, and predict the cast shadow map from the shadow-encoding transformed depth map. Moreover, the single-image based setup greatly simplifies the data capture process. We experimentally validate the advantage of our geometry guided approach over state-of-the-art image-based approaches in intrinsic image decomposition and image relighting, and also demonstrate our performance on real mobile phone photo examples.

Via

Access Paper or Ask Questions

A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

May 27, 2019

Mengyang Chen, Jin Zeng, Jie Lou

Figure 1 for A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Figure 2 for A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Figure 3 for A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Figure 4 for A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Abstract:Spoken language understanding (SLU) acts as a critical component in goal-oriented dialog systems. It typically involves identifying the speakers intent and extracting semantic slots from user utterances, which are known as intent detection (ID) and slot filling (SF). SLU problem has been intensively investigated in recent years. However, these methods just constrain SF results grammatically, solve ID and SF independently, or do not fully utilize the mutual impact of the two tasks. This paper proposes a multi-head self-attention joint model with a conditional random field (CRF) layer and a prior mask. The experiments show the effectiveness of our model, as compared with state-of-the-art models. Meanwhile, online education in China has made great progress in the last few years. But there are few intelligent educational dialog applications for students to learn foreign languages. Hence, we design an intelligent dialog robot equipped with different scenario settings to help students learn communication skills.

Via

Access Paper or Ask Questions

Deep Surface Normal Estimation with Hierarchical RGB-D Fusion

Apr 06, 2019

Jin Zeng, Yanfeng Tong, Yunmu Huang, Qiong Yan, Wenxiu Sun, Jing Chen, Yongtian Wang

Figure 1 for Deep Surface Normal Estimation with Hierarchical RGB-D Fusion

Figure 2 for Deep Surface Normal Estimation with Hierarchical RGB-D Fusion

Figure 3 for Deep Surface Normal Estimation with Hierarchical RGB-D Fusion

Figure 4 for Deep Surface Normal Estimation with Hierarchical RGB-D Fusion

Abstract:The growing availability of commodity RGB-D cameras has boosted the applications in the field of scene understanding. However, as a fundamental scene understanding task, surface normal estimation from RGB-D data lacks thorough investigation. In this paper, a hierarchical fusion network with adaptive feature re-weighting is proposed for surface normal estimation from a single RGB-D image. Specifically, the features from color image and depth are successively integrated at multiple scales to ensure global surface smoothness while preserving visually salient details. Meanwhile, the depth features are re-weighted with a confidence map estimated from depth before merging into the color branch to avoid artifacts caused by input depth corruption. Additionally, a hybrid multi-scale loss function is designed to learn accurate normal estimation given noisy ground-truth dataset. Extensive experimental results validate the effectiveness of the fusion strategy and the loss design, outperforming state-of-the-art normal estimation schemes.

Via

Access Paper or Ask Questions

Deep Graph Laplacian Regularization

Jul 31, 2018

Jin Zeng, Jiahao Pang, Wenxiu Sun, Gene Cheung, Ruichao Xiao

Figure 1 for Deep Graph Laplacian Regularization

Figure 2 for Deep Graph Laplacian Regularization

Figure 3 for Deep Graph Laplacian Regularization

Figure 4 for Deep Graph Laplacian Regularization

Abstract:We propose to combine the robustness merit of model-based approaches and the learning power of data-driven approaches for image restoration. Specifically, by integrating graph Laplacian regularization as a trainable module into a deep learning framework, we are less susceptible to overfitting than pure CNN-based approaches, achieving higher robustness to small dataset and cross-domain denoising. First, a sparse neighborhood graph is built from the output of a convolutional neural network (CNN). Then the image is restored by solving an unconstrained quadratic programming problem, using a corresponding graph Laplacian regularizer as a prior term. The proposed restoration pipeline is fully differentiable and hence can be end-to-end trained. Experimental results demonstrate that our work avoids overfitting given small training data. It is also endowed with strong cross-domain generalization power, outperforming the state-of-the-art approaches by remarkable margin.

Via

Access Paper or Ask Questions

3D Point Cloud Denoising using Graph Laplacian Regularization of a Low Dimensional Manifold Model

Mar 20, 2018

Jin Zeng, Gene Cheung, Michael Ng, Jiahao Pang, Cheng Yang

Figure 1 for 3D Point Cloud Denoising using Graph Laplacian Regularization of a Low Dimensional Manifold Model

Figure 2 for 3D Point Cloud Denoising using Graph Laplacian Regularization of a Low Dimensional Manifold Model

Figure 3 for 3D Point Cloud Denoising using Graph Laplacian Regularization of a Low Dimensional Manifold Model

Figure 4 for 3D Point Cloud Denoising using Graph Laplacian Regularization of a Low Dimensional Manifold Model

Abstract:3D point cloud - a new signal representation of volumetric objects - is a discrete collection of triples marking exterior object surface locations in 3D space. Conventional imperfect acquisition processes of 3D point cloud - e.g., stereo-matching from multiple viewpoint images or depth data acquired directly from active light sensors - imply non-negligible noise in the data. In this paper, we adopt a previously proposed low-dimensional manifold model for the surface patches in the point cloud and seek self-similar patches to denoise them simultaneously using the patch manifold prior. Due to discrete observations of the patches on the manifold, we approximate the manifold dimension computation defined in the continuous domain with a patch-based graph Laplacian regularizer and propose a new discrete patch distance measure to quantify the similarity between two same-sized surface patches for graph construction that is robust to noise. We show that our graph Laplacian regularizer has a natural graph spectral interpretation, and has desirable numerical stability properties via eigenanalysis. Extensive simulation results show that our proposed denoising scheme can outperform state-of-the-art methods in objective metrics and can better preserve visually salient structural features like edges.

Via

Access Paper or Ask Questions