Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zongyu Guo

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

May 23, 2025

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Abstract:Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.

* Under review

Via

Access Paper or Ask Questions

UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Dec 09, 2024

Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Zongyu Guo, Yiting Lu, Yulin Ren, Zhibo Chen

Figure 1 for UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Figure 2 for UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Figure 3 for UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Figure 4 for UniMIC: Towards Universal Multi-modality Perceptual Image Compression

Abstract:We present UniMIC, a universal multi-modality image compression framework, intending to unify the rate-distortion-perception (RDP) optimization for multiple image codecs simultaneously through excavating cross-modality generative priors. Unlike most existing works that need to design and optimize image codecs from scratch, our UniMIC introduces the visual codec repository, which incorporates amounts of representative image codecs and directly uses them as the basic codecs for various practical applications. Moreover, we propose multi-grained textual coding, where variable-length content prompt and compression prompt are designed and encoded to assist the perceptual reconstruction through the multi-modality conditional generation. In particular, a universal perception compensator is proposed to improve the perception quality of decoded images from all basic codecs at the decoder side by reusing text-assisted diffusion priors from stable diffusion. With the cooperation of the above three strategies, our UniMIC achieves a significant improvement of RDP optimization for different compression codecs, e.g., traditional and learnable codecs, and different compression costs, e.g., ultra-low bitrates. The code will be available in https://github.com/Amygyx/UniMIC .

Via

Access Paper or Ask Questions

GTPC-SSCD: Gate-guided Two-level Perturbation Consistency-based Semi-Supervised Change Detection

Nov 28, 2024

Yan Xing, Qi'ao Xu, Zongyu Guo, Rui Huang, Yuxiang Zhang

Abstract:Semi-supervised change detection (SSCD) employs partially labeled data and a substantial amount of unlabeled data to identify differences between images captured in the same geographic area but at different times. However, existing consistency regularization-based SSCD methods only implement perturbations at a single level and can not exploit the full potential of unlabeled data. In this paper, we introduce a novel Gate-guided Two-level Perturbation Consistency regularization-based SSCD method (GTPC-SSCD), which simultaneously maintains strong-to-weak consistency at the image level and perturbation consistency at the feature level, thus effectively utilizing the unlabeled data. Moreover, a gate module is designed to evaluate the training complexity of different samples and determine the necessity of performing feature perturbations on each sample. This differential treatment enables the network to more effectively explore the potential of unlabeled data. Extensive experiments conducted on six public remote sensing change detection datasets demonstrate the superiority of our method over seven state-of-the-art SSCD methods.

* 6 pages, 4 figures

Via

Access Paper or Ask Questions

Conditional Neural Video Coding with Spatial-Temporal Super-Resolution

Jan 25, 2024

Henan Wang, Xiaohan Pan, Runsen Feng, Zongyu Guo, Zhibo Chen

Abstract:This document is an expanded version of a one-page abstract originally presented at the 2024 Data Compression Conference. It describes our proposed method for the video track of the Challenge on Learned Image Compression (CLIC) 2024. Our scheme follows the typical hybrid coding framework with some novel techniques. Firstly, we adopt Spynet network to produce accurate motion vectors for motion estimation. Secondly, we introduce the context mining scheme with conditional frame coding to fully exploit the spatial-temporal information. As for the low target bitrates given by CLIC, we integrate spatial-temporal super-resolution modules to improve rate-distortion performance. Our team name is IMCLVC.

* Accepted by the 2024 Data Compression Conference (DCC) for presentation as a poster

Via

Access Paper or Ask Questions

Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Jan 04, 2024

Ruofei Wang, Renjie Wan, Zongyu Guo, Qing Guo, Rui Huang

Figure 1 for Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Figure 2 for Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Figure 3 for Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Figure 4 for Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Abstract:Backdoor attack aims to deceive a victim model when facing backdoor instances while maintaining its performance on benign data. Current methods use manual patterns or special perturbations as triggers, while they often overlook the robustness against data corruption, making backdoor attacks easy to defend in practice. To address this issue, we propose a novel backdoor attack method named Spy-Watermark, which remains effective when facing data collapse and backdoor defense. Therein, we introduce a learnable watermark embedded in the latent domain of images, serving as the trigger. Then, we search for a watermark that can withstand collapse during image decoding, cooperating with several anti-collapse operations to further enhance the resilience of our trigger against data corruption. Extensive experiments are conducted on CIFAR10, GTSRB, and ImageNet datasets, demonstrating that Spy-Watermark overtakes ten state-of-the-art methods in terms of robustness and stealthiness.

* Accepted by ICASSP2024

Via

Access Paper or Ask Questions

RECOMBINER: Robust and Enhanced Compression with Bayesian Implicit Neural Representations

Sep 29, 2023

Jiajun He, Gergely Flamich, Zongyu Guo, José Miguel Hernández-Lobato

Abstract:COMpression with Bayesian Implicit NEural Representations (COMBINER) is a recent data compression method that addresses a key inefficiency of previous Implicit Neural Representation (INR)-based approaches: it avoids quantization and enables direct optimization of the rate-distortion performance. However, COMBINER still has significant limitations: 1) it uses factorized priors and posterior approximations that lack flexibility; 2) it cannot effectively adapt to local deviations from global patterns in the data; and 3) its performance can be susceptible to modeling choices and the variational parameters' initializations. Our proposed method, Robust and Enhanced COMBINER (RECOMBINER), addresses these issues by 1) enriching the variational approximation while maintaining its computational cost via a linear reparameterization of the INR weights, 2) augmenting our INRs with learnable positional encodings that enable them to adapt to local details and 3) splitting high-resolution data into patches to increase robustness and utilizing expressive hierarchical priors to capture dependency across patches. We conduct extensive experiments across several data modalities, showcasing that RECOMBINER achieves competitive results with the best INR-based methods and even outperforms autoencoder-based codecs on low-resolution images at low bitrates.

* 24 pages, 13 figures

Via

Access Paper or Ask Questions

Compression with Bayesian Implicit Neural Representations

May 30, 2023

Zongyu Guo, Gergely Flamich, Jiajun He, Zhibo Chen, José Miguel Hernández-Lobato

Abstract:Many common types of data can be represented as functions that map coordinates to signal values, such as pixel locations to RGB values in the case of an image. Based on this view, data can be compressed by overfitting a compact neural network to its functional representation and then encoding the network weights. However, most current solutions for this are inefficient, as quantization to low-bit precision substantially degrades the reconstruction quality. To address this issue, we propose overfitting variational Bayesian neural networks to the data and compressing an approximate posterior weight sample using relative entropy coding instead of quantizing and entropy coding it. This strategy enables direct optimization of the rate-distortion performance by minimizing the $\beta$-ELBO, and target different rate-distortion trade-offs for a given network architecture by adjusting $\beta$. Moreover, we introduce an iterative algorithm for learning prior weight distributions and employ a progressive refinement process for the variational posterior that significantly enhances performance. Experiments show that our method achieves strong performance on image and audio compression while retaining simplicity.

* Preprint

Via

Access Paper or Ask Questions

NVTC: Nonlinear Vector Transform Coding

May 25, 2023

Runsen Feng, Zongyu Guo, Weiping Li, Zhibo Chen

Figure 1 for NVTC: Nonlinear Vector Transform Coding

Figure 2 for NVTC: Nonlinear Vector Transform Coding

Figure 3 for NVTC: Nonlinear Vector Transform Coding

Figure 4 for NVTC: Nonlinear Vector Transform Coding

Abstract:In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

Exploring the Rate-Distortion-Complexity Optimization in Neural Image Compression

May 12, 2023

Yixin Gao, Runsen Feng, Zongyu Guo, Zhibo Chen

Figure 1 for Exploring the Rate-Distortion-Complexity Optimization in Neural Image Compression

Figure 2 for Exploring the Rate-Distortion-Complexity Optimization in Neural Image Compression

Figure 3 for Exploring the Rate-Distortion-Complexity Optimization in Neural Image Compression

Figure 4 for Exploring the Rate-Distortion-Complexity Optimization in Neural Image Compression

Abstract:Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate-distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate-distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate-distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.

Via

Access Paper or Ask Questions

Versatile Neural Processes for Learning Implicit Neural Representations

Jan 21, 2023

Zongyu Guo, Cuiling Lan, Zhizheng Zhang, Zhibo Chen, Yan Lu

Abstract:Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.

* Accepted by ICLR2023

Via

Access Paper or Ask Questions