Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nam Ling

TVC: Tokenized Video Compression with Ultra-Low Bitrate

Apr 22, 2025

Lebin Zhou, Cihan Ruan, Nam Ling, Wei Wang, Wei Jiang

Abstract:Tokenized visual representations have shown great promise in image compression, yet their extension to video remains underexplored due to the challenges posed by complex temporal dynamics and stringent bitrate constraints. In this paper, we propose Tokenized Video Compression (TVC), the first token-based dual-stream video compression framework designed to operate effectively at ultra-low bitrates. TVC leverages the powerful Cosmos video tokenizer to extract both discrete and continuous token streams. The discrete tokens (i.e., code maps generated by FSQ) are partially masked using a strategic masking scheme, then compressed losslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens are reconstructed by a decoder-only transformer with spatiotemporal token prediction. Meanwhile, the continuous tokens, produced via an autoencoder (AE), are quantized and compressed using a continuous checkerboard context model, providing complementary continuous information at ultra-low bitrate. At the Decoder side, both streams are fused using ControlNet, with multi-scale hierarchical integration to ensure high perceptual quality alongside strong fidelity in reconstruction. This work mitigates the long-standing skepticism about the practicality of tokenized video compression and opens up new avenues for semantics-aware, token-native video compression.

Via

Access Paper or Ask Questions

Stereo Image Coding for Machines with Joint Visual Feature Compression

Feb 20, 2025

Dengchao Jin, Jianjun Lei, Bo Peng, Zhaoqing Pan, Nam Ling, Qingming Huang

Abstract:2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.

Via

Access Paper or Ask Questions

WTDUN: Wavelet Tree-Structured Sampling and Deep Unfolding Network for Image Compressed Sensing

Nov 25, 2024

Kai Han, Jin Wang, Yunhui Shi, Hanqin Cai, Nam Ling, Baocai Yin

Abstract:Deep unfolding networks have gained increasing attention in the field of compressed sensing (CS) owing to their theoretical interpretability and superior reconstruction performance. However, most existing deep unfolding methods often face the following issues: 1) they learn directly from single-channel images, leading to a simple feature representation that does not fully capture complex features; and 2) they treat various image components uniformly, ignoring the characteristics of different components. To address these issues, we propose a novel wavelet-domain deep unfolding framework named WTDUN, which operates directly on the multi-scale wavelet subbands. Our method utilizes the intrinsic sparsity and multi-scale structure of wavelet coefficients to achieve a tree-structured sampling and reconstruction, effectively capturing and highlighting the most important features within images. Specifically, the design of tree-structured reconstruction aims to capture the inter-dependencies among the multi-scale subbands, enabling the identification of both fine and coarse features, which can lead to a marked improvement in reconstruction quality. Furthermore, a wavelet domain adaptive sampling method is proposed to greatly improve the sampling capability, which is realized by assigning measurements to each wavelet subband based on its importance. Unlike pure deep learning methods that treat all components uniformly, our method introduces a targeted focus on important subbands, considering their energy and sparsity. This targeted strategy lets us capture key information more efficiently while discarding less important information, resulting in a more effective and detailed reconstruction. Extensive experimental results on various datasets validate the superior performance of our proposed method.

* 20pages,Accepted by ACM Transactions on Multimedia Computing Communications and Applications (TOMM)

Via

Access Paper or Ask Questions

GameIR: A Large-Scale Synthesized Ground-Truth Dataset for Image Restoration over Gaming Content

Aug 29, 2024

Lebin Zhou, Kun Han, Nam Ling, Wei Wang, Wei Jiang

Abstract:Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA's DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, the common approach of generating pseudo training data by degrading the original HR images results in inferior restoration performance. In this work, we develop GameIR, a large-scale high-quality computer-synthesized ground-truth dataset to fill in the blanks, targeting at two different applications. The first is super-resolution with deferred rendering, to support the gaming solution of rendering and transferring LR images only and restoring HR images on the client side. We provide 19200 LR-HR paired ground-truth frames coming from 640 videos rendered at 720p and 1440p for this task. The second is novel view synthesis (NVS), to support the multiview gaming solution of rendering and transferring part of the multiview frames and generating the remaining frames on the client side. This task has 57,600 HR frames from 960 videos of 160 scenes with 6 camera views. In addition to the RGB frames, the GBuffers during the deferred rendering stage are also provided, which can be used to help restoration. Furthermore, we evaluate several SOTA super-resolution algorithms and NeRF-based NVS algorithms over our dataset, which demonstrates the effectiveness of our ground-truth GameIR data in improving restoration performance for gaming content. Also, we test the method of incorporating the GBuffers as additional input information for helping super-resolution and NVS. We release our dataset and models to the general public to facilitate research on restoration methods over gaming content.

Via

Access Paper or Ask Questions

HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition

May 15, 2024

Honghui Chen, Yuhang Qiu, Jiabao Wang, Pingping Chen, Nam Ling

Abstract:Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.

* 12 pages, 10 figures

Via

Access Paper or Ask Questions

Novel View Synthesis from a Single Image via Unsupervised learning

Oct 29, 2021

Bingzheng Liu, Jianjun Lei, Bo Peng, Chuanbo Yu, Wanqing Li, Nam Ling

Figure 1 for Novel View Synthesis from a Single Image via Unsupervised learning

Figure 2 for Novel View Synthesis from a Single Image via Unsupervised learning

Figure 3 for Novel View Synthesis from a Single Image via Unsupervised learning

Figure 4 for Novel View Synthesis from a Single Image via Unsupervised learning

Abstract:View synthesis aims to generate novel views from one or more given source views. Although existing methods have achieved promising performance, they usually require paired views of different poses to learn a pixel transformation. This paper proposes an unsupervised network to learn such a pixel transformation from a single source viewpoint. In particular, the network consists of a token transformation module (TTM) that facilities the transformation of the features extracted from a source viewpoint image into an intrinsic representation with respect to a pre-defined reference pose and a view generation module (VGM) that synthesizes an arbitrary view from the representation. The learned transformation allows us to synthesize a novel view from any single source viewpoint image of unknown pose. Experiments on the widely used view synthesis datasets have demonstrated that the proposed network is able to produce comparable results to the state-of-the-art methods despite the fact that learning is unsupervised and only a single source viewpoint image is required for generating a novel view. The code will be available soon.

* 9 pages, submitted to TCSVT

Via

Access Paper or Ask Questions

Improved Image Coding Autoencoder With Deep Learning

Feb 28, 2020

Licheng Xiao, Hairong Wang, Nam Ling

Figure 1 for Improved Image Coding Autoencoder With Deep Learning

Figure 2 for Improved Image Coding Autoencoder With Deep Learning

Figure 3 for Improved Image Coding Autoencoder With Deep Learning

Figure 4 for Improved Image Coding Autoencoder With Deep Learning

Abstract:In this paper, we build autoencoder based pipelines for extreme end-to-end image compression based on Ball\'e's approach, which is the state-of-the-art open source implementation in image compression using deep learning. We deepened the network by adding one more hidden layer before each strided convolutional layer with exactly the same number of down-samplings and up-samplings. Our approach outperformed Ball\'e's approach, and achieved around 4.0% reduction in bits per pixel (bpp), 0.03% increase in multi-scale structural similarity (MS-SSIM), and only 0.47% decrease in peak signal-to-noise ratio (PSNR), It also outperforms all traditional image compression methods including JPEG2000 and HEIC by at least 20% in terms of compression efficiency at similar reconstruction image quality. Regarding encoding and decoding time, our approach takes similar amount of time compared with traditional methods with the support of GPU, which means it's almost ready for industrial applications.

Via

Access Paper or Ask Questions

HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

Nov 16, 2018

Runmin Cong, Jianjun Lei, Huazhu Fu, Qingming Huang, Xiaochun Cao, Nam Ling

Figure 1 for HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

Figure 2 for HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

Figure 3 for HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

Figure 4 for HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images

Abstract:Co-saliency detection aims to discover common and salient objects in an image group containing more than two relevant images. Moreover, depth information has been demonstrated to be effective for many computer vision tasks. In this paper, we propose a novel co-saliency detection method for RGBD images based on hierarchical sparsity reconstruction and energy function refinement. With the assistance of the intra saliency map, the inter-image correspondence is formulated as a hierarchical sparsity reconstruction framework. The global sparsity reconstruction model with a ranking scheme focuses on capturing the global characteristics among the whole image group through a common foreground dictionary. The pairwise sparsity reconstruction model aims to explore the corresponding relationship between pairwise images through a set of pairwise dictionaries. In order to improve the intra-image smoothness and inter-image consistency, an energy function refinement model is proposed, which includes the unary data term, spatial smooth term, and holistic consistency term. Experiments on two RGBD co-saliency detection benchmarks demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.

* 11 pages, 5 figures, Accepted by IEEE Transactions on Multimedia, https://rmcong.github.io/

Via

Access Paper or Ask Questions