Abstract:Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1\% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.
Abstract:With the evolution of storage and communication protocols, ultra-low bitrate image compression has become a highly demanding topic. However, existing compression algorithms must sacrifice either consistency with the ground truth or perceptual quality at ultra-low bitrate. In recent years, the rapid development of the Large Multimodal Model (LMM) has made it possible to balance these two goals. To solve this problem, this paper proposes a method called Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder for extracting the semantic information of the image, a map encoder to locate the region corresponding to the semantic, an image encoder generates an extremely compressed bitstream, and a decoder reconstructs the image based on the above information. Experimental results show that our proposed MISC is suitable for compressing both traditional Natural Sense Images (NSIs) and emerging AI-Generated Images (AIGIs) content. It can achieve optimal consistency and perception results while saving 50% bitrate, which has strong potential applications in the next generation of storage and communication. The code will be released on https://github.com/lcysyzxdxc/MISC.
Abstract:Current per-shot encoding schemes aim to improve the compression efficiency by shot-level optimization. It splits a source video sequence into shots and imposes optimal sets of encoding parameters to each shot. Per-shot encoding achieved approximately 20% bitrate savings over baseline fixed QP encoding at the expense of pre-processing complexity. However, the adjustable parameter space of the current per-shot encoding schemes only has spatial resolution and QP/CRF, resulting in a lack of encoding flexibility. In this paper, we extend the per-shot encoding framework in the complexity dimension. We believe that per-shot encoding with flexible complexity will help in deploying user-generated content. We propose a rate-distortion-complexity optimization process for encoders and a methodology to determine the coding parameters under the constraints of complexities and bitrate ladders. Experimental results show that our proposed method achieves complexity constraints ranging from 100% to 3% in a dense form compared to the slowest per-shot anchor. With similar complexities of the per-shot scheme fixed in specific presets, our proposed method achieves BDrate gain up to -19.17%.