Abstract:Recent advances in generative compression methods have demonstrated remarkable progress in enhancing the perceptual quality of compressed data, especially in scenarios with low bitrates. Nevertheless, their efficacy and applicability in achieving extreme compression ratios ($<0.1$ bpp) still remain constrained. In this work, we propose a simple yet effective coding framework by introducing vector quantization (VQ)-based generative models into the image compression domain. The main insight is that the codebook learned by the VQGAN model yields strong expressive capacity, facilitating efficient compression of continuous information in the latent space while maintaining reconstruction quality. Specifically, an image can be represented as VQ-indices by finding the nearest codeword, which can be encoded using lossless compression methods into bitstreams. We then propose clustering a pre-trained large-scale codebook into smaller codebooks using the K-means algorithm. This enables images to be represented as diverse ranges of VQ-indices maps, resulting in variable bitrates and different levels of reconstruction quality. Extensive qualitative and quantitative experiments on various datasets demonstrate that the proposed framework outperforms the state-of-the-art codecs in terms of perceptual quality-oriented metrics and human perception under extremely low bitrates.
Abstract:This paper describes our approach to the multi-modal fact verification (FACTIFY) challenge at AAAI2023. In recent years, with the widespread use of social media, fake news can spread rapidly and negatively impact social security. Automatic claim verification becomes more and more crucial to combat fake news. In fact verification involving multiple modal data, there should be a structural coherence between claim and document. Therefore, we proposed a structure coherence-based multi-modal fact verification scheme to classify fake news. Our structure coherence includes the following four aspects: sentence length, vocabulary similarity, semantic similarity, and image similarity. Specifically, CLIP and Sentence BERT are combined to extract text features, and ResNet50 is used to extract image features. In addition, we also extract the length of the text as well as the lexical similarity. Then the features were concatenated and passed through the random forest classifier. Finally, our weighted average F1 score has reached 0.8079, achieving 2nd place in FACTIFY2.