Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cheng Da

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Feb 03, 2025

Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Figure 1 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 2 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 3 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 4 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Abstract:Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28$\times$ training speedup compared to existing preference optimization methods. Our code will be available at https://github.com/casiatao/LPO.

* 20 pages, 14 tables, 15 figures

Via

Access Paper or Ask Questions

Vision Grid Transformer for Document Layout Analysis

Aug 29, 2023

Cheng Da, Chuwei Luo, Qi Zheng, Cong Yao

Figure 1 for Vision Grid Transformer for Document Layout Analysis

Figure 2 for Vision Grid Transformer for Document Layout Analysis

Figure 3 for Vision Grid Transformer for Document Layout Analysis

Figure 4 for Vision Grid Transformer for Document Layout Analysis

Abstract:Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D$^4$LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet ($95.7\%$$\rightarrow$$96.2\%$), DocBank ($79.6\%$$\rightarrow$$84.1\%$), and D$^4$LA ($67.7\%$$\rightarrow$$68.8\%$). The code and models as well as the D$^4$LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Aug 24, 2023

Changxu Cheng, Peng Wang, Cheng Da, Qi Zheng, Cong Yao

Figure 1 for LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Figure 2 for LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Figure 3 for LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Figure 4 for LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition

Abstract:The diversity in length constitutes a significant characteristic of text. Due to the long-tail distribution of text lengths, most existing methods for scene text recognition (STR) only work well on short or seen-length text, lacking the capability of recognizing longer text or performing length extrapolation. This is a crucial issue, since the lengths of the text to be recognized are usually not given in advance in real-world applications, but it has not been adequately investigated in previous works. Therefore, we propose in this paper a method called Length-Insensitive Scene TExt Recognizer (LISTER), which remedies the limitation regarding the robustness to various text lengths. Specifically, a Neighbor Decoder is proposed to obtain accurate character attention maps with the assistance of a novel neighbor matrix regardless of the text lengths. Besides, a Feature Enhancement Module is devised to model the long-range dependency with low computation cost, which is able to perform iterations with the neighbor decoder to enhance the feature map progressively. To the best of our knowledge, we are the first to achieve effective length-insensitive scene text recognition. Extensive experiments demonstrate that the proposed LISTER algorithm exhibits obvious superiority on long text recognition and the ability for length extrapolation, while comparing favourably with the previous state-of-the-art methods on standard benchmarks for STR (mainly short text).

* ICCV 2023

Via

Access Paper or Ask Questions

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Jul 25, 2023

Cheng Da, Peng Wang, Cong Yao

Figure 1 for Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Figure 2 for Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Figure 3 for Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Figure 4 for Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Abstract:Due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model, which is built upon ViT and a tailored Adaptive Addressing and Aggregation (A$^3$) module. It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, \ie, subword representations (BPE and WordPiece) widely used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. To produce the final recognition results, two strategies for effectively fusing the multi-granularity predictions are devised. The resultant algorithm (termed MGP-STR) is able to push the performance envelope of STR to an even higher level. Specifically, MGP-STR achieves an average recognition accuracy of $94\%$ on standard benchmarks for scene text recognition. Moreover, it also achieves state-of-the-art results on widely-used handwritten benchmarks as well as more challenging scene text datasets, demonstrating the generality of the proposed MGP-STR algorithm. The source code and models will be available at: \url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR}.

* submitted to TPAMI; an extension to our previous ECCV 2022 paper arXiv:2209.03592

Via

Access Paper or Ask Questions

Levenshtein OCR

Sep 08, 2022

Cheng Da, Peng Wang, Cong Yao

Abstract:A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code will be released soon.

* Accepted by ECCV2022

Via

Access Paper or Ask Questions

Multi-Granularity Prediction for Scene Text Recognition

Sep 08, 2022

Peng Wang, Cheng Da, Cong Yao

Figure 1 for Multi-Granularity Prediction for Scene Text Recognition

Figure 2 for Multi-Granularity Prediction for Scene Text Recognition

Figure 3 for Multi-Granularity Prediction for Scene Text Recognition

Figure 4 for Multi-Granularity Prediction for Scene Text Recognition

Abstract:Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks. Code will be released soon.

* Accepted by ECCV2022

Via

Access Paper or Ask Questions

Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Feb 09, 2021

Yanhao Zhang, Qiang Wang, Pan Pan, Yun Zheng, Cheng Da, Siyang Sun, Yinghui Xu

Figure 1 for Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Figure 2 for Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Figure 3 for Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce

Abstract:Nowadays, live-stream and short video shopping in E-commerce have grown exponentially. However, the sellers are required to manually match images of the selling products to the timestamp of exhibition in the untrimmed video, resulting in a complicated process. To solve the problem, we present an innovative demonstration of multi-modal retrieval system called "Fashion Focus", which enables to exactly localize the product images in the online video as the focuses. Different modality contributes to the community localization, including visual content, linguistic features and interaction context are jointly investigated via presented multi-modal learning. Our system employs two procedures for analysis, including video content structuring and multi-modal retrieval, to automatically achieve accurate video-to-shop matching. Fashion Focus presents a unified framework that can orientate the consumers towards relevant product exhibitions during watching videos and help the sellers to effectively deliver the products over search and recommendation.

* accepted by AAAI 2021

Via

Access Paper or Ask Questions