Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mengping Yang

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Mar 12, 2025

Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li

Abstract:Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

* For more details, please refer to our project page: https://sais-fuxi.github.io/projects/cockatiel/

Via

Access Paper or Ask Questions

Adversarial Semantic Augmentation for Training Generative Adversarial Networks under Limited Data

Feb 02, 2025

Mengping Yang, Zhe Wang, Ziqiu Chi, Dongdong Li, Wenli Du

Abstract:Generative adversarial networks (GANs) have made remarkable achievements in synthesizing images in recent years. Typically, training GANs requires massive data, and the performance of GANs deteriorates significantly when training data is limited. To improve the synthesis performance of GANs in low-data regimes, existing approaches use various data augmentation techniques to enlarge the training sets. However, it is identified that these augmentation techniques may leak or even alter the data distribution. To remedy this, we propose an adversarial semantic augmentation (ASA) technique to enlarge the training data at the semantic level instead of the image level. Concretely, considering semantic features usually encode informative information of images, we estimate the covariance matrices of semantic features for both real and generated images to find meaningful transformation directions. Such directions translate original features to another semantic representation, e.g., changing the backgrounds or expressions of the human face dataset. Moreover, we derive an upper bound of the expected adversarial loss. By optimizing the upper bound, our semantic augmentation is implicitly achieved. Such design avoids redundant sampling of the augmented features and introduces negligible computation overhead, making our approach computation efficient. Extensive experiments on both few-shot and large-scale datasets demonstrate that our method consistently improve the synthesis quality under various data regimes, and further visualized and analytic results suggesting satisfactory versatility of our proposed method.

* This work was completed in 2022 and submitted to an IEEE journal for potential publication

Via

Access Paper or Ask Questions

E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Dec 30, 2024

Zhiyu Tan, WenXu Qian, Hesen Chen, Mengping Yang, Lei Chen, Hao Li

Figure 1 for E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Figure 2 for E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Figure 3 for E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Figure 4 for E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models

Abstract:Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks. However, they face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training. To address these challenges, we propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output. Our method eliminates the training-sampling gap, mitigates information leakage by treating the training process as a direct mapping from pure noise to the target data distribution, and enables the integration of perceptual and adversarial losses into the objective. Extensive experiments on benchmarks such as COCO30K and HW30K demonstrate that our approach consistently outperforms traditional diffusion models, achieving superior results in terms of FID and CLIP score, even with reduced sampling steps. These findings highlight the potential of end-to-end training to advance diffusion-based generative models toward more robust and efficient solutions.

* technical report, to be further updated

Via

Access Paper or Ask Questions

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Jun 27, 2024

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang, Hao Li

Abstract:The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.

* Github Repository: https://github.com/SAIS-FUXI/EvalAlign

Via

Access Paper or Ask Questions

EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations

Jun 24, 2024

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang, Hao Li

* Github Repository: https://github.com/SAIS-FUXI/EvalAlign

Via

Access Paper or Ask Questions

Attention Calibration for Disentangled Text-to-Image Personalization

Apr 11, 2024

Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang

Abstract:Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.

* CVPR 2024 (Oral)

Via

Access Paper or Ask Questions

Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation

Aug 30, 2023

Mengping Yang, Zhe Wang, Wenyi Feng, Qian Zhang, Ting Xiao

Abstract:Few-shot image generation, which aims to produce plausible and diverse images for one category given a few images from this category, has drawn extensive attention. Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients. However, such an intuitive combination of images/features only exploits the most relevant information for generation, leading to poor diversity and coarse-grained semantic fusion. To remedy this, this paper proposes a novel textural modulation (TexMod) mechanism to inject external semantic signals into internal local representations. Parameterized by the feedback from the discriminator, our TexMod enables more fined-grained semantic injection while maintaining the synthesis fidelity. Moreover, a global structural discriminator (StructD) is developed to explicitly guide the model to generate images with reasonable layout and outline. Furthermore, the frequency awareness of the model is reinforced by encouraging the model to distinguish frequency signals. Together with these techniques, we build a novel and effective model for few-shot image generation. The effectiveness of our model is identified by extensive experiments on three popular datasets and various settings. Besides achieving state-of-the-art synthesis performance on these datasets, our proposed techniques could be seamlessly integrated into existing models for a further performance boost.

* To appear in ACM MM 2023, code is available at https://github.com/kobeshegu/SDTM-GAN-ACMMM-2023

Via

Access Paper or Ask Questions

Image Synthesis under Limited Data: A Survey and Taxonomy

Jul 31, 2023

Mengping Yang, Zhe Wang

Figure 1 for Image Synthesis under Limited Data: A Survey and Taxonomy

Figure 2 for Image Synthesis under Limited Data: A Survey and Taxonomy

Figure 3 for Image Synthesis under Limited Data: A Survey and Taxonomy

Figure 4 for Image Synthesis under Limited Data: A Survey and Taxonomy

Abstract:Deep generative models, which target reproducing the given data distribution to produce novel samples, have made unprecedented advancements in recent years. Their technical breakthroughs have enabled unparalleled quality in the synthesis of visual content. However, one critical prerequisite for their tremendous success is the availability of a sufficient number of training samples, which requires massive computation resources. When trained on limited data, generative models tend to suffer from severe performance deterioration due to overfitting and memorization. Accordingly, researchers have devoted considerable attention to develop novel models that are capable of generating plausible and diverse images from limited training data recently. Despite numerous efforts to enhance training stability and synthesis quality in the limited data scenarios, there is a lack of a systematic survey that provides 1) a clear problem definition, critical challenges, and taxonomy of various tasks; 2) an in-depth analysis on the pros, cons, and remain limitations of existing literature; as well as 3) a thorough discussion on the potential applications and future directions in the field of image synthesis under limited data. In order to fill this gap and provide a informative introduction to researchers who are new to this topic, this survey offers a comprehensive review and a novel taxonomy on the development of image synthesis under limited data. In particular, it covers the problem definition, requirements, main solutions, popular benchmarks, and remain challenges in a comprehensive and all-around manner.

* 230 references, 25 pages. GitHub: https://github.com/kobeshegu/awesome-few-shot-generation

Via

Access Paper or Ask Questions

Revisiting the Evaluation of Image Synthesis with GANs

Apr 04, 2023

Mengping Yang, Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai

Abstract:A good metric, which promises a reliable comparison between solutions, is essential to a well-defined task. Unlike most vision tasks that have per-sample ground-truth, image synthesis targets generating \emph{unseen} data and hence is usually evaluated with a distributional distance between one set of real samples and another set of generated samples. This work provides an empirical study on the evaluation of synthesis performance by taking the popular generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses on how to represent a data point in the feature space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Experiments on multiple datasets and settings suggest that (1) a group of models including both CNN-based and ViT-based architectures serve as reliable and robust feature extractors, (2) Centered Kernel Alignment (CKA) enables better comparison across various extractors and hierarchical layers in one model, and (3) CKA shows satisfactory sample efficiency and complements existing metrics (\textit{e.g.}, FID) in characterizing the similarity between two internal data correlations. These findings help us design a new measurement system, based on which we re-evaluate the state-of-the-art generative models in a consistent and reliable way.

* Technical report

Via

Access Paper or Ask Questions

FreGAN: Exploiting Frequency Components for Training GANs under Limited Data

Oct 11, 2022

Mengping Yang, Zhe Wang, Ziqiu Chi, Yanbing Zhang

Figure 1 for FreGAN: Exploiting Frequency Components for Training GANs under Limited Data

Figure 2 for FreGAN: Exploiting Frequency Components for Training GANs under Limited Data

Figure 3 for FreGAN: Exploiting Frequency Components for Training GANs under Limited Data

Figure 4 for FreGAN: Exploiting Frequency Components for Training GANs under Limited Data

Abstract:Training GANs under limited data often leads to discriminator overfitting and memorization issues, causing divergent training. Existing approaches mitigate the overfitting by employing data augmentations, model regularization, or attention mechanisms. However, they ignore the frequency bias of GANs and take poor consideration towards frequency information, especially high-frequency signals that contain rich details. To fully utilize the frequency information of limited data, this paper proposes FreGAN, which raises the model's frequency awareness and draws more attention to producing high-frequency signals, facilitating high-quality generation. In addition to exploiting both real and generated images' frequency information, we also involve the frequency signals of real images as a self-supervised constraint, which alleviates the GAN disequilibrium and encourages the generator to synthesize adequate rather than arbitrary frequency signals. Extensive results demonstrate the superiority and effectiveness of our FreGAN in ameliorating generation quality in the low-data regime (especially when training data is less than 100). Besides, FreGAN can be seamlessly applied to existing regularization and attention mechanism models to further boost the performance.

* To appear in NeurIPS 2022, github:https://github.com/kobeshegu/FreGAN_NeurIPS2022

Via

Access Paper or Ask Questions