Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ruichen Wang

NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

May 22, 2025

Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui(+105 more)

Abstract:This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

Via

Access Paper or Ask Questions

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Dec 02, 2024

Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu

Abstract:Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.

Via

Access Paper or Ask Questions

EM-GANSim: Real-time and Accurate EM Simulation Using Conditional GANs for 3D Indoor Scenes

May 27, 2024

Ruichen Wang, Dinesh Manocha

Abstract:We present a novel machine-learning (ML) approach (EM-GANSim) for real-time electromagnetic (EM) propagation that is used for wireless communication simulation in 3D indoor environments. Our approach uses a modified conditional Generative Adversarial Network (GAN) that incorporates encoded geometry and transmitter location while adhering to the electromagnetic propagation theory. The overall physically-inspired learning is able to predict the power distribution in 3D scenes, which is represented using heatmaps. Our overall accuracy is comparable to ray tracing-based EM simulation, as evidenced by lower mean squared error values. Furthermore, our GAN-based method drastically reduces the computation time, achieving a 5X speedup on complex benchmarks. In practice, it can compute the signal strength in a few milliseconds on any location in 3D indoor environments. We also present a large dataset of 3D models and EM ray tracing-simulated heatmaps. To the best of our knowledge, EM-GANSim is the first real-time algorithm for EM simulation in complex 3D indoor environments. We plan to release the code and the dataset.

* 10 pages, 8 figures, 5 tables

Via

Access Paper or Ask Questions

When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Apr 17, 2024

Yiqun Xie, Zhihao Wang, Weiye Chen, Zhili Li, Xiaowei Jia, Yanhua Li, Ruichen Wang, Kangyang Chai, Ruohan Li, Sergii Skakun

Figure 1 for When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Figure 2 for When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Figure 3 for When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Figure 4 for When are Foundation Models Effective? Understanding the Suitability for Pixel-Level Classification Using Multispectral Imagery

Abstract:Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.

Via

Access Paper or Ask Questions

Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects

Oct 16, 2023

Ruichen Wang, Samuel Audia, Dinesh Manocha

$Figure 1 for Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects$

$Figure 2 for Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects$

$Figure 3 for Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects$

$Figure 4 for Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects$

Abstract:We present a novel algorithm that enhances the accuracy of electromagnetic field simulations in indoor environments by incorporating the Uniform Geometrical Theory of Diffraction (UTD) for surface diffraction. This additional diffraction phenomenology is important for the design of modern wireless systems and allows us to capture the effects of more complex scene geometries. Central to our methodology is the Dynamic Coherence-Based EM Ray Tracing Simulator (DCEM), and we augment that formulation with smooth surface UTD and present techniques to efficiently compute the ray paths. We validate our additions by comparing them to analytical solutions of a sphere, method of moments solutions from FEKO, and ray-traced indoor scenes from WinProp. Our algorithm improves shadow region predicted powers by about 5dB compared to our previous work, and captures nuanced field effects beyond shadow boundaries. We highlight the performance on different indoor scenes and observe 60% faster computation time over WinProp.

* 5 pages, 9 figures, conference

Via

Access Paper or Ask Questions

Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback

May 25, 2023

Yiqi Lin, Hao Wu, Ruichen Wang, Haonan Lu, Xiaodong Lin, Hui Xiong, Lin Wang

Abstract:Generating and editing a 3D scene guided by natural language poses a challenge, primarily due to the complexity of specifying the positional relations and volumetric changes within the 3D space. Recent advancements in Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities across various domains. Surprisingly, these models also show great potential in realizing and interpreting the 3D space. In light of this, we propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter into the off-the-shelf layout-to-3D generative models, allowing users to flexibly and interactively generate visual content. Specifically, we design a versatile layout structure base on the bounding boxes and semantics to prompt the LLMs to model the spatial generation and reasoning from language. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content. We validate the effectiveness of LI3D, primarily in 3D generation and editing through multi-round interactions, which can be flexibly extended to 2D generation and editing. Various experiments demonstrate the potential benefits of incorporating LLMs in generative AI for applications, e.g., metaverse. Moreover, we benchmark the layout reasoning performance of LLMs with neural visual artist tasks, revealing their emergent ability in the spatial layout domain.

* Preprint. Work in Progres

Via

Access Paper or Ask Questions

Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

May 23, 2023

Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin

Figure 1 for Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Figure 2 for Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Figure 3 for Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Figure 4 for Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Abstract:Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, these models fail to semantically align the generated images with the text descriptions due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these three issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective, and can be readily integrated into existing cross-attention-diffusion-based T2I generators. We compare our approach to competing methods and demonstrate that it not only faithfully conveys the semantics of the original text to the generated content, but also achieves high availability as a ready-to-use plugin.

Via

Access Paper or Ask Questions

Edit Everything: A Text-Guided Generative System for Images Editing

Apr 27, 2023

Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, Dong Yang, Fobo Shi, Xiaodong Lin

Figure 1 for Edit Everything: A Text-Guided Generative System for Images Editing

Figure 2 for Edit Everything: A Text-Guided Generative System for Images Editing

Figure 3 for Edit Everything: A Text-Guided Generative System for Images Editing

Figure 4 for Edit Everything: A Text-Guided Generative System for Images Editing

Abstract:We introduce a new generative system called Edit Everything, which can take image and text inputs and produce image outputs. Edit Everything allows users to edit images using simple text instructions. Our system designs prompts to guide the visual module in generating requested images. Experiments demonstrate that Edit Everything facilitates the implementation of the visual aspects of Stable Diffusion with the use of Segment Anything model and CLIP. Our system is publicly available at https://github.com/DefengXie/Edit_Everything.

Via

Access Paper or Ask Questions

GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently

Mar 31, 2023

Jian Ma, Mingjun Zhao, Chen Chen, Ruichen Wang, Di Niu, Haonan Lu, Xiaodong Lin

Abstract:Recent breakthroughs in the field of language-guided image generation have yielded impressive achievements, enabling the creation of high-quality and diverse images based on user instructions. Although the synthesis performance is fascinating, one significant limitation of current image generation models is their insufficient ability to generate coherent text within images, particularly for complex glyph structures like Chinese characters. To address this problem, we introduce GlyphDraw, a general learning framework aiming at endowing image generation models with the capacity to generate images embedded with coherent text. To the best of our knowledge, this is the first work in the field of image synthesis to address the generation of Chinese characters. % we first adopt the OCR technique to collect images with Chinese characters as training samples, and extract the text and locations as auxiliary information. We first sophisticatedly design the image-text dataset's construction strategy, then build our model specifically on a diffusion-based image generator and carefully modify the network structure to allow the model to learn drawing Chinese characters with the help of glyph and position information. Furthermore, we maintain the model's open-domain image synthesis capability by preventing catastrophic forgetting by using a variety of training techniques. Extensive qualitative and quantitative experiments demonstrate that our method not only produces accurate Chinese characters as in prompts, but also naturally blends the generated text into the background. Please refer to https://1073521013.github.io/glyph-draw.github.io

* 24 pages, 5 figures

Via

Access Paper or Ask Questions

Dynamic EM Ray Tracing for Large Urban Scenes with Multiple Receivers

Mar 19, 2023

Ruichen Wang, Dinesh Manocha

Abstract:Radio applications are increasingly being used in urban environments for cellular radio systems and safety applications that use vehicle-vehicle, and vehicle-to-infrastructure. We present a novel ray tracing-based radio propagation algorithm that can handle large urban scenes with hundreds or thousands of dynamic objects and receivers. Our approach is based on the use of coherence-based techniques that exploit spatial and temporal coherence for efficient wireless propagation and radio network planning. Our formulation also utilizes channel coherence which is used to determine the effectiveness of the propagation model within a certain time in dynamically generated paths; and spatial consistency which is used to estimate the similarity and accuracy of changes in a dynamic environment with varying propagation models and blocking obstacles. We highlight the performance of our simulator in large urban traffic scenes with an area of 2*2 km^2 and more than 10,000 users and devices. We evaluate the accuracy by comparing the results with discrete model simulations performed using WinProp. In practice, our approach scales linearly with the area of the urban environment and the number of dynamic obstacles or receivers.

* 6 pages, 10 figures, conference

Via

Access Paper or Ask Questions