Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jihye Park

Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration

Dec 09, 2025

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim

Abstract:Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.

Via

Access Paper or Ask Questions

Text-Aware Image Restoration with Diffusion Models

Jun 11, 2025

Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim

Figure 1 for Text-Aware Image Restoration with Diffusion Models

Figure 2 for Text-Aware Image Restoration with Diffusion Models

Figure 3 for Text-Aware Image Restoration with Diffusion Models

Figure 4 for Text-Aware Image Restoration with Diffusion Models

Abstract:Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

* Project page: https://cvlab-kaist.github.io/TAIR/

Via

Access Paper or Ask Questions

HyperCLOVA X Technical Report

Apr 13, 2024

Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim(+386 more)

Abstract:We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.

* 44 pages; updated authors list and fixed author names

Via

Access Paper or Ask Questions

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Mar 28, 2024

Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim

Figure 1 for MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Figure 2 for MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Figure 3 for MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Figure 4 for MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Abstract:Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

Via

Access Paper or Ask Questions

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Feb 21, 2024

Kihong Kim, Haneol Lee, Jihye Park, Seyeon Kim, Kwanghee Lee, Seungryong Kim, Jaejun Yoo

Figure 1 for Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Figure 2 for Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Figure 3 for Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Figure 4 for Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Abstract:Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

* 17 pages, 13 figures

Via

Access Paper or Ask Questions

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Oct 04, 2022

Yunsung Lee, Gyuseong Lee, Kwangrok Ryoo, Hyojun Go, Jihye Park, Seungryong Kim

Figure 1 for Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Figure 2 for Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Figure 3 for Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Figure 4 for Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Abstract:There are two de facto standard architectures in recent computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Strong inductive biases of convolutions help the model learn sample effectively, but such strong biases also limit the upper bound of CNNs when sufficient data are available. On the contrary, ViT is inferior to CNNs for small data but superior for sufficient data. Recent approaches attempt to combine the strengths of these two architectures. However, we show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes by comparing various models' accuracy on subsets of sampled ImageNet at different ratios. In addition, through Fourier analysis of feature maps, the model's response patterns according to signal frequency changes, we observe which inductive bias is advantageous for each data scale. The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance. To obtain a model with flexible inductive bias on the data scale, we show reparameterization can interpolate inductive bias between convolution and self-attention. By adjusting the number of epochs the model stays in the convolution, we show that reparameterization from convolution to self-attention interpolates the Fourier analysis pattern between CNNs and ViTs. Adapting these findings, we propose Progressive Reparameterization Scheduling (PRS), in which reparameterization adjusts the required amount of convolution-like or self-attention-like inductive bias per layer. For small-scale datasets, our PRS performs reparameterization from convolution to self-attention linearly faster at the late stage layer. PRS outperformed previous studies on the small-scale dataset, e.g., CIFAR-100.

* Accepted at VIPriors ECCVW 2022, camera-ready version

Via

Access Paper or Ask Questions

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Aug 31, 2022

Jihye Park, Soohyun Kim, Sunwoo Kim, Jaejun Yoo, Youngjung Uh, Seungryong Kim

Figure 1 for LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Figure 2 for LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Figure 3 for LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Figure 4 for LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Abstract:Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent methods adopt clustering approaches to easily provide per-sample annotations in an unsupervised manner. However, they cannot account for the real-world setting; one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate domain annotations given in texts for a dataset and jointly optimize them during training. The target style is specified by aggregating multi-domain style vectors according to the multi-hot domain assignments. As the initial candidate domain texts might be inaccurate, we set the candidate domain texts to be learnable and jointly fine-tune them during training. Furthermore, we introduce a slack domain to cover samples that are not covered by the candidate domains. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to the existing model.

* Project Page: https://ku-cvlab.github.io/LANIT/

Via

Access Paper or Ask Questions

InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

Mar 30, 2022

Soohyun Kim, Jongbeom Baek, Jihye Park, Gyeongnyeon Kim, Seungryong Kim

Figure 1 for InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

Figure 2 for InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

Figure 3 for InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

Figure 4 for InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

Abstract:We present a novel Transformer-based network architecture for instance-aware image-to-image translation, dubbed InstaFormer, to effectively integrate global- and instance-level information. By considering extracted content features from an image as tokens, our networks discover global consensus of content features by considering context information through a self-attention module in Transformers. By augmenting such tokens with an instance-level feature extracted from the content feature with respect to bounding box information, our framework is capable of learning an interaction between object instances and the global image, thus boosting the instance-awareness. We replace layer normalization (LayerNorm) in standard Transformers with adaptive instance normalization (AdaIN) to enable a multi-modal translation with style codes. In addition, to improve the instance-awareness and translation quality at object regions, we present an instance-level content contrastive loss defined between input and translated image. We conduct experiments to demonstrate the effectiveness of our InstaFormer over the latest methods and provide extensive ablation studies.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

Automatic Construction of Context-Aware Sentiment Lexicon in the Financial Domain Using Direction-Dependent Words

Jun 10, 2021

Jihye Park, Hye Jin Lee, Sungzoon Cho

Figure 1 for Automatic Construction of Context-Aware Sentiment Lexicon in the Financial Domain Using Direction-Dependent Words

Figure 2 for Automatic Construction of Context-Aware Sentiment Lexicon in the Financial Domain Using Direction-Dependent Words

Figure 3 for Automatic Construction of Context-Aware Sentiment Lexicon in the Financial Domain Using Direction-Dependent Words

Figure 4 for Automatic Construction of Context-Aware Sentiment Lexicon in the Financial Domain Using Direction-Dependent Words

Abstract:Increasing attention has been drawn to the sentiment analysis of financial documents. The most popular examples of such documents include analyst reports and economic news, the analysis of which is frequently used to capture the trends in market sentiments. On the other hand, the significance of the role sentiment analysis plays in the financial domain has given rise to the efforts to construct a financial domain-specific sentiment lexicon. Sentiment lexicons lend a hand for solving various text mining tasks, such as unsupervised classification of text data, while alleviating the arduous human labor required for manual labeling. One of the challenges in the construction of an effective sentiment lexicon is that the semantic orientation of a word may change depending on the context in which it appears. For instance, the word ``profit" usually conveys positive sentiments; however, when the word is juxtaposed with another word ``decrease," the sentiment associated with the phrase ``profit decreases" now becomes negative. Hence, the sentiment of a given word may shift as one begins to consider the context surrounding the word. In this paper, we address this issue by incorporating context when building sentiment lexicon from a given corpus. Specifically, we construct a lexicon named Senti-DD for the Sentiment lexicon composed of Direction-Dependent words, which expresses each term a pair of a directional word and a direction-dependent word. Experiment results show that higher classification performance is achieved with Senti-DD, proving the effectiveness of our method for automatically constructing a context-aware sentiment lexicon in the financial domain.

Via

Access Paper or Ask Questions