Abstract:Scene Text Editing (STE) aims to substitute text in an image with new desired text while preserving the background and styles of the original text. However, present techniques present a notable challenge in the generation of edited text images that exhibit a high degree of clarity and legibility. This challenge primarily stems from the inherent diversity found within various text types and the intricate textures of complex backgrounds. To address this challenge, this paper introduces a three-stage framework for transferring texts across text images. Initially, we introduce a text-swapping network that seamlessly substitutes the original text with the desired replacement. Subsequently, we incorporate a background inpainting network into our framework. This specialized network is designed to skillfully reconstruct background images, effectively addressing the voids left after the removal of the original text. This process meticulously preserves visual harmony and coherence in the background. Ultimately, the synthesis of outcomes from the text-swapping network and the background inpainting network is achieved through a fusion network, culminating in the creation of the meticulously edited final image. A demo video is included in the supplementary material.