Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dogucan Yaman

Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Oct 15, 2024

Fevziye Irem Eyiokur, Christian Huber, Thai-Binh Nguyen, Tuan-Nam Nguyen, Fabian Retkowski, Enes Yavuz Ugan, Dogucan Yaman, Alexander Waibel

Figure 1 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 2 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 3 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Figure 4 for Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck

Abstract:In this paper, we report on communication experiments conducted in the summer of 2022 during a deep dive to the wreck of the Titanic. Radio transmission is not possible in deep sea water, and communication links rely on sonar signals. Due to the low bandwidth of sonar signals and the need to communicate readable data, text messaging is used in deep-sea missions. In this paper, we report results and experiences from a messaging system that converts speech to text in a submarine, sends text messages to the surface, and reconstructs those messages as synthetic lip-synchronous videos of the speakers. The resulting system was tested during an actual dive to Titanic in the summer of 2022. We achieved an acceptable latency for a system of such complexity as well as good quality. The system demonstration video can be found at the following link: https://youtu.be/C4lyM86-5Ig

Via

Access Paper or Ask Questions

Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation

May 07, 2024

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel

Abstract:In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.

* CVPR2024 NTIRE Workshop

Via

Access Paper or Ask Questions

Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Jul 18, 2023

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazim Kemal Ekenel, Alexander Waibel

Figure 1 for Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Figure 2 for Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Figure 3 for Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Figure 4 for Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

Abstract:Audio-driven talking face generation is the task of creating a lip-synchronized, realistic face video from given audio and reference frames. This involves two major challenges: overall visual quality of generated images on the one hand, and audio-visual synchronization of the mouth part on the other hand. In this paper, we start by identifying several problematic aspects of synchronization methods in recent audio-driven talking face generation approaches. Specifically, this involves unintended flow of lip and pose information from the reference to the generated image, as well as instabilities during model training. Subsequently, we propose various techniques for obviating these issues: First, a silent-lip reference image generator prevents leaking of lips from the reference to the generated image. Second, an adaptive triplet loss handles the pose leaking problem. Finally, we propose a stabilized formulation of synchronization loss, circumventing aforementioned training instabilities while additionally further alleviating the lip leaking issue. Combining the individual improvements, we present state-of-the art performance on LRS2 and LRW in both synchronization and visual quality. We further validate our design in various ablation experiments, confirming the individual contributions as well as their complementary effects.

* Submitted to ICCV 2023

Via

Access Paper or Ask Questions

Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Jun 09, 2022

Alexander Waibel, Moritz Behr, Fevziye Irem Eyiokur, Dogucan Yaman, Tuan-Nam Nguyen, Carlos Mullov, Mehmet Arif Demirtas, Alperen Kantarcı, Stefan Constantin, Hazım Kemal Ekenel

Figure 1 for Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Figure 2 for Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Figure 3 for Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Figure 4 for Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

Abstract:In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases mapped from the original sentence. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a conditional generative adversarial network-based model generates frames of adapted lip movements with respect to the input face image as well as the output of the voice conversion model. In the end, the system combines the generated video with the converted audio to produce the final output. The result is a video of a speaker speaking in another language without actually knowing it. To evaluate our design, we present a user study of the complete system as well as separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set and evaluate our system on this test set. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics. The collected dataset will be shared.

Via

Access Paper or Ask Questions

Exposure Correction Model to Enhance Image Quality

Apr 22, 2022

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Figure 1 for Exposure Correction Model to Enhance Image Quality

Figure 2 for Exposure Correction Model to Enhance Image Quality

Figure 3 for Exposure Correction Model to Enhance Image Quality

Figure 4 for Exposure Correction Model to Enhance Image Quality

Abstract:Exposure errors in an image cause a degradation in the contrast and low visibility in the content. In this paper, we address this problem and propose an end-to-end exposure correction model in order to handle both under- and overexposure errors with a single model. Our model contains an image encoder, consecutive residual blocks, and image decoder to synthesize the corrected image. We utilize perceptual loss, feature matching loss, and multi-scale discriminator to increase the quality of the generated image as well as to make the training more stable. The experimental results indicate the effectiveness of proposed model. We achieve the state-of-the-art result on a large-scale exposure dataset. Besides, we investigate the effect of exposure setting of the image on the portrait matting task. We find that under- and overexposed images cause severe degradation in the performance of the portrait matting models. We show that after applying exposure correction with the proposed model, the portrait matting quality increases significantly. https://github.com/yamand16/ExposureCorrection

* Accepted for CVPR2022 NTIRE Workshop

Via

Access Paper or Ask Questions

Alpha Matte Generation from Single Input for Portrait Matting

Jun 14, 2021

Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Figure 1 for Alpha Matte Generation from Single Input for Portrait Matting

Figure 2 for Alpha Matte Generation from Single Input for Portrait Matting

Figure 3 for Alpha Matte Generation from Single Input for Portrait Matting

Figure 4 for Alpha Matte Generation from Single Input for Portrait Matting

Abstract:Portrait matting is an important research problem with a wide range of applications, such as video conference app, image/video editing, and post-production. The goal is to predict an alpha matte that identifies the effect of each pixel on the foreground subject. Traditional approaches and most of the existing works utilized an additional input, e.g., trimap, background image, to predict alpha matte. However, providing additional input is not always practical. Besides, models are too sensitive to these additional inputs. In this paper, we introduce an additional input-free approach to perform portrait matting using Generative Adversarial Nets (GANs). We divide the main task into two subtasks. For this, we propose a segmentation network for the person segmentation and the alpha generation network for alpha matte prediction. While the segmentation network takes an input image and produces a coarse segmentation map, the alpha generation network utilizes the same input image as well as a coarse segmentation map that is produced by the segmentation network to predict the alpha matte. Besides, we present a segmentation encoding block to downsample the coarse segmentation map and provide feature representation to the residual block. Furthermore, we propose border loss to penalize only the borders of the subject separately which is more likely to be challenging and we also adapt perceptual loss for portrait matting. To train the proposed system, we combine two different popular training datasets to improve the amount of data as well as diversity to address domain shift problems in the inference time. We tested our model on three different benchmark datasets, namely Adobe Image Matting dataset, Portrait Matting dataset, and Distinctions dataset. The proposed method outperformed the MODNet method that also takes a single input.

Via

Access Paper or Ask Questions

CAGAN: Text-To-Image Generation with Combined Attention GANs

Apr 26, 2021

Henning Schulze, Dogucan Yaman, Alexander Waibel

Figure 1 for CAGAN: Text-To-Image Generation with Combined Attention GANs

Figure 2 for CAGAN: Text-To-Image Generation with Combined Attention GANs

Figure 3 for CAGAN: Text-To-Image Generation with Combined Attention GANs

Figure 4 for CAGAN: Text-To-Image Generation with Combined Attention GANs

Abstract:Generating images according to natural language descriptions is a challenging task. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS, outperforming the state of the art on the CUB dataset, but generates unrealistic images through feature repetition.

Via

Access Paper or Ask Questions

Ear2Face: Deep Biometric Modality Mapping

Jun 02, 2020

Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel

Figure 1 for Ear2Face: Deep Biometric Modality Mapping

Figure 2 for Ear2Face: Deep Biometric Modality Mapping

Figure 3 for Ear2Face: Deep Biometric Modality Mapping

Figure 4 for Ear2Face: Deep Biometric Modality Mapping

Abstract:In this paper, we explore the correlation between different visual biometric modalities. For this purpose, we present an end-to-end deep neural network model that learns a mapping between the biometric modalities. Namely, our goal is to generate a frontal face image of a subject given his/her ear image as the input. We formulated the problem as a paired image-to-image translation task and collected datasets of ear and face image pairs from the Multi-PIE and FERET datasets to train our GAN-based models. We employed feature reconstruction and style reconstruction losses in addition to adversarial and pixel losses. We evaluated the proposed method both in terms of reconstruction quality and in terms of person identification accuracy. To assess the generalization capability of the learned mapping models, we also run cross-dataset experiments. That is, we trained the model on the FERET dataset and tested it on the Multi-PIE dataset and vice versa. We have achieved very promising results, especially on the FERET dataset, generating visually appealing face images from ear image inputs. Moreover, we attained a very high cross-modality person identification performance, for example, reaching 90.9% Rank-10 identification accuracy on the FERET dataset.

* 13 pages, 4 figures

Via

Access Paper or Ask Questions

Multimodal Age and Gender Classification Using Ear and Profile Face Images

Jul 23, 2019

Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel

Abstract:In this paper, we present multimodal deep neural network frameworks for age and gender classification, which take input a profile face image as well as an ear image. Our main objective is to enhance the accuracy of soft biometric trait extraction from profile face images by additionally utilizing a promising biometric modality: ear appearance. For this purpose, we provided end-to-end multimodal deep learning frameworks. We explored different multimodal strategies by employing data, feature, and score level fusion. To increase representation and discrimination capability of the deep neural networks, we benefited from domain adaptation and employed center loss besides softmax loss. We conducted extensive experiments on the UND-F, UND-J2, and FERET datasets. Experimental results indicated that profile face images contain a rich source of information for age and gender classification. We found that the presented multimodal system achieves very high age and gender classification accuracies. Moreover, we attained superior results compared to the state-of-the-art profile face image or ear image-based age and gender classification methods.

* 8 pages, 4 figures, accepted for CVPR 2019 - Workshop on Biometrics

Via

Access Paper or Ask Questions

The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Mar 14, 2019

Žiga Emeršič, Aruna Kumar S. V., B. S. Harish, Weronika Gutfeter, Jalil Nourmohammadi Khiarak, Andrzej Pacut, Earnest Hansley, Mauricio Pamplona Segundo, Sudeep Sarkar, Hyeonjung Park(+21 more)

Figure 1 for The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Figure 2 for The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Figure 3 for The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Figure 4 for The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

Abstract:This paper presents a summary of the 2019 Unconstrained Ear Recognition Challenge (UERC), the second in a series of group benchmarking efforts centered around the problem of person recognition from ear images captured in uncontrolled settings. The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze performance of the technology from various viewpoints, such as generalization abilities to unseen data characteristics, sensitivity to rotations, occlusions and image resolution and performance bias on sub-groups of subjects, selected based on demographic criteria, i.e. gender and ethnicity. Research groups from 12 institutions entered the competition and submitted a total of 13 recognition approaches ranging from descriptor-based methods to deep-learning models. The majority of submissions focused on ensemble based methods combining either representations from multiple deep models or hand-crafted with learned image descriptors. Our analysis shows that methods incorporating deep learning models clearly outperform techniques relying solely on hand-crafted descriptors, even though both groups of techniques exhibit similar behaviour when it comes to robustness to various covariates, such presence of occlusions, changes in (head) pose, or variability in image resolution. The results of the challenge also show that there has been considerable progress since the first UERC in 2017, but that there is still ample room for further research in this area.

* The content of this paper was published in ICB, 2019. This ArXiv version is from before the peer review

Via

Access Paper or Ask Questions