Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daria Diatlova

METR: Image Watermarking with Large Number of Unique Messages

Aug 15, 2024

Alexander Varlamov, Daria Diatlova, Egor Spirin

Figure 1 for METR: Image Watermarking with Large Number of Unique Messages

Figure 2 for METR: Image Watermarking with Large Number of Unique Messages

Figure 3 for METR: Image Watermarking with Large Number of Unique Messages

Figure 4 for METR: Image Watermarking with Large Number of Unique Messages

Abstract:Improvements in diffusion models have boosted the quality of image generation, which has led researchers, companies, and creators to focus on improving watermarking algorithms. This provision would make it possible to clearly identify the creators of generative art. The main challenges that modern watermarking algorithms face have to do with their ability to withstand attacks and encrypt many unique messages, such as user IDs. In this paper, we present METR: Message Enhanced Tree-Ring, which is an approach that aims to address these challenges. METR is built on the Tree-Ring watermarking algorithm, a technique that makes it possible to encode multiple distinct messages without compromising attack resilience or image quality. This ensures the suitability of this watermarking algorithm for any Diffusion Model. In order to surpass the limitations on the quantity of encoded messages, we propose METR++, an enhanced version of METR. This approach, while limited to the Latent Diffusion Model architecture, is designed to inject a virtually unlimited number of unique messages. We demonstrate its robustness to attacks and ability to encrypt many unique messages while preserving image quality, which makes METR and METR++ hold great potential for practical applications in real-world settings. Our code is available at https://github.com/deepvk/metr

* 14 pages, 9 figures, code is available at https://github.com/deepvk/metr

Via

Access Paper or Ask Questions

Adapting WavLM for Speech Emotion Recognition

May 07, 2024

Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin

Abstract:Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.

Via

Access Paper or Ask Questions

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Jun 28, 2023

Daria Diatlova, Vitaly Shutov

Figure 1 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 2 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 3 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 4 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Abstract:State-of-the-art speech synthesis models try to get as close as possible to the human voice. Hence, modelling emotions is an essential part of Text-To-Speech (TTS) research. In our work, we selected FastSpeech2 as the starting point and proposed a series of modifications for synthesizing emotional speech. According to automatic and human evaluation, our model, EmoSpeech, surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech. We provided a detailed ablation study for every extension to FastSpeech2 architecture that forms EmoSpeech. The uneven distribution of emotions in the text is crucial for better, synthesized speech and intonation perception. Our model includes a conditioning mechanism that effectively handles this issue by allowing emotions to contribute to each phone with varying intensity levels. The human assessment indicates that proposed modifications generate audio with higher MOS and emotional expressiveness.

Via

Access Paper or Ask Questions