Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Athanasia Zlatintsi

NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

Apr 16, 2026

Andrey Moskalenko, Alexey Bryncev, Ivan Kosmynin, Kira Shilovskaya, Mikhail Erofeev, Dmitry Vatolin, Radu Timofte, Kun Wang, Yupeng Hu, Zhiran Li(+33 more)

Abstract:This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - https://github.com/msu-video-group/NTIRE26_Saliency_Prediction.

* CVPRW 2026

Via

Access Paper or Ask Questions

Classical Guitar Duet Separation using GuitarDuets -- a Dataset of Real and Synthesized Guitar Recordings

Jul 01, 2025

Marios Glytsos, Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Abstract:Recent advancements in music source separation (MSS) have focused in the multi-timbral case, with existing architectures tailored for the separation of distinct instruments, overlooking thus the challenge of separating instruments with similar timbral characteristics. Addressing this gap, our work focuses on monotimbral MSS, specifically within the context of classical guitar duets. To this end, we introduce the GuitarDuets dataset, featuring a combined total of approximately three hours of real and synthesized classical guitar duet recordings, as well as note-level annotations of the synthesized duets. We perform an extensive cross-dataset evaluation by adapting Demucs, a state-of-the-art MSS architecture, to monotimbral source separation. Furthermore, we develop a joint permutation-invariant transcription and separation framework, to exploit note event predictions as auxiliary information. Our results indicate that utilizing both the real and synthesized subsets of GuitarDuets leads to improved separation performance in an independently recorded test set compared to utilizing solely one subset. We also find that while the availability of ground-truth note labels greatly helps the performance of the separation network, the predicted note estimates result only in marginal improvement. Finally, we discuss the behavior of commonly utilized metrics, such as SDR and SI-SDR, in the context of monotimbral MSS.

* Proc. of the 25th Int. Society for Music Information Retrieval Conf. (ISMIR), San Francisco, USA, 2024, pp. 95-102
* In Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR 2024), San Francisco, USA, November 2024. The dataset is available at: https://zenodo.org/records/12802440

Via

Access Paper or Ask Questions

Pre-training Music Classification Models via Music Source Separation

Oct 24, 2023

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Figure 1 for Pre-training Music Classification Models via Music Source Separation

Figure 2 for Pre-training Music Classification Models via Music Source Separation

Figure 3 for Pre-training Music Classification Models via Music Source Separation

Figure 4 for Pre-training Music Classification Models via Music Source Separation

Abstract:In this paper, we study whether music source separation can be used as a pre-training strategy for music representation learning, targeted at music classification tasks. To this end, we first pre-train U-Net networks under various music source separation objectives, such as the isolation of vocal or instrumental sources from a musical piece; afterwards, we attach a convolutional tail network to the pre-trained U-Net and jointly finetune the whole network. The features learned by the separation network are also propagated to the tail network through skip connections. Experimental results in two widely used and publicly available datasets indicate that pre-training the U-Nets with a music source separation objective can improve performance compared to both training the whole network from scratch and using the tail network as a standalone in two music classification tasks: music auto-tagging, when vocal separation is used, and music genre classification for the case of multi-source separation.

* 5 pages (4+references), 3 figures. ICASSP-24 submission

Via

Access Paper or Ask Questions

Multi-Source Contrastive Learning from Musical Audio

Feb 14, 2023

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Abstract:Contrastive learning constitutes an emerging branch of self-supervised learning that leverages large amounts of unlabeled data, by learning a latent space, where pairs of different views of the same sample are associated. In this paper, we propose musical source association as a pair generation strategy in the context of contrastive music representation learning. To this end, we modify COLA, a widely used contrastive learning audio framework, to learn to associate a song excerpt with a stochastically selected and automatically extracted vocal or instrumental source. We further introduce a novel modification to the contrastive loss to incorporate information about the existence or absence of specific sources. Our experimental evaluation in three different downstream tasks (music auto-tagging, instrument classification and music genre classification) using the publicly available Magna-Tag-A-Tune (MTAT) as a source dataset yields competitive results to existing literature methods, as well as faster network convergence. The results also show that this pre-training method can be steered towards specific features, according to the selected musical source, while also being dependent on the quality of the separated sources.

* 8 pages, 5 figures, 3 tables. (Slightly edited) submission at SMC23

Via

Access Paper or Ask Questions

Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation

Feb 20, 2022

Kleanthis Avramidis, Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Figure 1 for Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation

Figure 2 for Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation

Figure 3 for Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation

Abstract:The study of Music Cognition and neural responses to music has been invaluable in understanding human emotions. Brain signals, though, manifest a highly complex structure that makes processing and retrieving meaningful features challenging, particularly of abstract constructs like affect. Moreover, the performance of learning models is undermined by the limited amount of available neuronal data and their severe inter-subject variability. In this paper we extract efficient, personalized affective representations from EEG signals during music listening. To this end, we employ music signals as a supervisory modality to EEG, aiming to project their semantic correspondence onto a common representation space. We utilize a bi-modal framework by combining an LSTM-based attention model to process EEG and a pre-trained model for music tagging, along with a reverse domain discriminator to align the distributions of the two modalities, further constraining the learning process with emotion tags. The resulting framework can be utilized for emotion recognition both directly, by performing supervised predictions from either modality, and indirectly, by providing relevant music samples to EEG input queries. The experimental findings show the potential of enhancing neuronal data through stimulus information for recognition purposes and yield insights into the distribution and temporal variance of music-induced affective features.

* 5 pages, 3 figures, IEEE ICASSP 2022

Via

Access Paper or Ask Questions

HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Mar 07, 2021

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Figure 1 for HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Figure 2 for HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Figure 3 for HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Figure 4 for HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

Abstract:The advent of deep learning has led to the prevalence of deep neural network architectures for monaural music source separation, with end-to-end approaches that operate directly on the waveform level increasingly receiving research attention. Among these approaches, transformation of the input mixture to a learned latent space, and multiplicative application of a soft mask to the latent mixture, achieves the best performance, but is prone to the introduction of artifacts to the source estimate. To alleviate this problem, in this paper we propose a hybrid time-domain approach, termed the HTMD-Net, combining a lightweight masking component and a denoising module, based on skip connections, in order to refine the source estimated by the masking procedure. Evaluation of our approach in the task of monaural singing voice separation in the musdb18 dataset indicates that our proposed method achieves competitive performance compared to methods based purely on masking when trained under the same conditions, especially regarding the behavior during silent segments, while achieving higher computational efficiency.

* submitted for publication in EUSIPCO 2021

Via

Access Paper or Ask Questions

Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Feb 13, 2021

Kleanthis Avramidis, Agelos Kratimenos, Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Figure 1 for Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Figure 2 for Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Figure 3 for Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Figure 4 for Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Abstract:Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms. However, the emergence of deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes. In this paper, we attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models. Various recurrent and convolutional architectures incorporating residual connections are examined and parameterized in order to build end-to-end classi-fiers with low computational cost and only minimal preprocessing. We obtain competitive classification scores and useful instrument-wise insight through the IRMAS test set, utilizing a parallel CNN-BiGRU model with multiple residual connections, while maintaining a significantly reduced number of trainable parameters.

* 5 pages, 4 figures, 6 tables, to be published in the Proc. of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021) @ Toronto, Ontario, Canada

Via

Access Paper or Ask Questions

Multiscale Fractal Analysis of Stimulated EEG Signals with Application to Emotion Classification

Oct 30, 2020

Kleanthis Avramidis, Athanasia Zlatintsi, Christos Garoufis, Petros Maragos

Figure 1 for Multiscale Fractal Analysis of Stimulated EEG Signals with Application to Emotion Classification

Figure 2 for Multiscale Fractal Analysis of Stimulated EEG Signals with Application to Emotion Classification

Figure 3 for Multiscale Fractal Analysis of Stimulated EEG Signals with Application to Emotion Classification

Figure 4 for Multiscale Fractal Analysis of Stimulated EEG Signals with Application to Emotion Classification

Abstract:Emotion Recognition from EEG signals has long been researched as it can assist numerous medical and rehabilitative applications. However, their complex and noisy structure has proven to be a serious barrier for traditional modeling methods. In this paper we employ multifractal analysis to examine the behavior of EEG signals in terms of presence of fluctuations and the degree of fragmentation along their major frequency bands, for the task of emotion recognition. In order to extract emotion-related features we utilize two novel algorithms for EEG analysis, based on Multiscale Fractal Dimension and Multifractal Detrended Fluctuation Analysis. The proposed feature extraction methods perform efficiently, surpassing some widely used baseline features on the competitive DEAP dataset, indicating that multifractal analysis could serve as basis for the development of robust models for affective state recognition.

* 5 pages, 3 figures, submitted to the International Conference on Acoustics, Speech & Signal Processing (ICASSP) 2021

Via

Access Paper or Ask Questions

Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music

Nov 28, 2019

Agelos Kratimenos, Kleanthis Avramidis, Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

Figure 1 for Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music

Figure 2 for Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music

Figure 3 for Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music

Figure 4 for Augmentation Methods on Monophonic Audio for Instrument Classification in Polyphonic Music

Abstract:Instrument classification is one of the fields in Music Information Retrieval (MIR) that has attracted a lot of research interest. However, the majority of that is dealing with monophonic music, while efforts on polyphonic material mainly focus on predominant instrument recognition or multi-instrument recognition for entire tracks. We present an approach for instrument classification in polyphonic music using monophonic training data that involves mixing-augmentation methods. Specifically, we experiment with pitch and tempo-based synchronization, as well as mixes of tracks with similar music genres. Further, a custom CNN model is proposed, that uses the augmented training data efficiently and a plethora of suitable evaluation metrics are discussed as well. The tempo-sync and genre techniques stand out, achieving an 81% label ranking average precision accuracy, detecting up to 9 instruments in over 2300 testing tracks.

Via

Access Paper or Ask Questions

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Feb 15, 2019

Nikolaos Gkanatsios, Vassilis Pitsikalis, Petros Koutras, Athanasia Zlatintsi, Petros Maragos

Figure 1 for Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Figure 2 for Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Figure 3 for Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Figure 4 for Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Abstract:Detecting visual relationships, i.e. <Subject, Predicate, Object> triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively.

Via

Access Paper or Ask Questions