Abstract:Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. Available at https://github.com/murphytju/DiffFAS.
Abstract:Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at https://github.com/xxtars/EMO-LLaMA.
Abstract:Recent advancements in the automatic re-identification of animal individuals from images have opened up new possibilities for studying wildlife through camera traps and citizen science projects. Existing methods leverage distinct and permanent visual body markings, such as fur patterns or scars, and typically employ one of two strategies: local features or end-to-end learning. In this study, we delve into the impact of training set size by conducting comprehensive experiments across six different methods and five animal species. While it is well known that end-to-end learning-based methods surpass local feature-based methods given a sufficient amount of good-quality training data, the challenge of gathering such datasets for wildlife animals means that local feature-based methods remain a more practical approach for many species. We demonstrate the benefits of both local feature and end-to-end learning-based approaches and show that species-specific characteristics, particularly intra-individual variance, have a notable effect on training data requirements.
Abstract:In this work, we focus on a special group of human body language -- the micro-gesture (MG), which differs from the range of ordinary illustrative gestures in that they are not intentional behaviors performed to convey information to others, but rather unintentional behaviors driven by inner feelings. This characteristic introduces two novel challenges regarding micro-gestures that are worth rethinking. The first is whether strategies designed for other action recognition are entirely applicable to micro-gestures. The second is whether micro-gestures, as supplementary data, can provide additional insights for emotional understanding. In recognizing micro-gestures, we explored various augmentation strategies that take into account the subtle spatial and brief temporal characteristics of micro-gestures, often accompanied by repetitiveness, to determine more suitable augmentation methods. Considering the significance of temporal domain information for micro-gestures, we introduce a simple and efficient plug-and-play spatiotemporal balancing fusion method. We not only studied our method on the considered micro-gesture dataset but also conducted experiments on mainstream action datasets. The results show that our approach performs well in micro-gesture recognition and on other datasets, achieving state-of-the-art performance compared to previous micro-gesture recognition methods. For emotional understanding based on micro-gestures, we construct complex emotional reasoning scenarios. Our evaluation, conducted with large language models, shows that micro-gestures play a significant and positive role in enhancing comprehensive emotional understanding. The scenarios we developed can be extended to other micro-gesture-based tasks such as deception detection and interviews. We confirm that our new insights contribute to advancing research in micro-gesture and emotional artificial intelligence.
Abstract:Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
Abstract:Plankton recognition provides novel possibilities to study various environmental aspects and an interesting real-world context to develop domain adaptation (DA) methods. Different imaging instruments cause domain shift between datasets hampering the development of general plankton recognition methods. A promising remedy for this is DA allowing to adapt a model trained on one instrument to other instruments. In this paper, we present a new DA dataset called DAPlankton which consists of phytoplankton images obtained with different instruments. Phytoplankton provides a challenging DA problem due to the fine-grained nature of the task and high class imbalance in real-world datasets. DAPlankton consists of two subsets. DAPlankton_LAB contains images of cultured phytoplankton providing a balanced dataset with minimal label uncertainty. DAPlankton_SEA consists of images collected from the Baltic Sea providing challenging real-world data with large intra-class variance and class imbalance. We further present a benchmark comparison of three widely used DA methods.
Abstract:Image-based re-identification of animal individuals allows gathering of information such as migration patterns of the animals over time. This, together with large image volumes collected using camera traps and crowdsourcing, opens novel possibilities to study animal populations. For many species, the re-identification can be done by analyzing the permanent fur, feather, or skin patterns that are unique to each individual. In this paper, we address the re-identification by combining two types of pattern similarity metrics: 1) pattern appearance similarity obtained by pattern feature aggregation and 2) geometric pattern similarity obtained by analyzing the geometric consistency of pattern similarities. The proposed combination allows to efficiently utilize both the local and global pattern features, providing a general re-identification approach that can be applied to a wide variety of different pattern types. In the experimental part of the work, we demonstrate that the method achieves promising re-identification accuracies for Saimaa ringed seals and whale sharks.
Abstract:Planktonic organisms are key components of aquatic ecosystems and respond quickly to changes in the environment, therefore their monitoring is vital to understand the changes in the environment. Yet, monitoring plankton at appropriate scales still remains a challenge, limiting our understanding of functioning of aquatic systems and their response to changes. Modern plankton imaging instruments can be utilized to sample at high frequencies, enabling novel possibilities to study plankton populations. However, manual analysis of the data is costly, time consuming and expert based, making such approach unsuitable for large-scale application and urging for automatic solutions. The key problem related to the utilization of plankton datasets through image analysis is plankton recognition. Despite the large amount of research done, automatic methods have not been widely adopted for operational use. In this paper, a comprehensive survey on existing solutions for automatic plankton recognition is presented. First, we identify the most notable challenges that that make the development of plankton recognition systems difficult. Then, we provide a detailed description of solutions for these challenges proposed in plankton recognition literature. Finally, we propose a workflow to identify the specific challenges in new datasets and the recommended approaches to address them. For many of the challenges, applicable solutions exist. However, important challenges remain unsolved: 1) the domain shift between the datasets hindering the development of a general plankton recognition system that would work across different imaging instruments, 2) the difficulty to identify and process the images of previously unseen classes, and 3) the uncertainty in expert annotations that affects the training of the machine learning models for recognition. These challenges should be addressed in the future research.
Abstract:Phytoplankton parasites are largely understudied microbial components with a potentially significant ecological impact on phytoplankton bloom dynamics. To better understand their impact, we need improved detection methods to integrate phytoplankton parasite interactions in monitoring aquatic ecosystems. Automated imaging devices usually produce high amount of phytoplankton image data, while the occurrence of anomalous phytoplankton data is rare. Thus, we propose an unsupervised anomaly detection system based on the similarity of the original and autoencoder-reconstructed samples. With this approach, we were able to reach an overall F1 score of 0.75 in nine phytoplankton species, which could be further improved by species-specific fine-tuning. The proposed unsupervised approach was further compared with the supervised Faster R-CNN based object detector. With this supervised approach and the model trained on plankton species and anomalies, we were able to reach the highest F1 score of 0.86. However, the unsupervised approach is expected to be more universal as it can detect also unknown anomalies and it does not require any annotated anomalous data that may not be always available in sufficient quantities. Although other studies have dealt with plankton anomaly detection in terms of non-plankton particles, or air bubble detection, our paper is according to our best knowledge the first one which focuses on automated anomaly detection considering putative phytoplankton parasites or infections.
Abstract:We propose a method for Saimaa ringed seal (Pusa hispida saimensis) re-identification. Access to large image volumes through camera trapping and crowdsourcing provides novel possibilities for animal monitoring and conservation and calls for automatic methods for analysis, in particular, when re-identifying individual animals from the images. The proposed method NOvel Ringed seal re-identification by Pelage Pattern Aggregation (NORPPA) utilizes the permanent and unique pelage pattern of Saimaa ringed seals and content-based image retrieval techniques. First, the query image is preprocessed, and each seal instance is segmented. Next, the seal's pelage pattern is extracted using a U-net encoder-decoder based method. Then, CNN-based affine invariant features are embedded and aggregated into Fisher Vectors. Finally, the cosine distance between the Fisher Vectors is used to find the best match from a database of known individuals. We perform extensive experiments of various modifications of the method on a new challenging Saimaa ringed seals re-identification dataset. The proposed method is shown to produce the best re-identification accuracy on our dataset in comparisons with alternative approaches.