Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Face Reenactment

What is Face Reenactment? Face reenactment is the process of transferring facial expressions from one person to another in videos.

InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians

Apr 10, 2025

Kefan Chen, Sergiu Oprea, Justin Theiss, Sreyas Mohan, Srinath Sridhar, Aayush Prakash

Abstract:With the rising interest from the community in digital avatars coupled with the importance of expressions and gestures in communication, modeling natural avatar behavior remains an important challenge across many industries such as teleconferencing, gaming, and AR/VR. Human hands are the primary tool for interacting with the environment and essential for realistic human behavior modeling, yet existing 3D hand and head avatar models often overlook the crucial aspect of hand-body interactions, such as between hand and face. We present InteracttAvatar, the first model to faithfully capture the photorealistic appearance of dynamic hand and non-rigid hand-face interactions. Our novel Dynamic Gaussian Hand model, combining template model and 3D Gaussian Splatting as well as a dynamic refinement module, captures pose-dependent change, e.g. the fine wrinkles and complex shadows that occur during articulation. Importantly, our hand-face interaction module models the subtle geometry and appearance dynamics that underlie common gestures. Through experiments of novel view synthesis, self reenactment and cross-identity reenactment, we demonstrate that InteracttAvatar can reconstruct hand and hand-face interactions from monocular or multiview videos with high-fidelity details and be animated with novel poses.

Via

Access Paper or Ask Questions

Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

Mar 28, 2025

Tharun Anand, Siva Sankar, Pravin Nair

Abstract:With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20\%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

Via

Access Paper or Ask Questions

Two-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints

Feb 28, 2025

Masoumeh Chapariniya, Hossein Ranjbar, Teodora Vukovic, Sarah Ebling, Volker Dellwo

Abstract:In the age of AI-driven generative technologies, traditional biometric recognition systems face unprecedented challenges, particularly from sophisticated deepfake and face reenactment techniques. In this study, we propose a Two-Stream Spatial-Temporal Transformer Framework for person identification using upper body keypoints visible during online conversations, which we term conversational keypoints. Our framework processes both spatial relationships between keypoints and their temporal evolution through two specialized branches: a Spatial Transformer (STR) that learns distinctive structural patterns in keypoint configurations, and a Temporal Transformer (TTR) that captures sequential motion patterns. Using the state-of-the-art Sapiens pose estimator, we extract 133 keypoints (based on COCO-WholeBody format) representing facial features, head pose, and hand positions. The framework was evaluated on a dataset of 114 individuals engaged in natural conversations, achieving recognition accuracies of 80.12% for the spatial stream, 63.61% for the temporal stream. We then explored two fusion strategies: a shared loss function approach achieving 82.22% accuracy, and a feature-level fusion method that concatenates feature maps from both streams, significantly improving performance to 94.86%. By jointly modeling both static anatomical relationships and dynamic movement patterns, our approach learns comprehensive identity signatures that are more robust to spoofing than traditional appearance-based methods.

Via

Access Paper or Ask Questions

GHOST 2.0: generative high-fidelity one shot transfer of heads

Feb 26, 2025

Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov

Figure 1 for GHOST 2.0: generative high-fidelity one shot transfer of heads

Figure 2 for GHOST 2.0: generative high-fidelity one shot transfer of heads

Figure 3 for GHOST 2.0: generative high-fidelity one shot transfer of heads

Figure 4 for GHOST 2.0: generative high-fidelity one shot transfer of heads

Abstract:While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0

Via

Access Paper or Ask Questions

Generating and Detecting Various Types of Fake Image and Audio Content: A Review of Modern Deep Learning Technologies and Tools

Jan 07, 2025

Arash Dehghani, Hossein Saberi

Abstract:This paper reviews the state-of-the-art in deepfake generation and detection, focusing on modern deep learning technologies and tools based on the latest scientific advancements. The rise of deepfakes, leveraging techniques like Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion models and other generative models, presents significant threats to privacy, security, and democracy. This fake media can deceive individuals, discredit real people and organizations, facilitate blackmail, and even threaten the integrity of legal, political, and social systems. Therefore, finding appropriate solutions to counter the potential threats posed by this technology is essential. We explore various deepfake methods, including face swapping, voice conversion, reenactment and lip synchronization, highlighting their applications in both benign and malicious contexts. The review critically examines the ongoing "arms race" between deepfake generation and detection, analyzing the challenges in identifying manipulated contents. By examining current methods and highlighting future research directions, this paper contributes to a crucial understanding of this rapidly evolving field and the urgent need for robust detection strategies to counter the misuse of this powerful technology. While focusing primarily on audio, image, and video domains, this study allows the reader to easily grasp the latest advancements in deepfake generation and detection.

Via

Access Paper or Ask Questions

G3FA: Geometry-guided GAN for Face Animation

Aug 23, 2024

Alireza Javanmardi, Alain Pagani, Didier Stricker

Figure 1 for G3FA: Geometry-guided GAN for Face Animation

Figure 2 for G3FA: Geometry-guided GAN for Face Animation

Figure 3 for G3FA: Geometry-guided GAN for Face Animation

Figure 4 for G3FA: Geometry-guided GAN for Face Animation

Abstract:Animating human face images aims to synthesize a desired source identity in a natural-looking way mimicking a driving video's facial movements. In this context, Generative Adversarial Networks have demonstrated remarkable potential in real-time face reenactment using a single source image, yet are constrained by limited geometry consistency compared to graphic-based approaches. In this paper, we introduce Geometry-guided GAN for Face Animation (G3FA) to tackle this limitation. Our novel approach empowers the face animation model to incorporate 3D information using only 2D images, improving the image generation capabilities of the talking head synthesis model. We integrate inverse rendering techniques to extract 3D facial geometry properties, improving the feedback loop to the generator through a weighted average ensemble of discriminators. In our face reenactment model, we leverage 2D motion warping to capture motion dynamics along with orthogonal ray sampling and volume rendering techniques to produce the ultimate visual output. To evaluate the performance of our G3FA, we conducted comprehensive experiments using various evaluation protocols on VoxCeleb2 and TalkingHead benchmarks to demonstrate the effectiveness of our proposed framework compared to the state-of-the-art real-time face animation methods.

* BMVC 2024, Accepted

Via

Access Paper or Ask Questions

TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model

Oct 14, 2024

Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie(+2 more)

Abstract:Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework. Resources can be found at https://guanjz20.github.io/projects/TALK-Act.

* Accepted to SIGGRAPH Asia 2024 (conference track). Project page: https://guanjz20.github.io/projects/TALK-Act

Via

Access Paper or Ask Questions

Anchored Diffusion for Video Face Reenactment

Jul 21, 2024

Idan Kligvasser, Regev Cohen, George Leifman, Ehud Rivlin, Michael Elad

Figure 1 for Anchored Diffusion for Video Face Reenactment

Figure 2 for Anchored Diffusion for Video Face Reenactment

Figure 3 for Anchored Diffusion for Video Face Reenactment

Figure 4 for Anchored Diffusion for Video Face Reenactment

Abstract:Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, the task of creating a video from a source image that replicates the facial expressions and movements from a driving video. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.

Via

Access Paper or Ask Questions

EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

Oct 04, 2024

Thorben Ortmann, Qi Wang, Larissa Putzar

Figure 1 for EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

Figure 2 for EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

Figure 3 for EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

Figure 4 for EmojiHeroVR: A Study on Facial Expression Recognition under Partial Occlusion from Head-Mounted Displays

Abstract:Emotion recognition promotes the evaluation and enhancement of Virtual Reality (VR) experiences by providing emotional feedback and enabling advanced personalization. However, facial expressions are rarely used to recognize users' emotions, as Head-Mounted Displays (HMDs) occlude the upper half of the face. To address this issue, we conducted a study with 37 participants who played our novel affective VR game EmojiHeroVR. The collected database, EmoHeVRDB (EmojiHeroVR Database), includes 3,556 labeled facial images of 1,778 reenacted emotions. For each labeled image, we also provide 29 additional frames recorded directly before and after the labeled image to facilitate dynamic Facial Expression Recognition (FER). Additionally, EmoHeVRDB includes data on the activations of 63 facial expressions captured via the Meta Quest Pro VR headset for each frame. Leveraging our database, we conducted a baseline evaluation on the static FER classification task with six basic emotions and neutral using the EfficientNet-B0 architecture. The best model achieved an accuracy of 69.84% on the test set, indicating that FER under HMD occlusion is feasible but significantly more challenging than conventional FER.

Via

Access Paper or Ask Questions

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Aug 01, 2024

Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber

Figure 1 for Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Figure 2 for Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Figure 3 for Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Figure 4 for Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Abstract:Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.

* Preprint. All videos in this paper are best viewed as animations with Acrobat Reader by pressing the highlighted frame of each video

Via

Access Paper or Ask Questions

Topic:Face Reenactment

Papers and Code