Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ozgur Kara

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

May 12, 2025

Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, Tobias Hinz

Abstract:Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in https://shotadapter.github.io/

* CVPR 2025

Via

Access Paper or Ask Questions

Optimization-Free Image Immunization Against Diffusion-Based Editing

Nov 27, 2024

Tarik Can Ozden, Ozgur Kara, Oguzhan Akcin, Kerem Zaman, Shashank Srivastava, Sandeep P. Chinchali, James M. Rehg

Abstract:Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage.

* Project webpage: https://diffvax.github.io/

Via

Access Paper or Ask Questions

Leveraging Object Priors for Point Tracking

Sep 09, 2024

Bikram Boote, Anh Thai, Wenqi Jia, Ozgur Kara, Stefan Stojanov, James M. Rehg, Sangmin Lee

Figure 1 for Leveraging Object Priors for Point Tracking

Figure 2 for Leveraging Object Priors for Point Tracking

Figure 3 for Leveraging Object Priors for Point Tracking

Figure 4 for Leveraging Object Priors for Point Tracking

Abstract:Point tracking is a fundamental problem in computer vision with numerous applications in AR and robotics. A common failure mode in long-term point tracking occurs when the predicted point leaves the object it belongs to and lands on the background or another object. We identify this as the failure to correctly capture objectness properties in learning to track. To address this limitation of prior work, we propose a novel objectness regularization approach that guides points to be aware of object priors by forcing them to stay inside the the boundaries of object instances. By capturing objectness cues at training time, we avoid the need to compute object masks during testing. In addition, we leverage contextual attention to enhance the feature representation for capturing objectness at the feature level more effectively. As a result, our approach achieves state-of-the-art performance on three point tracking benchmarks, and we further validate the effectiveness of our components via ablation studies. The source code is available at: https://github.com/RehgLab/tracking_objectness

* ECCV 2024 ILR Workshop

Via

Access Paper or Ask Questions

Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Mar 21, 2024

Ahmet Alp Kindiroglu, Ozgur Kara, Ogulcan Ozdemir, Lale Akarun

Figure 1 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 2 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 3 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 4 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Abstract:Sign language recognition (SLR) has recently achieved a breakthrough in performance thanks to deep neural networks trained on large annotated sign datasets. Of the many different sign languages, these annotated datasets are only available for a select few. Since acquiring gloss-level labels on sign language videos is difficult, learning by transferring knowledge from existing annotated sources is useful for recognition in under-resourced sign languages. This study provides a publicly available cross-dataset transfer learning benchmark from two existing public Turkish SLR datasets. We use a temporal graph convolution-based sign language recognition approach to evaluate five supervised transfer learning approaches and experiment with closed-set and partial-set cross-dataset transfer learning. Experiments demonstrate that improvement over finetuning based transfer learning is possible with specialized supervised transfer learning methods.

* Accepted to The 18th IEEE International Conference on Automatic Face and Gesture Recognition 2024, Code available in https://github.com/alpk/tid-supervised-transfer-learning-dataset

Via

Access Paper or Ask Questions

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Dec 07, 2023

Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag

Figure 1 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 2 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 3 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 4 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Abstract:Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.

* Project webpage: https://rave-video.github.io , Github: http://github.com/rehg-lab/RAVE

Via

Access Paper or Ask Questions

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior

Nov 27, 2021

Metin Ersin Arican, Ozgur Kara, Gustav Bredell, Ender Konukoglu

Figure 1 for ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior

Figure 2 for ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior

Figure 3 for ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior

Figure 4 for ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior

Abstract:Recent works show that convolutional neural network (CNN) architectures have a spectral bias towards lower frequencies, which has been leveraged for various image restoration tasks in the Deep Image Prior (DIP) framework. The benefit of the inductive bias the network imposes in the DIP framework depends on the architecture. Therefore, researchers have studied how to automate the search to determine the best-performing model. However, common neural architecture search (NAS) techniques are resource and time-intensive. Moreover, best-performing models are determined for a whole dataset of images instead of for each image independently, which would be prohibitively expensive. In this work, we first show that optimal neural architectures in the DIP framework are image-dependent. Leveraging this insight, we then propose an image-specific NAS strategy for the DIP framework that requires substantially less training than typical NAS approaches, effectively enabling image-specific NAS. For a given image, noise is fed to a large set of untrained CNNs, and their outputs' power spectral densities (PSD) are compared to that of the corrupted image using various metrics. Based on this, a small cohort of image-specific architectures is chosen and trained to reconstruct the corrupted image. Among this cohort, the model whose reconstruction is closest to the average of the reconstructed images is chosen as the final model. We justify the proposed strategy's effectiveness by (1) demonstrating its performance on a NAS Dataset for DIP that includes 500+ models from a particular search space (2) conducting extensive experiments on image denoising, inpainting, and super-resolution tasks. Our experiments show that image-specific metrics can reduce the search space to a small cohort of models, of which the best model outperforms current NAS approaches for image restoration.

Via

Access Paper or Ask Questions

Towards Fair Affective Robotics: Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Mar 15, 2021

Ozgur Kara, Nikhil Churamani, Hatice Gunes

Figure 1 for Towards Fair Affective Robotics: Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Figure 2 for Towards Fair Affective Robotics: Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Figure 3 for Towards Fair Affective Robotics: Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Abstract:As affective robots become integral in human life, these agents must be able to fairly evaluate human affective expressions without discriminating against specific demographic groups. Identifying bias in Machine Learning (ML) systems as a critical problem, different approaches have been proposed to mitigate such biases in the models both at data and algorithmic levels. In this work, we propose Continual Learning (CL) as an effective strategy to enhance fairness in Facial Expression Recognition (FER) systems, guarding against biases arising from imbalances in data distributions. We compare different state-of-the-art bias mitigation approaches with CL-based strategies for fairness on expression recognition and Action Unit (AU) detection tasks using popular benchmarks for each; RAF-DB and BP4D. Our experiments show that CL-based methods, on average, outperform popular bias mitigation techniques, strengthening the need for further investigation into CL for the development of fairer FER algorithms.

* Accepted at the Workshop on Lifelong Learning and Personalization in Long-Term Human-Robot Interaction (LEAP-HRI) at the 16th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2021. arXiv admin note: substantial text overlap with arXiv:2103.08637

Via

Access Paper or Ask Questions

Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Mar 15, 2021

Nikhil Churamani, Ozgur Kara, Hatice Gunes

Figure 1 for Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Figure 2 for Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Figure 3 for Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Figure 4 for Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

Abstract:As Facial Expression Recognition (FER) systems become integrated into our daily lives, these systems need to prioritise making fair decisions instead of aiming at higher individual accuracy scores. Ranging from surveillance systems to diagnosing mental and emotional health conditions of individuals, these systems need to balance the accuracy vs fairness trade-off to make decisions that do not unjustly discriminate against specific under-represented demographic groups. Identifying bias as a critical problem in facial analysis systems, different methods have been proposed that aim to mitigate bias both at data and algorithmic levels. In this work, we propose the novel usage of Continual Learning (CL), in particular, using Domain-Incremental Learning (Domain-IL) settings, as a potent bias mitigation method to enhance the fairness of FER systems while guarding against biases arising from skewed data distributions. We compare different non-CL-based and CL-based methods for their classification accuracy and fairness scores on expression recognition and Action Unit (AU) detection tasks using two popular benchmarks, the RAF-DB and BP4D datasets, respectively. Our experimental results show that CL-based methods, on average, outperform other popular bias mitigation techniques on both accuracy and fairness metrics.

Via

Access Paper or Ask Questions