Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Görkay Aydemir

Track-On: Transformer-based Online Point Tracking with Memory

Jan 30, 2025

Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Güney

Figure 1 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 2 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 3 for Track-On: Transformer-based Online Point Tracking with Memory

Figure 4 for Track-On: Transformer-based Online Point Tracking with Memory

Abstract:In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

* ICLR 2025

Via

Access Paper or Ask Questions

Robust Bird's Eye View Segmentation by Adapting DINOv2

Sep 16, 2024

Merve Rabia Barın, Görkay Aydemir, Fatma Güney

Figure 1 for Robust Bird's Eye View Segmentation by Adapting DINOv2

Figure 2 for Robust Bird's Eye View Segmentation by Adapting DINOv2

Figure 3 for Robust Bird's Eye View Segmentation by Adapting DINOv2

Figure 4 for Robust Bird's Eye View Segmentation by Adapting DINOv2

Abstract:Extracting a Bird's Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.

* ECCV 2024 - 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

Via

Access Paper or Ask Questions

Can Visual Foundation Models Achieve Long-term Point Tracking?

Aug 24, 2024

Görkay Aydemir, Weidi Xie, Fatma Güney

Figure 1 for Can Visual Foundation Models Achieve Long-term Point Tracking?

Figure 2 for Can Visual Foundation Models Achieve Long-term Point Tracking?

Figure 3 for Can Visual Foundation Models Achieve Long-term Point Tracking?

Figure 4 for Can Visual Foundation Models Achieve Long-term Point Tracking?

Abstract:Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.

* ECCV 2024 - Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) Workshop

Via

Access Paper or Ask Questions

Self-supervised Object-Centric Learning for Videos

Oct 10, 2023

Görkay Aydemir, Weidi Xie, Fatma Güney

Abstract:Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.

* NeurIPS 2023

Via

Access Paper or Ask Questions

ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Jul 26, 2023

Görkay Aydemir, Adil Kaan Akan, Fatma Güney

Figure 1 for ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Figure 2 for ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Figure 3 for ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Figure 4 for ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Abstract:Forecasting future trajectories of agents in complex traffic scenes requires reliable and efficient predictions for all agents in the scene. However, existing methods for trajectory prediction are either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance: first, to the adaptive head augmenting the model capacity without increasing the model size; second, to our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently. https://KUIS-AI.github.io/adapt

* ICCV 2023

Via

Access Paper or Ask Questions

Trajectory Forecasting on Temporal Graphs

Jul 01, 2022

Görkay Aydemir, Adil Kaan Akan, Fatma Güney

Figure 1 for Trajectory Forecasting on Temporal Graphs

Figure 2 for Trajectory Forecasting on Temporal Graphs

Figure 3 for Trajectory Forecasting on Temporal Graphs

Figure 4 for Trajectory Forecasting on Temporal Graphs

Abstract:Predicting future locations of agents in the scene is an important problem in self-driving. In recent years, there has been a significant progress in representing the scene and the agents in it. The interactions of agents with the scene and with each other are typically modeled with a Graph Neural Network. However, the graph structure is mostly static and fails to represent the temporal changes in highly dynamic scenes. In this work, we propose a temporal graph representation to better capture the dynamics in traffic scenes. We complement our representation with two types of memory modules; one focusing on the agent of interest and the other on the entire scene. This allows us to learn temporally-aware representations that can achieve good results even with simple regression of multiple futures. When combined with goal-conditioned prediction, we show better results that can reach the state-of-the-art performance on the Argoverse benchmark.

Via

Access Paper or Ask Questions