Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiaole Dong

Online Dense Point Tracking with Streaming Memory

Mar 09, 2025

Qiaole Dong, Yanwei Fu

Abstract:Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with \textbf{S}treaming memory for dense \textbf{PO}int \textbf{T}racking and online video processing. The \textbf{SPOT} framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$\times$ smaller parameter numbers operates at least 2$\times$ faster than previous state-of-the-art models while maintaining the best performance on CVO. We will release the models and codes at: https://github.com/DQiaole/SPOT.

Via

Access Paper or Ask Questions

MemFlow: Optical Flow Estimation and Prediction with Memory

Apr 07, 2024

Qiaole Dong, Yanwei Fu

Figure 1 for MemFlow: Optical Flow Estimation and Prediction with Memory

Figure 2 for MemFlow: Optical Flow Estimation and Prediction with Memory

Figure 3 for MemFlow: Optical Flow Estimation and Prediction with Memory

Figure 4 for MemFlow: Optical Flow Estimation and Prediction with Memory

Abstract:Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: https://dqiaole.github.io/MemFlow/.

* CVPR 2024

Via

Access Paper or Ask Questions

Repositioning the Subject within Image

Jan 30, 2024

Yikai Wang, Chenjie Cao, Qiaole Dong, Yifan Li, Yanwei Fu

Figure 1 for Repositioning the Subject within Image

Figure 2 for Repositioning the Subject within Image

Figure 3 for Repositioning the Subject within Image

Figure 4 for Repositioning the Subject within Image

Abstract:Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image's fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE's effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Our results on ReS demonstrate the quality of repositioned image generation.

* Project page: https://yikai-wang.github.io/seele/. Dataset: https://github.com/Yikai-Wang/ReS. Arxiv version uses small size images for fast preview. Full size PDF is available at project page

Via

Access Paper or Ask Questions

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

Dec 04, 2023

Qiaole Dong, Bo Zhao, Yanwei Fu

Abstract:Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in https://github.com/DQiaole/FlowDiffusion_pytorch.

* Technical Report

Via

Access Paper or Ask Questions

A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

May 19, 2023

Chenjie Cao, Qiaole Dong, Yikai Wang, Yunuo Cai, Yanwei Fu

Figure 1 for A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

Figure 2 for A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

Figure 3 for A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

Figure 4 for A Unified Prompt-Guided In-Context Inpainting Framework for Reference-based Image Manipulations

Abstract:Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.

Via

Access Paper or Ask Questions

Rethinking Optical Flow from Geometric Matching Consistent Perspective

Mar 15, 2023

Qiaole Dong, Chenjie Cao, Yanwei Fu

Figure 1 for Rethinking Optical Flow from Geometric Matching Consistent Perspective

Figure 2 for Rethinking Optical Flow from Geometric Matching Consistent Perspective

Figure 3 for Rethinking Optical Flow from Geometric Matching Consistent Perspective

Figure 4 for Rethinking Optical Flow from Geometric Matching Consistent Perspective

Abstract:Optical flow estimation is a challenging problem remaining unsolved. Recent deep learning based optical flow models have achieved considerable success. However, these models often train networks from the scratch on standard optical flow data, which restricts their ability to robustly and geometrically match image features. In this paper, we propose a rethinking to previous optical flow estimation. We particularly leverage Geometric Image Matching (GIM) as a pre-training task for the optical flow estimation (MatchFlow) with better feature representations, as GIM shares some common challenges as optical flow estimation, and with massive labeled real-world data. Thus, matching static scenes helps to learn more fundamental feature correlations of objects and scenes with consistent displacements. Specifically, the proposed MatchFlow model employs a QuadTree attention-based network pre-trained on MegaDepth to extract coarse features for further flow regression. Extensive experiments show that our model has great cross-dataset generalization. Our method achieves 11.5% and 10.1% error reduction from GMA on Sintel clean pass and KITTI test set. At the time of anonymous submission, our MatchFlow(G) enjoys state-of-the-art performance on Sintel clean and final pass compared to published approaches with comparable computation and memory footprint. Codes and models will be released in https://github.com/DQiaole/MatchFlow.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Oct 12, 2022

Chenjie Cao, Qiaole Dong, Yanwei Fu

Figure 1 for ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Figure 2 for ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Figure 3 for ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Figure 4 for ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors

Abstract:The image inpainting task fills missing areas of a corrupted image. Despite impressive results have been achieved recently, it is still challenging to restore corrupted images with both vivid textures and reasonable structures. Some previous methods only tackle regular textures while losing holistic structures limited by receptive fields of Convolution Neural Networks (CNNs). To this end, we study learning a Zero-initialized residual addition based Incremental Transformer on Structural priors (ZITS++), an improved model over our conference ZITS model. Specifically, given one corrupt image, we present the Transformer Structure Restorer (TSR) module to restore holistic structural priors at low image resolution, which are further upsampled by Simple Structure Upsampler (SSU) module to higher image resolution. Further, to well recover image texture details, we take the Fourier CNN Texture Restoration (FTR) module, which has both the Fourier and large-kernel attention convolutions. Typically, FTR can be independently pre-trained without image structural priors. Furthermore, to enhance the FTR, the upsampled structural priors from TSR are further processed by Structure Feature Encoder (SFE), and updating the FTR by a novel incremental training strategy of Zero-initialized Residual Addition (ZeroRA). Essentially, a new masking positional encoding is proposed to encode the large irregular masks. Extensive experiments on various datasets validate the efficacy of our model compared with other competitors. We also conduct extensive ablation to compare and verify various priors for image inpainting tasks.

* arXiv admin note: substantial text overlap with arXiv:2203.00867

Via

Access Paper or Ask Questions

Learning Prior Feature and Attention Enhanced Image Inpainting

Aug 03, 2022

Chenjie Cao, Qiaole Dong, Yanwei Fu

Figure 1 for Learning Prior Feature and Attention Enhanced Image Inpainting

Figure 2 for Learning Prior Feature and Attention Enhanced Image Inpainting

Figure 3 for Learning Prior Feature and Attention Enhanced Image Inpainting

Figure 4 for Learning Prior Feature and Attention Enhanced Image Inpainting

Abstract:Many recent inpainting works have achieved impressive results by leveraging Deep Neural Networks (DNNs) to model various prior information for image restoration. Unfortunately, the performance of these methods is largely limited by the representation ability of vanilla Convolutional Neural Networks (CNNs) backbones.On the other hand, Vision Transformers (ViT) with self-supervised pre-training have shown great potential for many visual recognition and object detection tasks. A natural question is whether the inpainting task can be greatly benefited from the ViT backbone? However, it is nontrivial to directly replace the new backbones in inpainting networks, as the inpainting is an inverse problem fundamentally different from the recognition tasks. To this end, this paper incorporates the pre-training based Masked AutoEncoder (MAE) into the inpainting model, which enjoys richer informative priors to enhance the inpainting process. Moreover, we propose to use attention priors from MAE to make the inpainting model learn more long-distance dependencies between masked and unmasked regions. Sufficient ablations have been discussed about the inpainting and the self-supervised pre-training models in this paper. Besides, experiments on both Places2 and FFHQ demonstrate the effectiveness of our proposed model. Codes and pre-trained models are released in https://github.com/ewrfcas/MAE-FAR.

* ECCV 2022

Via

Access Paper or Ask Questions

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

Mar 16, 2022

Qiaole Dong, Chenjie Cao, Yanwei Fu

Figure 1 for Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

Figure 2 for Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

Figure 3 for Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

Figure 4 for Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

Abstract:Image inpainting has made significant advances in recent years. However, it is still challenging to recover corrupted images with both vivid textures and reasonable structures. Some specific methods only tackle regular textures while losing holistic structures due to the limited receptive fields of convolutional neural networks (CNNs). On the other hand, attention-based models can learn better long-range dependency for the structure recovery, but they are limited by the heavy computation for inference with large image sizes. To address these issues, we propose to leverage an additional structure restorer to facilitate the image inpainting incrementally. The proposed model restores holistic image structures with a powerful attention-based transformer model in a fixed low-resolution sketch space. Such a grayscale space is easy to be upsampled to larger scales to convey correct structural information. Our structure restorer can be integrated with other pretrained inpainting models efficiently with the zero-initialized residual addition. Furthermore, a masking positional encoding strategy is utilized to improve the performance with large irregular masks. Extensive experiments on various datasets validate the efficacy of our model compared with other competitors. Our codes are released in https://github.com/DQiaole/ZITS_inpainting.

* This paper has been accepted in CVPR2022

Via

Access Paper or Ask Questions