Abstract:Video object segmentation is a fundamental research problem in computer vision. Recent techniques have often applied attention mechanism to object representation learning from video sequences. However, due to temporal changes in the video data, attention maps may not well align with the objects of interest across video frames, causing accumulated errors in long-term video processing. In addition, existing techniques have utilised complex architectures, requiring highly computational complexity and hence limiting the ability to integrate video object segmentation into low-powered devices. To address these issues, we propose a new method for self-supervised video object segmentation based on distillation learning of deformable attention. Specifically, we devise a lightweight architecture for video object segmentation that is effectively adapted to temporal changes. This is enabled by deformable attention mechanism, where the keys and values capturing the memory of a video sequence in the attention module have flexible locations updated across frames. The learnt object representations are thus adaptive to both the spatial and temporal dimensions. We train the proposed architecture in a self-supervised fashion through a new knowledge distillation paradigm where deformable attention maps are integrated into the distillation loss. We qualitatively and quantitatively evaluate our method and compare it with existing methods on benchmark datasets including DAVIS 2016/2017 and YouTube-VOS 2018/2019. Experimental results verify the superiority of our method via its achieved state-of-the-art performance and optimal memory usage.
Abstract:Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. This indicates that there exists a strong correlation between the visual and textual domains. In addition, text-image discriminative models such as CLIP excel in image labelling from text prompts, thanks to the rich and diverse information available from open concepts. In this paper, we leverage these technical advances to solve a challenging problem in computer vision: camouflaged instance segmentation. Specifically, we propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues are subtle to distinguish the objects from the background, especially in segmenting novel objects which are not seen in training. We also develop technically supportive components to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. We validate our method and compare it with existing ones on several benchmark datasets of camouflaged instance segmentation and generic open-vocabulary instance segmentation. Experimental results confirm the advances of our method over existing ones. We will publish our code and pre-trained models to support future research.
Abstract:Artificial intelligence has the potential to make valuable contributions to wildlife management through cost-effective methods for the collection and interpretation of wildlife data. Recent advances in remotely piloted aircraft systems (RPAS or ``drones'') and thermal imaging technology have created new approaches to collect wildlife data. These emerging technologies could provide promising alternatives to standard labourious field techniques as well as cover much larger areas. In this study, we conduct a comprehensive review and empirical study of drone-based wildlife detection. Specifically, we collect a realistic dataset of drone-derived wildlife thermal detections. Wildlife detections, including arboreal (for instance, koalas, phascolarctos cinereus) and ground dwelling species in our collected data are annotated via bounding boxes by experts. We then benchmark state-of-the-art object detection algorithms on our collected dataset. We use these experimental results to identify issues and discuss future directions in automatic animal monitoring using drones.
Abstract:Medical image analysis using computer-based algorithms has attracted considerable attention from the research community and achieved tremendous progress in the last decade. With recent advances in computing resources and availability of large-scale medical image datasets, many deep learning models have been developed for disease diagnosis from medical images. However, existing techniques focus on sub-tasks, e.g., disease classification and identification, individually, while there is a lack of a unified framework enabling multi-task diagnosis. Inspired by the capability of Vision Transformers in both local and global representation learning, we propose in this paper a new method, namely Multi-task Vision Transformer (MVC) for simultaneously classifying chest X-ray images and identifying affected regions from the input data. Our method is built upon the Vision Transformer but extends its learning capability in a multi-task setting. We evaluated our proposed method and compared it with existing baselines on a benchmark dataset of COVID-19 chest X-ray images. Experimental results verified the superiority of the proposed method over the baselines on both the image classification and affected region identification tasks.
Abstract:Neural radiance field is an emerging rendering method that generates high-quality multi-view consistent images from a neural scene representation and volume rendering. Although neural radiance field-based techniques are robust for scene reconstruction, their ability to add or remove objects remains limited. This paper proposes a new language-driven approach for object manipulation with neural radiance fields through dataset updates. Specifically, to insert a new foreground object represented by a set of multi-view images into a background radiance field, we use a text-to-image diffusion model to learn and generate combined images that fuse the object of interest into the given background across views. These combined images are then used for refining the background radiance field so that we can render view-consistent images containing both the object and the background. To ensure view consistency, we propose a dataset updates strategy that prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. We show that under the same dataset updates strategy, we can easily adapt our method for object insertion using data from text-to-3D models as well as object removal. Experimental results show that our method generates photorealistic images of the edited scenes, and outperforms state-of-the-art methods in 3D reconstruction and neural radiance field blending.
Abstract:In this paper, we address the problem of conditional scene decoration for 360-degree images. Our method takes a 360-degree background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360-degree view to enable a variety of furniture arrangements for an input 360-degree background image. We use this object layout to condition a generative adversarial network to synthesize images of an input scene. To further reinforce the generation capability of our model, we develop a simple yet effective scene emptier that removes the generated furniture and produces an emptied scene for our model to learn a cyclic constraint. We train the model on the Structure3D dataset and show that our model can generate diverse decorations with controllable object layout. Our method achieves state-of-the-art performance on the Structure3D dataset and generalizes well to the Zillow indoor scene dataset. Our user study confirms the immersive experiences provided by the realistic image quality and furniture layout in our generation results. Our implementation will be made available.
Abstract:Out-of-distribution (OOD) generalisation aims to build a model that can well generalise its learnt knowledge from source domains to an unseen target domain. However, current image classification models often perform poorly in the OOD setting due to statistically spurious correlations learning from model training. From causality-based perspective, we formulate the data generation process in OOD image classification using a causal graph. On this graph, we show that prediction P(Y|X) of a label Y given an image X in statistical learning is formed by both causal effect P(Y|do(X)) and spurious effects caused by confounding features (e.g., background). Since the spurious features are domain-variant, the prediction P(Y|X) becomes unstable on unseen domains. In this paper, we propose to mitigate the spurious effect of confounders using front-door adjustment. In our method, the mediator variable is hypothesized as semantic features that are essential to determine a label for an image. Inspired by capability of style transfer in image generation, we interpret the combination of the mediator variable with different generated images in the front-door formula and propose novel algorithms to estimate it. Extensive experimental results on widely used benchmark datasets verify the effectiveness of our method.
Abstract:In this paper, we propose a new method for mapping a 3D point cloud to the latent space of a 3D generative adversarial network. Our generative model for 3D point clouds is based on SP-GAN, a state-of-the-art sphere-guided 3D point cloud generator. We derive an efficient way to encode an input 3D point cloud to the latent space of the SP-GAN. Our point cloud encoder can resolve the point ordering issue during inversion, and thus can determine the correspondences between points in the generated 3D point cloud and those in the canonical sphere used by the generator. We show that our method outperforms previous GAN inversion methods for 3D point clouds, achieving state-of-the-art results both quantitatively and qualitatively. Our code is available at https://github.com/hkust-vgd/point_inverter.
Abstract:With the recent developments of convolutional neural networks, deep learning for 3D point clouds has shown significant progress in various 3D scene understanding tasks including 3D object recognition. In a safety-critical environment, it is however not well understood how such neural networks are vulnerable to adversarial examples. In this work, we explore adversarial attacks for point cloud-based neural networks with a focus on real-world data. We propose a general formulation for adversarial point cloud generation via $\ell_0$-norm optimisation. Our method generates adversarial examples by attacking the classification ability of the point cloud-based networks while considering the perceptibility of the examples and ensuring the minimum level of point manipulations. The proposed method is general and can be realised in different attack strategies. Experimental results show that our method achieves the state-of-the-art performance with 80\% of attack success rate while manipulating only about 4\% of the points. We also found that compared with synthetic data, real-world point cloud classification is more vulnerable to attacks.
Abstract:Deep learning has been widely adopted in automatic emotion recognition and has lead to significant progress in the field. However, due to insufficient annotated emotion datasets, pre-trained models are limited in their generalization capability and thus lead to poor performance on novel test sets. To mitigate this challenge, transfer learning performing fine-tuning on pre-trained models has been applied. However, the fine-tuned knowledge may overwrite and/or discard important knowledge learned from pre-trained models. In this paper, we address this issue by proposing a PathNet-based transfer learning method that is able to transfer emotional knowledge learned from one visual/audio emotion domain to another visual/audio emotion domain, and transfer the emotional knowledge learned from multiple audio emotion domains into one another to improve overall emotion recognition accuracy. To show the robustness of our proposed system, various sets of experiments for facial expression recognition and speech emotion recognition task on three emotion datasets: SAVEE, EMODB, and eNTERFACE have been carried out. The experimental results indicate that our proposed system is capable of improving the performance of emotion recognition, making its performance substantially superior to the recent proposed fine-tuning/pre-trained models based transfer learning methods.