Abstract:Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object's appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.
Abstract:In this paper, we propose a new approach to perform supervised texture classification/segmentation. The proposed idea is to feed a Fully Convolutional Network with specific texture descriptors. These texture features are extracted from images by using an empirical curvelet transform. We propose a method to build a unique empirical curvelet filter bank adapted to a given dictionary of textures. We then show that the output of these filters can be used to build efficient texture descriptors utilized to finally feed deep learning networks. Our approach is finally evaluated on several datasets and compare the results to various state-of-the-art algorithms and show that the proposed method dramatically outperform all existing ones.
Abstract:Wavelet-based segmentation approaches are widely used for texture segmentation purposes because of their ability to characterize different textures. In this paper, we assess the influence of the chosen wavelet and propose to use the recently introduced empirical wavelets. We show that the adaptability of the empirical wavelet permits to reach better results than classic wavelets. In order to focus only on the textural information, we also propose to perform a cartoon + texture decomposition step before applying the segmentation algorithm. The proposed method is tested on six classic benchmarks, based on several popular texture images.
Abstract:The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
Abstract:Autonomous virtual agents are often limited by their singular mode of interaction with real-world environments, restricting their versatility. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with operating systems. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. To evaluate the performance of MMAC-Copilot, we conducted experiments using both the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8\% over existing leading systems. Furthermore, it demonstrated remarkable capability on VIBench, particularly in managing various methods of interaction within systems and applications. These results underscore MMAC-Copilot's potential in advancing the field of autonomous virtual agents through its innovative approach to agent collaboration.
Abstract:This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.
Abstract:This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.
Abstract:In this work we address the problem of rain streak removal with RAW images. The general approach is firstly processing RAW data into RGB images and removing rain streak with RGB images. Actually the original information of rain in RAW images is affected by image signal processing (ISP) pipelines including none-linear algorithms, unexpected noise, artifacts and so on. It gains more benefit to directly remove rain in RAW data before being processed into RGB format. To solve this problem, we propose a joint solution for rain removal and RAW processing to obtain clean color images from rainy RAW image. To be specific, we generate rainy RAW data by converting color rain streak into RAW space and design simple but efficient RAW processing algorithms to synthesize both rainy and clean color images. The rainy color images are used as reference to help color corrections. Different backbones show that our method conduct a better result compared with several other state-of-the-art deraining methods focused on color image. In addition, the proposed network generalizes well to other cameras beyond our selected RAW dataset. Finally, we give the result tested on images processed by different ISP pipelines to show the generalization performance of our model is better compared with methods on color images.
Abstract:Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.
Abstract:Multi-goal path finding (MGPF) aims to find a closed and collision-free path to visit a sequence of goals orderly. As a physical travelling salesman problem, an undirected complete graph with accurate weights is crucial for determining the visiting order. Lack of prior knowledge of local paths between vertices poses challenges in meeting the optimality and efficiency requirements of algorithms. In this study, a multi-task learning model designated Prior Knowledge Extraction (PKE), is designed to estimate the local path length between pairwise vertices as the weights of the graph. Simultaneously, a promising region and a guideline are predicted as heuristics for the path-finding process. Utilizing the outputs of the PKE model, a variant of Rapidly-exploring Random Tree (RRT) is proposed known as PKE-RRT. It effectively tackles the MGPF problem by a local planner incorporating a prioritized visiting order, which is obtained from the complete graph. Furthermore, the predicted region and guideline facilitate efficient exploration of the tree structure, enabling the algorithm to rapidly provide a sub-optimal solution. Extensive numerical experiments demonstrate the outstanding performance of the PKE-RRT for the MGPF problem with a different number of goals, in terms of calculation time, path cost, sample number, and success rate.