Abstract:Capturing videos with wrong exposure usually produces unsatisfactory visual effects. While image exposure correction is a popular topic, the video counterpart is less explored in the literature. Directly applying prior image-based methods to input videos often results in temporal incoherence with low visual quality. Existing research in this area is also limited by the lack of high-quality benchmark datasets. To address these issues, we construct the first real-world paired video dataset, including both underexposure and overexposure dynamic scenes. To achieve spatial alignment, we utilize two DSLR cameras and a beam splitter to simultaneously capture improper and normal exposure videos. In addition, we propose a Video Exposure Correction Network (VECNet) based on Retinex theory, which incorporates a two-stream illumination learning mechanism to enhance the overexposure and underexposure factors, respectively. The estimated multi-frame reflectance and dual-path illumination components are fused at both feature and image levels, leading to visually appealing results. Experimental results demonstrate that the proposed method outperforms existing image exposure correction and underexposed video enhancement methods. The code and dataset will be available soon.
Abstract:Exposure correction aims to enhance images suffering from improper exposure to achieve satisfactory visual effects. Despite recent progress, existing methods generally mitigate either overexposure or underexposure in input images, and they still struggle to handle images with mixed exposure, i.e., one image incorporates both overexposed and underexposed regions. The mixed exposure distribution is non-uniform and leads to varying representation, which makes it challenging to address in a unified process. In this paper, we introduce an effective Region-aware Exposure Correction Network (RECNet) that can handle mixed exposure by adaptively learning and bridging different regional exposure representations. Specifically, to address the challenge posed by mixed exposure disparities, we develop a region-aware de-exposure module that effectively translates regional features of mixed exposure scenarios into an exposure-invariant feature space. Simultaneously, as de-exposure operation inevitably reduces discriminative information, we introduce a mixed-scale restoration unit that integrates exposure-invariant features and unprocessed features to recover local information. To further achieve a uniform exposure distribution in the global image, we propose an exposure contrastive regularization strategy under the constraints of intra-regional exposure consistency and inter-regional exposure continuity. Extensive experiments are conducted on various datasets, and the experimental results demonstrate the superiority and generalization of our proposed method. The code is released at: https://github.com/kravrolens/RECNet.
Abstract:With the rapid advancements in deep learning technologies, person re-identification (ReID) has witnessed remarkable performance improvements. However, the majority of prior works have traditionally focused on solving the problem via extracting features solely from a single perspective, such as uniform partitioning, hard attention mechanisms, or semantic masks. While these approaches have demonstrated efficacy within specific contexts, they fall short in diverse situations. In this paper, we propose a novel approach, Mutual Distillation Learning For Person Re-identification (termed as MDPR), which addresses the challenging problem from multiple perspectives within a single unified model, leveraging the power of mutual distillation to enhance the feature representations collectively. Specifically, our approach encompasses two branches: a hard content branch to extract local features via a uniform horizontal partitioning strategy and a Soft Content Branch to dynamically distinguish between foreground and background and facilitate the extraction of multi-granularity features via a carefully designed attention mechanism. To facilitate knowledge exchange between these two branches, a mutual distillation and fusion process is employed, promoting the capability of the outputs of each branch. Extensive experiments are conducted on widely used person ReID datasets to validate the effectiveness and superiority of our approach. Notably, our method achieves an impressive $88.7\%/94.4\%$ in mAP/Rank-1 on the DukeMTMC-reID dataset, surpassing the current state-of-the-art results. Our source code is available at https://github.com/KuilongCui/MDPR.
Abstract:Current object re-identification (ReID) system follows the centralized processing paradigm, i.e., all computations are conducted in the cloud server and edge devices are only used to capture and send images. As the number of videos experiences a rapid escalation, this paradigm has become impractical due to the finite computational resources. In such a scenario, the ReID system should be converted to fit in the cloud-edge collaborative processing paradigm, which is crucial to boost the scalability and practicality of ReID systems. However, current relevant work lacks research on this issue, making it challenging for ReID methods to be adapted effectively. Therefore, we pioneer a cloud-edge collaborative inference framework for ReID systems and particularly propose a distribution-aware correlation modeling network (DaCM) to make the desired image return to the cloud server as soon as possible via learning to model the spatial-temporal correlations among instances. DaCM embeds the spatial-temporal correlations implicitly included in the timestamps into a graph structure, and it can be applied in the cloud to regulate the size of the upload window and on the edge device to adjust the sequence of images, respectively. Traditional ReID methods can be combined with DaCM seamlessly, enabling their application within our proposed edge-cloud collaborative framework. Extensive experiments demonstrate that our method obviously reduces transmission overhead and significantly improves performance. We will release our code and model.
Abstract:Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
Abstract:Video denoising aims to recover high-quality frames from the noisy video. While most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content, however, CNNs focus on local information and ignore the interactions between long-range regions. Furthermore, most related works directly take the output after spatio-temporal denoising as the final result, neglecting the fine-grained denoising process. In this paper, we propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising, which inherits the advantages of both Transformer and CNNs. Specifically, DSCT is proposed based on a progressive dual-stage architecture, namely a coarse-level and a fine-level to extract dynamic feature and static feature, respectively. At both stages, a Spatial-Channel Encoding Module(SCEM) is designed to model the long-range contextual dependencies at spatial and channel levels. Meanwhile, we design a Multi-scale Residual Structure to preserve multiple aspects of information at different stages, which contains a Temporal Features Aggregation Module(TFAM) to summarize the dynamic representation. Extensive experiments on four publicly available datasets demonstrate our proposed DSCT achieves significant improvements compared to the state-of-the-art methods.