Abstract:This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model's precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over off-the-shelf iterative optimization techniques.
Abstract:We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas
Abstract:Vertebrae identification in arbitrary fields-of-view plays a crucial role in diagnosing spine disease. Most spine CT contain only local regions, such as the neck, chest, and abdomen. Therefore, identification should not depend on specific vertebrae or a particular number of vertebrae being visible. Existing methods at the spine-level are unable to meet this challenge. In this paper, we propose a three-stage method to address the challenges in 3D CT vertebrae identification at vertebrae-level. By sequentially performing the tasks of vertebrae localization, segmentation, and identification, the anatomical prior information of the vertebrae is effectively utilized throughout the process. Specifically, we introduce a dual-factor density clustering algorithm to acquire localization information for individual vertebra, thereby facilitating subsequent segmentation and identification processes. In addition, to tackle the issue of interclass similarity and intra-class variability, we pre-train our identification network by using a supervised contrastive learning method. To further optimize the identification results, we estimated the uncertainty of the classification network and utilized the message fusion module to combine the uncertainty scores, while aggregating global information about the spine. Our method achieves state-of-the-art results on the VerSe19 and VerSe20 challenge benchmarks. Additionally, our approach demonstrates outstanding generalization performance on an collected dataset containing a wide range of abnormal cases.
Abstract:Leveraging the advantage of satellite and terrestrial networks, the integrated satellite terrestrial networks (ISTNs) can help to achieve seamless global access and eliminate the digital divide. However, the dense deployment and frequent handover of satellites aggravate intra- and inter-system interference, resulting in a decrease in downlink sum rate. To address this issue, we propose a coordinated intra- and inter-system interference management algorithm for ISTN. This algorithm coordinates multidimensional interference through a joint design of inter-satellite handover and resource allocation method. On the one hand, we take inter-system interference between low earth orbit (LEO) and geostationary orbit (GEO) satellites as a constraint, and reduce interference to GEO satellite ground stations (GEO-GS) while ensuring system capacity through inter-satellite handover. On the other hand, satellite and terrestrial resource allocation schemes are designed based on the matching idea, and channel gain and interference to other channels are considered during the matching process to coordinate co-channel interference. In order to avoid too many unnecessary handovers, we consider handover scenarios related to service capabilities and service time to determine the optimal handover target satellite. Numerical results show that the gap between the results on the system sum rate obtained by the proposed method and the upper bound is reduced as the user density increases, and the handover frequency can be significantly reduced.
Abstract:In recent years deep learning methods have shown great superiority in compressed video quality enhancement tasks. Existing methods generally take the raw video as the ground truth and extract practical information from consecutive frames containing various artifacts. However, they do not fully exploit the valid information of compressed and raw videos to guide the quality enhancement for compressed videos. In this paper, we propose a unique Valid Information Guidance scheme (VIG) to enhance the quality of compressed videos by mining valid information from both compressed videos and raw videos. Specifically, we propose an efficient framework, Compressed Redundancy Filtering (CRF) network, to balance speed and enhancement. After removing the redundancy by filtering the information, CRF can use the valid information of the compressed video to reconstruct the texture. Furthermore, we propose a progressive Truth Guidance Distillation (TGD) strategy, which does not need to design additional teacher models and distillation loss functions. By only using the ground truth as input to guide the model to aggregate the correct spatio-temporal correspondence across the raw frames, TGD can significantly improve the enhancement effect without increasing the extra training cost. Extensive experiments show that our method achieves the state-of-the-art performance of compressed video quality enhancement in terms of accuracy and efficiency.
Abstract:Clothing changes and lack of data labels are both crucial challenges in person ReID. For the former challenge, people may occur multiple times at different locations wearing different clothing. However, most of the current person ReID research works focus on the benchmarks in which a person's clothing is kept the same all the time. For the last challenge, some researchers try to make model learn information from a labeled dataset as a source to an unlabeled dataset. Whereas purely unsupervised training is less used. In this paper, we aim to solve both problems at the same time. We design a novel unsupervised model, Sync-Person-Cloud ReID, to solve the unsupervised clothing change person ReID problem. We developer a purely unsupervised clothing change person ReID pipeline with person sync augmentation operation and same person feature restriction. The person sync augmentation is to supply additional same person resources. These same person's resources can be used as part supervised input by same person feature restriction. The extensive experiments on clothing change ReID datasets show the out-performance of our methods.
Abstract:Person Re-identification (ReID) is a critical computer vision task which aims to match the same person in images or video sequences. Most current works focus on settings where the resolution of images is kept the same. However, the resolution is a crucial factor in person ReID, especially when the cameras are at different distances from the person or the camera's models are different from each other. In this paper, we propose a novel two-stream network with a lightweight resolution association ReID feature transformation (RAFT) module and a self-weighted attention (SWA) ReID module to evaluate features under different resolutions. RAFT transforms the low resolution features to corresponding high resolution features. SWA evaluates both features to get weight factors for the person ReID. Both modules are jointly trained to get a resolution-invariant representation. Extensive experiments on five benchmark datasets show the effectiveness of our method. For instance, we achieve Rank-1 accuracy of 43.3% and 83.2% on CAVIAR and MLR-CUHK03, outperforming the state-of-the-art.
Abstract:RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.
Abstract:Most existing works in Person Re-identification (ReID) focus on settings where illumination either is kept the same or has very little fluctuation. However, the changes in the illumination degree may affect the robustness of a ReID algorithm significantly. To address this problem, we proposed a Two-Stream Network that can separate ReID features from lighting features to enhance ReID performance. Its innovations are threefold: (1) A discriminative entropy loss to ensure the ReID features contain no lighting information. (2) A ReID Teacher model trained by images under "neutral" lighting conditions to guide ReID classification. (3) An illumination Teacher model trained by the differences between the illumination-adjusted and original images to guide illumination classification. We construct two augmented datasets by synthetically changing a set of predefined lighting conditions in two of the most popular ReID benchmarks: Market1501 and DukeMTMC-ReID. Experiments demonstrate that our algorithm outperforms other state-of-the-art works and particularly potent in handling images under extremely low light.