Abstract:Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.
Abstract:A pedestrian navigation system (PNS) in indoor environments, where global navigation satellite system (GNSS) signal access is difficult, is necessary, particularly for search and rescue (SAR) operations in large buildings. This paper focuses on studying pedestrian walking behaviors to enhance the performance of indoor pedestrian dead reckoning (PDR) and map matching techniques. Specifically, our research aims to detect pedestrian turning motions using smartphone inertial measurement unit (IMU) information in a given PDR trajectory. To improve existing methods, including the threshold-based turn detection method, hidden Markov model (HMM)-based turn detection method, and pruned exact linear time (PELT) algorithm-based turn detection method, we propose enhanced algorithms that better detect pedestrian turning motions. During field tests, using the threshold-based method, we observed a missed detection rate of 20.35% and a false alarm rate of 7.65%. The PELT-based method achieved a significant improvement with a missed detection rate of 8.93% and a false alarm rate of 6.97%. However, the best results were obtained using the HMM-based method, which demonstrated a missed detection rate of 5.14% and a false alarm rate of 2.00%. In summary, our research contributes to the development of a more accurate and reliable pedestrian navigation system by leveraging smartphone IMU data and advanced algorithms for turn detection in indoor environments.
Abstract:Target localization is essential for emergency dispatching situations. Maximum likelihood estimation (MLE) methods are widely used to estimate the target position based on the received signal strength measurements. However, the performance of MLE solvers is significantly affected by the initialization (i.e., initial guess of the solution or solution search space). To address this, a previous study proposed the semidefinite programming (SDP)-based MLE initialization. However, the performance of the SDP-based initialization technique is largely affected by the shadowing variance and geometric diversity between the target and receivers. In this study, a radio frequency (RF) fingerprinting-based MLE initialization is proposed. Further, a maximum likelihood problem for target localization combining RF fingerprinting is formulated. In the three test environments of open space, urban, and indoor, the proposed RF fingerprinting-aided target localization method showed a performance improvement of up to 63.31% and an average of 39.13%, compared to the MLE algorithm initialized with SDP. Furthermore, unlike the SDP-MLE method, the proposed method was not significantly affected by the poor geometry between the target and receivers in our experiments.
Abstract:Image-based virtual try-on provides the capacity to transfer a clothing item onto a photo of a given person, which is usually accomplished by warping the item to a given human pose and adjusting the warped item to the person. However, the results of real-world synthetic images (e.g., selfies) from the previous method is not realistic because of the limitations which result in the neck being misrepresented and significant changes to the style of the garment. To address these challenges, we propose a novel method to solve this unique issue, called VITON-CROP. VITON-CROP synthesizes images more robustly when integrated with random crop augmentation compared to the existing state-of-the-art virtual try-on models. In the experiments, we demonstrate that VITON-CROP is superior to VITON-HD both qualitatively and quantitatively.
Abstract:We propose an indoor navigation algorithm based on pedestrian dead reckoning (PDR) using an inertial measurement unit in a smartphone and map matching. The proposed indoor navigation system is user-friendly and convenient because it requires no additional device except a smartphone and works with a pedestrian in a casual posture who is walking with a smartphone in their hand. Because the performance of the PDR decreases over time, we greatly reduced the position error of the trajectory estimated by PDR using a map matching method with a known indoor map. To verify the proposed indoor navigation algorithm, we conducted an experiment in a real indoor environment using a commercial Android smartphone. The performance of our algorithm was demonstrated through the results of the experiment.
Abstract:Existing state-of-the-art techniques in exemplar-based image-to-image translation have several critical problems. Existing method related to exemplar-based image-to-image translation is impossible to translate on an image tuple input(source, target) that is not aligned. Also, we can confirm that the existing method has limited generalization ability to unseen images. To overcome this limitation, we propose Multiple GAN Inversion for Exemplar-based Image-to-Image Translation. Our novel Multiple GAN Inversion avoids human intervention using a self-deciding algorithm in choosing the number of layers using Fr\'echet Inception Distance(FID), which selects more plausible image reconstruction result among multiple hypotheses without any training or supervision. Experimental results shows the advantage of the proposed method compared to existing state-of-the-art exemplar-based image-to-image translation methods.
Abstract:Existing techniques to solve exemplar-based image-to-image translation within deep convolutional neural networks (CNNs) generally require a training phase to optimize the network parameters on domain-specific and task-specific benchmarks, thus having limited applicability and generalization ability. In this paper, we propose a novel framework, for the first time, to solve exemplar-based translation through an online optimization given an input image pair, called online exemplar fine-tuning (OEFT), in which we fine-tune the off-the-shelf and general-purpose networks to the input image pair themselves. We design two sub-networks, namely correspondence fine-tuning and multiple GAN inversion, and optimize these network parameters and latent codes, starting from the pre-trained ones, with well-defined loss functions. Our framework does not require the off-line training phase, which has been the main challenge of existing methods, but the pre-trained networks to enable optimization in online. Experimental results prove that our framework is effective in having a generalization power to unseen image pairs and clearly even outperforms the state-of-the-arts needing the intensive training phase.
Abstract:Unsupervised image translation aims to learn the transformation from a source domain to another target domain given unpaired training data. Several state-of-the-art works have yielded impressive results in the GANs-based unsupervised image-to-image translation. It fails to capture strong geometric or structural change between domains or is unsatisfactory for complex scenes, compared to texture change tasks such as style transfer. Recently, SAGAN (Han Zhang, 2018) showed that the self-attention network produces better results than the convolution-based GAN. However, the effectiveness of the self-attention network in unsupervised image-to-image translation tasks have not been verified. In this paper, we propose an unsupervised image-to-image translation with self-attention networks, in which long range dependency helps to not only capture strong geometric change but also generate details using cues from all feature locations. In experiments, we qualitatively and quantitatively show superiority of the proposed method compared to existing state-of-the-art unsupervised image-to-image translation task.