Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunzhong Hou

REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Apr 14, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng

Abstract:In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

Via

Access Paper or Ask Questions

Learning Camera Movement Control from Real-World Drone Videos

Dec 12, 2024

Yunzhong Hou, Liang Zheng, Philip Torr

Figure 1 for Learning Camera Movement Control from Real-World Drone Videos

Figure 2 for Learning Camera Movement Control from Real-World Drone Videos

Figure 3 for Learning Camera Movement Control from Real-World Drone Videos

Figure 4 for Learning Camera Movement Control from Real-World Drone Videos

Abstract:This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos. Data and code are available at dvgformer.github.io.

Via

Access Paper or Ask Questions

Optimizing Camera Configurations for Multi-View Pedestrian Detection

Dec 04, 2023

Yunzhong Hou, Xingjian Leng, Tom Gedeon, Liang Zheng

Abstract:Jointly considering multiple camera views (multi-view) is very effective for pedestrian detection under occlusion. For such multi-view systems, it is critical to have well-designed camera configurations, including camera locations, directions, and fields-of-view (FoVs). Usually, these configurations are crafted based on human experience or heuristics. In this work, we present a novel solution that features a transformer-based camera configuration generator. Using reinforcement learning, this generator autonomously explores vast combinations within the action space and searches for configurations that give the highest detection accuracy according to the training dataset. The generator learns advanced techniques like maximizing coverage, minimizing occlusion, and promoting collaboration. Across multiple simulation scenarios, the configurations generated by our transformer-based model consistently outperform random search, heuristic-based methods, and configurations designed by human experts, shedding light on future camera layout optimization.

Via

Access Paper or Ask Questions

Learning to Select Camera Views: Efficient Multiview Understanding at Few Glances

Mar 10, 2023

Yunzhong Hou, Stephen Gould, Liang Zheng

Abstract:Multiview camera setups have proven useful in many computer vision applications for reducing ambiguities, mitigating occlusions, and increasing field-of-view coverage. However, the high computational cost associated with multiple views poses a significant challenge for end devices with limited computational resources. To address this issue, we propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing. Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network. Experimental results on multiview classification and detection tasks show that our approach achieves promising performance while using only 2 or 3 out of N available views, significantly reducing computational costs. Furthermore, analysis on the selected views reveals that certain cameras can be shut off with minimal performance impact, shedding light on future camera layout optimization for multiview systems. Code is available at https://github.com/hou-yz/MVSelect.

Via

Access Paper or Ask Questions

Learning to Structure an Image with Few Colors and Beyond

Aug 17, 2022

Yunzhong Hou, Liang Zheng, Stephen Gould

Figure 1 for Learning to Structure an Image with Few Colors and Beyond

Figure 2 for Learning to Structure an Image with Few Colors and Beyond

Figure 3 for Learning to Structure an Image with Few Colors and Beyond

Figure 4 for Learning to Structure an Image with Few Colors and Beyond

Abstract:Color and structure are the two pillars that combine to give an image its meaning. Interested in critical structures for neural network recognition, we isolate the influence of colors by limiting the color space to just a few bits, and find structures that enable network recognition under such constraints. To this end, we propose a color quantization network, ColorCNN, which learns to structure an image in limited color spaces by minimizing the classification loss. Building upon the architecture and insights of ColorCNN, we introduce ColorCNN+, which supports multiple color space size configurations, and addresses the previous issues of poor recognition accuracy and undesirable visual fidelity under large color spaces. Via a novel imitation learning approach, ColorCNN+ learns to cluster colors like traditional color quantization methods. This reduces overfitting and helps both visual fidelity and recognition accuracy under large color spaces. Experiments verify that ColorCNN+ achieves very competitive results under most circumstances, preserving both key structures for network recognition and visual fidelity with accurate colors. We further discuss differences between key structures and accurate colors, and their specific contributions to network recognition. For potential applications, we show that ColorCNNs can be used as image compression methods for network recognition.

Via

Access Paper or Ask Questions

Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Aug 17, 2022

Yunzhong Hou, Stephen Gould, Liang Zheng

Figure 1 for Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Figure 2 for Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Figure 3 for Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Figure 4 for Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Abstract:Semi-supervised semantic segmentation needs rich and robust supervision on unlabeled data. Consistency learning enforces the same pixel to have similar features in different augmented views, which is a robust signal but neglects relationships with other pixels. In comparison, contrastive learning considers rich pairwise relationships, but it can be a conundrum to assign binary positive-negative supervision signals for pixel pairs. In this paper, we take the best of both worlds and propose multi-view correlation consistency (MVCC) learning: it considers rich pairwise relationships in self-correlation matrices and matches them across views to provide robust supervision. Together with this correlation consistency loss, we propose a view-coherent data augmentation strategy that guarantees pixel-pixel correspondence between different views. In a series of semi-supervised settings on two datasets, we report competitive accuracy compared with the state-of-the-art methods. Notably, on Cityscapes, we achieve 76.8% mIoU with 1/8 labeled data, just 0.6% shy from the fully supervised oracle.

Via

Access Paper or Ask Questions

Multiview Detection with Cardboard Human Modeling

Jul 10, 2022

Jiahao Ma, Zicheng Duan, Yunzhong Hou, Liang Zheng, Chuong Nguyen

Figure 1 for Multiview Detection with Cardboard Human Modeling

Figure 2 for Multiview Detection with Cardboard Human Modeling

Figure 3 for Multiview Detection with Cardboard Human Modeling

Figure 4 for Multiview Detection with Cardboard Human Modeling

Abstract:Multiview detection uses multiple calibrated cameras with overlapping fields of views to locate occluded pedestrians. In this field, existing methods typically adopt a "human modeling - aggregation" strategy. To find robust pedestrian representations, some intuitively use locations of detected 2D bounding boxes, while others use entire frame features projected to the ground plane. However, the former does not consider human appearance and leads to many ambiguities, and the latter suffers from projection errors due to the lack of accurate height of the human torso and head. In this paper, we propose a new pedestrian representation scheme based on human point clouds modeling. Specifically, using ray tracing for holistic human depth estimation, we model pedestrians as upright, thin cardboard point clouds on the ground. Then, we aggregate the point clouds of the pedestrian cardboard across multiple views for a final decision. Compared with existing representations, the proposed method explicitly leverages human appearance and reduces projection errors significantly by relatively accurate height estimation. On two standard evaluation benchmarks, the proposed method achieves very competitive results.

* The thesis is not perfect enough

Via

Access Paper or Ask Questions

Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking

Dec 14, 2021

Yunzhong Hou, Zhongdao Wang, Shengjin Wang, Liang Zheng

Figure 1 for Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking

Figure 2 for Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking

Figure 3 for Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking

Figure 4 for Adaptive Affinity for Associations in Multi-Target Multi-Camera Tracking

Abstract:Data associations in multi-target multi-camera tracking (MTMCT) usually estimate affinity directly from re-identification (re-ID) feature distances. However, we argue that it might not be the best choice given the difference in matching scopes between re-ID and MTMCT problems. Re-ID systems focus on global matching, which retrieves targets from all cameras and all times. In contrast, data association in tracking is a local matching problem, since its candidates only come from neighboring locations and time frames. In this paper, we design experiments to verify such misfit between global re-ID feature distances and local matching in tracking, and propose a simple yet effective approach to adapt affinity estimations to corresponding matching scopes in MTMCT. Instead of trying to deal with all appearance changes, we tailor the affinity metric to specialize in ones that might emerge during data associations. To this end, we introduce a new data sampling scheme with temporal windows originally used for data associations in tracking. Minimizing the mismatch, the adaptive affinity module brings significant improvements over global re-ID distance, and produces competitive performance on CityFlow and DukeMTMC datasets.

* This paper appears in: IEEE Transactions on Image Processing

Via

Access Paper or Ask Questions

Label-Free Model Evaluation with Semi-Structured Dataset Representations

Dec 01, 2021

Xiaoxiao Sun, Yunzhong Hou, Hongdong Li, Liang Zheng

Figure 1 for Label-Free Model Evaluation with Semi-Structured Dataset Representations

Figure 2 for Label-Free Model Evaluation with Semi-Structured Dataset Representations

Figure 3 for Label-Free Model Evaluation with Semi-Structured Dataset Representations

Figure 4 for Label-Free Model Evaluation with Semi-Structured Dataset Representations

Abstract:Label-free model evaluation, or AutoEval, estimates model accuracy on unlabeled test sets, and is critical for understanding model behaviors in various unseen environments. In the absence of image labels, based on dataset representations, we estimate model performance for AutoEval with regression. On the one hand, image feature is a straightforward choice for such representations, but it hampers regression learning due to being unstructured (\ie no specific meanings for component at certain location) and of large-scale. On the other hand, previous methods adopt simple structured representations (like average confidence or average feature), but insufficient to capture the data characteristics given their limited dimensions. In this work, we take the best of both worlds and propose a new semi-structured dataset representation that is manageable for regression learning while containing rich information for AutoEval. Based on image features, we integrate distribution shapes, clusters, and representative samples for a semi-structured dataset representation. Besides the structured overall description with distribution shapes, the unstructured description with clusters and representative samples include additional fine-grained information facilitating the AutoEval task. On three existing datasets and 25 newly introduced ones, we experimentally show that the proposed representation achieves competitive results. Code and dataset are available at https://github.com/sxzrt/Semi-Structured-Dataset-Representations.

* 10 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

Ranking Models in Unlabeled New Environments

Sep 06, 2021

Xiaoxiao Sun, Yunzhong Hou, Weijian Deng, Hongdong Li, Liang Zheng

Figure 1 for Ranking Models in Unlabeled New Environments

Figure 2 for Ranking Models in Unlabeled New Environments

Figure 3 for Ranking Models in Unlabeled New Environments

Figure 4 for Ranking Models in Unlabeled New Environments

Abstract:Consider a scenario where we are supplied with a number of ready-to-use models trained on a certain source domain and hope to directly apply the most appropriate ones to different target domains based on the models' relative performance. Ideally we should annotate a validation set for model performance assessment on each new target environment, but such annotations are often very expensive. Under this circumstance, we introduce the problem of ranking models in unlabeled new environments. For this problem, we propose to adopt a proxy dataset that 1) is fully labeled and 2) well reflects the true model rankings in a given target environment, and use the performance rankings on the proxy sets as surrogates. We first select labeled datasets as the proxy. Specifically, datasets that are more similar to the unlabeled target domain are found to better preserve the relative performance rankings. Motivated by this, we further propose to search the proxy set by sampling images from various datasets that have similar distributions as the target. We analyze the problem and its solutions on the person re-identification (re-ID) task, for which sufficient datasets are publicly available, and show that a carefully constructed proxy set effectively captures relative performance ranking in new environments. Code is available at \url{https://github.com/sxzrt/Proxy-Set}.

* 13 pages, 10 figures, ICCV2021

Via

Access Paper or Ask Questions