Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuliang Zou

SceneCrafter: Controllable Multi-View Driving Scene Editing

Jun 24, 2025

Zehao Zhu, Yuliang Zou, Chiyu Max Jiang, Bo Sun, Vincent Casser, Xiukun Huang, Jiahao Wang, Zhenpei Yang, Ruiqi Gao, Leonidas Guibas(+2 more)

Abstract:Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.

* CVPR 2025

Via

Access Paper or Ask Questions

Differentially Private Video Activity Recognition

Jun 27, 2023

Zelun Luo, Yuliang Zou, Yijin Yang, Zane Durante, De-An Huang, Zhiding Yu, Chaowei Xiao, Li Fei-Fei, Animashree Anandkumar

Figure 1 for Differentially Private Video Activity Recognition

Figure 2 for Differentially Private Video Activity Recognition

Figure 3 for Differentially Private Video Activity Recognition

Figure 4 for Differentially Private Video Activity Recognition

Abstract:In recent years, differential privacy has seen significant advancements in image classification; however, its application to video activity recognition remains under-explored. This paper addresses the challenges of applying differential privacy to video activity recognition, which primarily stem from: (1) a discrepancy between the desired privacy level for entire videos and the nature of input data processed by contemporary video architectures, which are typically short, segmented clips; and (2) the complexity and sheer size of video datasets relative to those in image classification, which render traditional differential privacy methods inadequate. To tackle these issues, we propose Multi-Clip DP-SGD, a novel framework for enforcing video-level differential privacy through clip-based classification models. This method samples multiple clips from each video, averages their gradients, and applies gradient clipping in DP-SGD without incurring additional privacy loss. Moreover, we incorporate a parameter-efficient transfer learning strategy to make the model scalable for large-scale video datasets. Through extensive evaluations on the UCF-101 and HMDB-51 datasets, our approach exhibits impressive performance, achieving 81% accuracy with a privacy budget of epsilon=5 on UCF-101, marking a 76% improvement compared to a direct application of DP-SGD. Furthermore, we demonstrate that our transfer learning strategy is versatile and can enhance differentially private image classification across an array of datasets including CheXpert, ImageNet, CIFAR-10, and CIFAR-100.

Via

Access Paper or Ask Questions

Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Mar 30, 2022

Yuliang Zou, Zizhao Zhang, Chun-Liang Li, Han Zhang, Tomas Pfister, Jia-Bin Huang

Figure 1 for Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Figure 2 for Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Figure 3 for Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Figure 4 for Learning Instance-Specific Adaptation for Cross-Domain Segmentation

Abstract:We propose a test-time adaptation method for cross-domain image segmentation. Our method is simple: Given a new unseen instance at test time, we adapt a pre-trained model by conducting instance-specific BatchNorm (statistics) calibration. Our approach has two core components. First, we replace the manually designed BatchNorm calibration rule with a learnable module. Second, we leverage strong data augmentation to simulate random domain shifts for learning the calibration rule. In contrast to existing domain adaptation methods, our method does not require accessing the target domain data at training time or conducting computationally expensive test-time model training/optimization. Equipping our method with models trained by standard recipes achieves significant improvement, comparing favorably with several state-of-the-art domain generalization and one-shot unsupervised domain adaptation approaches. Combining our method with the domain generalization methods further improves performance, reaching a new state of the art.

* Project page: https://yuliang.vision/InstCal/

Via

Access Paper or Ask Questions

Learning Representational Invariances for Data-Efficient Action Recognition

Mar 30, 2021

Yuliang Zou, Jinwoo Choi, Qitong Wang, Jia-Bin Huang

Figure 1 for Learning Representational Invariances for Data-Efficient Action Recognition

Figure 2 for Learning Representational Invariances for Data-Efficient Action Recognition

Figure 3 for Learning Representational Invariances for Data-Efficient Action Recognition

Figure 4 for Learning Representational Invariances for Data-Efficient Action Recognition

Abstract:Data augmentation is a ubiquitous technique for improving image classification when labeled data is scarce. Constraining the model predictions to be invariant to diverse data augmentations effectively injects the desired representational invariances to the model (e.g., invariance to photometric variations), leading to improved accuracy. Compared to image data, the appearance variations in videos are far more complex due to the additional temporal dimension. Yet, data augmentation methods for videos remain under-explored. In this paper, we investigate various data augmentation strategies that capture different video invariances, including photometric, geometric, temporal, and actor/scene augmentations. When integrated with existing consistency-based semi-supervised learning frameworks, we show that our data augmentation strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets in the low-label regime. We also validate our data augmentation strategy in the fully supervised setting and demonstrate improved performance.

* Project page: https://yuliang.vision/video-data-aug

Via

Access Paper or Ask Questions

PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Oct 19, 2020

Yuliang Zou, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, Tomas Pfister

Figure 1 for PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Figure 2 for PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Figure 3 for PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Figure 4 for PseudoSeg: Designing Pseudo Labels for Semantic Segmentation

Abstract:Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at https://github.com/googleinterns/wss.

* Project page: https://yuliang.vision/pseudo_seg/

Via

Access Paper or Ask Questions

DRG: Dual Relation Graph for Human-Object Interaction Detection

Aug 26, 2020

Chen Gao, Jiarui Xu, Yuliang Zou, Jia-Bin Huang

Figure 1 for DRG: Dual Relation Graph for Human-Object Interaction Detection

Figure 2 for DRG: Dual Relation Graph for Human-Object Interaction Detection

Figure 3 for DRG: Dual Relation Graph for Human-Object Interaction Detection

Figure 4 for DRG: Dual Relation Graph for Human-Object Interaction Detection

Abstract:We tackle the challenging problem of human-object interaction (HOI) detection. Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features. In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph (one human-centric and one object-centric). Our proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions. Our model is conceptually simple and leads to favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets.

* ECCV 2020. Project: http://chengao.vision/DRG/ Code: https://github.com/vt-vl-lab/DRG

Via

Access Paper or Ask Questions

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Jul 21, 2020

Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, Manmohan Chandraker

Figure 1 for Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Figure 2 for Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Figure 3 for Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Figure 4 for Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Abstract:Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (e.g., O(100)) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a "global" loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.

* ECCV 2020. Project page: https://yuliang.vision/LTMVO

Via

Access Paper or Ask Questions

DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Sep 05, 2018

Yuliang Zou, Zelun Luo, Jia-Bin Huang

Figure 1 for DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Figure 2 for DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Figure 3 for DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Figure 4 for DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency

Abstract:We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals. Our core idea is that for rigid regions we can use the predicted scene depth and camera motion to synthesize 2D optical flow by backprojecting the induced 3D scene flow. The discrepancy between the rigid flow (from depth prediction and camera motion) and the estimated flow (from optical flow model) allows us to impose a cross-task consistency loss. While all the networks are jointly optimized during training, they can be applied independently at test time. Extensive experiments demonstrate that our depth and flow models compare favorably with state-of-the-art unsupervised methods.

* ECCV 2018. Project website: http://yuliang.vision/DF-Net/ Code: https://github.com/vt-vl-lab/DF-Net

Via

Access Paper or Ask Questions

iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Aug 30, 2018

Chen Gao, Yuliang Zou, Jia-Bin Huang

Figure 1 for iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Figure 2 for iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Figure 3 for iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Figure 4 for iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection

Abstract:Recent years have witnessed rapid progress in detecting and recognizing individual object instances. To understand the situation in a scene, however, computers need to recognize how humans interact with surrounding objects. In this paper, we tackle the challenging task of detecting human-object interactions (HOI). Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction. To exploit these cues, we propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance. Such an attention-based network allows us to selectively aggregate features relevant for recognizing HOIs. We validate the efficacy of the proposed network on the Verb in COCO and HICO-DET datasets and show that our approach compares favorably with the state-of-the-arts.

* BMVC 2018. Project webpage: https://gaochen315.github.io/iCAN/ Code: https://github.com/vt-vl-lab/iCAN

Via

Access Paper or Ask Questions

Learning to Generate Long-term Future via Hierarchical Prediction

Jan 08, 2018

Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee

Figure 1 for Learning to Generate Long-term Future via Hierarchical Prediction

Figure 2 for Learning to Generate Long-term Future via Hierarchical Prediction

Figure 3 for Learning to Generate Long-term Future via Hierarchical Prediction

Figure 4 for Learning to Generate Long-term Future via Hierarchical Prediction

Abstract:We propose a hierarchical approach for making long-term predictions of future frames. To avoid inherent compounding errors in recursive pixel-level prediction, we propose to first estimate high-level structure in the input frames, then predict how that structure evolves in the future, and finally by observing a single frame from the past and the predicted high-level structure, we construct the future frames without having to observe any of the pixel-level predictions. Long-term video prediction is difficult to perform by recurrently observing the predicted frames because the small errors in pixel space exponentially amplify as predictions are made deeper into the future. Our approach prevents pixel-level error propagation from happening by removing the need to observe the predicted frames. Our model is built with a combination of LSTM and analogy based encoder-decoder convolutional neural networks, which independently predict the video structure and generate the future frames, respectively. In experiments, our model is evaluated on the Human3.6M and Penn Action datasets on the task of long-term pixel-level video prediction of humans performing actions and demonstrate significantly better results than the state-of-the-art.

* International Conference on Machine Learning (ICML) 2017

Via

Access Paper or Ask Questions