Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziyi Wu

Memory-based Ensemble Learning in CMR Semantic Segmentation

Feb 13, 2025

Yiwei Liu, Ziyi Wu, Liang Zhong, Linyi Wen, Yuankai Wu

Abstract:Existing models typically segment either the entire 3D frame or 2D slices independently to derive clinical functional metrics from ventricular segmentation in cardiac cine sequences. While performing well overall, they struggle at the end slices. To address this, we leverage spatial continuity to extract global uncertainty from segmentation variance and use it as memory in our ensemble learning method, Streaming, for classifier weighting, balancing overall and end-slice performance. Additionally, we introduce the End Coefficient (EC) to quantify end-slice accuracy. Experiments on ACDC and M\&Ms datasets show that our framework achieves near-state-of-the-art Dice Similarity Coefficient (DSC) and outperforms all models on end-slice performance, improving patient-specific segmentation accuracy.

Via

Access Paper or Ask Questions

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Dec 06, 2024

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov

Abstract:Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.

* Project Page: https://mint-video.github.io/

Via

Access Paper or Ask Questions

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Nov 07, 2024

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David B. Lindell

Figure 1 for SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Figure 2 for SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Figure 3 for SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Figure 4 for SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Abstract:Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.

* Project page: https://kmcode1.github.io/Projects/SG-I2V/

Via

Access Paper or Ask Questions

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Jun 13, 2024

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A. Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R. Allen, Thomas Kipf

Figure 1 for Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Figure 2 for Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Figure 3 for Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Figure 4 for Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Abstract:We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

* Additional details and video results are available at https://neural-assets-paper.github.io/

Via

Access Paper or Ask Questions

SPAD : Spatially Aware Multiview Diffusers

Feb 07, 2024

Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin

Abstract:We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. In contrast to recent works that can only generate views at fixed azimuth and elevation, SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. See more details at our webpage: https://yashkant.github.io/spad

* Webpage: https://yashkant.github.io/spad

Via

Access Paper or Ask Questions

LEOD: Label-Efficient Object Detection for Event Cameras

Nov 29, 2023

Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, Igor Gilitschenski

Figure 1 for LEOD: Label-Efficient Object Detection for Event Cameras

Figure 2 for LEOD: Label-Efficient Object Detection for Event Cameras

Figure 3 for LEOD: Label-Efficient Object Detection for Event Cameras

Figure 4 for LEOD: Label-Efficient Object Detection for Event Cameras

Abstract:Object detection with event cameras enjoys the property of low latency and high dynamic range, making it suitable for safety-critical scenarios such as self-driving. However, labeling event streams with high temporal resolutions for supervised training is costly. We address this issue with LEOD, the first framework for label-efficient event-based detection. Our method unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events, and then re-train the detector with both real and generated labels. Leveraging the temporal consistency of events, we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training, we further design a soft anchor assignment strategy to mitigate the noise in labels. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results. Finally, we show that our method readily scales to improve larger detectors as well.

Via

Access Paper or Ask Questions

EventCLIP: Adapting CLIP for Event-based Object Recognition

Jun 10, 2023

Ziyi Wu, Xudong Liu, Igor Gilitschenski

Abstract:Recent advances in 2D zero-shot and few-shot recognition often leverage large pre-trained vision-language models (VLMs) such as CLIP. Due to a shortage of suitable datasets, it is currently infeasible to train such models for event camera data. Thus, leveraging existing models across modalities is an important research challenge. In this work, we propose EventCLIP, a new method that utilizes CLIP for zero-shot and few-shot recognition on event camera data. First, we demonstrate the suitability of CLIP's image embeddings for zero-shot event classification by converting raw events to 2D grid-based representations. Second, we propose a feature adapter that aggregates temporal information over event frames and refines text embeddings to better align with the visual inputs. We evaluate our work on N-Caltech, N-Cars, and N-ImageNet datasets under the few-shot learning setting, where EventCLIP achieves state-of-the-art performance. Finally, we show that the robustness of existing event-based classifiers against data variations can be further boosted by ensembling with EventCLIP.

* Pre-print, work in progress

Via

Access Paper or Ask Questions

SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models

May 18, 2023

Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg

Abstract:Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.

* Project page: https://slotdiffusion.github.io/ . An earlier version of this work appeared at the ICLR 2023 Workshop on Neurosymbolic Generative Models: https://nesygems.github.io/assets/pdf/papers/SlotDiffusion.pdf

Via

Access Paper or Ask Questions

Breaking Bad: A Dataset for Geometric Fracture and Reassembly

Oct 20, 2022

Silvia Sellán, Yun-Chun Chen, Ziyi Wu, Animesh Garg, Alec Jacobson

Figure 1 for Breaking Bad: A Dataset for Geometric Fracture and Reassembly

Figure 2 for Breaking Bad: A Dataset for Geometric Fracture and Reassembly

Figure 3 for Breaking Bad: A Dataset for Geometric Fracture and Reassembly

Figure 4 for Breaking Bad: A Dataset for Geometric Fracture and Reassembly

Abstract:We introduce Breaking Bad, a large-scale dataset of fractured objects. Our dataset consists of over one million fractured objects simulated from ten thousand base models. The fracture simulation is powered by a recent physically based algorithm that efficiently generates a variety of fracture modes of an object. Existing shape assembly datasets decompose objects according to semantically meaningful parts, effectively modeling the construction process. In contrast, Breaking Bad models the destruction process of how a geometric object naturally breaks into fragments. Our dataset serves as a benchmark that enables the study of fractured object reassembly and presents new challenges for geometric shape understanding. We analyze our dataset with several geometry measurements and benchmark three state-of-the-art shape assembly deep learning methods under various settings. Extensive experimental results demonstrate the difficulty of our dataset, calling on future research in model designs specifically for the geometric shape assembly task. We host our dataset at https://breaking-bad-dataset.github.io/.

* NeurIPS 2022 Track on Datasets and Benchmarks. The first three authors contributed equally to this work. Project page: https://breaking-bad-dataset.github.io/ Code: https://github.com/Wuziyi616/multi_part_assembly Dataset: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP3/LZNPKB

Via

Access Paper or Ask Questions

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Oct 12, 2022

Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg

Figure 1 for SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Figure 2 for SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Figure 3 for SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Figure 4 for SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Abstract:Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.

* Project page: https://slotformer.github.io/

Via

Access Paper or Ask Questions