Picture for Zhenheng Yang

Zhenheng Yang

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Add code
Mar 17, 2025
Viaarxiv icon

Long Context Tuning for Video Generation

Add code
Mar 13, 2025
Viaarxiv icon

UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

Add code
Feb 10, 2025
Figure 1 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 2 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 3 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Figure 4 for UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Viaarxiv icon

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Add code
Jan 06, 2025
Viaarxiv icon

Parallelized Autoregressive Visual Generation

Add code
Dec 19, 2024
Figure 1 for Parallelized Autoregressive Visual Generation
Figure 2 for Parallelized Autoregressive Visual Generation
Figure 3 for Parallelized Autoregressive Visual Generation
Figure 4 for Parallelized Autoregressive Visual Generation
Viaarxiv icon

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Add code
Dec 12, 2024
Figure 1 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 2 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 3 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Figure 4 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Viaarxiv icon

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Add code
Aug 22, 2024
Figure 1 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 2 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 3 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 4 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Viaarxiv icon

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Add code
Jul 02, 2024
Figure 1 for OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Figure 2 for OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Figure 3 for OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Figure 4 for OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Viaarxiv icon

Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

Add code
Mar 23, 2021
Figure 1 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Figure 2 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Figure 3 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Figure 4 for Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
Viaarxiv icon

SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization

Add code
Sep 01, 2020
Figure 1 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization
Figure 2 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization
Figure 3 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization
Figure 4 for SPAN: Spatial Pyramid Attention Network forImage Manipulation Localization
Viaarxiv icon