Picture for Wenyi Hong

Wenyi Hong

CogVLM2: Visual Language Models for Image and Video Understanding

Add code
Aug 29, 2024
Figure 1 for CogVLM2: Visual Language Models for Image and Video Understanding
Figure 2 for CogVLM2: Visual Language Models for Image and Video Understanding
Figure 3 for CogVLM2: Visual Language Models for Image and Video Understanding
Figure 4 for CogVLM2: Visual Language Models for Image and Video Understanding
Viaarxiv icon

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Add code
Aug 12, 2024
Viaarxiv icon

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Add code
Aug 12, 2024
Figure 1 for VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Figure 2 for VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Figure 3 for VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Figure 4 for VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Viaarxiv icon

LVBench: An Extreme Long Video Understanding Benchmark

Add code
Jun 12, 2024
Figure 1 for LVBench: An Extreme Long Video Understanding Benchmark
Figure 2 for LVBench: An Extreme Long Video Understanding Benchmark
Figure 3 for LVBench: An Extreme Long Video Understanding Benchmark
Figure 4 for LVBench: An Extreme Long Video Understanding Benchmark
Viaarxiv icon

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Add code
May 08, 2024
Figure 1 for Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer
Figure 2 for Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer
Figure 3 for Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer
Figure 4 for Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer
Viaarxiv icon

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Add code
Feb 06, 2024
Figure 1 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Figure 2 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Figure 3 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Figure 4 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Viaarxiv icon

CogAgent: A Visual Language Model for GUI Agents

Add code
Dec 21, 2023
Viaarxiv icon

CogVLM: Visual Expert for Pretrained Language Models

Add code
Nov 06, 2023
Viaarxiv icon

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Add code
Sep 04, 2023
Viaarxiv icon

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Add code
May 29, 2022
Figure 1 for CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Figure 2 for CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Figure 3 for CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Figure 4 for CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Viaarxiv icon