Picture for Shijie Geng

Shijie Geng

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Add code
May 09, 2024
Figure 1 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 2 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 3 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Figure 4 for Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Viaarxiv icon

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Add code
Feb 08, 2024
Viaarxiv icon

Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Add code
Sep 27, 2023
Figure 1 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
Figure 2 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
Figure 3 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
Figure 4 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
Viaarxiv icon

VIP5: Towards Multimodal Foundation Models for Recommendation

Add code
May 23, 2023
Viaarxiv icon

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Add code
Apr 28, 2023
Viaarxiv icon

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Add code
Mar 27, 2023
Figure 1 for Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Figure 2 for Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Figure 3 for Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Figure 4 for Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Viaarxiv icon

HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention

Add code
Mar 06, 2023
Viaarxiv icon

Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Add code
Jan 30, 2023
Figure 1 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction
Figure 2 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction
Figure 3 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction
Figure 4 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction
Viaarxiv icon

Frozen CLIP Models are Efficient Video Learners

Add code
Aug 06, 2022
Figure 1 for Frozen CLIP Models are Efficient Video Learners
Figure 2 for Frozen CLIP Models are Efficient Video Learners
Figure 3 for Frozen CLIP Models are Efficient Video Learners
Figure 4 for Frozen CLIP Models are Efficient Video Learners
Viaarxiv icon

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Add code
Jul 20, 2022
Figure 1 for Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Viaarxiv icon