Picture for Hanoona Rasheed

Hanoona Rasheed

Perception Encoder: The best visual embeddings are not at the output of the network

Add code
Apr 17, 2025
Viaarxiv icon

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Add code
Apr 17, 2025
Viaarxiv icon

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Add code
Jun 13, 2024
Figure 1 for VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Figure 2 for VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Figure 3 for VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Figure 4 for VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Viaarxiv icon

PALO: A Polyglot Large Multimodal Model for 5B People

Add code
Mar 05, 2024
Figure 1 for PALO: A Polyglot Large Multimodal Model for 5B People
Figure 2 for PALO: A Polyglot Large Multimodal Model for 5B People
Figure 3 for PALO: A Polyglot Large Multimodal Model for 5B People
Figure 4 for PALO: A Polyglot Large Multimodal Model for 5B People
Viaarxiv icon

GLaMM: Pixel Grounding Large Multimodal Model

Add code
Nov 06, 2023
Viaarxiv icon

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Add code
Jun 08, 2023
Figure 1 for Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Figure 2 for Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Figure 3 for Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Figure 4 for Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Viaarxiv icon

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Add code
Mar 27, 2023
Figure 1 for SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Figure 2 for SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Figure 3 for SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Figure 4 for SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Viaarxiv icon

UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation

Add code
Dec 08, 2022
Figure 1 for UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
Figure 2 for UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
Figure 3 for UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
Figure 4 for UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
Viaarxiv icon

Fine-tuned CLIP Models are Efficient Video Learners

Add code
Dec 06, 2022
Figure 1 for Fine-tuned CLIP Models are Efficient Video Learners
Figure 2 for Fine-tuned CLIP Models are Efficient Video Learners
Figure 3 for Fine-tuned CLIP Models are Efficient Video Learners
Figure 4 for Fine-tuned CLIP Models are Efficient Video Learners
Viaarxiv icon

MaPLe: Multi-modal Prompt Learning

Add code
Oct 06, 2022
Figure 1 for MaPLe: Multi-modal Prompt Learning
Figure 2 for MaPLe: Multi-modal Prompt Learning
Figure 3 for MaPLe: Multi-modal Prompt Learning
Figure 4 for MaPLe: Multi-modal Prompt Learning
Viaarxiv icon