Picture for Fahad Shahbaz Khan

Fahad Shahbaz Khan

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Add code
Jan 10, 2025
Figure 1 for LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Figure 2 for LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Figure 3 for LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Figure 4 for LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Viaarxiv icon

Mask Factory: Towards High-quality Synthetic Data Generation for Dichotomous Image Segmentation

Add code
Dec 26, 2024
Viaarxiv icon

Discriminative Image Generation with Diffusion Models for Zero-Shot Learning

Add code
Dec 23, 2024
Viaarxiv icon

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

Add code
Dec 19, 2024
Viaarxiv icon

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

Add code
Dec 13, 2024
Viaarxiv icon

Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

Add code
Nov 29, 2024
Figure 1 for Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Figure 2 for Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Figure 3 for Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Figure 4 for Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Viaarxiv icon

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Add code
Nov 28, 2024
Figure 1 for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Figure 2 for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Figure 3 for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Figure 4 for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Viaarxiv icon

ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction

Add code
Nov 12, 2024
Viaarxiv icon

Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Add code
Nov 11, 2024
Figure 1 for Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Figure 2 for Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Figure 3 for Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Figure 4 for Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Viaarxiv icon

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Add code
Nov 07, 2024
Figure 1 for VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Figure 2 for VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Figure 3 for VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Figure 4 for VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Viaarxiv icon