Picture for Enming Zhang

Enming Zhang

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

Add code
Sep 11, 2024
Viaarxiv icon

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Add code
Jun 18, 2024
Viaarxiv icon

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Add code
Apr 17, 2024
Viaarxiv icon

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Add code
Mar 21, 2024
Viaarxiv icon

A Simple Knowledge Distillation Framework for Open-world Object Detection

Add code
Dec 14, 2023
Viaarxiv icon

Looking and Listening: Audio Guided Text Recognition

Add code
Jun 06, 2023
Figure 1 for Looking and Listening: Audio Guided Text Recognition
Figure 2 for Looking and Listening: Audio Guided Text Recognition
Figure 3 for Looking and Listening: Audio Guided Text Recognition
Figure 4 for Looking and Listening: Audio Guided Text Recognition
Viaarxiv icon

Detecting the open-world objects with the help of the Brain

Add code
Mar 21, 2023
Viaarxiv icon