Picture for Shijie Li

Shijie Li

DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation

Add code
Jan 26, 2026
Viaarxiv icon

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Add code
Jan 25, 2026
Viaarxiv icon

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Add code
Oct 02, 2025
Figure 1 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 2 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 3 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Figure 4 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Viaarxiv icon

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Add code
Aug 08, 2025
Viaarxiv icon

CogStream: Context-guided Streaming Video Question Answering

Add code
Jun 12, 2025
Figure 1 for CogStream: Context-guided Streaming Video Question Answering
Figure 2 for CogStream: Context-guided Streaming Video Question Answering
Figure 3 for CogStream: Context-guided Streaming Video Question Answering
Figure 4 for CogStream: Context-guided Streaming Video Question Answering
Viaarxiv icon

Zero-Shot 3D Visual Grounding from Vision-Language Models

Add code
May 28, 2025
Figure 1 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 2 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 3 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Figure 4 for Zero-Shot 3D Visual Grounding from Vision-Language Models
Viaarxiv icon

Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion

Add code
Mar 14, 2025
Figure 1 for Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion
Figure 2 for Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion
Figure 3 for Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion
Figure 4 for Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion
Viaarxiv icon

Global-Aware Monocular Semantic Scene Completion with State Space Models

Add code
Mar 09, 2025
Figure 1 for Global-Aware Monocular Semantic Scene Completion with State Space Models
Figure 2 for Global-Aware Monocular Semantic Scene Completion with State Space Models
Figure 3 for Global-Aware Monocular Semantic Scene Completion with State Space Models
Figure 4 for Global-Aware Monocular Semantic Scene Completion with State Space Models
Viaarxiv icon

Future-Aware Interaction Network For Motion Forecasting

Add code
Mar 09, 2025
Figure 1 for Future-Aware Interaction Network For Motion Forecasting
Figure 2 for Future-Aware Interaction Network For Motion Forecasting
Figure 3 for Future-Aware Interaction Network For Motion Forecasting
Figure 4 for Future-Aware Interaction Network For Motion Forecasting
Viaarxiv icon

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Add code
Dec 05, 2024
Figure 1 for SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Figure 2 for SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Figure 3 for SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Figure 4 for SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Viaarxiv icon