Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongze Shen

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Jan 27, 2026

Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin(+31 more)

Abstract:Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

Via

Access Paper or Ask Questions

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

Oct 09, 2024

Keliang Li, Zaifei Yang, Jiahe Zhao, Hongze Shen, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

Abstract:The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K. Evaluations on HERM-Bench demonstrate that HERM-7B significantly outperforms existing MLLMs across various human-centric dimensions, reflecting the current inadequacy of data annotations used in MLLM training for human-centric visual understanding. This research emphasizes the importance of specialized datasets and benchmarks in advancing the MLLMs' capabilities for human-centric understanding.

Via

Access Paper or Ask Questions

How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction

Nov 14, 2018

Kui Fu, Jia Li, Hongze Shen, Yonghong Tian

Figure 1 for How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction

Figure 2 for How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction

Figure 3 for How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction

Figure 4 for How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction

Abstract:In ground-level platforms, many saliency models have been developed to perceive the visual world as the human does. However, they may not fit a drone that can look from many abnormal viewpoints. To address this problem, this paper proposes a Crowdsourced Multi-path Network (CMNet) that transfer the ground-level knowledge for spatiotemporal saliency prediction in aerial videos. To train CMNet, we first collect and fuse the eye-tracking data of 24 subjects on 1,000 aerial videos to annotate the ground-truth salient regions. Inspired by the crowdsourced annotation in eye-tracking experiments, we design a multi-path architecture for CMNet, in which each path is initialized under the supervision of a classic ground-level saliency model. After that, the most representative paths are selected in a data-driven manner, which are then fused and simultaneously fine-tuned on aerial videos. In this manner, the prior knowledge in various classic ground-level saliency models can be transferred into CMNet so as to improve its capability in processing aerial videos. Finally, the spatial predictions given by CMNet are adaptively refined by incorporating the temporal saliency predictions via a spatiotemporal saliency optimization algorithm. Experimental results show that the proposed approach outperforms ten state-of-the-art models in predicting visual saliency in aerial videos.

Via

Access Paper or Ask Questions