Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guobing Gan

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Jul 02, 2025

GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi(+69 more)

Figure 1 for GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Figure 2 for GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Figure 3 for GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Figure 4 for GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Abstract:We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.

Via

Access Paper or Ask Questions

HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Feb 09, 2025

Guobing Gan, Kaiming Gao, Li Wang, Shen Jiang, Peng Jiang

Figure 1 for HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Figure 2 for HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Figure 3 for HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Figure 4 for HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Abstract:Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.

* Accepted by WWW 2025 (Industry Track)

Via

Access Paper or Ask Questions

MorphTE: Injecting Morphology in Tensorized Embeddings

Oct 27, 2022

Guobing Gan, Peng Zhang, Sunzhu Li, Xiuqing Lu, Benyou Wang

Abstract:In the era of deep learning, word embeddings are essential when dealing with text tasks. However, storing and accessing these embeddings requires a large amount of space. This is not conducive to the deployment of these models on resource-limited devices. Combining the powerful compression capability of tensor products, we propose a word embedding compression method with morphological augmentation, Morphologically-enhanced Tensorized Embeddings (MorphTE). A word consists of one or more morphemes, the smallest units that bear meaning or have a grammatical function. MorphTE represents a word embedding as an entangled form of its morpheme vectors via the tensor product, which injects prior semantic and grammatical knowledge into the learning of embeddings. Furthermore, the dimensionality of the morpheme vector and the number of morphemes are much smaller than those of words, which greatly reduces the parameters of the word embeddings. We conduct experiments on tasks such as machine translation and question answering. Experimental results on four translation datasets of different languages show that MorphTE can compress word embedding parameters by about 20 times without performance loss and significantly outperforms related embedding compression methods.

* 20 pages, 6 figures, 18 tables. Published at NeurIPS 2022

Via

Access Paper or Ask Questions