Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhixiong Zeng

ScaleTrack: Scaling and back-tracking Automated GUI Agents

May 01, 2025

Jing Huang, Zhixiong Zeng, Wenkang Han, Yufeng Zhong, Liming Zheng, Shuai Fu, Jingyuan Chen, Lin Ma

Abstract:Automated GUI agents aims to facilitate user interaction by automatically performing complex tasks in digital environments, such as web, mobile, desktop devices. It receives textual task instruction and GUI description to generate executable actions (\emph{e.g.}, click) and operation boxes step by step. Training a GUI agent mainly involves grounding and planning stages, in which the GUI grounding focuses on finding the execution coordinates according to the task, while the planning stage aims to predict the next action based on historical actions. However, previous work suffers from the limitations of insufficient training data for GUI grounding, as well as the ignorance of backtracking historical behaviors for GUI planning. To handle the above challenges, we propose ScaleTrack, a training framework by scaling grounding and backtracking planning for automated GUI agents. We carefully collected GUI samples of different synthesis criterions from a wide range of sources, and unified them into the same template for training GUI grounding models. Moreover, we design a novel training strategy that predicts the next action from the current GUI image, while also backtracking the historical actions that led to the GUI image. In this way, ScaleTrack explains the correspondence between GUI images and actions, which effectively describes the evolution rules of the GUI environment. Extensive experimental results demonstrate the effectiveness of ScaleTrack. Data and code will be available at url.

Via

Access Paper or Ask Questions

Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Mar 20, 2023

Binbin Wang, Mingming Li, Zhixiong Zeng, Jingwei Zhuo, Songlin Wang, Sulong Xu, Bo Long, Weipeng Yan

Figure 1 for Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Figure 2 for Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Figure 3 for Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Figure 4 for Learning Multi-Stage Multi-Grained Semantic Embeddings for E-Commerce Search

Abstract:Retrieving relevant items that match users' queries from billion-scale corpus forms the core of industrial e-commerce search systems, in which embedding-based retrieval (EBR) methods are prevailing. These methods adopt a two-tower framework to learn embedding vectors for query and item separately and thus leverage efficient approximate nearest neighbor (ANN) search to retrieve relevant items. However, existing EBR methods usually ignore inconsistent user behaviors in industrial multi-stage search systems, resulting in insufficient retrieval efficiency with a low commercial return. To tackle this challenge, we propose to improve EBR methods by learning Multi-level Multi-Grained Semantic Embeddings(MMSE). We propose the multi-stage information mining to exploit the ordered, clicked, unclicked and random sampled items in practical user behavior data, and then capture query-item similarity via a post-fusion strategy. We then propose multi-grained learning objectives that integrate the retrieval loss with global comparison ability and the ranking loss with local comparison ability to generate semantic embeddings. Both experiments on a real-world billion-scale dataset and online A/B tests verify the effectiveness of MMSE in achieving significant performance improvements on metrics such as offline recall and online conversion rate (CVR).

Via

Access Paper or Ask Questions

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Jan 08, 2022

Zhixiong Zeng, Wenji Mao

Figure 1 for A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Figure 2 for A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Figure 3 for A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Figure 4 for A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Abstract:Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type, and has been widely used in many real-world applications. Recently, the vision-language pre-trained model represented by CLIP has demonstrated its superiority of learning visual and textual representations and its impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in unsupervised CMR, the performance and impact of these pre-trained models on supervised CMR were rarely explored due to the lack of multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study and provide insights on its performance and impact on supervised CMR. To this end, we first propose a novel model CLIP4CMR (\textbf{CLIP For} supervised \textbf{C}ross-\textbf{M}odal \textbf{R}etrieval) that employs pre-trained CLIP as backbone network to perform supervised CMR. We then revisit the existing loss function design in CMR, including the most common pair-wise losses, class-wise losses and hybrid ones, and provide insights on applying CLIP. Moreover, we investigate several concerned issues in supervised CMR and provide new perspectives for this field via CLIP4CMR, including the robustness to modality imbalance and the sensitivity to hyper-parameters. Extensive experimental results show that the CLIP4CMR achieves SOTA results with significant improvements on the benchmark datasets Wikipedia, NUS-WIDE, Pascal-Sentence and XmediaNet. Our data and codes are publicly available at https://github.com/zhixiongz/CLIP4CMR.

Via

Access Paper or Ask Questions

AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Sep 13, 2021

Guohai Xu, Hehong Chen, Feng-Lin Li, Fu Sun, Yunzhou Shi, Zhixiong Zeng, Wei Zhou, Zhongzhou Zhao, Ji Zhang

Figure 1 for AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Figure 2 for AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Figure 3 for AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Figure 4 for AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerce

Abstract:Live streaming is becoming an increasingly popular trend of sales in E-commerce. The core of live-streaming sales is to encourage customers to purchase in an online broadcasting room. To enable customers to better understand a product without jumping out, we propose AliMe MKG, a multi-modal knowledge graph that aims at providing a cognitive profile for products, through which customers are able to seek information about and understand a product. Based on the MKG, we build an online live assistant that highlights product search, product exhibition and question answering, allowing customers to skim over item list, view item details, and ask item-related questions. Our system has been launched online in the Taobao app, and currently serves hundreds of thousands of customers per day.

* CIKM2021

Via

Access Paper or Ask Questions