Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaobin Zhu

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Mar 19, 2025

Henrique Morimitsu, Xiaobin Zhu, Roberto M. Cesar Jr., Xiangyang Ji, Xu-Cheng Yin

Abstract:Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.

* Accepted at CVPR 2025. The code and dataset are available at https://github.com/hmorimitsu/ptlflow/tree/main/ptlflow/models/dpflow. 24 pages, 17 figures

Via

Access Paper or Ask Questions

Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Jul 16, 2024

Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin

Figure 1 for Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Figure 2 for Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Figure 3 for Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Figure 4 for Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer

Abstract:Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting video-text alignment precision for benefiting downstream tasks. In addition, we propose a self-similarity alignment loss to explore the inherent self-similarity in the video and text. With the initial optimization achieved by contrastive learning, it can further promote the alignment accuracy between video and text. Experimental results on challenging downstream tasks, including video-text retrieval and video question answering, verify the superior performance of our method.

* under review

Via

Access Paper or Ask Questions

Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding

May 01, 2024

Zhiyu Fang, Jingyan Qin, Xiaobin Zhu, Chun Yang, Xu-Cheng Yin

Figure 1 for Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding

Figure 2 for Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding

Figure 3 for Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding

Figure 4 for Arbitrary Time Information Modeling via Polynomial Approximation for Temporal Knowledge Graph Embedding

Abstract:Distinguished from traditional knowledge graphs (KGs), temporal knowledge graphs (TKGs) must explore and reason over temporally evolving facts adequately. However, existing TKG approaches still face two main challenges, i.e., the limited capability to model arbitrary timestamps continuously and the lack of rich inference patterns under temporal constraints. In this paper, we propose an innovative TKGE method (PTBox) via polynomial decomposition-based temporal representation and box embedding-based entity representation to tackle the above-mentioned problems. Specifically, we decompose time information by polynomials and then enhance the model's capability to represent arbitrary timestamps flexibly by incorporating the learnable temporal basis tensor. In addition, we model every entity as a hyperrectangle box and define each relation as a transformation on the head and tail entity boxes. The entity boxes can capture complex geometric structures and learn robust representations, improving the model's inductive capability for rich inference patterns. Theoretically, our PTBox can encode arbitrary time information or even unseen timestamps while capturing rich inference patterns and higher-arity relations of the knowledge base. Extensive experiments on real-world datasets demonstrate the effectiveness of our method.

* Accepted by LREC-COLING 2024 (long paper, camera-ready version)

Via

Access Paper or Ask Questions

Transformer-based Reasoning for Learning Evolutionary Chain of Events on Temporal Knowledge Graph

May 01, 2024

Zhiyu Fang, Shuai-Long Lei, Xiaobin Zhu, Chun Yang, Shi-Xue Zhang, Xu-Cheng Yin, Jingyan Qin

Abstract:Temporal Knowledge Graph (TKG) reasoning often involves completing missing factual elements along the timeline. Although existing methods can learn good embeddings for each factual element in quadruples by integrating temporal information, they often fail to infer the evolution of temporal facts. This is mainly because of (1) insufficiently exploring the internal structure and semantic relationships within individual quadruples and (2) inadequately learning a unified representation of the contextual and temporal correlations among different quadruples. To overcome these limitations, we propose a novel Transformer-based reasoning model (dubbed ECEformer) for TKG to learn the Evolutionary Chain of Events (ECE). Specifically, we unfold the neighborhood subgraph of an entity node in chronological order, forming an evolutionary chain of events as the input for our model. Subsequently, we utilize a Transformer encoder to learn the embeddings of intra-quadruples for ECE. We then craft a mixed-context reasoning module based on the multi-layer perceptron (MLP) to learn the unified representations of inter-quadruples for ECE while accomplishing temporal knowledge reasoning. In addition, to enhance the timeliness of the events, we devise an additional time prediction task to complete effective temporal information within the learned unified representation. Extensive experiments on six benchmark datasets verify the state-of-the-art performance and the effectiveness of our method.

* Accepted by SIGIR 2024 (the Full paper track, camera ready version)

Via

Access Paper or Ask Questions

Inverse-like Antagonistic Scene Text Spotting via Reading-Order Estimation and Dynamic Sampling

Jan 08, 2024

Shi-Xue Zhang, Chun Yang, Xiaobin Zhu, Hongyang Zhou, Hongfa Wang, Xu-Cheng Yin

Abstract:Scene text spotting is a challenging task, especially for inverse-like scene text, which has complex layouts, e.g., mirrored, symmetrical, or retro-flexed. In this paper, we propose a unified end-to-end trainable inverse-like antagonistic text spotting framework dubbed IATS, which can effectively spot inverse-like scene texts without sacrificing general ones. Specifically, we propose an innovative reading-order estimation module (REM) that extracts reading-order information from the initial text boundary generated by an initial boundary module (IBM). To optimize and train REM, we propose a joint reading-order estimation loss consisting of a classification loss, an orthogonality loss, and a distribution loss. With the help of IBM, we can divide the initial text boundary into two symmetric control points and iteratively refine the new text boundary using a lightweight boundary refinement module (BRM) for adapting to various shapes and scales. To alleviate the incompatibility between text detection and recognition, we propose a dynamic sampling module (DSM) with a thin-plate spline that can dynamically sample appropriate features for recognition in the detected text region. Without extra supervision, the DSM can proactively learn to sample appropriate features for text recognition through the gradient returned by the recognition module. Extensive experiments on both challenging scene text and inverse-like scene text datasets demonstrate that our method achieves superior performance both on irregular and inverse-like text spotting.

* 14 pages, 16 figures, Accepted by TIP-2024

Via

Access Paper or Ask Questions

Arbitrary Shape Text Detection via Segmentation with Probability Maps

Aug 26, 2022

Shi-Xue Zhang, Xiaobin Zhu, Lei Chen, Jie-Bo Hou, Xu-Cheng Yin

Figure 1 for Arbitrary Shape Text Detection via Segmentation with Probability Maps

Figure 2 for Arbitrary Shape Text Detection via Segmentation with Probability Maps

Figure 3 for Arbitrary Shape Text Detection via Segmentation with Probability Maps

Figure 4 for Arbitrary Shape Text Detection via Segmentation with Probability Maps

Abstract:Arbitrary shape text detection is a challenging task due to the significantly varied sizes and aspect ratios, arbitrary orientations or shapes, inaccurate annotations, etc. Due to the scalability of pixel-level prediction, segmentation-based methods can adapt to various shape texts and hence attracted considerable attention recently. However, accurate pixel-level annotations of texts are formidable, and the existing datasets for scene text detection only provide coarse-grained boundary annotations. Consequently, numerous misclassified text pixels or background pixels inside annotations always exist, degrading the performance of segmentation-based text detection methods. Generally speaking, whether a pixel belongs to text or not is highly related to the distance with the adjacent annotation boundary. With this observation, in this paper, we propose an innovative and robust segmentation-based detection method via probability maps for accurately detecting text instances. To be concrete, we adopt a Sigmoid Alpha Function (SAF) to transfer the distances between boundaries and their inside pixels to a probability map. However, one probability map can not cover complex probability distributions well because of the uncertainty of coarse-grained text boundary annotations. Therefore, we adopt a group of probability maps computed by a series of Sigmoid Alpha Functions to describe the possible probability distributions. In addition, we propose an iterative model to learn to predict and assimilate probability maps for providing enough information to reconstruct text instances. Finally, simple region growth algorithms are adopted to aggregate probability maps to complete text instances. Experimental results demonstrate that our method achieves state-of-the-art performance in terms of detection accuracy on several benchmarks.

* Accepted by TPAMI 2022. arXiv admin note: text overlap with arXiv:1812.01393 by other authors

Via

Access Paper or Ask Questions

Arbitrary Shape Text Detection via Boundary Transformer

May 11, 2022

Shi-Xue Zhang, Xiaobin Zhu, Chun Yang, Xu-Cheng Yin

Figure 1 for Arbitrary Shape Text Detection via Boundary Transformer

Figure 2 for Arbitrary Shape Text Detection via Boundary Transformer

Figure 3 for Arbitrary Shape Text Detection via Boundary Transformer

Figure 4 for Arbitrary Shape Text Detection via Boundary Transformer

Abstract:Arbitrary shape text detection is a challenging task due to its complexity and variety, e.g, various scales, random rotations, and curve shapes. In this paper, we propose an arbitrary shape text detector with a boundary transformer, which can accurately and directly locate text boundaries without any post-processing. Our method mainly consists of a boundary proposal module and an iteratively optimized boundary transformer module. The boundary proposal module consisting of multi-layer dilated convolutions will compute important prior information (including classification map, distance field, and direction field) for generating coarse boundary proposals meanwhile guiding the optimization of boundary transformer. The boundary transformer module adopts an encoder-decoder structure, in which the encoder is constructed by multi-layer transformer blocks with residual connection while the decoder is a simple multi-layer perceptron network (MLP). Under the guidance of prior information, the boundary transformer module will gradually refine the coarse boundary proposals via boundary deformation in an iterative manner. Furthermore, we propose a novel boundary energy loss (BEL) which introduces an energy minimization constraint and an energy monotonically decreasing constraint for every boundary optimization step. Extensive experiments on publicly available and challenging datasets demonstrate the state-of-the-art performance and promising efficiency of our method.

* 13 pages, 12 figures.It is not the final version,just a preview. arXiv admin note: text overlap with arXiv:2107.12664

Via

Access Paper or Ask Questions

Graph Fusion Network for Multi-Oriented Object Detection

May 07, 2022

Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Xu-Cheng Yin

Figure 1 for Graph Fusion Network for Multi-Oriented Object Detection

Figure 2 for Graph Fusion Network for Multi-Oriented Object Detection

Figure 3 for Graph Fusion Network for Multi-Oriented Object Detection

Figure 4 for Graph Fusion Network for Multi-Oriented Object Detection

Abstract:In object detection, non-maximum suppression (NMS) methods are extensively adopted to remove horizontal duplicates of detected dense boxes for generating final object instances. However, due to the degraded quality of dense detection boxes and not explicit exploration of the context information, existing NMS methods via simple intersection-over-union (IoU) metrics tend to underperform on multi-oriented and long-size objects detection. Distinguishing with general NMS methods via duplicate removal, we propose a novel graph fusion network, named GFNet, for multi-oriented object detection. Our GFNet is extensible and adaptively fuse dense detection boxes to detect more accurate and holistic multi-oriented object instances. Specifically, we first adopt a locality-aware clustering algorithm to group dense detection boxes into different clusters. We will construct an instance sub-graph for the detection boxes belonging to one cluster. Then, we propose a graph-based fusion network via Graph Convolutional Network (GCN) to learn to reason and fuse the detection boxes for generating final instance boxes. Extensive experiments both on public available multi-oriented text datasets (including MSRA-TD500, ICDAR2015, ICDAR2017-MLT) and multi-oriented object datasets (DOTA) verify the effectiveness and robustness of our method against general NMS methods in multi-oriented object detection.

* Accepted by Applied Intelligence (APIN 2022)

Via

Access Paper or Ask Questions

Towards Open-Set Text Recognition via Label-to-Prototype Learning

Apr 09, 2022

Chang Liu, Chun Yang, Hai-Bo Qin, Xiaobin Zhu, Cheng-Lin Liu, Xu-Cheng Yin

Figure 1 for Towards Open-Set Text Recognition via Label-to-Prototype Learning

Figure 2 for Towards Open-Set Text Recognition via Label-to-Prototype Learning

Figure 3 for Towards Open-Set Text Recognition via Label-to-Prototype Learning

Figure 4 for Towards Open-Set Text Recognition via Label-to-Prototype Learning

Abstract:Scene text recognition is a popular topic and extensively used in the industry. Although many methods have achieved satisfactory performance for the close-set text recognition challenges, these methods lose feasibility in open-set scenarios, where collecting data or retraining models for novel characters is too expensive. E.g., annotating samples for foreign languages can be expensive, whereas retraining the model each time a "novel" character is discovered from historical documents also costs time and resources. In this paper, we introduce and formulate a new task, i.e., the open-set text recognition task, which demands the capability to spot and cognize novel characters without retraining. Here, we propose a label-to-prototype learning framework that fulfills the new requirements in the proposed task. Specifically, novel characters are mapped to their corresponding prototypes with a Label-to-Prototype Learning module. The module is trained on seen labels and holds generalization capability for generating class centers for novel characters without retraining. The framework also implements rejection capability over out-of-set characters, which allows spotting unknown characters during the evaluation process. Extensive experiments show that our method achieves promising performance on a variety of zero-shot, close-set, and open-set text recognition datasets.

* V2 of paper Towards Open-Set Text Recognition via Label-to-Prototype Learning. It is a major extension of V1 and the models are tunned for better performances, yet the core experiments from v1 are kept so its not a new paper

Via

Access Paper or Ask Questions

Kernel Proposal Network for Arbitrary Shape Text Detection

Mar 12, 2022

Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chun Yang, Xu-Cheng Yin

Figure 1 for Kernel Proposal Network for Arbitrary Shape Text Detection

Figure 2 for Kernel Proposal Network for Arbitrary Shape Text Detection

Figure 3 for Kernel Proposal Network for Arbitrary Shape Text Detection

Figure 4 for Kernel Proposal Network for Arbitrary Shape Text Detection

Abstract:Segmentation-based methods have achieved great success for arbitrary shape text detection. However, separating neighboring text instances is still one of the most challenging problems due to the complexity of texts in scene images. In this paper, we propose an innovative Kernel Proposal Network (dubbed KPN) for arbitrary shape text detection. The proposed KPN can separate neighboring text instances by classifying different texts into instance-independent feature maps, meanwhile avoiding the complex aggregation process existing in segmentation-based arbitrary shape text detection methods. To be concrete, our KPN will predict a Gaussian center map for each text image, which will be used to extract a series of candidate kernel proposals (i.e., dynamic convolution kernel) from the embedding feature maps according to their corresponding keypoint positions. To enforce the independence between kernel proposals, we propose a novel orthogonal learning loss (OLL) via orthogonal constraints. Specifically, our kernel proposals contain important self-information learned by network and location information by position embedding. Finally, kernel proposals will individually convolve all embedding feature maps for generating individual embedded maps of text instances. In this way, our KPN can effectively separate neighboring text instances and improve the robustness against unclear boundaries. To our knowledge, our work is the first to introduce the dynamic convolution kernel strategy to efficiently and effectively tackle the adhesion problem of neighboring text instances in text detection. Experimental results on challenging datasets verify the impressive performance and efficiency of our method. The code and model are available at https://github.com/GXYM/KPN.

* This paper was completed in 2020-11.It was first submitted to CVPR 2021 and then ICCV 2021. Finally, it has been accepted by TNNLS in 2022-02 after major revision. Here, I thank Dr.Hou for his important contributions

Via

Access Paper or Ask Questions