Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiahong Li

Treatment Effect Estimation for Exponential Family Outcomes using Neural Networks with Targeted Regularization

Feb 11, 2025

Jiahong Li, Zeqin Yang, Jiayi Dan, Jixing Xu, Zhichao Zou, Peng Zhen, Jiecheng Guo

Abstract:Neural Networks (NNs) have became a natural choice for treatment effect estimation due to their strong approximation capabilities. Nevertheless, how to design NN-based estimators with desirable properties, such as low bias and doubly robustness, still remains a significant challenge. A common approach to address this is targeted regularization, which modifies the objective function of NNs. However, existing works on targeted regularization are limited to Gaussian-distributed outcomes, significantly restricting their applicability in real-world scenarios. In this work, we aim to bridge this blank by extending this framework to the boarder exponential family outcomes. Specifically, we first derive the von-Mises expansion of the Average Dose function of Canonical Functions (ADCF), which inspires us how to construct a doubly robust estimator with good properties. Based on this, we develop a NN-based estimator for ADCF by generalizing functional targeted regularization to exponential families, and provide the corresponding theoretical convergence rate. Extensive experimental results demonstrate the effectiveness of our proposed model.

Via

Access Paper or Ask Questions

FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation

May 05, 2023

Yuzhong Zhao, Weijia Wu, Zhuang Li, Jiahong Li, Weiqiang Wang

Abstract:Current video text spotting methods can achieve preferable performance, powered with sufficient labeled training data. However, labeling data manually is time-consuming and labor-intensive. To overcome this, using low-cost synthetic data is a promising alternative. This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters. Unlike existing methods that focus on image-level synthesis, FlowText concentrates on synthesizing temporal information of text instances across consecutive frames using optical flow. This temporal information is crucial for accurately tracking and spotting text in video sequences, including text movement, distortion, appearance, disappearance, shelter, and blur. Experiments show that combining general detectors like TransDETR with the proposed FlowText produces remarkable results on various datasets, such as ICDAR2015video and ICDAR2013video. Code is available at https://github.com/callsys/FlowText.

* ICME 2023

Via

Access Paper or Ask Questions

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

May 05, 2023

Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai

Abstract:Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.

Via

Access Paper or Ask Questions

ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Apr 10, 2023

Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai

Figure 1 for ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Figure 2 for ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Figure 3 for ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Figure 4 for ICDAR 2023 Video Text Reading Competition for Dense and Small Text

Abstract:Recently, video text detection, tracking, and recognition in natural scenes are becoming very popular in the computer vision community. However, most existing algorithms and benchmarks focus on common text cases (e.g., normal size, density) and single scenarios, while ignoring extreme video text challenges, i.e., dense and small text in various scenarios. In this competition report, we establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video with various scenarios. Compared with the previous datasets, the proposed dataset mainly include three new challenges: 1) Dense video texts, a new challenge for video text spotter. 2) High-proportioned small texts. 3) Various new scenarios, e.g., Game, sports, etc. The proposed DSText includes 100 video clips from 12 open scenarios, supporting two tasks (i.e., video text tracking (Task 1) and end-to-end video text spotting (Task 2)). During the competition period (opened on 15th February 2023 and closed on 20th March 2023), a total of 24 teams participated in the three proposed tasks with around 30 valid submissions, respectively. In this article, we describe detailed statistical information of the dataset, tasks, evaluation protocols and the results summaries of the ICDAR 2023 on DSText competition. Moreover, we hope the benchmark will promise video text research in the community.

* ICDAR 2023 competition

Via

Access Paper or Ask Questions

Audience Expansion for Multi-show Release Based on an Edge-prompted Heterogeneous Graph Network

Apr 08, 2023

Kai Song, Shaofeng Wang, Ziwei Xie, Shanyu Wang, Jiahong Li, Yongqiang Yang

Figure 1 for Audience Expansion for Multi-show Release Based on an Edge-prompted Heterogeneous Graph Network

Figure 2 for Audience Expansion for Multi-show Release Based on an Edge-prompted Heterogeneous Graph Network

Figure 3 for Audience Expansion for Multi-show Release Based on an Edge-prompted Heterogeneous Graph Network

Figure 4 for Audience Expansion for Multi-show Release Based on an Edge-prompted Heterogeneous Graph Network

Abstract:In the user targeting and expanding of new shows on a video platform, the key point is how their embeddings are generated. It's supposed to be personalized from the perspective of both users and shows. Furthermore, the pursue of both instant (click) and long-time (view time) rewards, and the cold-start problem for new shows bring additional challenges. Such a problem is suitable for processing by heterogeneous graph models, because of the natural graph structure of data. But real-world networks usually have billions of nodes and various types of edges. Few existing methods focus on handling large-scale data and exploiting different types of edges, especially the latter. In this paper, we propose a two-stage audience expansion scheme based on an edge-prompted heterogeneous graph network which can take different double-sided interactions and features into account. In the offline stage, to construct the graph, user IDs and specific side information combinations of the shows are chosen to be the nodes, and click/co-click relations and view time are used to build the edges. Embeddings and clustered user groups are then calculated. When new shows arrive, their embeddings and subsequent matching users can be produced within a consistent space. In the online stage, posterior data including click/view users are employed as seeds to look for similar users. The results on the public datasets and our billion-scale data demonstrate the accuracy and efficiency of our approach.

Via

Access Paper or Ask Questions

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Jul 18, 2022

Wejia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Hong Zhou, Size Li, Zhongyuan Wang, Ping Luo

Figure 1 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 2 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 3 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Figure 4 for Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Abstract:Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText.

* ECCV 2022

Via

Access Paper or Ask Questions

Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Mar 18, 2022

Zhuo Wang, Zezheng Wang, Zitong Yu, Weihong Deng, Jiahong Li, Tingting Gao, Zhongyuan Wang

Figure 1 for Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Figure 2 for Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Figure 3 for Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Figure 4 for Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Abstract:With diverse presentation attacks emerging continually, generalizable face anti-spoofing (FAS) has drawn growing attention. Most existing methods implement domain generalization (DG) on the complete representations. However, different image statistics may have unique properties for the FAS tasks. In this work, we separate the complete representation into content and style ones. A novel Shuffled Style Assembly Network (SSAN) is proposed to extract and reassemble different content and style features for a stylized feature space. Then, to obtain a generalized representation, a contrastive learning strategy is developed to emphasize liveness-related style information while suppress the domain-specific one. Finally, the representations of the correct assemblies are used to distinguish between living and spoofing during the inferring. On the other hand, despite the decent performance, there still exists a gap between academia and industry, due to the difference in data quantity and distribution. Thus, a new large-scale benchmark for FAS is built up to further evaluate the performance of algorithms in reality. Both qualitative and quantitative results on existing and proposed benchmarks demonstrate the effectiveness of our methods. The codes will be available at https://github.com/wangzhuo2019/SSAN.

* Accepted by CVPR2022

Via

Access Paper or Ask Questions

Contrastive Learning of Semantic and Visual Representations for Text Tracking

Dec 30, 2021

Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan Wang, Hong Zhou

Figure 1 for Contrastive Learning of Semantic and Visual Representations for Text Tracking

Figure 2 for Contrastive Learning of Semantic and Visual Representations for Text Tracking

Figure 3 for Contrastive Learning of Semantic and Visual Representations for Text Tracking

Figure 4 for Contrastive Learning of Semantic and Visual Representations for Text Tracking

Abstract:Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video. Most existing approaches tackle this task by appearance similarity in continuous frames, while ignoring the abundant semantic features. In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations. Correspondingly, we present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence. Besides, with a light-weight architecture, SVRep achieves state-of-the-art performance while maintaining competitive inference speed. Specifically, with a backbone of ResNet-18, SVRep achieves an ${\rm ID_{F1}}$ of $\textbf{65.9\%}$, running at $\textbf{16.7}$ FPS, on the ICDAR2015(video) dataset with $\textbf{8.6\%}$ improvement than the previous state-of-the-art methods.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Dec 09, 2021

Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong Li, Yejun Tang, Hong Zhou

Figure 1 for A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Figure 2 for A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Figure 3 for A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Figure 4 for A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Abstract:Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter, respectively.

* NeurIPS 2021 Track on Datasets and Benchmarks
* 20 pages, 6 figures

Via

Access Paper or Ask Questions

Consistency Regularization for Deep Face Anti-Spoofing

Nov 25, 2021

Zezheng Wang, Zitong Yu, Xun Wang, Yunxiao Qin, Jiahong Li, Chenxu Zhao, Zhen Lei, Xin Liu, Size Li, Zhongyuan Wang

Figure 1 for Consistency Regularization for Deep Face Anti-Spoofing

Figure 2 for Consistency Regularization for Deep Face Anti-Spoofing

Figure 3 for Consistency Regularization for Deep Face Anti-Spoofing

Figure 4 for Consistency Regularization for Deep Face Anti-Spoofing

Abstract:Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems. Empirically, given an image, a model with more consistent output on different views of this image usually performs better, as shown in Fig.1. Motivated by this exciting observation, we conjecture that encouraging feature consistency of different views may be a promising way to boost FAS models. In this paper, we explore this way thoroughly by enhancing both Embedding-level and Prediction-level Consistency Regularization (EPCR) in FAS. Specifically, at the embedding-level, we design a dense similarity loss to maximize the similarities between all positions of two intermediate feature maps in a self-supervised fashion; while at the prediction-level, we optimize the mean square error between the predictions of two views. Notably, our EPCR is free of annotations and can directly integrate into semi-supervised learning schemes. Considering different application scenarios, we further design five diverse semi-supervised protocols to measure semi-supervised FAS techniques. We conduct extensive experiments to show that EPCR can significantly improve the performance of several supervised and semi-supervised tasks on benchmark datasets. The codes and protocols will be released at https://github.com/clks-wzz/EPCR.

* 10 tables, 4 figures

Via

Access Paper or Ask Questions