Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ting Sun

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

Mar 21, 2025

Ting Sun, Cheng Cui, Yuning Du, Yi Liu

Abstract:Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .

* Github Repo: https://github.com/PaddlePaddle/PaddleX

Via

Access Paper or Ask Questions

DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services

Feb 17, 2025

Ting Sun, Penghan Wang, Fan Lai

Abstract:The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce DiSCo, a device-server cooperative scheduler designed to optimize users' QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. DiSCo employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads -- including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3 -- show that DiSCo can improve users' QoE by reducing tail TTFT (11-52\%) and mean TTFT (6-78\%) across different model-device configurations, while dramatically reducing serving costs by up to 84\% through its migration mechanism while maintaining comparable QoE levels.

* 17 pages, 14 figures

Via

Access Paper or Ask Questions

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Jun 18, 2024

Xiaoze Liu, Ting Sun, Tianyang Xu, Feijie Wu, Cunxiang Wang, Xiaoqian Wang, Jing Gao

Figure 1 for SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Figure 2 for SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Figure 3 for SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Figure 4 for SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Abstract:Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text. To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose lightweight, real-time defenses to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanisms significantly reduce the volume of copyrighted text generated by LLMs by effectively refusing malicious requests. Code is publicly available at https://github.com/xz-liu/SHIELD

Via

Access Paper or Ask Questions

PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices

Apr 11, 2023

Shiyu Tang, Ting Sun, Juncai Peng, Guowei Chen, Yuying Hao, Manhui Lin, Zhihong Xiao, Jiangbin You, Yi Liu

Abstract:The success of transformers in computer vision has led to several attempts to adapt them for mobile devices, but their performance remains unsatisfactory in some real-world applications. To address this issue, we propose PP-MobileSeg, a semantic segmentation model that achieves state-of-the-art performance on mobile devices. PP-MobileSeg comprises three novel parts: the StrideFormer backbone, the Aggregated Attention Module (AAM), and the Valid Interpolate Module (VIM). The four-stage StrideFormer backbone is built with MV3 blocks and strided SEA attention, and it is able to extract rich semantic and detailed features with minimal parameter overhead. The AAM first filters the detailed features through semantic feature ensemble voting and then combines them with semantic features to enhance the semantic information. Furthermore, we proposed VIM to upsample the downsampled feature to the resolution of the input image. It significantly reduces model latency by only interpolating classes present in the final prediction, which is the most significant contributor to overall model latency. Extensive experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods. On the ADE20K dataset, PP-MobileSeg achieves 1.57% higher accuracy in mIoU than SeaFormer-Base with 32.9% fewer parameters and 42.3% faster acceleration on Qualcomm Snapdragon 855. Source codes are available at https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.8.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation

Nov 05, 2021

WeiFu Fu, CongChong Nie, Ting Sun, Jun Liu, TianLiang Zhang, Yong Liu

Figure 1 for LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation

Figure 2 for LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation

Figure 3 for LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation

Figure 4 for LVIS Challenge Track Technical Report 1st Place Solution: Distribution Balanced and Boundary Refinement for Large Vocabulary Instance Segmentation

Abstract:This report introduces the technical details of the team FuXi-Fresher for LVIS Challenge 2021. Our method focuses on the problem in following two aspects: the long-tail distribution and the segmentation quality of mask and boundary. Based on the advanced HTC instance segmentation algorithm, we connect transformer backbone(Swin-L) through composite connections inspired by CBNetv2 to enhance the baseline results. To alleviate the problem of long-tail distribution, we design a Distribution Balanced method which includes dataset balanced and loss function balaced modules. Further, we use a Mask and Boundary Refinement method composed with mask scoring and refine-mask algorithms to improve the segmentation quality. In addition, we are pleasantly surprised to find that early stopping combined with EMA method can achieve a great improvement. Finally, by using multi-scale testing and increasing the upper limit of the number of objects detected per image, we achieved more than 45.4% boundary AP on the val set of LVIS Challenge 2021. On the test data of LVIS Challenge 2021, we rank 1st and achieve 48.1% AP. Notably, our APr 47.5% is very closed to the APf 48.0%.

Via

Access Paper or Ask Questions

Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection

Mar 25, 2021

Ting Sun, Jinlin Chen, Francis Ng

Figure 1 for Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection

Figure 2 for Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection

Figure 3 for Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection

Figure 4 for Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection

Abstract:Object detection is an essential technique for autonomous driving. The performance of an object detector significantly degrades if the weather of the training images is different from that of test images. Domain adaptation can be used to address the domain shift problem so as to improve the robustness of an object detector. However, most existing domain adaptation methods either handle single target domain or require domain labels. We propose a novel unsupervised domain classification method which can be used to generalize single-target domain adaptation methods to multi-target domains, and design a weather-invariant object detector training framework based on it. We conduct the experiments on Cityscapes dataset and its synthetic variants, i.e. foggy, rainy, and night. The experimental results show that the object detector trained by our proposed method realizes robust object detection under different weather conditions.

Via

Access Paper or Ask Questions

Application of Pre-training Models in Named Entity Recognition

Feb 09, 2020

Yu Wang, Yining Sun, Zuchang Ma, Lisheng Gao, Yang Xu, Ting Sun

Figure 1 for Application of Pre-training Models in Named Entity Recognition

Figure 2 for Application of Pre-training Models in Named Entity Recognition

Abstract:Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities from unstructured data. The previous methods for NER were based on machine learning or deep learning. Recently, pre-training models have significantly improved performance on multiple NLP tasks. In this paper, firstly, we introduce the architecture and pre-training tasks of four common pre-training models: BERT, ERNIE, ERNIE2.0-tiny, and RoBERTa. Then, we apply these pre-training models to a NER task by fine-tuning, and compare the effects of the different model architecture and pre-training tasks on the NER task. The experiment results showed that RoBERTa achieved state-of-the-art results on the MSRA-2006 dataset.

Via

Access Paper or Ask Questions

Movable-Object-Aware Visual SLAM via Weakly Supervised Semantic Segmentation

Jul 31, 2019

Ting Sun, Yuxiang Sun, Ming Liu, Dit-Yan Yeung

Figure 1 for Movable-Object-Aware Visual SLAM via Weakly Supervised Semantic Segmentation

Figure 2 for Movable-Object-Aware Visual SLAM via Weakly Supervised Semantic Segmentation

Figure 3 for Movable-Object-Aware Visual SLAM via Weakly Supervised Semantic Segmentation

Figure 4 for Movable-Object-Aware Visual SLAM via Weakly Supervised Semantic Segmentation

Abstract:Moving objects can greatly jeopardize the performance of a visual simultaneous localization and mapping (vSLAM) system which relies on the static-world assumption. Motion removal have seen successful on solving this problem. Two main streams of solutions are based on either geometry constraints or deep semantic segmentation neural network. The former rely on static majority assumption, and the latter require labor-intensive pixel-wise annotations. In this paper we propose to adopt a novel weakly-supervised semantic segmentation method. The segmentation mask is obtained from a CNN pre-trained with image-level class labels only. Thus, we leverage the power of deep semantic segmentation CNNs, while avoid requiring expensive annotations for training. We integrate our motion removal approach with the ORB-SLAM2 system. Experimental results on the TUM RGB-D and the KITTI stereo datasets demonstrate our superiority over the state-of-the-art.

Via

Access Paper or Ask Questions

Fully Using Classifiers for Weakly Supervised Semantic Segmentation with Modified Cues

Apr 03, 2019

Ting Sun, Lei Tai, Zhihan Gao, Ming Liu, Dit-Yan Yeung

Figure 1 for Fully Using Classifiers for Weakly Supervised Semantic Segmentation with Modified Cues

Figure 2 for Fully Using Classifiers for Weakly Supervised Semantic Segmentation with Modified Cues

Figure 3 for Fully Using Classifiers for Weakly Supervised Semantic Segmentation with Modified Cues

Figure 4 for Fully Using Classifiers for Weakly Supervised Semantic Segmentation with Modified Cues

Abstract:This paper proposes a novel weakly-supervised semantic segmentation method using image-level label only. The class-specific activation maps from the well-trained classifiers are used as cues to train a segmentation network. The well-known defects of these cues are coarseness and incompleteness. We use super-pixel to refine them, and fuse the cues extracted from both a color image trained classifier and a gray image trained classifier to compensate for their incompleteness. The conditional random field is adapted to regulate the training process and to refine the outputs further. Besides initializing the segmentation network, the previously trained classifier is also used in the testing phase to suppress the non-existing classes. Experimental results on the PASCAL VOC 2012 dataset illustrate the effectiveness of our method.

Via

Access Paper or Ask Questions

Hallucinating very low-resolution and obscured face images

Dec 12, 2018

Lianping Yang, Bin Shao, Ting Sun, Song Ding, Xiangde Zhang

Figure 1 for Hallucinating very low-resolution and obscured face images

Figure 2 for Hallucinating very low-resolution and obscured face images

Figure 3 for Hallucinating very low-resolution and obscured face images

Figure 4 for Hallucinating very low-resolution and obscured face images

Abstract:Most of the face hallucination methods are designed for complete inputs. They will not work well if the inputs are very tiny or contaminated by large occlusion. Inspired by this fact, we propose an obscured face hallucination network(OFHNet). The OFHNet consists of four parts: an inpainting network, an upsampling network, a discriminative network, and a fixed facial landmark detection network. The inpainting network restores the low-resolution(LR) obscured face images. The following upsampling network is to upsample the output of inpainting network. In order to ensure the generated high-resolution(HR) face images more photo-realistic, we utilize the discriminative network and the facial landmark detection network to better the result of upsampling network. In addition, we present a semantic structure loss, which makes the generated HR face images more pleasing. Extensive experiments show that our framework can restore the appealing HR face images from 1/4 missing area LR face images with a challenging scaling factor of 8x.

* 20 pages, Submitted to Pattern Recognition Letters

Via

Access Paper or Ask Questions