Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiming Lin

LLM-Powered Proactive Data Systems

Feb 18, 2025

Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran

Abstract:With the power of LLMs, we now have the ability to query data that was previously impossible to query, including text, images, and video. However, despite this enormous potential, most present-day data systems that leverage LLMs are reactive, reflecting our community's desire to map LLMs to known abstractions. Most data systems treat LLMs as an opaque black box that operates on user inputs and data as is, optimizing them much like any other approximate, expensive UDFs, in conjunction with other relational operators. Such data systems do as they are told, but fail to understand and leverage what the LLM is being asked to do (i.e. the underlying operations, which may be error-prone), the data the LLM is operating on (e.g., long, complex documents), or what the user really needs. They don't take advantage of the characteristics of the operations and/or the data at hand, or ensure correctness of results when there are imprecisions and ambiguities. We argue that data systems instead need to be proactive: they need to be given more agency -- armed with the power of LLMs -- to understand and rework the user inputs and the data and to make decisions on how the operations and the data should be represented and processed. By allowing the data system to parse, rewrite, and decompose user inputs and data, or to interact with the user in ways that go beyond the standard single-shot query-result paradigm, the data system is able to address user needs more efficiently and effectively. These new capabilities lead to a rich design space where the data system takes more initiative: they are empowered to perform optimization based on the transformation operations, data characteristics, and user intent. We discuss various successful examples of how this framework has been and can be applied in real-world tasks, and present future directions for this ambitious research agenda.

* IEEE Data Engineering Bulletin March 2025

Via

Access Paper or Ask Questions

TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Jan 11, 2025

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran

Figure 1 for TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Figure 2 for TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Figure 3 for TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Figure 4 for TWIX: Automatically Reconstructing Structured Data from Templatized Documents

Abstract:Many documents, that we call templatized documents, are programmatically generated by populating fields in a visual template. Effective data extraction from these documents is crucial to supporting downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and often require significant human effort, when extracting tables or values given user-specified fields from documents. The key insight of our tool, TWIX, is to predict the underlying template used to create such documents, modeling the visual and structural commonalities across documents. Data extraction based on this predicted template provides a more principled, accurate, and efficient solution at a low cost. Comprehensive evaluations on 34 diverse real-world datasets show that uncovering the template is crucial for data extraction from templatized documents. TWIX achieves over 90% precision and recall on average, outperforming tools from industry: Textract and Azure Document Intelligence, and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.

Via

Access Paper or Ask Questions

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network

Oct 25, 2023

Yiming Lin, Xiao-Bo Jin, Qiufeng Wang, Kaizhu Huang

Abstract:Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar $k$ image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-$k$ most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially.Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.

* Accepted by ICDM 2023

Via

Access Paper or Ask Questions

FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection

Nov 11, 2022

Jing Yang, Jie Shen, Yiming Lin, Yordan Hristov, Maja Pantic

Abstract:Due to its importance in facial behaviour analysis, facial action unit (AU) detection has attracted increasing attention from the research community. Leveraging the online knowledge distillation framework, we propose the ``FANTrans" method for AU detection. Our model consists of a hybrid network of convolution and transformer blocks to learn per-AU features and to model AU co-occurrences. The model uses a pre-trained face alignment network as the feature extractor. After further transformation by a small learnable add-on convolutional subnet, the per-AU features are fed into transformer blocks to enhance their representation. As multiple AUs often appear together, we propose a learnable attention drop mechanism in the transformer block to learn the correlation between the features for different AUs. We also design a classifier that predicts AU presence by considering all AUs' features, to explicitly capture label dependencies. Finally, we make the attempt of adapting online knowledge distillation in the training stage for this task, further improving the model's performance. Experiments on the BP4D and DISFA datasets demonstrating the effectiveness of proposed method.

* WACV2023
* 9 pages, 6 figures

Via

Access Paper or Ask Questions

Self-supervised Video-centralised Transformer for Video Face Clustering

Mar 24, 2022

Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Luo, Yiming Lin, Pingchuan Ma, Stavros Petridis, Maja Pantic

Figure 1 for Self-supervised Video-centralised Transformer for Video Face Clustering

Figure 2 for Self-supervised Video-centralised Transformer for Video Face Clustering

Figure 3 for Self-supervised Video-centralised Transformer for Video Face Clustering

Figure 4 for Self-supervised Video-centralised Transformer for Video Face Clustering

Abstract:This paper presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.

Via

Access Paper or Ask Questions

FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

Jun 21, 2021

Yiming Lin, Jie Shen, Yujiang Wang, Maja Pantic

Figure 1 for FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

Figure 2 for FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

Figure 3 for FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

Figure 4 for FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild

Abstract:Image-based age estimation aims to predict a person's age from facial images. It is used in a variety of real-world applications. Although end-to-end deep models have achieved impressive results for age estimation on benchmark datasets, their performance in-the-wild still leaves much room for improvement due to the challenges caused by large variations in head pose, facial expressions, and occlusions. To address this issue, we propose a simple yet effective method to explicitly incorporate facial semantics into age estimation, so that the model would learn to correctly focus on the most informative facial components from unaligned facial images regardless of head pose and non-rigid deformation. To this end, we design a face parsing-based network to learn semantic information at different scales and a novel face parsing attention module to leverage these semantic features for age estimation. To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean. This dataset is created by semi-automatically cleaning the noisy IMDB-WIKI dataset using a constrained clustering method. Through comprehensive experiment on IMDB-Clean and other benchmark datasets, under both intra-dataset and cross-dataset evaluation protocols, we show that our method consistently outperforms all existing age estimation methods and achieves a new state-of-the-art performance. To the best of our knowledge, our work presents the first attempt of leveraging face parsing attention to achieve semantic-aware age estimation, which may be inspiring to other high level facial analysis tasks.

* Code and data will be available on https://github.com/hhj1897/age_estimation

Via

Access Paper or Ask Questions

Deep Polarization Imaging for 3D shape and SVBRDF Acquisition

May 06, 2021

Valentin Deschaintre, Yiming Lin, Abhijeet Ghosh

Figure 1 for Deep Polarization Imaging for 3D shape and SVBRDF Acquisition

Figure 2 for Deep Polarization Imaging for 3D shape and SVBRDF Acquisition

Figure 3 for Deep Polarization Imaging for 3D shape and SVBRDF Acquisition

Figure 4 for Deep Polarization Imaging for 3D shape and SVBRDF Acquisition

Abstract:We present a novel method for efficient acquisition of shape and spatially varying reflectance of 3D objects using polarization cues. Unlike previous works that have exploited polarization to estimate material or object appearance under certain constraints (known shape or multiview acquisition), we lift such restrictions by coupling polarization imaging with deep learning to achieve high quality estimate of 3D object shape (surface normals and depth) and SVBRDF using single-view polarization imaging under frontal flash illumination. In addition to acquired polarization images, we provide our deep network with strong novel cues related to shape and reflectance, in the form of a normalized Stokes map and an estimate of diffuse color. We additionally describe modifications to network architecture and training loss which provide further qualitative improvements. We demonstrate our approach to achieve superior results compared to recent works employing deep learning in conjunction with flash illumination.

* CVPR 2021 Oral paper

Via

Access Paper or Ask Questions

RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Feb 04, 2021

Yiming Lin, Jie Shen, Yujiang Wang, Maja Pantic

Figure 1 for RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Figure 2 for RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Figure 3 for RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Figure 4 for RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Abstract:Face parsing aims to predict pixel-wise labels for facial components of a target face in an image. Existing approaches usually crop the target face from the input image with respect to a bounding box calculated during pre-processing, and thus can only parse inner facial Regions of Interest (RoIs). Peripheral regions like hair are ignored and nearby faces that are partially included in the bounding box can cause distractions. Moreover, these methods are only trained and evaluated on near-frontal portrait images and thus their performance for in-the-wild cases were unexplored. To address these issues, this paper makes three contributions. First, we introduce iBugMask dataset for face parsing in the wild containing 1,000 manually annotated images with large variations in sizes, poses, expressions and background, and Helen-LP, a large-pose training set containing 21,866 images generated using head pose augmentation. Second, we propose RoI Tanh-polar transform that warps the whole image to a Tanh-polar representation with a fixed ratio between the face area and the context, guided by the target bounding box. The new representation contains all information in the original image, and allows for rotation equivariance in the convolutional neural networks (CNNs). Third, we propose a hybrid residual representation learning block, coined HybridBlock, that contains convolutional layers in both the Tanh-polar space and the Tanh-Cartesian space, allowing for receptive fields of different shapes in CNNs. Through extensive experiments, we show that the proposed method significantly improves the state-of-the-art for face parsing in the wild.

* Code and data will be available on https://ibug.doc.ic.ac.uk/resources/ibugmask/

Via

Access Paper or Ask Questions

Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

Jun 16, 2020

Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Lin, Maja Pantic

Figure 1 for Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

Figure 2 for Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

Figure 3 for Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

Figure 4 for Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

Abstract:Dilated convolutions are widely used in deep semantic segmentation models as they can enlarge the filters' receptive field without adding additional weights nor sacrificing spatial resolution. However, as dilated convolutional filters do not possess positional knowledge about the pixels on semantically meaningful contours, they could lead to ambiguous predictions on object boundaries. In addition, although dilating the filter can expand its receptive field, the total number of sampled pixels remains unchanged, which usually comprises a small fraction of the receptive field's total area. Inspired by the Lateral Inhibition (LI) mechanisms in human visual systems, we propose the dilated convolution with lateral inhibitions (LI-Convs) to overcome these limitations. Introducing LI mechanisms improves the convolutional filter's sensitivity to semantic object boundaries. Moreover, since LI-Convs also implicitly take the pixels from the laterally inhibited zones into consideration, they can also extract features at a denser scale. By integrating LI-Convs into the Deeplabv3+ architecture, we propose the Lateral Inhibited Atrous Spatial Pyramid Pooling (LI-ASPP) and the Lateral Inhibited MobileNet-V2 (LI-MNV2). Experimental results on three benchmark datasets (PASCAL VOC 2012, CelebAMask-HQ and ADE20K) show that our LI-based segmentation models outperform the baseline on all of them, thus verify the effectiveness and generality of the proposed LI-Convs.

Via

Access Paper or Ask Questions

Mobile Face Tracking: A Survey and Benchmark

May 24, 2018

Yiming Lin, Jie Shen, Shiyang Cheng, Maja Pantic

Figure 1 for Mobile Face Tracking: A Survey and Benchmark

Figure 2 for Mobile Face Tracking: A Survey and Benchmark

Figure 3 for Mobile Face Tracking: A Survey and Benchmark

Figure 4 for Mobile Face Tracking: A Survey and Benchmark

Abstract:With the rapid development of smartphones, facial analysis has been playing an increasingly important role in a multitude of mobile applications. In most scenarios, face tracking serves as a crucial first step because more often than not, a mobile application would only need to focus on analysing a specific face in a complex setting. Albeit inheriting many commons traits of the generic visual tracking problem, face tracking in mobile scenarios is characterised by a unique set of challenges. In this work, we propose iBUG MobiFace benchmark, the first mobile face tracking benchmark consisting of 50 sequences captured by smartphone users in unconstrained environments. The sequences contain a total of 50,736 frames with 46 distinct identities to be tracked. The tracking target in each sequence is selected with varying difficulties in mobile scenarios. In addition to frame by frame bounding box, the annotations of 9 sequence attributes(e.g. multiple faces) are provided. We further provide a survey of 23 state-of-the-art visual trackers and a comprehensive quantitative evaluation of these methods on the proposed benchmark. In particular, trackers from two most popular frameworks, namely, correlation filter-based tracking and deep learning-based tracking, are studied. Our experiment shows that (a) the performance of all existing generic object trackers drops significantly on the mobile face tracking scenario, suggesting the need of more research effort into mobile face tracking, and (b) the effective combination of deep learning tracking and face-related algorithms(e.g. face detection) provides the most promising basis for future developments in the field. The database, annotations and evaluation protocol/code will be made publicly available on the iBUG website.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions