Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yue Dai

Enhancing Document Key Information Localization Through Data Augmentation

Feb 10, 2025

Yue Dai

Abstract:The Visually Rich Form Document Intelligence and Understanding (VRDIU) Track B focuses on the localization of key information in document images. The goal is to develop a method capable of localizing objects in both digital and handwritten documents, using only digital documents for training. This paper presents a simple yet effective approach that includes a document augmentation phase and an object detection phase. Specifically, we augment the training set of digital documents by mimicking the appearance of handwritten documents. Our experiments demonstrate that this pipeline enhances the models' generalization ability and achieves high performance in the competition.

* Accepted as a workshop paper in DOCUI-AAAI2025

Via

Access Paper or Ask Questions

Multimodal Graph Constrastive Learning and Prompt for ChartQA

Jan 08, 2025

Yue Dai, Soyeon Caren Han, Wei Liu

Abstract:ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data. In this chapter, we have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns. Our proposed multimodal scene graph consists of two components: a visual graph and a textual graph, each designed to capture the structural and semantic information within the chart. To unify representations across these different modalities, we introduce a multimodal graph contrastive learning approach that learns unified representations by maximizing similarity between nodes representing the same object across multimodal graphs. The learned graph representations can be seamlessly incorporated into a transformer decoder as a soft prompt. Additionally, given the growing need for Multimodal Large Language Models (MLLMs) in zero-shot scenarios, we have designed Chain-of-Thought (CoT) prompts for MLLMs to reduce hallucinations. We tested both methods on public benchmarks such as ChartQA, OpenCQA, and ChartX, demonstrating improved performance and validating the effectiveness of our proposed methods.

Via

Access Paper or Ask Questions

ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Oct 14, 2024

Yan Li, Caren Han, Yue Dai, Feiqi Cao

Figure 1 for ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Figure 2 for ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Figure 3 for ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Figure 4 for ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Abstract:Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document classification that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunk to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is especially important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analyses.

* Submitted to ICLR 2025

Via

Access Paper or Ask Questions

MSG-Chart: Multimodal Scene Graph for ChartQA

Aug 09, 2024

Yue Dai, Soyeon Caren Han, Wei Liu

Figure 1 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 2 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 3 for MSG-Chart: Multimodal Scene Graph for ChartQA

Figure 4 for MSG-Chart: Multimodal Scene Graph for ChartQA

Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts' elements' structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.

* Accpeted by CIKM Short 2024

Via

Access Paper or Ask Questions

EdgeOL: Efficient in-situ Online Learning on Edge Devices

Jan 30, 2024

Sheng Li, Geng Yuan, Yawen Wu, Yue Dai, Chao Wu, Alex K. Jones, Jingtong Hu, Yanzhi Wang, Xulong Tang

Figure 1 for EdgeOL: Efficient in-situ Online Learning on Edge Devices

Figure 2 for EdgeOL: Efficient in-situ Online Learning on Edge Devices

Figure 3 for EdgeOL: Efficient in-situ Online Learning on Edge Devices

Figure 4 for EdgeOL: Efficient in-situ Online Learning on Edge Devices

Abstract:Emerging applications, such as robot-assisted eldercare and object recognition, generally employ deep learning neural networks (DNNs) models and naturally require: i) handling streaming-in inference requests and ii) adapting to possible deployment scenario changes. Online model fine-tuning is widely adopted to satisfy these needs. However, fine-tuning involves significant energy consumption, making it challenging to deploy on edge devices. In this paper, we propose EdgeOL, an edge online learning framework that optimizes inference accuracy, fine-tuning execution time, and energy efficiency through both inter-tuning and intra-tuning optimizations. Experimental results show that, on average, EdgeOL reduces overall fine-tuning execution time by 82%, energy consumption by 74%, and improves average inference accuracy by 1.70% over the immediate online learning strategy.

Via

Access Paper or Ask Questions

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Jan 30, 2024

Sheng Li, Geng Yuan, Yue Dai, Youtao Zhang, Yanzhi Wang, Xulong Tang

Figure 1 for SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Figure 2 for SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Figure 3 for SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Figure 4 for SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Abstract:There has been a proliferation of artificial intelligence applications, where model training is key to promising high-quality services for these applications. However, the model training process is both time-intensive and energy-intensive, inevitably affecting the user's demand for application efficiency. Layer freezing, an efficient model training technique, has been proposed to improve training efficiency. Although existing layer freezing methods demonstrate the great potential to reduce model training costs, they still remain shortcomings such as lacking generalizability and compromised accuracy. For instance, existing layer freezing methods either require the freeze configurations to be manually defined before training, which does not apply to different networks, or use heuristic freezing criteria that is hard to guarantee decent accuracy in different scenarios. Therefore, there lacks a generic and smart layer freezing method that can automatically perform ``in-situation'' layer freezing for different networks during training processes. To this end, we propose a generic and efficient training framework (SmartFRZ). The core proposed technique in SmartFRZ is attention-guided layer freezing, which can automatically select the appropriate layers to freeze without compromising accuracy. Experimental results show that SmartFRZ effectively reduces the amount of computation in training and achieves significant training acceleration, and outperforms the state-of-the-art layer freezing approaches.

Via

Access Paper or Ask Questions

ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

Sep 14, 2022

Zhexiong Liu, Meiqi Guo, Yue Dai, Diane Litman

Figure 1 for ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

Figure 2 for ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

Figure 3 for ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

Figure 4 for ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining

Abstract:The growing interest in developing corpora of persuasive texts has promoted applications in automated systems, e.g., debating and essay scoring systems; however, there is little prior work mining image persuasiveness from an argumentative perspective. To expand persuasiveness mining into a multi-modal realm, we present a multi-modal dataset, ImageArg, consisting of annotations of image persuasiveness in tweets. The annotations are based on a persuasion taxonomy we developed to explore image functionalities and the means of persuasion. We benchmark image persuasiveness tasks on ImageArg using widely-used multi-modal learning methods. The experimental results show that our dataset offers a useful resource for this rich and challenging topic, and there is ample room for modeling improvement.

* In Argument Mining Workshop, held in conjunction with the International Conference on Computational Linguistics (COLING), October 2022

Via

Access Paper or Ask Questions

An Analysis of Deep Reinforcement Learning Agents for Text-based Games

Sep 12, 2022

Chen Chen, Yue Dai, Josiah Poon, Caren Han

Figure 1 for An Analysis of Deep Reinforcement Learning Agents for Text-based Games

Figure 2 for An Analysis of Deep Reinforcement Learning Agents for Text-based Games

Figure 3 for An Analysis of Deep Reinforcement Learning Agents for Text-based Games

Figure 4 for An Analysis of Deep Reinforcement Learning Agents for Text-based Games

Abstract:Text-based games(TBG) are complex environments which allow users or computer agents to make textual interactions and achieve game goals.In TBG agent design and training process, balancing the efficiency and performance of the agent models is a major challenge. Finding TBG agent deep learning modules' performance in standardized environments, and testing their performance among different evaluation types is also important for TBG agent research. We constructed a standardized TBG agent with no hand-crafted rules, formally categorized TBG evaluation types, and analyzed selected methods in our environment.

Via

Access Paper or Ask Questions