Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siwen Luo

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Nov 05, 2024

Shuo Yang, Siwen Luo, Soyeon Caren Han

Abstract:Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

* AAAI 2025 (Accepted, Oral)

Via

Access Paper or Ask Questions

'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Oct 31, 2024

Rena Gao, Xuetong Wu, Siwen Luo, Caren Han, Feng Liu

Figure 1 for 'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Figure 2 for 'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Figure 3 for 'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Figure 4 for 'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

Abstract:Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.

* 16 pages, 5 figures

Via

Access Paper or Ask Questions

3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Jul 12, 2024

Rina Carines Cabral, Siwen Luo, Soyeon Caren Han, Josiah Poon

Figure 1 for 3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Figure 2 for 3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Figure 3 for 3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Figure 4 for 3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Abstract:The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals' well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance. All relevant codes will be made available upon publication.

Via

Access Paper or Ask Questions

PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Apr 19, 2024

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han

Figure 1 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 2 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 3 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 4 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Abstract:Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.

* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions

Workshop on Document Intelligence Understanding

Jul 31, 2023

Soyeon Caren Han, Yihao Ding, Siwen Luo, Josiah Poon, HeeGuen Yoon, Zhe Huang, Paul Duuring, Eun Jung Holden

Figure 1 for Workshop on Document Intelligence Understanding

Figure 2 for Workshop on Document Intelligence Understanding

Figure 3 for Workshop on Document Intelligence Understanding

Abstract:Document understanding and information extraction include different tasks to understand a document and extract valuable information automatically. Recently, there has been a rising demand for developing document understanding among different domains, including business, law, and medicine, to boost the efficiency of work that is associated with a large number of documents. This workshop aims to bring together researchers and industry developers in the field of document intelligence and understanding diverse document types to boost automatic document processing and understanding techniques. We also released a data challenge on the recently introduced document-level VQA dataset, PDFVQA. The PDFVQA challenge examines the structural and contextual understandings of proposed models on the natural full document level of multiple consecutive document pages by including questions with a sequence of answers extracted from multi-pages of the full document. This task helps to boost the document understanding step from the single-page level to the full document level understanding.

* Accepted at CIKM 2023; Orgnised in CIKM

Via

Access Paper or Ask Questions

PDFVQA: A New Dataset for Real-World VQA on Documents

Apr 24, 2023

Yihao Ding, Siwen Luo, Hyunsuk Chung, Soyeon Caren Han

Figure 1 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 2 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 3 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 4 for PDFVQA: A New Dataset for Real-World VQA on Documents

Abstract:Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.

* Work in progress

Via

Access Paper or Ask Questions

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Dec 16, 2022

Siwen Luo, Feiqi Cao, Felipe Nunez, Zean Wen, Josiah Poon, Caren Han

Figure 1 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 2 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 3 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Figure 4 for SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Abstract:Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.

Via

Access Paper or Ask Questions

PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Dec 01, 2022

Zhihao Zhang, Siwen Luo, Junyi Chen, Sijia Lai, Siqu Long, Hyunsuk Chung, Soyeon Caren Han

Figure 1 for PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Figure 2 for PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Figure 3 for PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Abstract:We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily. The PiggyBack supports the full stack of visual question answering tasks, specifically data processing, model fine-tuning, and result visualisation. We integrate visual-language models, pretrained by HuggingFace, an open-source API platform of deep learning technologies; however, it cannot be runnable without programming skills or deep learning understanding. Hence, our PiggyBack supports an easy-to-use browser-based user interface with several deep learning visual language pretrained models for general users and domain experts. The PiggyBack includes the following benefits: Free availability under the MIT License, Portability due to web-based and thus runs on almost any platform, A comprehensive data creation and processing technique, and ease of use on deep learning-based visual language pretrained models. The demo video is available on YouTube and can be found at https://youtu.be/iz44RZ1lF4s.

* Accepted by WSDM 2023

Via

Access Paper or Ask Questions

Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Aug 22, 2022

Siwen Luo, Yihao Ding, Siqu Long, Soyeon Caren Han, Josiah Poon

Figure 1 for Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Figure 2 for Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Figure 3 for Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Figure 4 for Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Abstract:Recognizing the layout of unstructured digital documents is crucial when parsing the documents into the structured, machine-readable format for downstream applications. Recent studies in Document Layout Analysis usually rely on computer vision models to understand documents while ignoring other information, such as context information or relation of document components, which are vital to capture. Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis. We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information. Then, we apply graph convolutional networks for representing each aspect of information and use pooling to integrate them. Finally, we aggregate each aspect and feed them into 2-layer MLPs for document layout component classification. Our Doc-GCN achieves new state-of-the-art results in three widely used DLA datasets.

* Accepted by COLING 2022

Via

Access Paper or Ask Questions

Local Interpretations for Explainable Natural Language Processing: A Survey

Mar 20, 2021

Siwen Luo, Hamish Ivison, Caren Han, Josiah Poon

Figure 1 for Local Interpretations for Explainable Natural Language Processing: A Survey

Figure 2 for Local Interpretations for Explainable Natural Language Processing: A Survey

Figure 3 for Local Interpretations for Explainable Natural Language Processing: A Survey

Figure 4 for Local Interpretations for Explainable Natural Language Processing: A Survey

Abstract:As the use of deep learning techniques has grown across various fields over the past decade, complaints about the opaqueness of the black-box models have increased, resulting in an increased focus on transparency in deep learning models. This work investigates various methods to improve the interpretability of deep neural networks for natural language processing (NLP) tasks, including machine translation and sentiment analysis. We provide a comprehensive discussion on the definition of the term \textit{interpretability} and its various aspects at the beginning of this work. The methods collected and summarised in this survey are only associated with local interpretation and are divided into three categories: 1) explaining the model's predictions through related input features; 2) explaining through natural language explanation; 3) probing the hidden states of models and word representations.

* This work is an initial draft

Via

Access Paper or Ask Questions