Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yihao Ding

Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task

May 28, 2025

Yanbei Jiang, Yihao Ding, Chao Lei, Jiayang Ao, Jey Han Lau, Krista A. Ehinger

Abstract:Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn't explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages.

* Accepted at ACL Findings

Via

Access Paper or Ask Questions

Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review

May 28, 2025

Zihan Xu, Haotian Ma, Gongbo Zhang, Yihao Ding, Chunhua Weng, Yifan Peng

Abstract:Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM -- Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.

* Accepted by ACL 2025 Findings

Via

Access Paper or Ask Questions

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

Mar 19, 2025

Zechuan Li, Hongshan Yu, Yihao Ding, Jinhao Qiao, Basim Azam, Naveed Akhtar

Abstract:We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at https://github.com/ZechuanLi/GO-N3RDet.

* Accepted by CVPR2025

Via

Access Paper or Ask Questions

DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Oct 02, 2024

Yihao Ding, Soyeon Caren Han, Zechuan Li, Hyunsuk Chung

Figure 1 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 2 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 3 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Figure 4 for DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights

Abstract:Visually-Rich Documents (VRDs), encompassing elements like charts, tables, and references, convey complex information across various fields. However, extracting information from these rich documents is labor-intensive, especially given their inconsistent formats and domain-specific requirements. While pretrained models for VRD Understanding have progressed, their reliance on large, annotated datasets limits scalability. This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework, which utilises machine-generated synthetic data for domain adaptation. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling. By leveraging pretrained models and synthetic data, DAViD achieves competitive performance with minimal annotated datasets. Extensive experiments validate DAViD's effectiveness, demonstrating its ability to efficiently adapt to domain-specific VRDU tasks.

* Work in progress

Via

Access Paper or Ask Questions

Deep Learning based Visually Rich Document Content Understanding: A Survey

Aug 02, 2024

Yihao Ding, Jean Lee, Soyeon Caren Han

Abstract:Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and marketing due to their multimodal information content. Traditional methods for extracting information from VRDs depend on expert knowledge and manual labor, making them costly and inefficient. The advent of deep learning has revolutionized this process, introducing models that leverage multimodal information vision, text, and layout along with pretraining tasks to develop comprehensive document representations. These models have achieved state-of-the-art performance across various downstream tasks, significantly enhancing the efficiency and accuracy of information extraction from VRDs. In response to the growing demands and rapid developments in Visually Rich Document Understanding (VRDU), this paper provides a comprehensive review of deep learning-based VRDU frameworks. We systematically survey and analyze existing methods and benchmark datasets, categorizing them based on adopted strategies and downstream tasks. Furthermore, we compare different techniques used in VRDU models, focusing on feature representation and fusion, model architecture, and pretraining methods, while highlighting their strengths, limitations, and appropriate scenarios. Finally, we identify emerging trends and challenges in VRDU, offering insights into future research directions and practical applications. This survey aims to provide a thorough understanding of VRDU advancements, benefiting both academic and industrial sectors.

* Work in Progress

Via

Access Paper or Ask Questions

PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Apr 19, 2024

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han

Figure 1 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 2 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 3 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Figure 4 for PDF-MVQA: A Dataset for Multimodal Information Retrieval in-based Visual Question Answering

Abstract:Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.

* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions

M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Feb 28, 2024

Yihao Ding, Lorenzo Vaiani, Caren Han, Jean Lee, Paolo Garza, Josiah Poon, Luca Cagliero

Figure 1 for M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Figure 2 for M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Figure 3 for M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Figure 4 for M3-VRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Abstract:This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents.

* Work in progress

Via

Access Paper or Ask Questions

Workshop on Document Intelligence Understanding

Jul 31, 2023

Soyeon Caren Han, Yihao Ding, Siwen Luo, Josiah Poon, HeeGuen Yoon, Zhe Huang, Paul Duuring, Eun Jung Holden

Figure 1 for Workshop on Document Intelligence Understanding

Figure 2 for Workshop on Document Intelligence Understanding

Figure 3 for Workshop on Document Intelligence Understanding

Abstract:Document understanding and information extraction include different tasks to understand a document and extract valuable information automatically. Recently, there has been a rising demand for developing document understanding among different domains, including business, law, and medicine, to boost the efficiency of work that is associated with a large number of documents. This workshop aims to bring together researchers and industry developers in the field of document intelligence and understanding diverse document types to boost automatic document processing and understanding techniques. We also released a data challenge on the recently introduced document-level VQA dataset, PDFVQA. The PDFVQA challenge examines the structural and contextual understandings of proposed models on the natural full document level of multiple consecutive document pages by including questions with a sequence of answers extracted from multi-pages of the full document. This task helps to boost the document understanding step from the single-page level to the full document level understanding.

* Accepted at CIKM 2023; Orgnised in CIKM

Via

Access Paper or Ask Questions

Graph Neural Networks for Text Classification: A Survey

Apr 27, 2023

Kunze Wang, Yihao Ding, Soyeon Caren Han

Abstract:Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.

* 28 pages

Via

Access Paper or Ask Questions

PDFVQA: A New Dataset for Real-World VQA on Documents

Apr 24, 2023

Yihao Ding, Siwen Luo, Hyunsuk Chung, Soyeon Caren Han

Figure 1 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 2 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 3 for PDFVQA: A New Dataset for Real-World VQA on Documents

Figure 4 for PDFVQA: A New Dataset for Real-World VQA on Documents

Abstract:Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.

* Work in progress

Via

Access Paper or Ask Questions