Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Srinivas Sunkara

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Dec 05, 2024

Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak(+33 more)

Figure 1 for BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Figure 2 for BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Figure 3 for BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Figure 4 for BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Abstract:Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

* The project is hosted at https://bigdocs.github.io

Via

Access Paper or Ask Questions

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Feb 19, 2024

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma

Figure 1 for ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Figure 2 for ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Figure 3 for ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Figure 4 for ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Abstract:Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

* Revision notes: 1) In Appendix I, added dataset location for ScreenQA Short in Appendix I. 2) In Table 4, updated evaluation numbers for Screen Annotation and Complex Screen QA benchmarks as the datasets are updated. 3) Updated Figure 4 to reflect the changes in evaluation numbers described in 2). 4) Minor revisions in other places

Via

Access Paper or Ask Questions

Towards Better Semantic Understanding of Mobile Interfaces

Oct 06, 2022

Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles Baechler, Yu-Chung Hsiao, Jindong, Chen, Abhanshu Sharma, James Stout

Figure 1 for Towards Better Semantic Understanding of Mobile Interfaces

Figure 2 for Towards Better Semantic Understanding of Mobile Interfaces

Figure 3 for Towards Better Semantic Understanding of Mobile Interfaces

Figure 4 for Towards Better Semantic Understanding of Mobile Interfaces

Abstract:Improving the accessibility and automation capabilities of mobile devices can have a significant positive impact on the daily lives of countless users. To stimulate research in this direction, we release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements. This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs, with annotations for icons based on their shapes and semantics, and associations between different elements and their corresponding text labels, resulting in a significant increase in the number of UI elements and the categories assigned to them. We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset. Our models demonstrate strong performance on an evaluation set of unseen apps, indicating their generalizability to newer screens. These models, combined with the new dataset, can enable innovative functionalities like referring to UI elements by their labels, improved coverage and better semantics for icons etc., which would go a long way in making UIs more usable for everyone.

* This paper is to be published at COLING 2022

Via

Access Paper or Ask Questions

A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Jan 28, 2022

Ulrich Rückert, Srinivas Sunkara, Abhinav Rastogi, Sushant Prakash, Pranav Khaitan

Figure 1 for A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Figure 2 for A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Figure 3 for A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Figure 4 for A Unified Approach to Entity-Centric Context Tracking in Social Conversations

Abstract:In human-human conversations, Context Tracking deals with identifying important entities and keeping track of their properties and relationships. This is a challenging problem that encompasses several subtasks such as slot tagging, coreference resolution, resolving plural mentions and entity linking. We approach this problem as an end-to-end modeling task where the conversational context is represented by an entity repository containing the entity references mentioned so far, their properties and the relationships between them. The repository is updated turn-by-turn, thus making training and inference computationally efficient even for long conversations. This paper lays the groundwork for an investigation of this framework in two ways. First, we release Contrack, a large scale human-human conversation corpus for context tracking with people and location annotations. It contains over 7000 conversations with an average of 11.8 turns, 5.8 entities and 15.2 references per conversation. Second, we open-source a neural network architecture for context tracking. Finally we compare this network to state-of-the-art approaches for the subtasks it subsumes and report results on the involved tradeoffs.

Via

Access Paper or Ask Questions

UIBert: Learning Generic Multimodal Representations for UI Understanding

Aug 10, 2021

Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Aguera y Arcas

Figure 1 for UIBert: Learning Generic Multimodal Representations for UI Understanding

Figure 2 for UIBert: Learning Generic Multimodal Representations for UI Understanding

Figure 3 for UIBert: Learning Generic Multimodal Representations for UI Understanding

Figure 4 for UIBert: Learning Generic Multimodal Representations for UI Understanding

Abstract:To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance when high-quality labeled data is unavailable. To address such challenges we introduce UIBert, a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data to learn generic feature representations for a UI and its components. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.

* 8 pages, IJCAI 2021

Via

Access Paper or Ask Questions

ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Jan 25, 2021

Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen, Blaise Agüera y Arcas

Figure 1 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Figure 2 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Figure 3 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Figure 4 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Abstract:As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there are several challenges to achieve this. First, UI components of similar appearance can have different functionalities, making understanding their function more important than just analyzing their appearance. Second, domain-specific features like Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features are not in a natural language format. Third, owing to a large diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data. Inspired by the success of pre-training based approaches in NLP for tackling a variety of problems in a data-efficient way, we introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Our key intuition is that user actions, e.g., a sequence of clicks on different UI components, reveals important information about their functionality. We evaluate the proposed model on a wide variety of downstream tasks, ranging from icon classification to UI component retrieval based on its natural language description. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.

* Accepted to AAAI Conference on Artificial Intelligence (AAAI-21)

Via

Access Paper or Ask Questions

Schema-Guided Dialogue State Tracking Task at DSTC8

Feb 02, 2020

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, Pranav Khaitan

Figure 1 for Schema-Guided Dialogue State Tracking Task at DSTC8

Figure 2 for Schema-Guided Dialogue State Tracking Task at DSTC8

Figure 3 for Schema-Guided Dialogue State Tracking Task at DSTC8

Figure 4 for Schema-Guided Dialogue State Tracking Task at DSTC8

Abstract:This paper gives an overview of the Schema-Guided Dialogue State Tracking task of the 8th Dialogue System Technology Challenge. The goal of this task is to develop dialogue state tracking models suitable for large-scale virtual assistants, with a focus on data-efficient joint modeling across domains and zero-shot generalization to new APIs. This task provided a new dataset consisting of over 16000 dialogues in the training set spanning 16 domains to highlight these challenges, and a baseline model capable of zero-shot generalization to new APIs. Twenty-five teams participated, developing a range of neural network models, exceeding the performance of the baseline model by a very high margin. The submissions incorporated a variety of pre-trained encoders and data augmentation techniques. This paper describes the task definition, dataset and evaluation methodology. We also summarize the approach and results of the submitted systems to highlight the overall trends in the state-of-the-art.

* Presented at DSTC workshop, AAAI 2020. arXiv admin note: text overlap with arXiv:1909.05855

Via

Access Paper or Ask Questions

The Eighth Dialog System Technology Challenge

Nov 14, 2019

Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada(+11 more)

Figure 1 for The Eighth Dialog System Technology Challenge

Figure 2 for The Eighth Dialog System Technology Challenge

Figure 3 for The Eighth Dialog System Technology Challenge

Figure 4 for The Eighth Dialog System Technology Challenge

Abstract:This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.

* Submitted to NeurIPS 2019 3rd Conversational AI Workshop

Via

Access Paper or Ask Questions

Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Sep 12, 2019

Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, Pranav Khaitan

Figure 1 for Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Figure 2 for Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Figure 3 for Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Figure 4 for Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Abstract:Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a zero-shot dialogue state tracking model that achieves state-of-the-art performance on recent benchmark datasets.

Via

Access Paper or Ask Questions