Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vikas Bhardwaj

EgoQR: Efficient QR Code Reading in Egocentric Settings

Oct 07, 2024

Mohsen Moslehpour, Yichao Lu, Pierce Chuang, Ashish Shenoy, Debojeet Chatterjee, Abhay Harpale, Srihari Jayakumar, Vikas Bhardwaj, Seonghyeon Nam, Anuj Kumar

Figure 1 for EgoQR: Efficient QR Code Reading in Egocentric Settings

Figure 2 for EgoQR: Efficient QR Code Reading in Egocentric Settings

Figure 3 for EgoQR: Efficient QR Code Reading in Egocentric Settings

Figure 4 for EgoQR: Efficient QR Code Reading in Egocentric Settings

Abstract:QR codes have become ubiquitous in daily life, enabling rapid information exchange. With the increasing adoption of smart wearable devices, there is a need for efficient, and friction-less QR code reading capabilities from Egocentric point-of-views. However, adapting existing phone-based QR code readers to egocentric images poses significant challenges. Code reading from egocentric images bring unique challenges such as wide field-of-view, code distortion and lack of visual feedback as compared to phones where users can adjust the position and framing. Furthermore, wearable devices impose constraints on resources like compute, power and memory. To address these challenges, we present EgoQR, a novel system for reading QR codes from egocentric images, and is well suited for deployment on wearable devices. Our approach consists of two primary components: detection and decoding, designed to operate on high-resolution images on the device with minimal power consumption and added latency. The detection component efficiently locates potential QR codes within the image, while our enhanced decoding component extracts and interprets the encoded information. We incorporate innovative techniques to handle the specific challenges of egocentric imagery, such as varying perspectives, wider field of view, and motion blur. We evaluate our approach on a dataset of egocentric images, demonstrating 34% improvement in reading the code compared to an existing state of the art QR code readers.

* Submitted to ICLR 2025

Via

Access Paper or Ask Questions

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Feb 12, 2024

Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao(+4 more)

Figure 1 for Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Figure 2 for Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Figure 3 for Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Figure 4 for Lumos : Empowering Multimodal LLMs with Scene Text Recognition

Abstract:We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

* Submitted to KDD 2024 (ADS Track)

Via

Access Paper or Ask Questions

Conversational Answer Generation and Factuality for Reading Comprehension Question-Answering

Mar 11, 2021

Stan Peshterliev, Barlas Oguz, Debojeet Chatterjee, Hakan Inan, Vikas Bhardwaj

Figure 1 for Conversational Answer Generation and Factuality for Reading Comprehension Question-Answering

Figure 2 for Conversational Answer Generation and Factuality for Reading Comprehension Question-Answering

Figure 3 for Conversational Answer Generation and Factuality for Reading Comprehension Question-Answering

Figure 4 for Conversational Answer Generation and Factuality for Reading Comprehension Question-Answering

Abstract:Question answering (QA) is an important use case on voice assistants. A popular approach to QA is extractive reading comprehension (RC) which finds an answer span in a text passage. However, extractive answers are often unnatural in a conversational context which results in suboptimal user experience. In this work, we investigate conversational answer generation for QA. We propose AnswerBART, an end-to-end generative RC model which combines answer generation from multiple passages with passage ranking and answerability. Moreover, a hurdle in applying generative RC are hallucinations where the answer is factually inconsistent with the passage text. We leverage recent work from summarization to evaluate factuality. Experiments show that AnswerBART significantly improves over previous best published results on MS MARCO 2.1 NLGEN by 2.5 ROUGE-L and NarrativeQA by 9.4 ROUGE-L.

Via

Access Paper or Ask Questions

Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data

Nov 08, 2020

Ankit Arun, Soumya Batra, Vikas Bhardwaj, Ashwini Challa, Pinar Donmez, Peyman Heidari, Hakan Inan, Shashank Jain, Anuj Kumar, Shawn Mei(+2 more)

Abstract:Natural language generation (NLG) is a critical component in conversational systems, owing to its role of formulating a correct and natural text response. Traditionally, NLG components have been deployed using template-based solutions. Although neural network solutions recently developed in the research community have been shown to provide several benefits, deployment of such model-based solutions has been challenging due to high latency, correctness issues, and high data needs. In this paper, we present approaches that have helped us deploy data-efficient neural solutions for NLG in conversational systems to production. We describe a family of sampling and modeling techniques to attain production quality with light-weight neural network models using only a fraction of the data that would be necessary otherwise, and show a thorough comparison between each. Our results show that domain complexity dictates the appropriate approach to achieve high data efficiency. Finally, we distill the lessons from our experimental findings into a list of best practices for production-level NLG model development, and present them in a brief runbook. Importantly, the end products of all of the techniques are small sequence-to-sequence models (2Mb) that we can reliably deploy in production.

* Accepted for publication in COLING 2020

Via

Access Paper or Ask Questions

On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Apr 30, 2020

Phillip Keung, Yichao Lu, Julian Salazar, Vikas Bhardwaj

Figure 1 for On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Figure 2 for On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Figure 3 for On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Figure 4 for On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning

Abstract:Pre-trained multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on some source language (typically English) and evaluated on a different target language. However, published results for baseline mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot cross-lingual performance varies greatly within the same fine-tuning run and between different fine-tuning runs. We recommend providing oracle scores alongside the zero-shot results: still fine-tune using English, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding the variation from bad checkpoints.

Via

Access Paper or Ask Questions

Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Feb 12, 2020

Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj

Figure 1 for Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Figure 2 for Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Figure 3 for Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Figure 4 for Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

Abstract:We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output (i.e. more than 100 characters per second). A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts. These decoding issues are reproducible in a speech transformer model from ESPnet, and to a lesser extent in a self-attention CTC model, suggesting that these issues are intrinsic to the use of the attention mechanism. We create a separate length prediction model to predict the correct number of wordpieces in the output, which allows us to identify and truncate problematic decoding results without increasing word error rates on the LibriSpeech task.

* Artifacts like our filtered Audio BNC dataset can be found at https://github.com/aws-samples/seq2seq-asr-misbehaves

Via

Access Paper or Ask Questions

Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Sep 13, 2019

Phillip Keung, Yichao Lu, Vikas Bhardwaj

Figure 1 for Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Figure 2 for Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Figure 3 for Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Figure 4 for Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Abstract:Contextual word embeddings (e.g. GPT, BERT, ELMo, etc.) have demonstrated state-of-the-art performance on various NLP tasks. Recent work with the multilingual version of BERT has shown that the model performs very well in cross-lingual settings, even when only labeled English data is used to finetune the model. We improve upon multilingual BERT's zero-resource cross-lingual performance via adversarial learning. We report the magnitude of the improvement on the multilingual MLDoc text classification and CoNLL 2002/2003 named entity recognition tasks. Furthermore, we show that language-adversarial training encourages BERT to align the embeddings of English documents and their translations, which may be the cause of the observed performance gains.

* In EMNLP 2019

Via

Access Paper or Ask Questions

A neural interlingua for multilingual machine translation

Oct 16, 2018

Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, Jason Sun

Figure 1 for A neural interlingua for multilingual machine translation

Figure 2 for A neural interlingua for multilingual machine translation

Figure 3 for A neural interlingua for multilingual machine translation

Figure 4 for A neural interlingua for multilingual machine translation

Abstract:We incorporate an explicit neural interlingua into a multilingual encoder-decoder neural machine translation (NMT) architecture. We demonstrate that our model learns a language-independent representation by performing direct zero-shot translation (without using pivot translation), and by using the source sentence embeddings to create an English Yelp review classifier that, through the mediation of the neural interlingua, can also classify French and German reviews. Furthermore, we show that, despite using a smaller number of parameters than a pairwise collection of bilingual NMT models, our approach produces comparable BLEU scores for each language pair in WMT15.

* Accepted in WMT 18

Via

Access Paper or Ask Questions

A practical approach to dialogue response generation in closed domains

Mar 28, 2017

Yichao Lu, Phillip Keung, Shaonan Zhang, Jason Sun, Vikas Bhardwaj

Figure 1 for A practical approach to dialogue response generation in closed domains

Figure 2 for A practical approach to dialogue response generation in closed domains

Figure 3 for A practical approach to dialogue response generation in closed domains

Figure 4 for A practical approach to dialogue response generation in closed domains

Abstract:We describe a prototype dialogue response generation model for the customer service domain at Amazon. The model, which is trained in a weakly supervised fashion, measures the similarity between customer questions and agent answers using a dual encoder network, a Siamese-like neural network architecture. Answer templates are extracted from embeddings derived from past agent answers, without turn-by-turn annotations. Responses to customer inquiries are generated by selecting the best template from the final set of templates. We show that, in a closed domain like customer service, the selected templates cover $>$70\% of past customer inquiries. Furthermore, the relevance of the model-selected templates is significantly higher than templates selected by a standard tf-idf baseline.

Via

Access Paper or Ask Questions