Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Shih

Clinical Named Entity Recognition using Contextualized Token Representations

Jun 23, 2021

Yichao Zhou, Chelsea Ju, J. Harry Caufield, Kevin Shih, Calvin Chen, Yizhou Sun, Kai-Wei Chang, Peipei Ping, Wei Wang

Figure 1 for Clinical Named Entity Recognition using Contextualized Token Representations

Figure 2 for Clinical Named Entity Recognition using Contextualized Token Representations

Figure 3 for Clinical Named Entity Recognition using Contextualized Token Representations

Figure 4 for Clinical Named Entity Recognition using Contextualized Token Representations

Abstract:The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches in extracting the entities of interests focus on using static word embeddings to represent each word. However, one word can have different interpretations that depend on the context of the sentences. Evidently, static word embeddings are insufficient to integrate the diverse interpretation of a word. To overcome this challenge, the technique of contextualized word embedding has been introduced to better capture the semantic meaning of each word based on its context. Two of these language models, ELMo and Flair, have been widely used in the field of Natural Language Processing to generate the contextualized word embeddings on domain-generic documents. However, these embeddings are usually too general to capture the proximity among vocabularies of specific domains. To facilitate various downstream applications using clinical case reports (CCRs), we pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair) using the clinical-related corpus from the PubMed Central. Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.

* 1 figure, 6 tables

Via

Access Paper or Ask Questions

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

May 13, 2020

Rafael Valle, Kevin Shih, Ryan Prenger, Bryan Catanzaro

Figure 1 for Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Figure 2 for Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Figure 3 for Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Figure 4 for Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Abstract:In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at https://github.com/NVIDIA/flowtron

* 10 pages, 7 pictures

Via

Access Paper or Ask Questions

An Interpretable Model for Scene Graph Generation

Nov 21, 2018

Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

Figure 1 for An Interpretable Model for Scene Graph Generation

Figure 2 for An Interpretable Model for Scene Graph Generation

Figure 3 for An Interpretable Model for Scene Graph Generation

Figure 4 for An Interpretable Model for Scene Graph Generation

Abstract:We propose an efficient and interpretable scene graph generator. We consider three types of features: visual, spatial and semantic, and we use a late fusion strategy such that each feature's contribution can be explicitly investigated. We study the key factors about these features that have the most impact on the performance, and also visualize the learned visual features for relationships and investigate the efficacy of our model. We won the champion of the OpenImages Visual Relationship Detection Challenge on Kaggle, where we outperform the 2nd place by 5\% (20\% relatively). We believe an accurate scene graph generator is a fundamental stepping stone for higher-level vision-language tasks such as image captioning and visual QA, since it provides a semantic, structured comprehension of an image that is beyond pixels and objects.

* arXiv admin note: substantial text overlap with arXiv:1811.00662

Via

Access Paper or Ask Questions

Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

Nov 01, 2018

Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

Figure 1 for Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

Figure 2 for Introduction to the 1st Place Winning Model of OpenImages Relationship Detection Challenge

Abstract:This article describes the model we built that achieved 1st place in the OpenImage Visual Relationship Detection Challenge on Kaggle. Three key factors contribute the most to our success: 1) language bias is a powerful baseline for this task. We build the empirical distribution $P(predicate|subject,object)$ in the training set and directly use that in testing. This baseline achieved the 2nd place when submitted; 2) spatial features are as important as visual features, especially for spatial relationships such as "under" and "inside of"; 3) It is a very effective way to fuse different features by first building separate modules for each of them, then adding their output logits before the final softmax layer. We show in ablation study that each factor can improve the performance to a non-trivial extent, and the model reaches optimal when all of them are combined.

Via

Access Paper or Ask Questions

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Oct 16, 2017

Tanmay Gupta, Kevin Shih, Saurabh Singh, Derek Hoiem

Figure 1 for Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Figure 2 for Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Figure 3 for Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Figure 4 for Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Abstract:An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.

* Accepted in ICCV 2017. The arxiv version has an extra analysis on correlation with human attention

Via

Access Paper or Ask Questions

Efficient Media Retrieval from Non-Cooperative Queries

Nov 19, 2014

Kevin Shih, Wei Di, Vignesh Jagadeesh, Robinson Piramuthu

Figure 1 for Efficient Media Retrieval from Non-Cooperative Queries

Figure 2 for Efficient Media Retrieval from Non-Cooperative Queries

Figure 3 for Efficient Media Retrieval from Non-Cooperative Queries

Figure 4 for Efficient Media Retrieval from Non-Cooperative Queries

Abstract:Text is ubiquitous in the artificial world and easily attainable when it comes to book title and author names. Using the images from the book cover set from the Stanford Mobile Visual Search dataset and additional book covers and metadata from openlibrary.org, we construct a large scale book cover retrieval dataset, complete with 100K distractor covers and title and author strings for each. Because our query images are poorly conditioned for clean text extraction, we propose a method for extracting a matching noisy and erroneous OCR readings and matching it against clean author and book title strings in a standard document look-up problem setup. Finally, we demonstrate how to use this text-matching as a feature in conjunction with popular retrieval features such as VLAD using a simple learning setup to achieve significant improvements in retrieval accuracy over that of either VLAD or the text alone.

* 8 pages, 9 figures, 1 table

Via

Access Paper or Ask Questions