Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sushant Kafle

Gemini: A Family of Highly Capable Multimodal Models

Dec 19, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth(+930 more)

Abstract:This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.

Via

Access Paper or Ask Questions

Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Mar 28, 2019

Sushant Kafle, Cecilia O. Alm, Matt Huenerfauth

Figure 1 for Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Figure 2 for Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Figure 3 for Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Figure 4 for Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Abstract:Prosodic cues in conversational speech aid listeners in discerning a message. We investigate whether acoustic cues in spoken dialogue can be used to identify the importance of individual words to the meaning of a conversation turn. Individuals who are Deaf and Hard of Hearing often rely on real-time captions in live meetings. Word error rate, a traditional metric for evaluating automatic speech recognition, fails to capture that some words are more important for a system to transcribe correctly than others. We present and evaluate neural architectures that use acoustic features for 3-class word importance prediction. Our model performs competitively against state-of-the-art text-based word-importance prediction models, and it demonstrates particular benefits when operating on imperfect ASR output.

* 8 pages, 2 figures

Via

Access Paper or Ask Questions

A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

Feb 09, 2018

Sushant Kafle, Matt Huenerfauth

Figure 1 for A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

Figure 2 for A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

Figure 3 for A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

Figure 4 for A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts

Abstract:Motivated by a project to create a system for people who are deaf or hard-of-hearing that would use automatic speech recognition (ASR) to produce real-time text captions of spoken English during in-person meetings with hearing individuals, we have augmented a transcript of the Switchboard conversational dialogue corpus with an overlay of word-importance annotations, with a numeric score for each word, to indicate its importance to the meaning of each dialogue turn. Further, we demonstrate the utility of this corpus by training an automatic word importance labeling model; our best performing model has an F-score of 0.60 in an ordinal 6-class word-importance classification task with an agreement (concordance correlation coefficient) of 0.839 with the human annotators (agreement score between annotators is 0.89). Finally, we discuss our intended future applications of this resource, particularly for the task of evaluating ASR performance, i.e. creating metrics that predict ASR-output caption text usability for DHH users better thanWord Error Rate (WER).

* Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
* Language Resources and Evaluation Conference (LREC)

Via

Access Paper or Ask Questions