Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ajay Kannan

Gemini: A Family of Highly Capable Multimodal Models

Dec 19, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth(+930 more)

Abstract:This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.

Via

Access Paper or Ask Questions

FLINT: A Platform for Federated Learning Integration

Mar 11, 2023

Ewen Wang, Ajay Kannan, Yuefeng Liang, Boyi Chen, Mosharaf Chowdhury

Abstract:Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users.

* Preprint for MLSys 2023

Via

Access Paper or Ask Questions

Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

Dec 12, 2018

Tzu-Yu Liu, Ajay Kannan, Adam Drake, Marvin Bertin, Nathan Wan

Figure 1 for Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

Figure 2 for Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

Figure 3 for Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

Figure 4 for Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

Abstract:Statistical learning on biological data can be challenging due to confounding variables in sample collection and processing. Confounders can cause models to generalize poorly and result in inaccurate prediction performance metrics if models are not validated thoroughly. In this paper, we propose methods to control for confounding factors and further improve prediction performance. We introduce OrthoNormal basis construction In cOnfounding factor Normalization (ONION) to remove confounding covariates and use the Domain-Adversarial Neural Network (DANN) to penalize models for encoding confounder information. We apply the proposed methods to simulated and empirical patient data and show significant improvements in generalization.

Via

Access Paper or Ask Questions

A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Sep 16, 2018

Gautham Krishna G, Karthik Subramanian Nathan, Yogesh Kumar B, Ankith A Prabhu, Ajay Kannan, Vineeth Vijayaraghavan

Figure 1 for A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Figure 2 for A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Figure 3 for A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Figure 4 for A Generic Multi-modal Dynamic Gesture Recognition System using Machine Learning

Abstract:Human computer interaction facilitates intelligent communication between humans and computers, in which gesture recognition plays a prominent role. This paper proposes a machine learning system to identify dynamic gestures using tri-axial acceleration data acquired from two public datasets. These datasets, uWave and Sony, were acquired using accelerometers embedded in Wii remotes and smartwatches, respectively. A dynamic gesture signed by the user is characterized by a generic set of features extracted across time and frequency domains. The system was analyzed from an end-user perspective and was modelled to operate in three modes. The modes of operation determine the subsets of data to be used for training and testing the system. From an initial set of seven classifiers, three were chosen to evaluate each dataset across all modes rendering the system towards mode-neutrality and dataset-independence. The proposed system is able to classify gestures performed at varying speeds with minimum preprocessing, making it computationally efficient. Moreover, this system was found to run on a low-cost embedded platform - Raspberry Pi Zero (USD 5), making it economically viable.

* Accepted at IEEE Future of Information and Communications Conference (FICC 2018)

Via

Access Paper or Ask Questions

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Feb 22, 2018

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller

Figure 1 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 2 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 3 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 4 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Abstract:We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

* Published as a conference paper at ICLR 2018. (v3 changed paper title)

Via

Access Paper or Ask Questions

Reducing Bias in Production Speech Models

May 11, 2017

Eric Battenberg, Rewon Child, Adam Coates, Christopher Fougner, Yashesh Gaur, Jiaji Huang, Heewoo Jun, Ajay Kannan, Markus Kliegl, Atul Kumar(+6 more)

Figure 1 for Reducing Bias in Production Speech Models

Figure 2 for Reducing Bias in Production Speech Models

Figure 3 for Reducing Bias in Production Speech Models

Figure 4 for Reducing Bias in Production Speech Models

Abstract:Replacing hand-engineered pipelines with end-to-end deep learning systems has enabled strong results in applications like speech and object recognition. However, the causality and latency constraints of production systems put end-to-end speech models back into the underfitting regime and expose biases in the model that we show cannot be overcome by "scaling up", i.e., training bigger models on more data. In this work we systematically identify and address sources of bias, reducing error rates by up to 20% while remaining practical for deployment. We achieve this by utilizing improved neural architectures for streaming inference, solving optimization issues, and employing strategies that increase audio and label modelling versatility.

Via

Access Paper or Ask Questions

Deep Speaker: an End-to-End Neural Speaker Embedding System

May 05, 2017

Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu

Figure 1 for Deep Speaker: an End-to-End Neural Speaker Embedding System

Figure 2 for Deep Speaker: an End-to-End Neural Speaker Embedding System

Figure 3 for Deep Speaker: an End-to-End Neural Speaker Embedding System

Figure 4 for Deep Speaker: an End-to-End Neural Speaker Embedding System

Abstract:We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

Via

Access Paper or Ask Questions

Automated Attribution and Intertextual Analysis

May 03, 2014

James Brofos, Ajay Kannan, Rui Shu

Figure 1 for Automated Attribution and Intertextual Analysis

Figure 2 for Automated Attribution and Intertextual Analysis

Figure 3 for Automated Attribution and Intertextual Analysis

Figure 4 for Automated Attribution and Intertextual Analysis

Abstract:In this work, we employ quantitative methods from the realm of statistics and machine learning to develop novel methodologies for author attribution and textual analysis. In particular, we develop techniques and software suitable for applications to Classical study, and we illustrate the efficacy of our approach in several interesting open questions in the field. We apply our numerical analysis techniques to questions of authorship attribution in the case of the Greek tragedian Euripides, to instances of intertextuality and influence in the poetry of the Roman statesman Seneca the Younger, and to cases of "interpolated" text with respect to the histories of Livy.

* 10 pages, 4 tables, 4 figures

Via

Access Paper or Ask Questions