Abstract:In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
Abstract:Empathy is a critical element of effective and satisfactory conversational communication, yet previous studies in measuring conversational empathy mostly focus on expressed communicative intents -- in which way empathy is expressed, ignoring the fact that conversation is also a collaborative practice involving both speakers and listeners. In contrast, we propose a multi-dimensional empathy evaluation framework that extends upon existing work to measure both expressed intents from the speaker's perspective and perceived empathy from the listener's perspective. Applying the proposed framework to analyzing our internal customer-service dialogue shows that the two dimensions (expressed intent types and perceived empathy) are inter-connected, while perceived empathy has high correlation with the satisfactory level of dialogue sessions. This proposed framework still requires subjective assessments from trained annotators, which can be non-trivial to collect. To scale up evaluation without excessive reliance on carefully annotated data, we explore different modeling options to automatically measure conversational empathy with (1) prompting frozen large language models (LLMs) and (2) training language model-based classifiers. Extensive experiments on both internal and external dialogue datasets show that measuring conversational empathy remains a challenging task for prompting frozen LLMs, reflected by less satisfying performance of GPT-4 and Flan family models. On the other hand, our proposed instruction-finetuned classifiers based on sequence-to-sequence (Seq2Seq) language models is able to achieve the best performance compared to prior works and competitive baselines. Finally, we perform comprehensive ablation studies on the performance of proposed instruction-finetuned classifiers and give recommendations on potentially adopting them as automatic conversational empathy evaluation metrics.
Abstract:This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users.
Abstract:Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional automated metrics such as BLEU and ROUGE primarily measure lexical similarity to human-written references, and are not able to distinguish personalization from other subtle semantic aspects, thus falling short of capturing the nuances of personalized generated content quality. On the other hand, human judgments are costly to obtain, especially in the realm of personalized evaluation. Inspired by these challenges, we explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context. We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. To validate the effectiveness of AuPEL, we design carefully controlled experiments and compare the accuracy of the evaluation judgments made by LLMs versus that of judgements made by human annotators, and conduct rigorous analyses of the consistency and sensitivity of the proposed metric. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task. Our work suggests that using LLMs as the evaluators of personalized text generation is superior to traditional text similarity metrics, even though interesting new challenges still remain.