Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Boratko

Gemini Embedding: Generalizable Embeddings from Gemini

Mar 10, 2025

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera(+37 more)

Abstract:In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.

* 19 pages

Via

Access Paper or Ask Questions

A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Sep 03, 2024

Nicholas Monath, Will Grathwohl, Michael Boratko, Rob Fergus, Andrew McCallum, Manzil Zaheer

Figure 1 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 2 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 3 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Figure 4 for A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks

Abstract:In dense retrieval, deep encoders provide embeddings for both inputs and targets, and the softmax function is used to parameterize a distribution over a large number of candidate targets (e.g., textual passages for information retrieval). Significant challenges arise in training such encoders in the increasingly prevalent scenario of (1) a large number of targets, (2) a computationally expensive target encoder model, (3) cached target embeddings that are out-of-date due to ongoing training of target encoder parameters. This paper presents a simple and highly scalable response to these challenges by training a small parametric corrector network that adjusts stale cached target embeddings, enabling an accurate softmax approximation and thereby sampling of up-to-date high scoring "hard negatives." We theoretically investigate the generalization properties of our proposed target corrector, relating the complexity of the network, staleness of cached representations, and the amount of training data. We present experimental results on large benchmark dense retrieval datasets as well as on QA with retrieval augmented language models. Our approach matches state-of-the-art results even when no target embedding updates are made during training beyond an initial cache from the unsupervised pre-trained model, providing a 4-80x reduction in re-embedding computational cost.

* ICML 2024

Via

Access Paper or Ask Questions

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Jun 19, 2024

Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia(+9 more)

Figure 1 for Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Figure 2 for Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Figure 3 for Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Figure 4 for Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Abstract:Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.

* 29 pages. Dataset available at https://github.com/google-deepmind/loft

Via

Access Paper or Ask Questions

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Jun 06, 2024

Qi Cheng, Michael Boratko, Pranay Kumar Yelugam, Tim O'Gorman, Nalini Singh, Andrew McCallum, Xiang Lorraine Li

Figure 1 for Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Figure 2 for Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Figure 3 for Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Figure 4 for Every Answer Matters: Evaluating Commonsense with Probabilistic Measures

Abstract:Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.

* ACL 2024 Camera Ready

Via

Access Paper or Ask Questions

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Mar 29, 2024

Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding(+10 more)

Figure 1 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 2 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 3 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 4 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Abstract:We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

* 18 pages

Via

Access Paper or Ask Questions

Box Embeddings: An open-source library for representation learning using geometric structures

Sep 10, 2021

Tejas Chheda, Purujit Goyal, Trang Tran, Dhruvesh Patel, Michael Boratko, Shib Sankar Dasgupta, Andrew McCallum

Figure 1 for Box Embeddings: An open-source library for representation learning using geometric structures

Figure 2 for Box Embeddings: An open-source library for representation learning using geometric structures

Figure 3 for Box Embeddings: An open-source library for representation learning using geometric structures

Figure 4 for Box Embeddings: An open-source library for representation learning using geometric structures

Abstract:A major factor contributing to the success of modern representation learning is the ease of performing various vector operations. Recently, objects with geometric structures (eg. distributions, complex or hyperbolic vectors, or regions such as cones, disks, or boxes) have been explored for their alternative inductive biases and additional representational capacities. In this work, we introduce Box Embeddings, a Python library that enables researchers to easily apply and extend probabilistic box embeddings.

* The source code and the usage and API documentation for the library is available at https://github.com/iesl/box-embeddings and https://www.iesl.cs.umass.edu/box-embeddings/main/index.html

Via

Access Paper or Ask Questions

Word2Box: Learning Word Representation Using Box Embeddings

Jun 28, 2021

Shib Sankar Dasgupta, Michael Boratko, Shriya Atmakuri, Xiang Lorraine Li, Dhruvesh Patel, Andrew McCallum

Figure 1 for Word2Box: Learning Word Representation Using Box Embeddings

Figure 2 for Word2Box: Learning Word Representation Using Box Embeddings

Figure 3 for Word2Box: Learning Word Representation Using Box Embeddings

Figure 4 for Word2Box: Learning Word Representation Using Box Embeddings

Abstract:Learning vector representations for words is one of the most fundamental topics in NLP, capable of capturing syntactic and semantic relationships useful in a variety of downstream NLP tasks. Vector representations can be limiting, however, in that typical scoring such as dot product similarity intertwines position and magnitude of the vector in space. Exciting innovations in the space of representation learning have proposed alternative fundamental representations, such as distributions, hyperbolic vectors, or regions. Our model, Word2Box, takes a region-based approach to the problem of word representation, representing words as $n$-dimensional rectangles. These representations encode position and breadth independently and provide additional geometric operations such as intersection and containment which allow them to model co-occurrence patterns vectors struggle with. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a qualitative analysis exploring the additional unique expressivity provided by Word2Box.

* Work in progress

Via

Access Paper or Ask Questions

Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

Apr 09, 2021

Xuelu Chen, Michael Boratko, Muhao Chen, Shib Sankar Dasgupta, Xiang Lorraine Li, Andrew McCallum

Figure 1 for Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

Figure 2 for Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

Figure 3 for Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

Figure 4 for Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning

Abstract:Knowledge bases often consist of facts which are harvested from a variety of sources, many of which are noisy and some of which conflict, resulting in a level of uncertainty for each triple. Knowledge bases are also often incomplete, prompting the use of embedding methods to generalize from known facts, however, existing embedding methods only model triple-level uncertainty, and reasoning results lack global consistency. To address these shortcomings, we propose BEUrRE, a novel uncertain knowledge graph embedding method with calibrated probabilistic semantics. BEUrRE models each entity as a box (i.e. axis-aligned hyperrectangle) and relations between two entities as affine transforms on the head and tail entity boxes. The geometry of the boxes allows for efficient calculation of intersections and volumes, endowing the model with calibrated probabilistic semantics and facilitating the incorporation of relational constraints. Extensive experiments on two benchmark datasets show that BEUrRE consistently outperforms baselines on confidence prediction and fact ranking due to its probabilistic calibration and ability to capture high-order dependencies among facts.

* NAACL-HLT 2021

Via

Access Paper or Ask Questions

Modeling Fine-Grained Entity Types with Box Embeddings

Jan 02, 2021

Yasumasa Onoe, Michael Boratko, Greg Durrett

Figure 1 for Modeling Fine-Grained Entity Types with Box Embeddings

Figure 2 for Modeling Fine-Grained Entity Types with Box Embeddings

Figure 3 for Modeling Fine-Grained Entity Types with Box Embeddings

Figure 4 for Modeling Fine-Grained Entity Types with Box Embeddings

Abstract:Neural entity typing models typically represent entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types' complex interdependencies. We study the ability of box embeddings, which represent entity types as d-dimensional hyperrectangles, to represent hierarchies of fine-grained entity type labels even when these relationships are not defined explicitly in the ontology. Our model represents both types and entity mentions as boxes. Each mention and its context are fed into a BERT-based model to embed that mention in our box space; essentially, this model leverages typological clues present in the surface text to hypothesize a type representation for the mention. Soft box containment can then be used to derive probabilities, both the posterior probability of a mention exhibiting a given type and the conditional probability relations between types themselves. We compare our approach with a strong vector-based typing model, and observe state-of-the-art performance on several entity typing benchmarks. In addition to competitive typing performance, our box-based model shows better performance in prediction consistency (predicting a supertype and a subtype together) and confidence (i.e., calibration), implying that the box-based model captures the latent type hierarchies better than the vector-based model does.

Via

Access Paper or Ask Questions

Improving Local Identifiability in Probabilistic Box Embeddings

Oct 29, 2020

Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, Andrew McCallum

Figure 1 for Improving Local Identifiability in Probabilistic Box Embeddings

Figure 2 for Improving Local Identifiability in Probabilistic Box Embeddings

Figure 3 for Improving Local Identifiability in Probabilistic Box Embeddings

Figure 4 for Improving Local Identifiability in Probabilistic Box Embeddings

Abstract:Geometric embeddings have recently received attention for their natural ability to represent transitive asymmetric relations via containment. Box embeddings, where objects are represented by n-dimensional hyperrectangles, are a particularly promising example of such an embedding as they are closed under intersection and their volume can be calculated easily, allowing them to naturally represent calibrated probability distributions. The benefits of geometric embeddings also introduce a problem of local identifiability, however, where whole neighborhoods of parameters result in equivalent loss which impedes learning. Prior work addressed some of these issues by using an approximation to Gaussian convolution over the box parameters, however, this intersection operation also increases the sparsity of the gradient. In this work, we model the box parameters with min and max Gumbel distributions, which were chosen such that space is still closed under the operation of the intersection. The calculation of the expected intersection volume involves all parameters, and we demonstrate experimentally that this drastically improves the ability of such models to learn.

* Accepted at NeurIPS2020

Via

Access Paper or Ask Questions