Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Blair Chen

Gemini Embedding: Generalizable Embeddings from Gemini

Mar 10, 2025

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera(+37 more)

Abstract:In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.

* 19 pages

Via

Access Paper or Ask Questions

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Mar 29, 2024

Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding(+10 more)

Figure 1 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 2 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 3 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Figure 4 for Gecko: Versatile Text Embeddings Distilled from Large Language Models

Abstract:We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

* 18 pages

Via

Access Paper or Ask Questions

Content Conditional Debiasing for Fair Text Embedding

Feb 23, 2024

Wenlong Deng, Blair Chen, Xiaoxiao Li, Christos Thrampoulidis

Figure 1 for Content Conditional Debiasing for Fair Text Embedding

Figure 2 for Content Conditional Debiasing for Fair Text Embedding

Figure 3 for Content Conditional Debiasing for Fair Text Embedding

Figure 4 for Content Conditional Debiasing for Fair Text Embedding

Abstract:Mitigating biases in machine learning models has gained increasing attention in Natural Language Processing (NLP). Yet, only a few studies focus on fair text embeddings, which are crucial yet challenging for real-world applications. In this paper, we propose a novel method for learning fair text embeddings. We achieve fairness while maintaining utility trade-off by ensuring conditional independence between sensitive attributes and text embeddings conditioned on the content. Specifically, we enforce that embeddings of texts with different sensitive attributes but identical content maintain the same distance toward the embedding of their corresponding neutral text. Furthermore, we address the issue of lacking proper training data by using Large Language Models (LLMs) to augment texts into different sensitive groups. Our extensive evaluations demonstrate that our approach effectively improves fairness while preserving the utility of embeddings, representing a pioneering effort in achieving conditional independence for fair text embeddings.

Via

Access Paper or Ask Questions

An Investigation of how Label Smoothing Affects Generalization

Oct 23, 2020

Blair Chen, Liu Ziyin, Zihao Wang, Paul Pu Liang

Figure 1 for An Investigation of how Label Smoothing Affects Generalization

Figure 2 for An Investigation of how Label Smoothing Affects Generalization

Figure 3 for An Investigation of how Label Smoothing Affects Generalization

Figure 4 for An Investigation of how Label Smoothing Affects Generalization

Abstract:It has been hypothesized that label smoothing can reduce overfitting and improve generalization, and current empirical evidence seems to corroborate these effects. However, there is a lack of mathematical understanding of when and why such empirical improvements occur. In this paper, as a step towards understanding why label smoothing is effective, we propose a theoretical framework to show how label smoothing provides in controlling the generalization loss. In particular, we show that this benefit can be precisely formulated and identified in the label noise setting, where the training is partially mislabeled. Our theory also predicts the existence of an optimal label smoothing point, a single value for the label smoothing hyperparameter that minimizes generalization loss. Extensive experiments are done to confirm the predictions of our theory. We believe that our findings will help both theoreticians and practitioners understand label smoothing, and better apply them to real-world datasets.

Via

Access Paper or Ask Questions

Learning Not to Learn in the Presence of Noisy Labels

Feb 16, 2020

Liu Ziyin, Blair Chen, Ru Wang, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency, Masahito Ueda

Figure 1 for Learning Not to Learn in the Presence of Noisy Labels

Figure 2 for Learning Not to Learn in the Presence of Noisy Labels

Figure 3 for Learning Not to Learn in the Presence of Noisy Labels

Figure 4 for Learning Not to Learn in the Presence of Noisy Labels

Abstract:Learning in the presence of label noise is a challenging yet important task: it is crucial to design models that are robust in the presence of mislabeled datasets. In this paper, we discover that a new class of loss functions called the gambler's loss provides strong robustness to label noise across various levels of corruption. We show that training with this loss function encourages the model to "abstain" from learning on the data points with noisy labels, resulting in a simple and effective method to improve robustness and generalization. In addition, we propose two practical extensions of the method: 1) an analytical early stopping criterion to approximately stop training before the memorization of noisy labels, as well as 2) a heuristic for setting hyperparameters which do not require knowledge of the noise corruption rate. We demonstrate the effectiveness of our method by achieving strong results across three image and text classification tasks as compared to existing baselines.

Via

Access Paper or Ask Questions