Picture for Gui Citovsky

Gui Citovsky

Google Research

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Add code
Feb 04, 2025
Figure 1 for Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Figure 2 for Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Figure 3 for Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Figure 4 for Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Viaarxiv icon

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Add code
May 29, 2024
Viaarxiv icon

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

Add code
Jan 24, 2024
Viaarxiv icon

Leveraging Importance Weights in Subset Selection

Add code
Jan 28, 2023
Viaarxiv icon

Batch Active Learning at Scale

Add code
Jul 29, 2021
Figure 1 for Batch Active Learning at Scale
Figure 2 for Batch Active Learning at Scale
Figure 3 for Batch Active Learning at Scale
Figure 4 for Batch Active Learning at Scale
Viaarxiv icon

Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets

Add code
May 25, 2021
Figure 1 for Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
Figure 2 for Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
Figure 3 for Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
Figure 4 for Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets
Viaarxiv icon

Online Hierarchical Clustering Approximations

Add code
Sep 20, 2019
Figure 1 for Online Hierarchical Clustering Approximations
Figure 2 for Online Hierarchical Clustering Approximations
Figure 3 for Online Hierarchical Clustering Approximations
Figure 4 for Online Hierarchical Clustering Approximations
Viaarxiv icon