Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abbas Zaidi

Flexible Models for Microclustering with Application to Entity Resolution

Oct 31, 2016

Giacomo Zanella, Brenda Betancourt, Hanna Wallach, Jeffrey Miller, Abbas Zaidi, Rebecca C. Steorts

Figure 1 for Flexible Models for Microclustering with Application to Entity Resolution

Figure 2 for Flexible Models for Microclustering with Application to Entity Resolution

Figure 3 for Flexible Models for Microclustering with Application to Entity Resolution

Figure 4 for Flexible Models for Microclustering with Application to Entity Resolution

Abstract:Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

* 15 pages, 3 figures, 1 table, to appear NIPS 2016. arXiv admin note: text overlap with arXiv:1512.00792

Via

Access Paper or Ask Questions

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Dec 02, 2015

Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, Rebecca C. Steorts

Figure 1 for Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Figure 2 for Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Figure 3 for Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Abstract:Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.

* 8 pages, 3 figures, NIPS Bayesian Nonparametrics: The Next Generation Workshop Series

Via

Access Paper or Ask Questions