Abstract:We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.
Abstract:We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments.
Abstract:It has become standard to explain neural network latent spaces with attraction/repulsion dimensionality reduction (ARDR) methods like tSNE and UMAP. This relies on the premise that structure in the 2D representation is consistent with the structure in the model's latent space. However, this is an unproven assumption -- we are unaware of any convergence guarantees for ARDR algorithms. We work on closing this question by relating ARDR methods to classical dimensionality reduction techniques. Specifically, we show that one can fully recover a PCA embedding by applying attractions and repulsions onto a randomly initialized dataset. We also show that, with a small change, Locally Linear Embeddings (LLE) can reproduce ARDR embeddings. Finally, we formalize a series of conjectures that, if true, would allow one to attribute structure in the 2D embedding back to the input distribution.
Abstract:tSNE and UMAP are popular dimensionality reduction algorithms due to their speed and interpretable low-dimensional embeddings. Despite their popularity, however, little work has been done to study their full span of differences. We theoretically and experimentally evaluate the space of parameters in both tSNE and UMAP and observe that a single one -- the normalization -- is responsible for switching between them. This, in turn, implies that a majority of the algorithmic differences can be toggled without affecting the embeddings. We discuss the implications this has on several theoretic claims behind UMAP, as well as how to reconcile them with existing tSNE interpretations. Based on our analysis, we provide a method (\ourmethod) that combines previously incompatible techniques from tSNE and UMAP and can replicate the results of either algorithm. This allows our method to incorporate further improvements, such as an acceleration that obtains either method's outputs faster than UMAP. We release improved versions of tSNE, UMAP, and \ourmethod that are fully plug-and-play with the traditional libraries at https://github.com/Andrew-Draganov/GiDR-DUN
Abstract:TSNE and UMAP are two of the most popular dimensionality reduction algorithms due to their speed and interpretable low-dimensional embeddings. However, while attempts have been made to improve on TSNE's computational complexity, no existing method can obtain TSNE embeddings at the speed of UMAP. In this work, we show that this is indeed possible by combining the two approaches into a single method. We theoretically and experimentally evaluate the full space of parameters in the TSNE and UMAP algorithms and observe that a single parameter, the normalization, is responsible for switching between them. This, in turn, implies that a majority of the algorithmic differences can be toggled without affecting the embeddings. We discuss the implications this has on several theoretic claims underpinning the UMAP framework, as well as how to reconcile them with existing TSNE interpretations. Based on our analysis, we propose a new dimensionality reduction algorithm, GDR, that combines previously incompatible techniques from TSNE and UMAP and can replicate the results of either algorithm by changing the normalization. As a further advantage, GDR performs the optimization faster than available UMAP methods and thus an order of magnitude faster than available TSNE methods. Our implementation is plug-and-play with the traditional UMAP and TSNE libraries and can be found at github.com/Andrew-Draganov/GiDR-DUN.
Abstract:We present complex-valued Convolutional Neural Networks (CNNs) for RF fingerprinting that go beyond translation invariance and appropriately account for the inductive bias with respect to multipath propagation channels, a phenomenon that is specific to the fields of wireless signal processing and communications. We focus on the problem of fingerprinting wireless IoT devices in-the-wild using Deep Learning (DL) techniques. Under these real-world conditions, the multipath environments represented in the train and test sets will be different. These differences are due to the physics governing the propagation of wireless signals, as well as the limitations of practical data collection campaigns. Our approach follows a group-theoretic framework, leverages prior work on DL on manifold-valued data, and extends this prior work to the wireless signal processing domain. We introduce the Lie group of transformations that a signal experiences under the multipath propagation model and define operations that are equivariant and invariant to the frequency response of a Finite Impulse Response (FIR) filter to build a ChaRRNet. We present results using synthetic and real-world datasets, and we benchmark against a strong baseline model, that show the efficacy of our approach. Our results provide evidence of the benefits of incorporating appropriate wireless domain biases into DL models. We hope to spur new work in the area of robust RF machine learning, as the 5G revolution increases demand for enhanced security mechanisms.