Abstract:Text word embeddings that encode distributional semantic features work by modeling contextual similarities of frequently occurring words. Acoustic word embeddings, on the other hand, typically encode low-level phonetic similarities. Semantic embeddings for spoken words have been previously explored using similar algorithms to Word2Vec, but the resulting vectors still mainly encoded phonetic rather than semantic features. In this paper, we examine the assumptions and architectures used in previous works and show experimentally how Word2Vec algorithms fail to encode distributional semantics when the input units are acoustically correlated. In addition, previous works relied on the simplifying assumptions of perfect word segmentation and clustering by word type. Given these conditions, a trivial solution identical to text-based embeddings has been overlooked. We follow this simpler path using automatic word type clustering and examine the effects on the resulting embeddings, highlighting the true challenges in this task.