We propose a model of the speech perception of individual words in the presence of mishearings. This phenomenological approach is based on concepts used in linguistics, and provides a formalism that is universal across languages. We put forward an efficient two-parameter form for the word length distribution, and introduce a simple representation of mishearings, which we use in our subsequent modelling of word recognition. In a context-free scenario, word recognition often occurs via anticipation when, part-way into a word, we can correctly guess its full form. We give a quantitative estimate of this anticipation threshold when no mishearings occur, in terms of model parameters. As might be expected, the whole anticipation effect disappears when there are sufficiently many mishearings. Our global approach to the problem of speech perception is in the spirit of an optimisation problem. We show for instance that speech perception is easy when the word length is less than a threshold, to be identified with a static transition, and hard otherwise. We extend this to the dynamics of word recognition, proposing an intuitive approach highlighting the distinction between individual, isolated mishearings and clusters of contiguous mishearings. At least in some parameter range, a dynamical transition is manifest well before the static transition is reached, as is the case for many other examples of complex systems.