Abstract:Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ($\beta$-VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal' locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art $\beta$-VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.
Abstract:Learning to read places a strong challenge on the visual system. Years of expertise lead to a remarkable capacity to separate highly similar letters and encode their relative positions, thus distinguishing words such as FORM and FROM, invariantly over a large range of sizes and absolute positions. How neural circuits achieve invariant word recognition remains unknown. Here, we address this issue by training deep neural network models to recognize written words and then analyzing how reading-specialized units emerge and operate across different layers of the network. With literacy, a small subset of units becomes specialized for word recognition in the learned script, similar to the "visual word form area" of the human brain. We show that these units are sensitive to specific letter identities and their distance from the blank space at the left or right of a word, thus acting as "space bigrams". These units specifically encode ordinal positions and operate by pooling across low and high-frequency detector units from early layers of the network. The proposed neural code provides a mechanistic insight into how information on letter identity and position is extracted and allow for invariant word recognition, and leads to predictions for reading behavior, error patterns, and the neurophysiology of reading.