Abstract:What metrics should guide the development of more realistic models of the brain? One proposal is to quantify the similarity between models and brains using methods such as linear regression, Centered Kernel Alignment (CKA), and angular Procrustes distance. To better understand the limitations of these similarity measures we analyze neural activity recorded in five experiments on nonhuman primates, and optimize synthetic datasets to become more similar to these neural recordings. How similar can these synthetic datasets be to neural activity while failing to encode task relevant variables? We find that some measures like linear regression and CKA, differ from angular Procrustes, and yield high similarity scores even when task relevant variables cannot be linearly decoded from the synthetic datasets. Synthetic datasets optimized to maximize similarity scores initially learn the first principal component of the target dataset, but angular Procrustes captures higher variance dimensions much earlier than methods like linear regression and CKA. We show in both theory and simulations how these scores change when different principal components are perturbed. And finally, we jointly optimize multiple similarity scores to find their allowed ranges, and show that a high angular Procrustes similarity, for example, implies a high CKA score, but not the converse.
Abstract:Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.