Abstract:Human language users can generate descriptions of perceptual concepts beyond instance-level representations and also use such descriptions to learn provisional class-level representations. However, the ability of computational models to learn and operate with class representations is under-investigated in the language-and-vision field. In this paper, we train separate neural networks to generate and interpret class-level descriptions. We then use the zero-shot classification performance of the interpretation model as a measure of communicative success and class-level conceptual grounding. We investigate the performance of prototype- and exemplar-based neural representations grounded category description. Finally, we show that communicative success reveals performance issues in the generation model that are not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues can be traced to a failure to properly ground language in vision at the class level. We observe that the interpretation model performs better with descriptions that are low in diversity on the class level, possibly indicating a strong reliance on frequently occurring features.
Abstract:In this paper we examine different meaning representations that are commonly used in different natural language applications today and discuss their limits, both in terms of the aspects of the natural language meaning they are modelling and in terms of the aspects of the application for which they are used.
Abstract:Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.