Annotated datasets are commonly used in the training and evaluation of tasks involving natural language and vision (image description generation, action recognition and visual question answering). However, many of the existing datasets reflect problems that emerge in the process of data selection and annotation. Here we point out some of the difficulties and problems one confronts when creating and validating annotated vision and language datasets.