Abstract:When trains collide with obstacles, the consequences are often severe. To assess how artificial intelligence might contribute to avoiding collisions, we need to understand how train drivers do it. What aspects of a situation do they consider when evaluating the risk of collision? In the present study, we assumed that train drivers do not only identify potential obstacles but interpret what they see in order to anticipate how the situation might unfold. However, to date it is unclear how exactly this is accomplished. Therefore, we assessed which cues train drivers use and what inferences they make. To this end, image-based expert interviews were conducted with 33 train drivers. Participants saw images with potential obstacles, rated the risk of collision, and explained their evaluation. Moreover, they were asked how the situation would need to change to decrease or increase collision risk. From their verbal reports, we extracted concepts about the potential obstacles, contexts, or consequences, and assigned these concepts to various categories (e.g., people's identity, location, movement, action, physical features, and mental states). The results revealed that especially for people, train drivers reason about their actions and mental states, and draw relations between concepts to make further inferences. These inferences systematically differ between situations. Our findings emphasise the need to understand train drivers' risk evaluation processes when aiming to enhance the safety of both human and automatic train operation.
Abstract:Deep Learning models like Convolutional Neural Networks (CNN) are powerful image classifiers, but what factors determine whether they attend to similar image areas as humans do? While previous studies have focused on technological factors, little is known about the role of factors that affect human attention. In the present study, we investigated how the tasks used to elicit human attention maps interact with image characteristics in modulating the similarity between humans and CNN. We varied the intentionality of human tasks, ranging from spontaneous gaze during categorization over intentional gaze-pointing up to manual area selection. Moreover, we varied the type of image to be categorized, using either singular, salient objects, indoor scenes consisting of object arrangements, or landscapes without distinct objects defining the category. The human attention maps generated in this way were compared to the CNN attention maps revealed by explainable artificial intelligence (Grad-CAM). The influence of human tasks strongly depended on image type: For objects, human manual selection produced maps that were most similar to CNN, while the specific eye movement task has little impact. For indoor scenes, spontaneous gaze produced the least similarity, while for landscapes, similarity was equally low across all human tasks. To better understand these results, we also compared the different human attention maps to each other. Our results highlight the importance of taking human factors into account when comparing the attention of humans and CNN.