Vision-aided wireless communication is attracting increasing interest and finding new use cases in various wireless communication applications. These vision-aided communication frameworks leverage the visual data (captured, for example, by cameras installed at the infrastructure or mobile devices) to construct some perception about the communication environment (geometry, users, scatterers, etc.). This is typically achieved through the use of deep learning and advances in computer vision and visual scene understanding. Prior work has investigated various problems such as vision-aided beam, blockage, and handoff prediction in millimeter wave (mmWave) systems and vision aided covariance prediction in massive MIMO systems. This prior work, however, has focused on scenarios with a single object (user) moving in front of the camera. To enable vision-aided wireless communication in practice, however, it is important for these systems to be able to operate in crowded scenarios with multiple objects in the visual scene. In this paper, we define the user identification task as the key enabler for realistic vision-aided wireless communication systems that can operate in crowded scenarios and support multi-user applications. The objective of the user identification task is to identify the target communication user from the other candidate objects (distractors) in the visual scene. We develop machine learning models that process either one frame or a sequence of frames of visual and wireless data to efficiently identify the target user in the visual/communication environment. Using the large-scale multi-modal sense and communication dataset, DeepSense 6G, which is based on real-world measurements, we show that the developed approaches can successfully identify the target users with more than 97% accuracy ...