Abstract:Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.
Abstract:Current gesture recognition systems primarily focus on identifying gestures within a predefined set, leaving a gap in connecting these gestures to interactive GUI elements or system functions (e.g., linking a 'thumb-up' gesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture understanding and grounding framework leveraging large language models (LLMs). Gesture descriptions are formulated based on hand landmark coordinates from gesture videos and fed into our dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context (e.g., interface, history, gaze data), which a context agent organizes and provides. Following iterative exchanges, the gesture agent discerns user intent, grounding it to an interactive function. We validated the gesture description module using public first-view and third-view gesture datasets and tested the whole system in two real-world settings: video streaming and smart home IoT control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video streaming and 90.78% for smart home tasks, showing potential of the new gesture understanding paradigm.