Abstract:We introduce a fitness tracking system that enables remote monitoring for exercises using only a RGB smartphone camera, making fitness tracking more private, scalable, and cost effective. Although prior work explored automated exercise supervision, existing models are either too limited in exercise variety or too complex for real-world deployment. Prior approaches typically focus on a small set of exercises and fail to generalize across diverse movements. In contrast, we develop a robust, multitask motion analysis model capable of performing exercise detection and repetition counting across hundreds of exercises, a scale far beyond previous methods. We overcome previous data limitations by assembling a large-scale fitness dataset, Olympia covering more than 1,900 exercises. To our knowledge, our vision-language model is the first that can perform multiple tasks on skeletal fitness data. On Olympia, our model can detect exercises with 76.5% accuracy and count repetitions with 85.3% off-by-one accuracy, using only RGB video. By presenting a single vision-language transformer model for both exercise identification and rep counting, we take a significant step toward democratizing AI-powered fitness tracking.




Abstract:Gesturing is one of the natural modes of human communication. Signs produced by gestures can have a basic meaning coupled with additional information that is layered over the basic meaning of the sign. Sign language is an important example of communicative gestures that are highly structured and well accepted across the world as a communication medium for deaf and dumb. In this paper, an online recognition scheme is proposed to interpret the standard numeric sign language comprising of 10 basic hand symbols. A web camera is used to capture the real time hand movements as input to the system. The basic meaning of the hand gesture is extracted from the input data frame by analysing the shape of the hand, considering its orientation, movement and location to be fixed. The input hand shape is processed to identify the palm structure, fingertips and their relative positions and the presence of the extended thumb. A 2-dimensional skeletal model is generated from the acquired shape information to represent and subsequently interpret the basic meaning of the hand gesture.