Abstract:With the advance of technologies, machines are increasingly present in people's daily lives. Thus, there has been more and more effort for developing interfaces, such as dynamic gestures, that provide an intuitive way of interaction. Currently, the most common trend is to use multimodal data, as depth and skeleton information, to try to recognize dynamic gestures. However, the use of only color information would be more interesting, once RGB cameras are usually found in almost every public place, and could be used for gesture recognition without the need to install other equipment. The main problem with this approach is the difficulty of representing spatio-temporal information using just color. With this in mind, we propose a technique that we called Star RGB, capable of describing a videoclip containing a dynamic gesture as an RGB image. This image is then passed to a classifier formed by two Resnet CNN's, a soft-attention ensemble, and a multilayer perceptron, which returns the predicted class label that indicates to which type of gesture the input video belongs. Experiments were carried out using the Montalbano and GRIT datasets. On the Montalbano dataset, the proposed approach achieved an accuracy of 94.58%, this result reaches the state-of-the-art using this dataset, considering only color information. On the GRIT dataset, our proposal achieves more than 98% of accuracy, recall, precision, and F1-score, outperforming the reference approach in more than 6%.