Human activity recognition based on wearable sensor data has been an attractive research topic due to its application in areas such as healthcare, homeland security and smart environments. In this context, many works have presented remarkable results using accelerometer, gyroscope and magnetometer data to represent the categories of activities. However, the current studies do not consider important issues that lead to skewed results, making hard to measure how well sensor-based human activity recognition is and preventing a direct comparison of previous works. These issues include the employed metrics, the validation protocol used, the samples generation process, and the quality of the dataset (i.e., the sampling rate and the number of activities to be recognized). We emphasize that in other research areas, such as image classification and object detection, these issues are well-defined, which brings more efforts towards the application. Inspired by this, in this work, we conduct an extensive set of experiments to indicate the vulnerable points in human activity recognition based on wearable sensor data. To this purpose, we implement and evaluate several state-of-the-art approaches, ranging from handcrafted-based methods to convolutional neural networks. Furthermore, we standardize a large number of datasets, which vary in terms of sampling rate, number of sensors, activities and subjects. According to our study, the most of evaluation types applied in the literature are not adequate to perform the activity recognition in the context of wearable sensor data, in which the recognition accuracy drops around ten percentage points when compared to the appropriate validation.