We showcase the practicability of an indoor positioning system (IPS) solely based on Neural Networks (NNs) and the channel state information (CSI) of a (Massive) multiple-input multiple-output (MIMO) communication system, i.e., only build on the basis of data that is already existent in today's systems. As such our IPS system promises both, a good accuracy without the need of any additional protocol/signaling overhead for the user localization task. In particular, we propose a tailored NN structure with an additional phase branch as feature extractor and (compared to previous results) a significantly reduced amount of trainable parameters, leading to a minimization of the amount of required training data. We provide actual measurements for indoor scenarios with up to 64 antennas covering a large area of 80m2. In the second part, several robustness investigations for real-measurements are conducted, i.e., once trained, we analyze the recall accuracy over a time-period of several days. Further, we analyze the impact of pedestrians walking in-between the measurements and show that finetuning and pre-training of the NN helps to mitigate effects of hardware drifts and alterations in the propagation environment over time. This reduces the amount of required training samples at equal precision and, thereby, decreases the effort of the costly training data acquisition