Energy Efficiency (EE) is of high importance while considering Massive Multiple-Input Multiple-Output (M-MIMO) networks where base stations (BSs) are equipped with an antenna array composed of up to hundreds of elements. M-MIMO transmission, although highly spectrally efficient, results in high energy consumption growing with the number of antennas. This paper investigates EE improvement through switching on/off underutilized BSs. It is proposed to use the location-aware approach, where data about an optimal active BSs set is stored in a Radio Environment Map (REM). For efficient acquisition, processing and utilization of the REM data, reinforcement learning (RL) algorithms are used. State-of-the-art exploration/exploitation methods including e-greedy, Upper Confidence Bound (UCB), and Gradient Bandit are evaluated. Then analytical action filtering, and an REM-based Exploration Algorithm (REM-EA) are proposed to improve the RL convergence time. Algorithms are evaluated using an advanced, system-level simulator of an M-MIMO Heterogeneous Network (HetNet) utilizing an accurate 3D-ray-tracing radio channel model. The proposed RL-based BSs switching algorithm is proven to provide 70% gains in EE over a state-of-the-art algorithm using an analytical heuristic. Moreover, the proposed action filtering and REM-EA can reduce RL convergence time in relation to the best-performing state-of-the-art exploration method by 60% and 83%, respectively.