The growing adoption of mmWave frequency bands to realize the full potential of 5G, turns beamforming into a key enabler for current and next-generation wireless technologies. Many mmWave networks rely on beam selection with Grid-of-Beams (GoB) approach to handle user-beam association. In beam selection with GoB, users select the appropriate beam from a set of pre-defined beams and the overhead during the beam selection process is a common challenge in this area. In this paper, we propose an Advantage Actor Critic (A2C) learning-based framework to improve the GoB and the beam selection process, as well as optimize transmission power in a mmWave network. The proposed beam selection technique allows performance improvement while considering transmission power improves Energy Efficiency (EE) and ensures the coverage is maintained in the network. We further investigate how the proposed algorithm can be deployed in a Service Management and Orchestration (SMO) platform. Our simulations show that A2C-based joint optimization of beam selection and transmission power is more effective than using Equally Spaced Beams (ESB) and fixed power strategy, or optimization of beam selection and transmission power disjointly. Compared to the ESB and fixed transmission power strategy, the proposed approach achieves more than twice the average EE in the scenarios under test and is closer to the maximum theoretical EE.