The optimal control of non-equilibrium open quantum systems is a challenging task but has a key role in improving existing quantum information processing technologies. We introduce a general model-free framework based on Reinforcement Learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal trade-offs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method identifying Pareto-optimal trade-offs between power and efficiency in two systems: an experimentally realistic refrigerator based on a superconducting qubit, where we identify non-intuitive control sequences that reduce quantum friction and outperform previous cycles proposed in literature; and a heat engine based on a quantum harmonic oscillator, where we find cycles with an elaborate structure that outperform the optimized Otto cycle.