Increasing levels of renewable generation motivate a growing interest in data-driven approaches for AC optimal power flow (AC OPF) to manage uncertainty; however, a lack of disciplined dataset creation and benchmarking prohibits useful comparison among approaches in the literature. To instill confidence, models must be able to reliably predict solutions across a wide range of operating conditions. This paper develops the OPF-Learn package for Julia and Python, which uses a computationally efficient approach to create representative datasets that span a wide spectrum of the AC OPF feasible region. Load profiles are uniformly sampled from a convex set that contains the AC OPF feasible set. For each infeasible point found, the convex set is reduced using infeasibility certificates, found by using properties of a relaxed formulation. The framework is shown to generate datasets that are more representative of the entire feasible space versus traditional techniques seen in the literature, improving machine learning model performance.