Generative models are typically evaluated by direct inspection of their generated samples, e.g., by visual inspection in the case of images. Further evaluation metrics like the Fr\'echet inception distance or maximum mean discrepancy are intricate to interpret and lack physical motivation. These observations make evaluating generative models in the wireless PHY layer non-trivial. This work establishes a framework consisting of evaluation metrics and methods for generative models applied to the wireless PHY layer. The proposed metrics and methods are motivated by wireless applications, facilitating interpretation and understandability for the wireless community. In particular, we propose a spectral efficiency analysis for validating the generated channel norms and a codebook fingerprinting method to validate the generated channel directions. Moreover, we propose an application cross-check to evaluate the generative model's samples for training machine learning-based models in relevant downstream tasks. Our analysis is based on real-world measurement data and includes the Gaussian mixture model, variational autoencoder, diffusion model, and generative adversarial network as generative models. Our results under a fair comparison in terms of model architecture indicate that solely relying on metrics like the maximum mean discrepancy produces insufficient evaluation outcomes. In contrast, the proposed metrics and methods exhibit consistent and explainable behavior.