Performance disparities across sub-populations are known to exist in deep learning-based vision recognition models, but previous work has largely addressed such fairness concerns assuming knowledge of sensitive attribute labels. To overcome this reliance, previous strategies have involved separate learning structures to expose and adjust for disparities. In this work, we explore a new paradigm that does not require sensitive attribute labels, and evades the need for extra training by leveraging the vision-language model, CLIP, as a rich knowledge source to infer sensitive information. We present sample clustering based on similarity derived from image and attribute-specified language embeddings and assess their correspondence to true attribute distribution. We train a target model by re-sampling and augmenting under-performed clusters. Extensive experiments on multiple benchmark bias datasets show clear fairness gains of the model over existing baselines, which indicate that CLIP can extract discriminative sensitive information prompted by language, and used to promote model fairness.