Multi-modal data, such as image data sets, often miss the detailed descriptions that properly capture the rich information encoded in them. This makes answering complex natural language queries a major challenge in these domains. In particular, unlike the traditional nearest-neighbor search, where the tuples and the query are modeled as points in a data cube, the query and the tuples are of different natures, making the traditional query answering solutions not directly applicable for such settings. Existing literature addresses this challenge for image data through vector representations jointly trained on natural language and images. This technique, however, underperforms for complex queries due to various reasons. This paper takes a step towards addressing this challenge by introducing a Generative-AI (GenAI) powered Monte Carlo method that utilizes foundation models to generate synthetic samples that capture the complexity of the natural language query and transform it to the same space of the multi-modal data. Following this method, we develop a system for image data retrieval and propose practical solutions that enable leveraging future advancements in GenAI and vector representations for improving our system's performance. Our comprehensive experiments on various benchmark datasets verify that our system significantly outperforms state-of-the-art techniques.