Machine learning has been considered a promising approach for indoor localization. Nevertheless, the sample efficiency, scalability, and generalization ability remain open issues of implementing learning-based algorithms in practical systems. In this paper, we establish a zero-shot learning framework that does not need real-world measurements in a new communication environment. Specifically, a graph neural network that is scalable to the number of access points (APs) and mobile devices (MDs) is used for obtaining coarse locations of MDs. Based on the coarse locations, the floor-plan image between an MD and an AP is exploited to improve localization accuracy in a floor-plan-aided deep neural network. To further improve the generalization ability, we develop a synthetic data generator that provides synthetic data samples in different scenarios, where real-world samples are not available. We implement the framework in a prototype that estimates the locations of MDs. Experimental results show that our zero-shot learning method can reduce localization errors by around $30$\% to $55$\% compared with three baselines from the existing literature.