Radiation therapy serves as an effective and standard method for cancer treatment. Excellent radiation therapy plans always rely on high-quality dose distribution maps obtained through repeated trial and error by experienced experts. However, due to individual differences and complex clinical situations, even seasoned expert teams may need help to achieve the best treatment plan every time quickly. Many automatic dose distribution prediction methods have been proposed recently to accelerate the radiation therapy planning process and have achieved good results. However, these results suffer from over-smoothing issues, with the obtained dose distribution maps needing more high-frequency details, limiting their clinical application. To address these limitations, we propose a dose prediction diffusion model based on SwinTransformer and a projector, SP-DiffDose. To capture the direct correlation between anatomical structure and dose distribution maps, SP-DiffDose uses a structural encoder to extract features from anatomical images, then employs a conditional diffusion process to blend noise and anatomical images at multiple scales and gradually map them to dose distribution maps. To enhance the dose prediction distribution for organs at risk, SP-DiffDose utilizes SwinTransformer in the deeper layers of the network to capture features at different scales in the image. To learn good representations from the fused features, SP-DiffDose passes the fused features through a designed projector, improving dose prediction accuracy. Finally, we evaluate SP-DiffDose on an internal dataset. The results show that SP-DiffDose outperforms existing methods on multiple evaluation metrics, demonstrating the superiority and generalizability of our method.