Recently, large language models (LLMs) and visionlanguage models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often fails to yield satisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we utilize publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10, to convert traditional annotation information into natural language, thereby constructing an instruction-tuning (SFT) dataset for VLM training. We then evaluate the detection performance of different fine-tuning strategies for VLMs and obtain optimized model weights for object detection in remote sensing images. Finally, we assess the model's prior knowledge capabilities through natural language queries.Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our dataset and relevant code will be released soon.