Multi-camera 3D object detection, a critical component for vision-only driving systems, has achieved impressive progress. Notably, transformer-based methods with 2D features augmented by 3D positional encodings (PE) have enjoyed great success. However, the mechanism and options of 3D PE have not been thoroughly explored. In this paper, we first explore, analyze and compare various 3D positional encodings. In particular, we devise 3D point PE and show its superior performance since more precise positioning may lead to superior 3D detection. In practice, we utilize monocular depth estimation to obtain the 3D point positions for multi-camera 3D object detection. The PE with estimated 3D point locations can bring significant improvements compared to the commonly used camera-ray PE. Among DETR-based strategies, our method achieves state-of-the-art 45.6 mAP and 55.1 NDS on the competitive nuScenes valuation set. It's the first time that the performance gap between the vision-only (DETR-based) and LiDAR-based methods is reduced within 5\% mAP and 6\% NDS.