The implementation of Autonomous Driving (AD) technologies within urban environments presents significant challenges. These challenges necessitate the development of advanced perception systems and motion planning algorithms capable of managing situations of considerable complexity. Although the end-to-end AD method utilizing LiDAR sensors has achieved significant success in this scenario, we argue that its drawbacks may hinder its practical application. Instead, we propose the vision-centric AD as a promising alternative offering a streamlined model without compromising performance. In this study, we present a path planning method that utilizes 2D bounding boxes of objects, developed through imitation learning in urban driving scenarios. This is achieved by integrating high-definition (HD) map data with images captured by surrounding cameras. Subsequent perception tasks involve bounding-box detection and tracking, while the planning phase employs both local embeddings via Graph Neural Network (GNN) and global embeddings via Transformer for temporal-spatial feature aggregation, ultimately producing optimal path planning information. We evaluated our model on the nuPlan planning task and observed that it performs competitively in comparison to existing vision-centric methods.