We present a new learning-based approach to recover egocentric 3D vehicle pose from a single RGB image. In contrast to previous works that directly map from local appearance to 3D angles, we explore a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs) for 3D pose estimation. We design a deep model that transforms perceived intensities to IGRs, which are mapped to a 3D representation encoding object orientation in the camera coordinate system. To fulfill our goal, we need to specify what IGRs to use and how to learn them more effectively. We answer the former question by designing an interpolated cuboid representation that derives from primitive 3D annotation readily. The latter question motivates us to incorporate geometry knowledge by designing a new loss function based on a projective invariant. This loss function allows unlabeled data to be used in the training stage which is validated to improve representation learning. Our system outperforms previous monocular RGB-based methods for joint vehicle detection and pose estimation on the KITTI benchmark, achieving performance even comparable to stereo methods. Code and pre-trained models will be available at the project website.