State-of-the-art approaches for 6D object pose estimation require large amounts of labeled data to train the deep networks. However, the acquisition of 6D object pose annotations is tedious and labor-intensive in large quantity. To alleviate this problem, we propose a weakly supervised 6D object pose estimation approach based on 2D keypoint detection. Our method trains only on image pairs with known relative transformations between their viewpoints. Specifically, we assign a set of arbitrarily chosen 3D keypoints to represent each unknown target 3D object and learn a network to detect their 2D projections that comply with the relative camera viewpoints. During inference, our network first infers the 2D keypoints from the query image and a given labeled reference image. We then use these 2D keypoints and the arbitrarily chosen 3D keypoints retained from training to infer the 6D object pose. Extensive experiments demonstrate that our approach achieves comparable performance with state-of-the-art fully supervised approaches.