We propose a method that integrates two widely available data sources, building footprints from 2D maps and street level images, to derive valuable information that is generally difficult to acquire -- building heights and building facade masks in images. Building footprints are elevated in world coordinates and projected onto images. Building heights are estimated by scoring projected footprints based on their alignment with building features in images. Building footprints with estimated heights can be converted to simple 3D building models, which are projected back to images to identify buildings. In this procedure, accurate camera projections are critical. However, camera position errors inherited from external sensors commonly exist, which adversely affect results. We derive a solution to precisely locate cameras on maps using correspondence between image features and building footprints. Experiments on real-world datasets show the promise of our method.