Abstract:We introduce PIGEON, a multi-task end-to-end system for planet-scale image geolocalization that achieves state-of-the-art performance on both external benchmarks and in human evaluation. Our work incorporates semantic geocell creation with label smoothing, conducts pretraining of a vision transformer on images with geographic information, and refines location predictions with ProtoNets across a candidate set of geocells. The contributions of PIGEON are three-fold: first, we design a semantic geocells creation and splitting algorithm based on open-source data which can be adapted to any geospatial dataset. Second, we show the effectiveness of intra-geocell refinement and the applicability of unsupervised clustering and ProtNets to the task. Finally, we make our pre-trained CLIP transformer model, StreetCLIP, publicly available for use in adjacent domains with applications to fighting climate change and urban and rural scene understanding.
Abstract:Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
Abstract:In the race towards carbon neutrality, the building sector has fallen behind and bears the potential to endanger the progress made across other industries. This is because buildings exhibit a life span of several decades which creates substantial inertia in the face of climate change. This inertia is further exacerbated by the scale of the existing building stock. With several billion operational buildings around the globe, working towards a carbon-neutral building sector requires solutions which enable stakeholders to accurately identify and retrofit subpar buildings at scale. However, improving the energy efficiency of the existing building stock through retrofits in a targeted and efficient way remains challenging. This is because, as of today, the energy efficiency of buildings is generally determined by on-site visits of certified energy auditors which makes the process slow, costly, and geographically incomplete. In order to accelerate the identification of promising retrofit targets, this work proposes a new method which can estimate a building's energy efficiency using purely remotely sensed data such as street view and aerial imagery, OSM-derived footprint areas, and satellite-borne land surface temperature (LST) measurements. We find that in the binary setting of distinguishing efficient from inefficient buildings, our end-to-end deep learning model achieves a macro-averaged F1-score of 62.06\%. As such, this work shows the potential and complementary nature of remotely sensed data in predicting building attributes such as energy efficiency and opens up new opportunities for future work to integrate additional data sources.