Given the increasingly serious air pollution problem, the monitoring of air quality index (AQI) in urban areas has drawn considerable attention. This paper presents ImgSensingNet, a vision guided aerial-ground sensing system, for fine-grained air quality monitoring and forecasting using the fusion of haze images taken by the unmanned-aerial-vehicle (UAV) and the AQI data collected by an on-ground three-dimensional (3D) wireless sensor network (WSN). Specifically, ImgSensingNet first leverages the computer vision technique to tell the AQI scale in different regions from the taken haze images, where haze-relevant features and a deep convolutional neural network (CNN) are designed for direct learning between haze images and corresponding AQI scale. Based on the learnt AQI scale, ImgSensingNet determines whether to wake up on-ground wireless sensors for small-scale AQI monitoring and inference, which can greatly reduce the energy consumption of the system. An entropy-based model is employed for accurate real-time AQI inference at unmeasured locations and future air quality distribution forecasting. We implement and evaluate ImgSensingNet on two university campuses since Feb. 2018, and has collected 17,630 photos and 2.6 millions of AQI data samples. Experimental results confirm that ImgSensingNet can achieve higher inference accuracy while greatly reduce the energy consumption, compared to state-of-the-art AQI monitoring approaches.