This thesis tackles the subject of spatio-temporal forecasting with deep learning. The motivating application at Electricity de France (EDF) is short-term solar energy forecasting with fisheye images. We explore two main research directions for improving deep forecasting methods by injecting external physical knowledge. The first direction concerns the role of the training loss function. We show that differentiable shape and temporal criteria can be leveraged to improve the performances of existing models. We address both the deterministic context with the proposed DILATE loss function and the probabilistic context with the STRIPE model. Our second direction is to augment incomplete physical models with deep data-driven networks for accurate forecasting. For video prediction, we introduce the PhyDNet model that disentangles physical dynamics from residual information necessary for prediction, such as texture or details. We further propose a learning framework (APHYNITY) that ensures a principled and unique linear decomposition between physical and data-driven components under mild assumptions, leading to better forecasting performances and parameter identification.