We introduce an explainable, physics-aware, and end-to-end differentiable model which predicts the outcome of robot-terrain interaction from camera images. The proposed MonoForce model consists of a black-box module, which predicts robot-terrain interaction forces from the onboard camera, followed by a white-box module, which transforms these forces through the laws of classical mechanics into the predicted trajectories. As the white-box model is implemented as a differentiable ODE solver, it enables measuring the physical consistency between predicted forces and ground-truth trajectories of the robot. Consequently, it creates a self-supervised loss similar to MonoDepth. To facilitate the reproducibility of the paper, we provide the source code. See the project github for codes and supplementary materials such as videos and data sequences.