Context plays a significant role in the generation of motion for dynamic agents in interactive environments. This work proposes a modular method that utilises a model of the environment to aid motion prediction of tracked agents. This paper shows that modelling the spatial and dynamic aspects of a given environment alongside the local per agent behaviour results in more accurate and informed long-term motion prediction. Further, we observe that this decoupling of dynamics and environment models allows for better adaptation to unseen environments, requiring that only a spatial representation of a new environment be learned. We highlight the model's prediction capability using a benchmark pedestrian tracking problem and by tracking a robot arm performing a tabletop manipulation task. The proposed approach allows for robust and data efficient forward modelling, and relaxes the need for full model re-training in new environments. We evaluate this through an ablation study which shows better performance gain when utilising both representation modules in addition to improved generalisation on tasks with dynamics unseen at training time.