We expect this project to advance the state of the art for the task of predicting future video content, and lead to testbeds for assessing whether a learned representation has correctly disentangled the independent causal mechanisms of a scene. Predicting future video content is a challenging task with high impact for a number of applications including self-driving cars and robotics, but also the generation of additional training data. Existing approaches range from predicting future actions with semantic labels to creating realistic renderings of future frames. Most of the current approaches are based on predictions given convolutional descriptors of previous frames. The goal here is to go beyond such representations and model the causality of the actions and disentangle motion and appearance. The approach envisioned will make the system more explainable and interpretable. It will hence result in more trustworthy AI for applications such as self-driving cars and robotics.