Generative video-based foundation models hold great promise for planning and decision-making but face challenges in efficiency, controllability, generalization, and consistency, which limit their utility in real-world applications like autonomous vehicles. Current research largely focuses on enhancing video quality, which does not necessarily align with the need for world models that understand real-world causality and enable agents to reason and plan effectively. These models endow agents with the ability to reason, plan, and learn from both simulated experiences and the models themselves, without direct interaction with the real world. These challenges often manifest as memory deterioration, hallucinations, and out-of-distribution (OOD) failures, compounded by the inherent demands of foundation models to generalize across tasks, domains, and embodiments, each of which requires specialized state-action representations. The root cause of these challenges often lies in the absence of dynamic causal representations and their associated implicit action spaces and is exacerbated by the unreliable modelling of state transitions, whether deterministic or stochastic. In light of these limitations, a set of open questions arise regarding the misalignment between high-quality video generation and the need for controllable world models oriented towards general planning and control. This discrepancy and lack of dynamic controllable representations call for a critical re-evaluation of current architectures and training paradigms targeted at planning and control and an investigation into how abundant action-free data from diverse domains might be leveraged to develop controllable realistic foundation models. One aspect of realism is semantic consistency, e.g., motion, depth and the object-ness of the scene elements. Another aspect pertains to behaviour and short-and long-term relations among world entities. For instance, in the driving domain, a left-curve sign should influence the model's imagination of the road ahead, faster cars should remain on the left, and pedestrians should never be stepped over. In Summary, the project focus is to address these challenges to ensure that foundation models scale effectively for generalized planning, ultimately paving the way for truly intelligent systems that are not only visually compelling but also capable of robust, real-world general reasoning.