Abstract:The ability to understand physical dynamics is essential to learning agents acting in the world. This paper presents Counterfactual World Modeling (CWM), a candidate pure vision foundational model for physical dynamics understanding. CWM consists of three basic concepts. First, we propose a simple and powerful temporally-factored masking policy for masked prediction of video data, which encourages the model to learn disentangled representations of scene appearance and dynamics. Second, as a result of the factoring, CWM is capable of generating counterfactual next-frame predictions by manipulating a few patch embeddings to exert meaningful control over scene dynamics. Third, the counterfactual modeling capability enables the design of counterfactual queries to extract vision structures similar to keypoints, optical flows, and segmentations, which are useful for dynamics understanding. We show that zero-shot readouts of these structures extracted by the counterfactual queries attain competitive performance to prior methods on real-world datasets. Finally, we demonstrate that CWM achieves state-of-the-art performance on the challenging Physion benchmark for evaluating physical dynamics understanding.
Abstract:Humans are interactive agents driven to seek out situations with interesting physical dynamics. Here we formalize the functional form of physical intrinsic motivation. We first collect ratings of how interesting humans find a variety of physics scenarios. We then model human interestingness responses by implementing various hypotheses of intrinsic motivation including models that rely on simple scene features to models that depend on forward physics prediction. We find that the single best predictor of human responses is adversarial reward, a model derived from physical prediction loss. We also find that simple scene feature models do not generalize their prediction of human responses across all scenarios. Finally, linearly combining the adversarial model with the number of collisions in a scene leads to the greatest improvement in predictivity of human responses, suggesting humans are driven towards scenarios that result in high information gain and physical activity.
Abstract:Traditional machine learning applications, such as optical character recognition, arose from the inability to explicitly program a computer to perform a routine task. In this context, learning algorithms usually derive a model exclusively from the evidence present in a massive dataset. Yet in some scientific disciplines, obtaining an abundance of data is an impractical luxury, however; there is an explicit model of the domain based upon previous scientific discoveries. Here we introduce a new approach to machine learning that is able to leverage prior scientific discoveries in order to improve generalizability over a scientific model. We show its efficacy in predicting the entire energy spectrum of a Hamiltonian on a superconducting quantum device, a key task in present quantum computer calibration. Our accuracy surpasses the current state-of-the-art by over $20\%.$ Our approach thus demonstrates how artificial intelligence can be further enhanced by "standing on the shoulders of giants."