A Path Towards Autonomous Machine Intelligence
This paper proposes an architecture for autonomous intelligent agents, combining configurable predictive world models, intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning to achieve human-like learning and reasoning capabilities.
Abstract This paper proposes an architecture and training paradigms for autonomous intelligent agents combining world models, intrinsic motivation, and hierarchical self-supervised learning. | 1:51Explained | |
Prologue This position paper outlines a vision for intelligent machines that learn like animals and humans, driven by intrinsic objectives, and assembles existing ideas into a coherent proposal addressing future challenges. | 1:39Explained | |
Introduction Current AI systems fall short of human learning abilities due to their inability to learn world models, reason compatibly with gradient-based learning, and represent information hierarchically, necessitating research into these three core challenges. | 2:19Explained | |
Learning World Models Animals learn common sense through observation and minimal interaction, forming world models that enable prediction, reasoning, and planning, a capability crucial for AI development to overcome the limitations of current data-intensive learning methods. | 1:48Explained | |
Hierarchies of Models Humans and animals acquire knowledge hierarchically, starting with basic concepts like dimensionality and object permanence, and building towards intuitive physics and social knowledge, suggesting a single, configurable world model engine is more efficient than task-specific models. | 1:53Explained | |
Figure 1: Infant Concept Acquisition Infants acquire concepts hierarchically, starting with fundamental notions like object permanence and progressing to more abstract ideas like intuitive physics, supporting the hypothesis of a single, adaptable world model. | 1:17Explained | |
A Model Architecture for Autonomous Intelligence The proposed architecture for autonomous agents features a configurable configurator module, a hierarchical perception module, a world model for prediction and uncertainty representation, a cost module for intrinsic motivation and learned value estimation, short-term memory, and an actor for action generation. | 2:08Explained | |
Figure 2: Autonomous Intelligence Architecture This architecture for autonomous intelligence comprises interconnected, differentiable modules including perception, world model, cost, memory, and actor, all orchestrated by a configurator for task-specific adaptation. | 1:57Explained | |
Figure 3: Mode-1 Perception-Action Episode Mode-1, analogous to System 1 thinking, describes a reactive perception-action loop where an actor directly generates actions based on perceived states, with optional world model updates. | 1:44Explained | |
Typical Perception-Action Loops The proposed model operates in two modes: Mode-1 for reactive behavior and Mode-2 for reasoning and planning using the world model, akin to Kahneman's System 1 and System 2 respectively. | 2:06Explained | |
Mode-2: reasoning and planning using the world model Mode-2 involves a perception-action loop where an actor proposes action sequences, the world model simulates outcomes, the cost module evaluates them, and the actor refines the sequence through gradient-based planning to minimize estimated future cost. | 1:39Explained | |
Mode-2 Perception-Action Episode Mode-2 perception-action episodes involve estimating world state, predicting future states via a world model, and optimizing an action sequence to minimize total energy, analogous to model-predictive control. | 1:33Explained | |
Model-Predictive Control and Stochasticity Following action execution, states and costs are stored for critic training, mirroring model-predictive control where learned world models and cost functions are central, and acknowledging real-world stochasticity requires accounting for multiple potential future states. | 1:52Explained | |
Training a Reactive Policy Module A reactive policy module is trained to approximate Mode-2 optimized actions by minimizing the divergence between the module's output and the optimal action, enabling faster reactive or planning-accelerated behavior. | 1:35Explained | |
Mode-1 vs. Mode-2 Operation and Policy Training Mode-2 operation is computationally intensive, focusing on one task, while Mode-1 is less demanding and can utilize trained policy modules for reactive action generation or to propose initial sequences for Mode-2. | 1:49Explained | |
Reasoning as Simulation and Optimization Mode-2 reasoning is framed as simulation-based planning and energy optimization, extending beyond traditional AI reasoning paradigms to include simulation and analogy. | 1:36Explained | |
Cost Module Components and Intrinsic Drives The cost module combines an immutable intrinsic cost with a trainable critic, where submodules and configurable weights specify behavioral drives, analogous to biological emotional and motivational systems. | 2:02Explained | |
Specifying AI Agent Behavior AI agent behavior can be specified through programmed behaviors, objective functions, direct supervision, or imitation learning, with objective-based approaches offering greater simplicity and adaptability. | 1:55Explained | |
Critic Training Data Critic training utilizes triplets of (time, state, intrinsic energy) stored in short-term memory, where the critic learns to predict future intrinsic energies from past states. | 1:20Explained | |
Critic's Role in Predicting Future Energy The critic predicts future intrinsic energy values using stored state-energy pairs from short-term memory, optimizing its parameters to minimize prediction error, akin to reinforcement learning critics. | 2:05Explained | |
Short-Term Memory Implementation Short-term memory is implemented as a key-value memory network, enabling soft associative retrieval and interpolation, with potential for one-shot learning and end-to-end differentiability. | 1:29Explained | |
World Model Training Challenges and SSL Training world models, especially those handling multiple predictions and diverse timescales, is a key AI challenge, addressed through Self-Supervised Learning (SSL) focusing on pattern completion and representing multi-modal dependencies. | 1:48Explained | |
Hierarchical Concept Acquisition via SSL SSL on video data can lead to hierarchical acquisition of abstract concepts, from basic features like edges to complex physics and object permanence, by learning predictive relationships across different representational levels. | 2:08Explained | |
Latent-Variable Energy-Based Model (LVEBM) A Latent-Variable Energy-Based Model (LVEBM) uses latent variables to parameterize relationships between inputs and compatible outputs, aiding in compatibility assessment by inferring optimal latent parameters. | 1:51Explained | |
Latent Variables for Multi-Modal Predictions Latent variables are crucial for representing information about future outcomes not directly predictable from past observations, enabling models to capture multi-modal dependencies and uncertainty in predictions. | 1:41Explained | |
Energy-Based Model (EBM) Training Energy-Based Models (EBMs) are trained to shape an energy function, assigning low energy to compatible data pairs and higher energies to incompatible ones, requiring careful architecture design to avoid collapse. | 1:24Explained | |
EBM Architectures and Collapse Risk EBM architectures vary in their susceptibility to collapse, with deterministic models being safe and latent-variable or auto-encoder models requiring careful design to prevent the energy landscape from becoming too flat. | 1:49Explained | |
Architectures and Collapse Susceptibility Deterministic architectures avoid collapse, while non-deterministic and auto-encoder architectures can collapse if not properly constrained, and simple joint embedding architectures collapse if encoders ignore inputs. | 2:08Explained | |
JEPA Energy Minimization and Prediction JEPA minimizes energy by predicting within representation space, leveraging encoder invariance or a latent variable to handle multiple possible outputs without predicting every detail. | 1:36Explained | |
Latent Variable for Predictive Information A latent variable allows the predictor to capture information not present in the input representation, enabling predictions of different outcomes based on contextual cues. | 1:37Explained | |
JEPA for Learning World Models Non-contrastively trained JEPAs learn abstract, predictable world models by eliminating or encoding unpredictable details, enabling hierarchical predictions at multiple time scales. | 2:00Explained | |
JEPA Trainability and Criteria JEPAs are trained non-contrastively by maximizing information in representations, ensuring predictability, and minimizing latent variable information, preventing informational collapse. | 1:35Explained | |
VICReg Method for Representation Learning VICReg maximizes representation information by mapping to higher dimensions and driving covariance towards identity, decorrelating components and making them somewhat independent. | 1:37Explained | |
VICReg for Representation Prediction VICReg's representation prediction error encourages invariant representations, and minimizing latent variable information prevents collapse, enabling JEPA to learn predictive world models. | 2:36Explained | |
JEPA Training with Non-Contrastive Methods Non-contrastive methods train JEPAs efficiently by regularizing energy volume through four criteria: maximizing information in representations, ensuring predictability, and minimizing latent variable information. | 1:54Explained | |
JEPA Principles and Non-Contrastive Training JEPAs are trained non-contrastively to maximize representation information, ensure predictability, and minimize latent variable information, avoiding the dimensionality curse of contrastive methods. | 2:08Explained | |
Hierarchical Prediction with JEPA JEPAs learn abstract representations for hierarchical, multi-scale predictions by eliminating unpredictable details and enabling coarse, long-term forecasts. | 1:56Explained | |
Multilevel World State Prediction Intelligent behavior requires representing world states at multiple abstraction levels, enabling task decomposition and prediction of trajectories, routes, and arrival times. | 1:45Explained | |
Hierarchical Planning Hierarchical planning leverages multi-scale world models by defining high-level objectives and decomposing them into lower-level subgoals, which are then optimized through action sequences. | 1:51Explained | |
Handling Uncertainty The world model handles various types of uncertainty by using latent variables that can be optimized, predicted, or sampled, allowing for robust planning through directed search and exploration of plausible outcomes. | 1:46Explained | |
World Model Architecture Details The world model architecture should incorporate gating or dynamic routing, utilizing feature vector displacement for low-level predictions and transformer architectures for higher-level object interactions. | 1:44Explained | |
Separating World and Ego Models A separate, potentially deterministic ego-model for the agent complements the world model's handling of unpredictability, and can serve as a template for modeling other agents. | 2:08Explained | |
Data Streams for Learning Agents can learn about the world through passive observation, active foveation, passive agency observation, active egomotion, and active agency, with the latter modes enabling more efficient and active information gathering. | 2:05Explained | |
Actor Module Functions The actor module infers optimal action sequences, produces latent variable configurations for uncertainty, and trains policy networks, acting as an optimizer and explorer through gradient-based methods or alternative planning techniques. | 1:56Explained | |
Configurator Module Functions The configurator acts as the central controller, modulating parameters and connection graphs of other modules for hardware and knowledge reuse, priming perception, and setting subgoals for the cost module. | 1:48Explained | |
Related Work: World Models and Planning Prior work in optimal control, reinforcement learning, and robotics has explored learned world models, model-predictive control, and hierarchical planning, with recent advances focusing on sample efficiency and learning from visual input. | 2:02Explained | |
Related Work: Predictive Models Various generative and non-generative models, including GANs, VAEs, and CPC, have been applied to video prediction and control tasks, with ongoing research addressing uncertainty, representation learning, and the need for supervised pre-training. | 1:52Explained | |
Related Work on Self-Supervised Learning and Transformers Recent works apply non-contrastive self-supervised learning (SSL) to robotics control and use transformers for state trajectory prediction, drawing inspiration from advancements in speech recognition and car trajectory prediction. | 1:30Explained | |
Energy-Based Models and Joint Embedding Architectures Energy-Based Models (EBMs) and Joint Embedding Architectures (JEAs), both trained contrastively and non-contrastively, have a long history in machine learning, with recent SSL approaches causing a surge in their application. | 2:13Explained | |
Cognitive Science and World Models Human learning, with its ability to grasp abstract concepts and plan complex actions, inspires the development of predictive world models in machines, drawing parallels with concepts like intuition, planning, and consciousness. | 1:32Explained | |
Challenges in Implementing the Cognitive Architecture Significant challenges exist in implementing and training the proposed Hierarchical JEPA architecture, including regularizing latent variables, optimizing action sequences, and specifying the precise architecture of its modules. | 2:16Explained | |
Parallels Between Proposed Architecture and Mammalian Brain The proposed architecture's modules have functional counterparts in the mammalian brain, suggesting potential links between its computational mechanisms and cognitive functions like perception, world modeling, reward processing, and executive control. | 1:54Explained | |
Common Sense in AI and World Models Unlike AI systems, animals possess common sense derived from world interaction, suggesting that grounded intelligence through configurable world models, potentially emergent from SSL applied to H-JEPA, could be the substrate for machine common sense. | 1:47Explained | |
Limitations of Current AI Approaches to Intelligence Scaling up transformer architectures and relying solely on reinforcement learning or reward are insufficient for human-level AI due to limitations in handling continuous data, representing uncertainty, and performing complex reasoning. | 1:51Explained | |
Role of Reinforcement Learning and Intrinsic Costs The proposed architecture learns world models to minimize actions and uses differentiable intrinsic costs, making it more akin to optimal control than traditional reinforcement learning, where reward plays a minor role in training the world model. | 2:13Explained | |
Reasoning and Search in the Proposed Architecture Reasoning in the proposed architecture involves energy minimization or constraint satisfaction by the actor, utilizing gradient-based or gradient-free search methods depending on the continuity and cardinality of the action space. | 1:47Explained | |
Acknowledgements The ideas presented in this paper are a distillation of years of interactions with numerous colleagues, with specific individuals acknowledged for their significant contributions and comments on the manuscript. | 1:05Explained | |
References This section lists numerous academic papers and books relevant to machine learning, robotics, cognitive science, and related fields. | 1:20Explained | |
Figure 18: Symbols used in architectural diagrams Architectural diagrams use symbols for variables, energy terms, and deterministic functions to represent models, with filled circles for observed variables and hollow circles for latent variables. | 1:54Explained | |
Figure 19: Amortized Inference with an EBM Amortized inference uses an encoder to approximate the latent variable that minimizes energy in an energy-based model, reducing computational cost. | 1:44Explained | |
Figure 20: Amortized Inference with a Regularized Generative Latent-Variable EBM architecture A regularized generative latent-variable EBM architecture uses an encoder for amortized inference, where a regularizer limits information transfer from observed variables to latent variables to prevent collapse. | 1:42Explained | |
Appendix: Loss functions for Contrastive Training of EBM Contrastive training methods for EBMs utilize various strategies for selecting contrastive samples and define loss functions, categorized into exact/approximate maximum likelihood and methods not interpretable within a probabilistic framework. | 2:08Explained | |
Table 1: List of contrastive methods and loss functions Table 1 categorizes contrastive methods for training energy-based models, detailing their strategies for generating contrastive samples and their corresponding loss functions, which can be exact/approximate maximum likelihood, or based on other principles like hinge loss, GANs, and denoising auto-encoders. | 1:59Explained |