Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Francesco Belardinelli

Imperial College London

Multi-Agent Q-Learning Dynamics in Random Networks: Convergence due to Exploration and Sparsity

Mar 13, 2025

Aamal Hussain, Dan Leonte, Francesco Belardinelli, Raphael Huser, Dario Paccagnan

Abstract:Beyond specific settings, many multi-agent learning algorithms fail to converge to an equilibrium solution, and instead display complex, non-stationary behaviours such as recurrent or chaotic orbits. In fact, recent literature suggests that such complex behaviours are likely to occur when the number of agents increases. In this paper, we study Q-learning dynamics in network polymatrix games where the network structure is drawn from classical random graph models. In particular, we focus on the Erdos-Renyi model, a well-studied model for social networks, and the Stochastic Block model, which generalizes the above by accounting for community structures within the network. In each setting, we establish sufficient conditions under which the agents' joint strategies converge to a unique equilibrium. We investigate how this condition depends on the exploration rates, payoff matrices and, crucially, the sparsity of the network. Finally, we validate our theoretical findings through numerical simulations and demonstrate that convergence can be reliably achieved in many-agent systems, provided network sparsity is controlled.

Via

Access Paper or Ask Questions

Probabilistic Shielding for Safe Reinforcement Learning

Mar 09, 2025

Edwin Hamel-De le Court, Francesco Belardinelli, Alex W. Goodall

Abstract:In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

* 13 pages, 3 figures, Conference: AAAI 2025

Via

Access Paper or Ask Questions

Explainable Reinforcement Learning for Formula One Race Strategy

Jan 07, 2025

Devin Thomas, Junqi Jiang, Avinash Kori, Aaron Russo, Steffen Winkler, Stuart Sale, Joseph McMillan, Francesco Belardinelli, Antonio Rago

Figure 1 for Explainable Reinforcement Learning for Formula One Race Strategy

Figure 2 for Explainable Reinforcement Learning for Formula One Race Strategy

Figure 3 for Explainable Reinforcement Learning for Formula One Race Strategy

Figure 4 for Explainable Reinforcement Learning for Formula One Race Strategy

Abstract:In Formula One, teams compete to develop their cars and achieve the highest possible finishing position in each race. During a race, however, teams are unable to alter the car, so they must improve their cars' finishing positions via race strategy, i.e. optimising their selection of which tyre compounds to put on the car and when to do so. In this work, we introduce a reinforcement learning model, RSRL (Race Strategy Reinforcement Learning), to control race strategies in simulations, offering a faster alternative to the industry standard of hard-coded and Monte Carlo-based race strategies. Controlling cars with a pace equating to an expected finishing position of P5.5 (where P1 represents first place and P20 is last place), RSRL achieves an average finishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix, outperforming the best baseline of P5.63. We then demonstrate, in a generalisability study, how performance for one track or multiple tracks can be prioritised via training. Further, we supplement model predictions with feature importance, decision tree-based surrogate models, and decision tree counterfactuals towards improving user trust in the model. Finally, we provide illustrations which exemplify our approach in real-world situations, drawing parallels between simulations and reality.

* 9 pages, 6 figures. Copyright ACM 2025. This is the authors' version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in SAC 2025, http://dx.doi.org/10.1145/3672608.3707766

Via

Access Paper or Ask Questions

Measuring Goal-Directedness

Dec 06, 2024

Matt MacDermott, James Fox, Francesco Belardinelli, Tom Everitt

Figure 1 for Measuring Goal-Directedness

Figure 2 for Measuring Goal-Directedness

Figure 3 for Measuring Goal-Directedness

Figure 4 for Measuring Goal-Directedness

Abstract:We define maximum entropy goal-directedness (MEG), a formal measure of goal-directedness in causal models and Markov decision processes, and give algorithms for computing it. Measuring goal-directedness is important, as it is a critical element of many concerns about harm from AI. It is also of philosophical interest, as goal-directedness is a key aspect of agency. MEG is based on an adaptation of the maximum causal entropy framework used in inverse reinforcement learning. It can measure goal-directedness with respect to a known utility function, a hypothesis class of utility functions, or a set of random variables. We prove that MEG satisfies several desiderata and demonstrate our algorithms with small-scale experiments.

* Accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

Via

Access Paper or Ask Questions

The Reasons that Agents Act: Intention and Instrumental Goals

Feb 15, 2024

Francis Rhys Ward, Matt MacDermott, Francesco Belardinelli, Francesca Toni, Tom Everitt

Figure 1 for The Reasons that Agents Act: Intention and Instrumental Goals

Figure 2 for The Reasons that Agents Act: Intention and Instrumental Goals

Figure 3 for The Reasons that Agents Act: Intention and Instrumental Goals

Figure 4 for The Reasons that Agents Act: Intention and Instrumental Goals

Abstract:Intention is an important and challenging concept in AI. It is important because it underlies many other concepts we care about, such as agency, manipulation, legal responsibility, and blame. However, ascribing intent to AI systems is contentious, and there is no universally accepted theory of intention applicable to AI agents. We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision. We introduce a formal definition of intention in structural causal influence models, grounded in the philosophy literature on intent and applicable to real-world machine learning systems. Through a number of examples and results, we show that our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work. In addition, we show how our definition relates to past concepts, including actual causality, and the notion of instrumental goals, which is a core idea in the literature on safe AI agents. Finally, we demonstrate how our definition can be used to infer the intentions of reinforcement learning agents and language models from their behaviour.

* AAMAS24

Via

Access Paper or Ask Questions

Leveraging Approximate Model-based Shielding for Probabilistic Safety Guarantees in Continuous Environments

Feb 01, 2024

Alexander W. Goodall, Francesco Belardinelli

Abstract:Shielding is a popular technique for achieving safe reinforcement learning (RL). However, classical shielding approaches come with quite restrictive assumptions making them difficult to deploy in complex environments, particularly those with continuous state or action spaces. In this paper we extend the more versatile approximate model-based shielding (AMBS) framework to the continuous setting. In particular we use Safety Gym as our test-bed, allowing for a more direct comparison of AMBS with popular constrained RL algorithms. We also provide strong probabilistic safety guarantees for the continuous setting. In addition, we propose two novel penalty techniques that directly modify the policy gradient, which empirically provide more stable convergence in our experiments.

* Accepted as an Extended Abstract at AAMAS 2024

Via

Access Paper or Ask Questions

Stability of Multi-Agent Learning in Competitive Networks: Delaying the Onset of Chaos

Dec 19, 2023

Aamal Hussain, Francesco Belardinelli

Abstract:The behaviour of multi-agent learning in competitive network games is often studied within the context of zero-sum games, in which convergence guarantees may be obtained. However, outside of this class the behaviour of learning is known to display complex behaviours and convergence cannot be always guaranteed. Nonetheless, in order to develop a complete picture of the behaviour of multi-agent learning in competitive settings, the zero-sum assumption must be lifted. Motivated by this we study the Q-Learning dynamics, a popular model of exploration and exploitation in multi-agent learning, in competitive network games. We determine how the degree of competition, exploration rate and network connectivity impact the convergence of Q-Learning. To study generic competitive games, we parameterise network games in terms of correlations between agent payoffs and study the average behaviour of the Q-Learning dynamics across all games drawn from a choice of this parameter. This statistical approach establishes choices of parameters for which Q-Learning dynamics converge to a stable fixed point. Differently to previous works, we find that the stability of Q-Learning is explicitly dependent only on the network connectivity rather than the total number of agents. Our experiments validate these findings and show that, under certain network structures, the total number of agents can be increased without increasing the likelihood of unstable or chaotic behaviours.

* AAAI 2024

Via

Access Paper or Ask Questions

Honesty Is the Best Policy: Defining and Mitigating AI Deception

Dec 03, 2023

Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, Tom Everitt

Figure 1 for Honesty Is the Best Policy: Defining and Mitigating AI Deception

Figure 2 for Honesty Is the Best Policy: Defining and Mitigating AI Deception

Figure 3 for Honesty Is the Best Policy: Defining and Mitigating AI Deception

Figure 4 for Honesty Is the Best Policy: Defining and Mitigating AI Deception

Abstract:Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.

* Accepted as a spotlight at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

Via

Access Paper or Ask Questions

3vLTL: A Tool to Generate Automata for Three-valued LTL

Nov 16, 2023

Francesco Belardinelli, Angelo Ferrando, Vadim Malvone

Figure 1 for 3vLTL: A Tool to Generate Automata for Three-valued LTL

Figure 2 for 3vLTL: A Tool to Generate Automata for Three-valued LTL

Figure 3 for 3vLTL: A Tool to Generate Automata for Three-valued LTL

Abstract:Multi-valued logics have a long tradition in the literature on system verification, including run-time verification. However, comparatively fewer model-checking tools have been developed for multi-valued specification languages. We present 3vLTL, a tool to generate Buchi automata from formulas in Linear-time Temporal Logic (LTL) interpreted on a three-valued semantics. Given an LTL formula, a set of atomic propositions as the alphabet for the automaton, and a truth value, our procedure generates a Buchi automaton that accepts all the words that assign the chosen truth value to the LTL formula. Given the particular type of the output of the tool, it can also be seamlessly processed by third-party libraries in a natural way. That is, the Buchi automaton can then be used in the context of formal verification to check whether an LTL formula is true, false, or undefined on a given model.

* EPTCS 395, 2023, pp. 180-187
* In Proceedings FMAS 2023, arXiv:2311.08987

Via

Access Paper or Ask Questions

Approximate Model-Based Shielding for Safe Reinforcement Learning

Jul 27, 2023

Alexander W. Goodall, Francesco Belardinelli

Abstract:Reinforcement learning (RL) has shown great potential for solving complex tasks in a variety of domains. However, applying RL to safety-critical systems in the real-world is not easy as many algorithms are sample-inefficient and maximising the standard RL objective comes with no guarantees on worst-case performance. In this paper we propose approximate model-based shielding (AMBS), a principled look-ahead shielding algorithm for verifying the performance of learned RL policies w.r.t. a set of given safety constraints. Our algorithm differs from other shielding approaches in that it does not require prior knowledge of the safety-relevant dynamics of the system. We provide a strong theoretical justification for AMBS and demonstrate superior performance to other safety-aware approaches on a set of Atari games with state-dependent safety-labels.

* Accepted at ECAI 2023 (main technical track)

Via

Access Paper or Ask Questions