Abstract

Persuasion and deception pose distinct risks in multi-agent settings. In the context of misuse, an AI agent instructed to pursue a harmful goal could use persuasion, deception, and other forms of influence to recruit other AI agents — combining their capabilities and steering their actions toward the harmful goal. We propose an evaluation framework for empirically assessing four inter-agent influence capabilities: persuasion, deception, coercion, and jailbreaking. We evaluate models in dyadic environments using the Inspect platform, complete with realistic tool access and simulated operational consequences. The benchmark enables systematic comparison of influence capability across models and informs both deployment decisions and safety research priorities.

Bio

Qi is an AI Safety early career researcher, currently doing the Cooperative AI Research Fellowship hosted by AI Safety South Africa in Cape Town. She is collaborating with Cooperative AI Foundation doing inter-agent influence evaluations.