Abstract

LLM agents now hold sensitive user data and interact autonomously at scale — but what happens when one agent is adversarial from the start? Unlike external attacks (prompt injection, jailbreaking), an embedded adversary is a trusted participant that exploits cooperative dynamics to extract private information through social channels. We build a benchmark framework grounded in Contextual Integrity theory (Nissenbaum, 2004), operationalized as a 7-factor schema that controls when sharing is contextually appropriate. Our automated pipeline generates diverse multi-agent scenarios, runs them in Google DeepMind’s Concordia framework, and scores agents on their ability to discriminate between task-relevant sharing and private data protection.

Bio

Omer is a researcher working at the intersection of multi-agent systems and AI safety. His work spans Multi-Agent Reinforcement Learning (MARL) and cooperative AI, investigating how agents learn, generalise, and interact in complex multi-agent settings. Through his Master’s at Stellenbosch University and ongoing research fellowships, he combines theoretical foundations in AI with hands-on implementation, bringing a background in Electrical Engineering and full-stack development to advance safe and robust AI systems.