Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety
github.comI wrote a short position paper arguing that current agentic AI safety failures are the confused deputy problem on repeat. We are handing agents ambient authority and trying to contain it with soft constraints like prompts and userland wrappers. My take: you need hard, reduce-only authority enforced at a real boundary (kernel control plane class), not something bypassable from userland. Curious how others are modeling this. What constraints do you think are truly non-negotiable?
Here are some important differences:
- The players in competitive games don't write code. Coding agents do. When you copy the code outside the sandbox and run it, what permissions does it get?
- Game players usually don't have access to confidential material, so you don't need to prevent them from exfiltrating it.
You're right. Players are in a sandbox and they only have access to what they have been given rights to. The game analogy isn’t about confidential material, it’s about adversarial incentives under fixed mechanics. In games you don’t rely on “good behavior” because players will explore every edge the rules allow.
In agentic systems, the agent often has privileged material by design (API keys, local files, browser cookies, tokens, credentials, docs) plus high-leverage actions (shell, package manager, cloud control planes). That combination is exactly why ambient authority without hard boundaries is dangerous.
The point is threat modeling: "don’t rely on intent, rely on boundaries." The paper argues for reduce-only, fast-revocable authority at a real enforcement boundary, not userland wrappers.
Was this written with a LLM? If so, please add a note about it at the start of the README.
this is so slop-reeky that the note should reveal the prompt used as well!
[dead]
People want convenience more than they want security. No one wants permission grants to go away in minutes or hours. Every time the agent is stopped by permissions grant check, the average user experience is a little worse.
I agree that UX is the hard part. The point isn’t “pop-up permission dialog every minute.” It’s “remove standing power.” You can make short lived authority feel smoother with scoped permits, pre-approved workflows, clear revocation semantics, and defaults that renew narrowly. The non-negotiable part is that authority can be pulled instantly and cannot silently widen. Convenience matters, but “always-on admin” is convenience paid for with failure.
I wrote a short position paper arguing that current agentic AI safety failures are the confused deputy problem on repeat. We are handing agents ambient authority and trying to contain it with soft constraints like prompts and userland wrappers. My take: you need hard, reduce-only authority enforced at a real boundary (kernel control plane class), not something bypassable from userland. Curious how others are modeling this. What constraints do you think are truly non-negotiable?
> I wrote a short position
> "Reality check"
Hi GPT :)
I thought "surely they wouldn't ...." The issues in the article are more blatant. You were right and caught it extremely quickly.
I recommend researching sandboxes.