agent security

Three layers of agent safety

model architecture: fundamental limitations of transformer structure
architecture -> LLMs: training data (poisoning), training objective (reward hacking)
LLMs -> prompts: prompt injections, unintended actions, goal scheming

prompt injections

OWASP top 10 for LLM applications…. RAG/Agents are WORSE because humans do not have choice. Web agents, can browse the web and have context poisoning.

evaluation setup

etiologic validity
realistic threat models
systematic evaluations (e.g., obviously anecdotal works)
controlled environments

computer security principles

confidentiality (don’t infiltrate passwords)
integrity (don’t nuke important files)
availability (don’t bring things down)

benign inputs leading to harms

triggering compaction => failures

Unintentional behavior: “unsafe agent behavior that deviations from user intentions from a task”

questions

etiological validity? once you discover failures, how do you argue how much things happen in the wild? coverage? how to make sure you are not doing multiple of thew same thing?
- think about what the attacker can do, and try to cover more of them
- strike a tradeoff between not covering out of domains / bad scenarios
guardrails vs fine tuning
- “defense in depth”
- perhaps the model can be tuned to recognize dangerous situations and then kick the problem down the can