Three layers of agent safety
- model architecture: fundamental limitations of transformer structure
- architecture -> LLMs: training data (poisoning), training objective (reward hacking)
- LLMs -> prompts: prompt injections, unintended actions, goal scheming
prompt injections
OWASP top 10 for LLM applications…. RAG/Agents are WORSE because humans do not have choice. Web agents, can browse the web and have context poisoning.
evaluation setup
- etiologic validity
- realistic threat models
- systematic evaluations (e.g., obviously anecdotal works)
- controlled environments
computer security principles
- confidentiality (don’t infiltrate passwords)
- integrity (don’t nuke important files)
- availability (don’t bring things down)
benign inputs leading to harms
- triggering compaction => failures
Unintentional behavior: “unsafe agent behavior that deviations from user intentions from a task”
questions
- etiological validity? once you discover failures, how do you argue how much things happen in the wild? coverage? how to make sure you are not doing multiple of thew same thing?
- think about what the attacker can do, and try to cover more of them
- strike a tradeoff between not covering out of domains / bad scenarios
- guardrails vs fine tuning
- “defense in depth”
- perhaps the model can be tuned to recognize dangerous situations and then kick the problem down the can
