agent security

Three layers of agent safety

  1. model architecture: fundamental limitations of transformer structure
  2. architecture -> LLMs: training data (poisoning), training objective (reward hacking)
  3. LLMs -> prompts: prompt injections, unintended actions, goal scheming

prompt injections

OWASP top 10 for LLM applications…. RAG/Agents are WORSE because humans do not have choice. Web agents, can browse the web and have context poisoning.

evaluation setup

  1. etiologic validity
  2. realistic threat models
  3. systematic evaluations (e.g., obviously anecdotal works)
  4. controlled environments

computer security principles

  • confidentiality (don’t infiltrate passwords)
  • integrity (don’t nuke important files)
  • availability (don’t bring things down)

benign inputs leading to harms

  • triggering compaction => failures

Unintentional behavior: “unsafe agent behavior that deviations from user intentions from a task”

questions

  • etiological validity? once you discover failures, how do you argue how much things happen in the wild? coverage? how to make sure you are not doing multiple of thew same thing?
    • think about what the attacker can do, and try to cover more of them
    • strike a tradeoff between not covering out of domains / bad scenarios
  • guardrails vs fine tuning
    • “defense in depth”
    • perhaps the model can be tuned to recognize dangerous situations and then kick the problem down the can