Challenge of Making Agents
Agents are not very new—(Riedl and Amant 2002). But newer models can be powered by LLM/VLMs, meaning we are using language for reasoning/communication.
Sequentiality is hard
- what is the context/motivation?
- how to you transfer across contexts?
- how do you plan?
Evaluation
- Different from how previous NLP benchmarks: we are not worried about language modeling
- No longer boundaries between various fields
Common goals:
- realistic agents—stop playing Atari games.
- reproducible systems
- measurability goals
- scalable models
- which are easy to use
Web as an Interactive Environment
- agents on the web is both practical and scalable
- https://webshop-pnlp.github.io/
- WebShop can actually transfer with no work to training on Amazon
- Mind2Web
InterCode
Formulation of agent decisions as POMDP in order to fully benchmark Markovian decisions:
https://arxiv.org/abs/2306.14898
Agent Development
Agents development has no core framework
production systems
- set of rules specificying a precondition + action
- when preconditinons are met, perform an action
Big kitchen sink proposal: https://arxiv.org/abs/2309.02427
Trust and safety
Agents are much more powerful and dynamic
Challenges of Agent Data Collection
Because agent data collection requires embodiment (it like actually have to touch the world).
- infra is hard (initial enevironment setup is really hard)
- complex observation-action interactions in divere environment
- we want to create / filter for goal-aligned alignment
some strategies
- humans do it
- synthetic data: NNetNav or AgentTrek (limitation: parallelization and search is hard)
- interest scale data: observing INTERNET demonstrations (but its hard to ground to some goal)
human agent interaction collection procedure
- Make users install AgentNet tool and capture the screen
- Make humans do stuff that are goal aligned
- Then, we now have unified agent data!
Challenges of Agent Benchmarking
- only can write evaluation for very limited tasks: time consuming
- can’t script evaluation metrics for open-answer tasks (chichis from real users)