Challenges of Language Model Agents

Challenge of Making Agents

Agents are not very new—(Riedl and Amant 2002). But newer models can be powered by LLM/VLMs, meaning we are using language for reasoning/communication.

Sequentiality is hard

what is the context/motivation?
how to you transfer across contexts?
how do you plan?

Evaluation

Different from how previous NLP benchmarks: we are not worried about language modeling
No longer boundaries between various fields

Common goals:

realistic agents—stop playing Atari games.
reproducible systems
measurability goals
scalable models
which are easy to use

Web as an Interactive Environment

agents on the web is both practical and scalable
https://webshop-pnlp.github.io/
WebShop can actually transfer with no work to training on Amazon
Mind2Web

InterCode

Formulation of agent decisions as POMDP in order to fully benchmark Markovian decisions:

https://arxiv.org/abs/2306.14898

Agent Development

Agents development has no core framework

production systems

set of rules specificying a precondition + action
when preconditinons are met, perform an action

Big kitchen sink proposal: https://arxiv.org/abs/2309.02427

Trust and safety

Agents are much more powerful and dynamic

Challenges of Agent Data Collection

Because agent data collection requires embodiment (it like actually have to touch the world).

infra is hard (initial enevironment setup is really hard)
complex observation-action interactions in divere environment
we want to create / filter for goal-aligned alignment

some strategies

humans do it
synthetic data: NNetNav or AgentTrek (limitation: parallelization and search is hard)
interest scale data: observing INTERNET demonstrations (but its hard to ground to some goal)

human agent interaction collection procedure

Make users install AgentNet tool and capture the screen
Make humans do stuff that are goal aligned
Then, we now have unified agent data!

Challenges of Agent Benchmarking

only can write evaluation for very limited tasks: time consuming
can’t script evaluation metrics for open-answer tasks (chichis from real users)