Decision Making Index

Lecture notes taking during CS238, decision making. Stanford Intelligence Systems Laboratory (SISL: planning and validation of intelligent systems).

Big Ideas

Themes

There’s a principled mathematical framework for defining rational behavior
There are computational techniques that could lead to better, and perhaps counter-intuitive decisions
Successful application depends on your choice of representation and approximation
- you typically can’t solve mathematical models exactly
- so, we have to rely on good models of approximations
The same computational approaches can be applied to different application domains
- the same set of abstractions can be carried through life
- send Mykel a note about how these topics about where this stuff is applied

These algorithms drive high quality decisions on a tight timeline. You can’t fuck up: people die.

Fundamental understanding of mathematical models and solution methods—ungraded book exercises
- Three quizzes: one question per chapter
  1. chapters 2, 3, 5
Implement and extend key algorithms for learning and decision making
Identify an application of the theory of this course and formulate it mathematically (proposal)
- what are the i/o
- what are the sensors measurements
- what are the decisions to be made
[one other thing]

Course Outline

1-shot: Probabilistic Reasoning

models of distributions over many variables
using distributions to make inferences
utility theory

n-shot: Sequential Problems

we now 1-shot decision networks into making a series of decisions
- assume: model of environment is known (no Model Uncertainty), and environment is fully observable (no State Uncertainty)
- this introduces a Markov Decision Process (MDP)
approximation solutions for observing the environment both online and offline

Model Uncertainty

deal with situations where we don’t know what the best action is at any given step
i.e.: future rewards, etc.
introduce reinforcement learning and its challenges
1. Rewards may be received long after important decisions
2. Agents must generalized from limited exploration experience

State Uncertainty

deal with situations where we don’t know what is actually happening: we only have a probabilistic state
introduce Partially Observable Markov Decision Process
1. keep a distribution of believes
2. update the distribution of believes
3. make decisions based the distribution

Multiagent Systems

challenges of Interaction Uncertainty
building up interaction complexity
1. simple games: many agents, each with individual rewards, acting to make a single joint action
2. markov games: many agents, many states, multiple outcomes in a stochastic environment; Interaction Uncertainty arises out of unknowns about what other agents will do
3. partially observable markov game: markov games with State Uncertainty
4. decentralized partially observable markov game: POMGs with shared rewards between agents instead of individual rewards

Lectures

probabilistic reasoning relating to single decisions

Baysian Networks, and how to deal with them.

a chain of reasoning with feedback

Markov Decision Process uses policies that are evaluated with policy evaluation via utility, Bellman Equation, value function, etc.

If we know the state space fully, we can use policy iteration and value iteration to determine an objectively optimal policy. If we don’t (or if the state space is too large), we can try to discretize our state space and appropriate through Approximate Value Functions, or use online planning approaches to compute good policy as we go.

If none of those things are feasible (i.e. your state space is too big or complex to be discretized (i.e. sampling will cause you to loose the structure of the problem)), you can do some lovely Policy Optimization which will keep you in continuous space while iterating on the policy directly. Some nerds lmao like Policy Gradient methods if your policy is differentiable.

Now, Policy Optimization methods all require sampling a certain set of trajectories and optimizing over them in order to work. How do we know how much sampling to do before we start optimizing? That’s an Exploration and Exploitation question. We can try really hard to collect trajectories, but then we’d loose out on collecting intermediate reward.

POMDP bomp bomp bomp

Failures?

Change the action space
Change the reward function
Change the transition function
Improve the solver
Don’t worry about it
Don’t deploy the system

Words of Wisdom from Mykel

“The belief update is central to learning. The point of education is to change your beliefs; look for opportunities to change your belief.”

“What’s in the action space, how do we maximize it?”

From MDPs, “we can learn from the past, but the past doesn’t influence you.”

“Optimism under uncertainty”: Exploration and Exploitation “you should try things”

Worksheets

SU-CS238 Q0Q3