Lecture notes taking during CS238, decision making. Stanford Intelligence Systems Laboratory (SISL: planning and validation of intelligent systems).

## Big Ideas

### Themes

- There’s a principled mathematical framework for defining rational behavior
- There are computational techniques that could lead to better, and perhaps counter-intuitive decisions
- Successful application depends on your choice of representation and approximation
- you typically can’t solve mathematical models
**exactly** - so, we have to rely on good models of approximations

- you typically can’t solve mathematical models
- The same computational approaches can be applied to different application domains
- the same set of abstractions can be carried through life
- send Mykel a note about how these topics about where this stuff is applied

These algorithms drive **high quality** decisions on a **tight timeline**. You can’t fuck up: people die.

### Contents

- Fundamental understanding of mathematical models and solution methods—ungraded book exercises
- Three quizzes: one question per chapter
- chapters 2, 3, 5

- Three quizzes: one question per chapter
- Implement and extend key algorithms for learning and decision making
- Identify an application of the theory of this course and formulate it mathematically (proposal)
- what are the i/o
- what are the sensors measurements
- what are the decisions to be made

- [one other thing]

## Course Outline

### 1-shot: Probabilistic Reasoning

- models of distributions over many variables
- using distributions to make inferences
- utility theory

### n-shot: Sequential Problems

- we now 1-shot decision networks into making a series of decisions
**assume**: model of environment is known (no Model Uncertainty), and environment is fully observable (no State Uncertainty)- this introduces a Markov Decision Process (MDP)

- approximation solutions for observing the environment both online and offline

### Model Uncertainty

- deal with situations where we don’t know what the best action is at any given step
- i.e.: future rewards, etc.
- introduce reinforcement learning and its challenges
- Rewards may be received long after important decisions
- Agents must generalized from limited exploration experience

### State Uncertainty

- deal with situations where we don’t know what is actually happening: we only have a
**probabilistic**state - introduce Partially Observable Markov Decision Process
- keep a distribution of believes
- update the distribution of believes
- make decisions based the distribution

### Multiagent Systems

- challenges of Interaction Uncertainty
- building up interaction complexity
- simple games: many agents, each with individual rewards, acting to make a single joint action
- markov games: many agents, many states, multiple outcomes in a stochastic environment; Interaction Uncertainty arises out of unknowns about what other agents will do
- partially observable markov game: markov games with State Uncertainty
- decentralized partially observable markov game: POMGs with shared rewards between agents instead of individual rewards

## Lectures

### probabilistic reasoning relating to single decisions

Baysian Networks, and how to deal with them.

- SU-CS238 SEP262023
- SU-CS238 SEP272023
- SU-CS238 OCT032023
- SU-CS238 OCT052023
- SU-CS238 OCT102023
- SU-CS238 OCT122023

### a chain of reasoning with feedback

Markov Decision Process uses policies that are evaluated with policy evaluation via utility, Bellman Equation, value function, etc.

If we know the state space fully, we can use policy iteration and value iteration to determine an objectively optimal policy. If we don’t (or if the state space is too large), we can try to discretize our state space and appropriate through Approximate Value Functions, or use online planning approaches to compute good policy as we go.

If none of those things are feasible (i.e. your state space is too big or complex to be discretized (i.e. sampling will cause you to loose the structure of the problem)), you can do some lovely Policy Optimization which will keep you in continuous space while iterating on the policy directly. Some nerds lmao like Policy Gradient methods if your policy is differentiable.

Now, Policy Optimization methods all require sampling a certain set of trajectories and optimizing over them in order to work. How do we know how much sampling to do before we start optimizing? That’s an Exploration and Exploitation question. We can try really hard to collect trajectories, but then we’d loose out on collecting intermediate reward.

- SU-CS238 OCT172023
- SU-CS238 OCT192023
- SU-CS238 OCT242023
- SU-CS238 OCT262023
- SU-CS238 OCT312023
- SU-CS238 NOV022023

### POMDP bomp bomp bomp

### Failures?

- Change the action space
- Change the reward function
- Change the transition function
- Improve the solver
- Don’t worry about it
- Don’t deploy the system

### Words of Wisdom from Mykel

“The belief update is central to learning. The point of education is to change your beliefs; look for opportunities to change your belief.”

“What’s in the action space, how do we maximize it?”

From MDPs, “we can learn from the past, but the past doesn’t influence you.”

“Optimism under uncertainty”: Exploration and Exploitation “you should try things”