“Can we come up a policy that, if not fast, **at least reach the goal!**”

## Background

### Stochastic Shortest-Path

we are at an initial state, and we have a series of goal states, and we want to reach to the goal states.

We can solve this just by:

- value iteration
- simulate a trajectory and only updating reachable state: RTDP, LRTDP
- MBP

## Problem

MDP + Goal States

- \(S\): set of states
- \(A\): actions
- \(P(s’|s,a)\): transition
- \(C\): reward
- \(G\): absorbing
**goal states**

## Approach

Combining LRTDP with anytime dynamics

- run GPT (not the transformer, “General Planning Tool”, think LRTDP) exact solver
- use GPT policy for solved states or visited more than a certain threshold
- uses MBP policy for other states
- policy evaluation for convergence

“use GPT solution as much as possible, and when we haven’t ever visited a place due to the search trajectories, we can use MBP to supplement the solution”