Policy Gradients

Now that we want the neural network to learn a better policy itself - such that the cart pole doesn't wobble as much- we shall understand the policy gradient algorithm which helps us achieve it.

Understanding Policy Gradients

To train this neural network we will need to define the target probabilities y. If an action is good we should increase its probability, and conversely, if it is bad we should reduce it. But how do we know whether an action is good or bad? The problem is that most actions have delayed effects, so when you win or lose points in an episode, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? This is called the credit assignment problem.
The Policy Gradients algorithm tackles this problem by first playing multiple episodes, then making the actions in good episodes slightly more likely, while actions in bad episodes are made slightly less likely. First we play, then we go back and think about what we did.

Loading comments...