Registrations Closing Soon for DevOps Certification Training by CloudxLab | Registrations Closing inEnroll Now
Understanding Policy Gradients
To train this neural network we will need to define the target probabilities
y. If an action is good we should increase its probability, and conversely, if it is bad we should reduce it. But how do we know whether an action is good or bad? The problem is that most actions have delayed effects, so when you win or lose points in an episode, it is not clear which actions contributed to this result: was it just the last action? Or the last 10? Or just one action 50 steps earlier? This is called the credit assignment problem.
The Policy Gradients algorithm tackles this problem by first playing multiple episodes, then making the actions in good episodes slightly more likely, while actions in bad episodes are made slightly less likely. First we play, then we go back and think about what we did.
No hints are availble for this assesment
Answer is not availble for this assesment