Login using Social Account
     Continue with GoogleLogin using your credentials
In The Policy Gradients concept, the algorithm uses the model to play the episode several times (e.g., 10 times), then it goes back and looks at all the rewards, discounts them, and normalizes them.
So let's create the functions for that: the first will compute discounted rewards(discount_rewards()
); the second will normalize the discounted rewards across many episodes(discount_and_normalize_rewards()
).
Define the discount_rewards
which takes the rewards, discount_rate
as input arguments and the discounted
as return parameter.
def discount_rewards(rewards, discount_rate):
discounted = np.array(rewards)
for step in range(len(rewards) - 2, -1, -1):
discounted[step] += discounted[step + 1] * discount_rate
return discounted
This function will be called by discount_and_normalize_rewards
function.
Say there were 3 actions, and after each action there was a reward: first 10, then 0, then -50. If we use a discount factor of 80%, then the 3rd action will get -50 (full credit for the last reward), but the 2nd action will only get -40 (80% credit for the last reward), and the 1st action will get 80% of -40 (-32) plus full credit for the first reward (+10), which leads to a discounted reward of -22:
For example, discount_rewards([10, 0, -50], discount_rate=0.8)
returns array([-22, -40, -50])
.
Define the discount_and_normalize_rewards
function all_rewards, discount_rate
as input arguments and returns normalized values of discounted values.
def discount_and_normalize_rewards(all_rewards, discount_rate):
all_discounted_rewards = [discount_rewards(rewards, discount_rate)
for rewards in all_rewards]
flat_rewards = np.concatenate(all_discounted_rewards)
reward_mean = flat_rewards.mean()
reward_std = flat_rewards.std()
return [(discounted_rewards - reward_mean) / reward_std
for discounted_rewards in all_discounted_rewards]
To normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.
For example, discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)
returns [array([-0.28435071, -0.86597718, -1.18910299]),
array([1.26665318, 1.0727777 ])]
.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...