#NoPayJan Offer - Access all CloudxLab Courses for free between 1st to 31st JanEnroll Now >>
In The Policy Gradients concept, the algorithm uses the model to play the episode several times (e.g., 10 times), then it goes back and looks at all the rewards, discounts them, and normalizes them.
So let's create the functions for that: the first will compute discounted rewards(
discount_rewards()); the second will normalize the discounted rewards across many episodes(
discount_rewards which takes the
rewards, discount_rate as input arguments and the
discounted as return parameter.
def discount_rewards(rewards, discount_rate): discounted = np.array(rewards) for step in range(len(rewards) - 2, -1, -1): discounted[step] += discounted[step + 1] * discount_rate return discounted
This function will be called by
Say there were 3 actions, and after each action there was a reward: first 10, then 0, then -50. If we use a discount factor of 80%, then the 3rd action will get -50 (full credit for the last reward), but the 2nd action will only get -40 (80% credit for the last reward), and the 1st action will get 80% of -40 (-32) plus full credit for the first reward (+10), which leads to a discounted reward of -22:
discount_rewards([10, 0, -50], discount_rate=0.8) returns
array([-22, -40, -50]).
all_rewards, discount_rate as input arguments and returns normalized values of discounted values.
def discount_and_normalize_rewards(all_rewards, discount_rate): all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards] flat_rewards = np.concatenate(all_discounted_rewards) reward_mean = flat_rewards.mean() reward_std = flat_rewards.std() return [(discounted_rewards - reward_mean) / reward_std for discounted_rewards in all_discounted_rewards]
To normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8) returns
[array([-0.28435071, -0.86597718, -1.18910299]),
array([1.26665318, 1.0727777 ])].
No hints are availble for this assesment
Answer is not availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here