 # Defining the discount function and normalizing function

• In The Policy Gradients concept, the algorithm uses the model to play the episode several times (e.g., 10 times), then it goes back and looks at all the rewards, discounts them, and normalizes them.

• So let's create the functions for that: the first will compute discounted rewards(`discount_rewards()`); the second will normalize the discounted rewards across many episodes(`discount_and_normalize_rewards()`).

INSTRUCTIONS
• Define the `discount_rewards` which takes the `rewards, discount_rate` as input arguments and the `discounted` as return parameter.

``````def discount_rewards(rewards, discount_rate):
discounted = np.array(rewards)
for step in range(len(rewards) - 2, -1, -1):
discounted[step] += discounted[step + 1] * discount_rate
return discounted
``````

This function will be called by `discount_and_normalize_rewards` function.

Say there were 3 actions, and after each action there was a reward: first 10, then 0, then -50. If we use a discount factor of 80%, then the 3rd action will get -50 (full credit for the last reward), but the 2nd action will only get -40 (80% credit for the last reward), and the 1st action will get 80% of -40 (-32) plus full credit for the first reward (+10), which leads to a discounted reward of -22:

For example, `discount_rewards([10, 0, -50], discount_rate=0.8)` returns `array([-22, -40, -50])`.

• Define the `discount_and_normalize_rewards` function `all_rewards, discount_rate` as input arguments and returns normalized values of discounted values.

``````def discount_and_normalize_rewards(all_rewards, discount_rate):
all_discounted_rewards = [discount_rewards(rewards, discount_rate)
for rewards in all_rewards]
flat_rewards = np.concatenate(all_discounted_rewards)
reward_mean = flat_rewards.mean()
reward_std = flat_rewards.std()
return [(discounted_rewards - reward_mean) / reward_std
for discounted_rewards in all_discounted_rewards]
``````

To normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.

For example, `discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)` returns ```[array([-0.28435071, -0.86597718, -1.18910299]), array([1.26665318, 1.0727777 ])]```.

No hints are availble for this assesment

Answer is not availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here