Previous Index Next

Defining the discount function and normalizing function

In The Policy Gradients concept, the algorithm uses the model to play the episode several times (e.g., 10 times), then it goes back and looks at all the rewards, discounts them, and normalizes them.
So let's create the functions for that: the first will compute discounted rewards(discount_rewards()); the second will normalize the discounted rewards across many episodes(discount_and_normalize_rewards()).

INSTRUCTIONS

Define the discount_rewards which takes the rewards, discount_rate as input arguments and the discounted as return parameter.
```
def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted
```
This function will be called by discount_and_normalize_rewards function.

Say there were 3 actions, and after each action there was a reward: first 10, then 0, then -50. If we use a discount factor of 80%, then the 3rd action will get -50 (full credit for the last reward), but the 2nd action will only get -40 (80% credit for the last reward), and the 1st action will get 80% of -40 (-32) plus full credit for the first reward (+10), which leads to a discounted reward of -22:

For example, discount_rewards([10, 0, -50], discount_rate=0.8) returns array([-22, -40, -50]).
Define the discount_and_normalize_rewards function all_rewards, discount_rate as input arguments and returns normalized values of discounted values.
```
def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                            for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]
```
To normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.

For example, discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8) returns [array([-0.28435071, -0.86597718, -1.18910299]), array([1.26665318, 1.0727777 ])].

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Project - Reinforcement Learning - How to make computer learn to play CartPole game

Defining the discount function and normalizing function

XP

Loading comments...