Project - Reinforcement Learning - How to make computer learn to play CartPole game

23 / 24

Defining the discount function and normalizing function

  • In The Policy Gradients concept, the algorithm uses the model to play the episode several times (e.g., 10 times), then it goes back and looks at all the rewards, discounts them, and normalizes them.

  • So let's create the functions for that: the first will compute discounted rewards(discount_rewards()); the second will normalize the discounted rewards across many episodes(discount_and_normalize_rewards()).

  • Define the discount_rewards which takes the rewards, discount_rate as input arguments and the discounted as return parameter.

    def discount_rewards(rewards, discount_rate):
        discounted = np.array(rewards)
        for step in range(len(rewards) - 2, -1, -1):
            discounted[step] += discounted[step + 1] * discount_rate
        return discounted

    This function will be called by discount_and_normalize_rewards function.

    Say there were 3 actions, and after each action there was a reward: first 10, then 0, then -50. If we use a discount factor of 80%, then the 3rd action will get -50 (full credit for the last reward), but the 2nd action will only get -40 (80% credit for the last reward), and the 1st action will get 80% of -40 (-32) plus full credit for the first reward (+10), which leads to a discounted reward of -22:

    For example, discount_rewards([10, 0, -50], discount_rate=0.8) returns array([-22, -40, -50]).

  • Define the discount_and_normalize_rewards function all_rewards, discount_rate as input arguments and returns normalized values of discounted values.

    def discount_and_normalize_rewards(all_rewards, discount_rate):
        all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                                for rewards in all_rewards]
        flat_rewards = np.concatenate(all_discounted_rewards)
        reward_mean = flat_rewards.mean()
        reward_std = flat_rewards.std()
        return [(discounted_rewards - reward_mean) / reward_std
                for discounted_rewards in all_discounted_rewards]

    To normalize all discounted rewards across all episodes, we compute the mean and standard deviation of all the discounted rewards, and we subtract the mean from each discounted reward, and divide by the standard deviation.

    For example, discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8) returns [array([-0.28435071, -0.86597718, -1.18910299]), array([1.26665318, 1.0727777 ])].

See Answer

No hints are availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...