Project - Reinforcement Learning - How to make computer learn to play CartPole game

21 / 24

Defining play_one_step function

  • Let's start by creating a function to play a single step using the model. We will also pretend for now that whatever action it takes is the right one, so we can compute the loss and its gradients (we will just save these gradients for now, and modify them later depending on how good or bad the action turned out to be):
  • Let us define the play_one_step function which takes env, obs, model, loss_fn as input arguments and returns obs, reward, done, which are the return parameters of the one-step, along with grads .

    def play_one_step(env, obs, model, loss_fn):
        with tf.GradientTape() as tape:
            left_proba = model(obs[np.newaxis])
            action = (tf.random.uniform([1, 1]) > left_proba)
            y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
            loss = tf.reduce_mean(loss_fn(y_target, left_proba))
        grads = tape.gradient(loss, model.trainable_variables)
        obs, reward, done, info = env.step(int(action[0, 0].numpy()))
        return obs, reward, done, grads

    In the above,

    • we get the left probabilities based on the model and observation passed to the function.

    • then we calculate the action based on the predicted probability and on the random value(this is a tensor with shape (1,1)), calculate the y_target and loss. Notice that, if left_proba is high, then action will most likely be False (since a random number uniformly sampled between 0 and 1 will probably not be greater than left_proba). And False means 0 when you cast it to a number, so y_target would be equal to 1 - 0 = 1. In other words, we set the target to 1, meaning we pretend that the probability of going left should have been 100% (so we took the right action). In short, whatever the value is predicted by the model, we simply pretend that the predicted value is correct, by forming the y_target such that the loss between y_target and left_proba is very less.

    • Then we calculate the gradients nd make the gent take the next step which returns obs, reward, done, info.

    • Finally, we return obs, reward, done, grads.

    This function will be called in the play_multiple_episodes function.

See Answer

No hints are availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...