#NoPayJan Offer - Access all CloudxLab Courses for free between 1st to 31st JanEnroll Now >>
Let us define the
play_one_step function which takes
loss_fn as input arguments and returns
done, which are the return parameters of the one-step, along with
def play_one_step(env, obs, model, loss_fn): with tf.GradientTape() as tape: left_proba = model(obs[np.newaxis]) action = (tf.random.uniform([1, 1]) > left_proba) y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32) loss = tf.reduce_mean(loss_fn(y_target, left_proba)) grads = tape.gradient(loss, model.trainable_variables) obs, reward, done, info = env.step(int(action[0, 0].numpy())) return obs, reward, done, grads
In the above,
we get the left probabilities based on the model and observation passed to the function.
then we calculate the action based on the predicted probability and on the random value(this is a tensor with shape (1,1)), calculate the y_target and loss. Notice that, if
left_proba is high, then
action will most likely be
False (since a random number uniformly sampled between 0 and 1 will probably not be greater than
False means 0 when you cast it to a number, so
y_target would be equal to 1 - 0 = 1. In other words, we set the target to 1, meaning we pretend that the probability of going left should have been 100% (so we took the right action). In short, whatever the value is predicted by the model, we simply pretend that the predicted value is correct, by forming the y_target such that the loss between y_target and left_proba is very less.
Then we calculate the gradients nd make the gent take the next step which returns
obs, reward, done, info.
Finally, we return
obs, reward, done, grads.
This function will be called in the
No hints are availble for this assesment
Answer is not availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here