Login using Social Account
     Continue with GoogleLogin using your credentials
Let us define the play_one_step
function which takes env
, obs
, model
, loss_fn
as input arguments and returns obs
, reward
, done
, which are the return parameters of the one-step, along with grads
.
def play_one_step(env, obs, model, loss_fn):
with tf.GradientTape() as tape:
left_proba = model(obs[np.newaxis])
action = (tf.random.uniform([1, 1]) > left_proba)
y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
loss = tf.reduce_mean(loss_fn(y_target, left_proba))
grads = tape.gradient(loss, model.trainable_variables)
obs, reward, done, info = env.step(int(action[0, 0].numpy()))
return obs, reward, done, grads
In the above,
we get the left probabilities based on the model and observation passed to the function.
then we calculate the action based on the predicted probability and on the random value(this is a tensor with shape (1,1)), calculate the y_target and loss. Notice that, if left_proba
is high, then action
will most likely be False
(since a random number uniformly sampled between 0 and 1 will probably not be greater than left_proba
). And False
means 0 when you cast it to a number, so y_target
would be equal to 1 - 0 = 1. In other words, we set the target to 1, meaning we pretend that the probability of going left should have been 100% (so we took the right action). In short, whatever the value is predicted by the model, we simply pretend that the predicted value is correct, by forming the y_target such that the loss between y_target and left_proba is very less.
Then we calculate the gradients nd make the gent take the next step which returns obs, reward, done, info
.
Finally, we return obs, reward, done, grads
.
This function will be called in the play_multiple_episodes
function.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...