21 / 24

# Defining play_one_step function

• Let's start by creating a function to play a single step using the model. We will also pretend for now that whatever action it takes is the right one, so we can compute the loss and its gradients (we will just save these gradients for now, and modify them later depending on how good or bad the action turned out to be):
INSTRUCTIONS
• Let us define the `play_one_step` function which takes `env`, `obs`, `model`, `loss_fn` as input arguments and returns `obs`, `reward`, `done`, which are the return parameters of the one-step, along with `grads` .

``````def play_one_step(env, obs, model, loss_fn):
left_proba = model(obs[np.newaxis])
action = (tf.random.uniform([1, 1]) > left_proba)
y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
loss = tf.reduce_mean(loss_fn(y_target, left_proba))
obs, reward, done, info = env.step(int(action[0, 0].numpy()))
``````

In the above,

• we get the left probabilities based on the model and observation passed to the function.

• then we calculate the action based on the predicted probability and on the random value(this is a tensor with shape (1,1)), calculate the y_target and loss. Notice that, if `left_proba` is high, then `action` will most likely be `False` (since a random number uniformly sampled between 0 and 1 will probably not be greater than `left_proba`). And `False` means 0 when you cast it to a number, so `y_target` would be equal to 1 - 0 = 1. In other words, we set the target to 1, meaning we pretend that the probability of going left should have been 100% (so we took the right action). In short, whatever the value is predicted by the model, we simply pretend that the predicted value is correct, by forming the y_target such that the loss between y_target and left_proba is very less.

• Then we calculate the gradients nd make the gent take the next step which returns `obs, reward, done, info`.

• Finally, we return `obs, reward, done, grads`.

This function will be called in the `play_multiple_episodes` function.

No hints are availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here