What is Reinforcement Learning?

Reinforcement Learning

Reinforcement learning is a special kind of machine learning.
Reinforcement learning has evolved a lot in the gaming industry. It is more commonly used in training computers to play games.
There are certain terms in a reinforcement learning setting:
- agent: the software component or algorithm which learns the best possible way of achieving the task. It performs an action to get some reward. The agent learns when to take which action, with the aim of maximizing its rewards.
- environment: the problem or scenario the agent is expected to face and solve.
- state: the scenario returned by the environment.
- rewards: we can think of this as a gift or encouragement the agent gets for making a desirable step.
- punishment: as the name suggests, it is like a penalty the agent gets for making an undesirable step.
- policy: the policy is the rule(or strategy) that the agent applies to determine the action to perform in the current state. Upon applying the action, the agent is returned with the next state and reward by the environment.
  
  Mathematically,
  
  The goal is to enhance the policy so as to maximize the total rewards.

Intuitive Example

Let us say we have a dog and we want to train it to perform some tasks.
It is clear that we humans don't know the language of the dog, and the dog too doesn't understand human language. So how do we train it to the task?
A simple method one could try is to incentivize the desirable actions and penalize the undesirable actions performed by the dog. For example, in a given situation, we may give it a biscuit if it performed a desirable action. Likewise, we may punish it for performing an undesirable action.
In this way, the dog learns what is right and wrong, how to behave in different situations, etc.
So in reinforcement learning,
- the dog represents the agent that faces the situation(here environment)
- biscuits are like rewards
- the dog's behavior is the set of actions it takes in different situations(here states). The actions are chosen based on the policy it learns, expecting to maximize the rewards.