Login using Social Account
     Continue with GoogleLogin using your credentials
Now, let's start preprocessing the categorical attributes. First of all, machine learning algorithms prefer to work on numbers instead of text. So, let's convert the categories of the attribute ocean_proximity
from text to numbers. But it is not as simple as it sounds like.
As we know, there are 5 categories in ocean_proximity
i.e., ( '*<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN*'). We can assign numbers from 1 to 5 to each category. But that'll bring order in the attribute. That means category assigned 5 will have more value than category assigned 1. That'll also include the fact that two nearby values (suppose category 1 and 2) are more similar than two distant values (suppose category 1 and 4). This all doesn't make sense because we know that all categories of ocean_proximity
are independent of each other and so we can't compare any two values. But how to tell this to a Machine Learning Algorithm?
One way can be to create one binary attribute per category. As we have 5 categories, there will be 5 attributes. Every attribute can contain either 1 or 0. For example, one attribute equals to 1 when the category is '<1H OCEAN', and other four attributes 0. In the same way, the other four attributes can be represented. This way of encoding the categorical variables is termed as One Hot Encoding.
For example, the categories can be encoded as-
where, the vectors represent 5 attributes.
So, when ocean_proximity
value of a district will be '<1H OCEAN', the first attribute will be 1 and the other four will be zeros. In the same way for ‘NEAR BAY’, value of the fourth attribute will be 1 and others will be zero.
This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Answer is not availble for this assesment
Loading comments...