Handling categorical and text attributes

Now, let's start preprocessing the categorical attributes. First of all, machine learning algorithms prefer to work on numbers instead of text. So, let's convert the categories of the attribute ocean_proximity from text to numbers. But it is not as simple as it sounds like.

As we know, there are 5 categories in ocean_proximity i.e., ( '*<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN*'). We can assign numbers from 1 to 5 to each category. But that'll bring order in the attribute. That means category assigned 5 will have more value than category assigned 1. That'll also include the fact that two nearby values (suppose category 1 and 2) are more similar than two distant values (suppose category 1 and 4). This all doesn't make sense because we know that all categories of ocean_proximity are independent of each other and so we can't compare any two values. But how to tell this to a Machine Learning Algorithm?

One way can be to create one binary attribute per category. As we have 5 categories, there will be 5 attributes. Every attribute can contain either 1 or 0. For example, one attribute equals to 1 when the category is '<1H OCEAN', and other four attributes 0. In the same way, the other four attributes can be represented. This way of encoding the categorical variables is termed as One Hot Encoding.

For example, the categories can be encoded as-

'<1H OCEAN' - [1,0,0,0,0]
‘INLAND’ - [0,1,0,0,0]
‘ISLAND’ - [0,0,1,0,0]
‘NEAR BAY’ - [0,0,0,1,0]
‘NEAR OCEAN' - [0,0,0,0,1]

where, the vectors represent 5 attributes.

So, when ocean_proximity value of a district will be '<1H OCEAN', the first attribute will be 1 and the other four will be zeros. In the same way for ‘NEAR BAY’, value of the fourth attribute will be 1 and others will be zero.

This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes.

Previous Index Next

End-to-End ML Project- Beginner friendly

Handling categorical and text attributes

XP

Loading comments...