Previous Index Next

End to End ML Project - Handling categorical attributes

So far we have only dealt with numerical attributes, but now let’s look at text attributes. In this dataset, there is just one: the ocean_proximity attribute. A Machine Learning model does not understand categorical values, so we will turn this into a numerical value using onehot encoding.

Onehot encoding creates one binary attribute per category: one attribute equal to 1 when the category is <1H OCEAN (and 0 otherwise), another attribute equal to 1 when the category is INLAND (and 0 otherwise), and so on.

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After onehot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.

Let's see how it is done.

INSTRUCTIONS

First, we will store the categorical feature in a new variable called housing_cat
```
<<your code goes here>> = housing[["ocean_proximity"]]
```
Let's see what it looks like using the head method
```
housing_cat.<<your code goes here>>(10)
```

Now let's import OneHotEncoder from sklearn

from sklearn.preprocessing import <<your code goes here>>

Now we will fit_transform our categorical data

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.<<your code goes here>>(housing_cat)
housing_cat_1hot

Finally, we will convert it to a dense Numpy array using toarray method
```
housing_cat_1hot.<<your code goes here>>()
```

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

End-to-End ML Project - California Housing

End to End ML Project - Handling categorical attributes

XP

Loading comments...