So far we have only dealt with numerical attributes, but now let’s look at text attributes. In this dataset, there is just one: the ocean_proximity
attribute. A Machine Learning model does not understand categorical values, so we will turn this into a numerical value using onehot encoding
.
Onehot encoding
creates one binary attribute per category: one attribute equal to 1
when the category
is <1H OCEAN
(and 0
otherwise), another attribute equal to 1
when the category is INLAND
(and 0
otherwise), and so on.
Notice that the output is a SciPy
sparse matrix, instead of a NumPy
array. This is very useful when you have categorical attributes with thousands of categories. After onehot encoding
, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.
Let's see how it is done.
First, we will store the categorical feature in a new variable called housing_cat
<<your code goes here>> = housing[["ocean_proximity"]]
Let's see what it looks like using the head
method
housing_cat.<<your code goes here>>(10)
Now let's import OneHotEncoder
from sklearn
from sklearn.preprocessing import <<your code goes here>>
Now we will fit_transform
our categorical data
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.<<your code goes here>>(housing_cat)
housing_cat_1hot
Finally, we will convert it to a dense Numpy array using toarray
method
housing_cat_1hot.<<your code goes here>>()
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
0 Comments
There are 15 new comments.