End-to-End ML Project - California Housing

10 / 17

End to End ML Project - Handling categorical attributes

So far we have only dealt with numerical attributes, but now let’s look at text attributes. In this dataset, there is just one: the ocean_proximity attribute. A Machine Learning model does not understand categorical values, so we will turn this into a numerical value using onehot encoding.

Onehot encoding creates one binary attribute per category: one attribute equal to 1 when the category is <1H OCEAN (and 0 otherwise), another attribute equal to 1 when the category is INLAND (and 0 otherwise), and so on.

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After onehot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements.

Let's see how it is done.

  • First, we will store the categorical feature in a new variable called housing_cat

    <<your code goes here>> = housing[["ocean_proximity"]]
  • Let's see what it looks like using the head method

    housing_cat.<<your code goes here>>(10)
  • Now let's import OneHotEncoder from sklearn

    from sklearn.preprocessing import <<your code goes here>>
  • Now we will fit_transform our categorical data

    cat_encoder = OneHotEncoder()
    housing_cat_1hot = cat_encoder.<<your code goes here>>(housing_cat)
  • Finally, we will convert it to a dense Numpy array using toarray method

    housing_cat_1hot.<<your code goes here>>()
See Answer

No hints are availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...