End-to-End ML Project - California Housing

9 / 17

End to End ML Project - Fill in the missing data

When you were exploring the dataset, you must have noticed that some of the features had missing data.

INSTRUCTIONS
  • We will revert to a clean training set that we got after we used StratifiedShuffleSplit and drop the median_house_value since it is the label that we will predict

    housing = strat_train_set.<<your code goes here>>("median_house_value", axis=1)
    
  • Now we will store the labels in housing_labels variable

    <<your code goes here>> = strat_train_set["median_house_value"].copy()
    
  • Now we will impute the missing values using the SimpleImputer class. First, import the SimpleImputer class from sklearn

    from sklearn.impute import <<your code goes here>>
    
  • Now, for the missing values we will consider the median value for that feature. We are not considering mean since median is a better measure of central tendency as it takes into account the outliers. We will set the strategy parameter to "median" in the SimpleImputer class

    imputer = SimpleImputer(<<your code goes here>>="median")
    
  • Now let's drop the categorical column ocean_proximity because median can only be calculated on numerical attributes

    housing_num = housing.drop("ocean_proximity", axis=1)
    
  • We will use fit on the housing_num dataset

    imputer.<<your code goes here>>(housing_num)
    
  • Now we will use transform the training set

    X = imputer.<<your code goes here>>(housing_num)
    housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...