When you were exploring the dataset, you must have noticed that some of the features had missing data.
We will revert to a clean training set that we got after we used StratifiedShuffleSplit
and drop the median_house_value
since it is the label that we will predict
housing = strat_train_set.<<your code goes here>>("median_house_value", axis=1)
Now we will store the labels in housing_labels
variable
<<your code goes here>> = strat_train_set["median_house_value"].copy()
Now we will impute the missing values using the SimpleImputer
class. First, import the SimpleImputer
class from sklearn
from sklearn.impute import <<your code goes here>>
Now, for the missing values we will consider the median value for that feature. We are not considering mean since median is a better measure of central tendency as it takes into account the outliers. We will set the strategy
parameter to "median"
in the SimpleImputer
class
imputer = SimpleImputer(<<your code goes here>>="median")
Now let's drop the categorical column ocean_proximity
because median can only be calculated on numerical attributes
housing_num = housing.drop("ocean_proximity", axis=1)
We will use fit
on the housing_num
dataset
imputer.<<your code goes here>>(housing_num)
Now we will use transform
the training set
X = imputer.<<your code goes here>>(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing.index)
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
0 Comments
There are 28 new comments.