Login using Social Account
     Continue with GoogleLogin using your credentials
On looking at our dataset we notice that we do not have a very large dataset to drop the rows and that too when only one attribute contains missing values for each row. Also, the attribute total_bedrooms
contains a very less number of missing values. Hence, we choose the third option, i.e. Set the value to some value.
We'll set the missing value of the attribute to its median value as it is not affected by outliers. We'll study about outliers later in this chapter.
We'll use the SimpleImputer
class from the impute
submodule of sklearn
for handling missing values. We have to first create an instance of it. It has a very important parameter to set, which is strategy
. We can specify its value as mean for mean(which is also by default), median for median, most_frequent for mode, and constant for a constant value.
Refer to SimpleImputer documentation for further details about the class.
Since the median can only be computed on numerical attributes, we have to drop the text attributes in our dataset. In the case of categorical or text attributes, we generally prefer mode
.
After that, we fit the instance which we created to the training data using the fit()
method. Its syntax is-
imputer_name.fit(dataset)
On using the fit()
method, the imputer simply computes the median value of each attribute and stores the result in its statistics_
instance variable. You can simply access an instance variable of an instance like-
imputer_name.variable_name
We know, only the attribute
total_bedrooms
contains missing values, but we can never be sure that there won't be any missing values in new data after the system goes live. Hence, we apply the imputer to all the numerical attributes.
After that, we use the trained imputer to transform the training set by replacing the missing values with learned medians. We use the transform()
method for that. Its syntax is the same as the fit()
method.
There's also a method fit_transform()
which performs both steps at a single time and also saves us a little time. But I wanted to demonstrate the use of both methods separately, that's why I used fit()
and transform()
separately. Otherwise, it is advised to use the fit_transform()
method.
The transform method returns a numpy array. So, we convert it back to a pandas
DataFrame.
Import SimpleImputer
from sklearn.impute
.
Create an instance with the name imputer
for the class SimleImputer
. Specify the parameter strategy
to value 'median'.
Drop all the non-numeric attributes from our dataset train_data
using the DataFrame.drop()
method and store the result in a variable with the name housing_num
.
Fit the imputer
on our dataset housing_num
using the fit()
method.
Use the transform()
method on imputer
and specify our dataset housing_num
inside its parameters and store the result in a variable named out
.
At last, run the following command to convert the output back to a pandas
DataFrame.-
housing_tr = pd.DataFrame(out, columns=housing_num.columns)
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...