End-to-End ML Project- Beginner friendly

71 / 95


On looking at our dataset we notice that we do not have a very large dataset to drop the rows and that too when only one attribute contains missing values for each row. Also, the attribute total_bedrooms contains a very less number of missing values. Hence, we choose the third option, i.e. Set the value to some value.

We'll set the missing value of the attribute to its median value as it is not affected by outliers. We'll study about outliers later in this chapter.

We'll use the SimpleImputer class from the impute submodule of sklearn for handling missing values. We have to first create an instance of it. It has a very important parameter to set, which is strategy. We can specify its value as mean for mean(which is also by default), median for median, most_frequent for mode, and constant for a constant value.

Refer to SimpleImputer documentation for further details about the class.

Since the median can only be computed on numerical attributes, we have to drop the text attributes in our dataset. In the case of categorical or text attributes, we generally prefer mode.

After that, we fit the instance which we created to the training data using the fit() method. Its syntax is-


On using the fit() method, the imputer simply computes the median value of each attribute and stores the result in its statistics_ instance variable. You can simply access an instance variable of an instance like-


We know, only the attribute total_bedrooms contains missing values, but we can never be sure that there won't be any missing values in new data after the system goes live. Hence, we apply the imputer to all the numerical attributes.

After that, we use the trained imputer to transform the training set by replacing the missing values with learned medians. We use the transform() method for that. Its syntax is the same as the fit() method.

There's also a method fit_transform() which performs both steps at a single time and also saves us a little time. But I wanted to demonstrate the use of both methods separately, that's why I used fit() and transform() separately. Otherwise, it is advised to use the fit_transform() method.

The transform method returns a numpy array. So, we convert it back to a pandas DataFrame.

  1. Import SimpleImputer from sklearn.impute.

  2. Create an instance with the name imputer for the class SimleImputer. Specify the parameter strategy to value 'median'.

  3. Drop all the non-numeric attributes from our dataset train_data using the DataFrame.drop() method and store the result in a variable with the name housing_num.

  4. Fit the imputer on our dataset housing_num using the fit() method.

  5. Use the transform() method on imputer and specify our dataset housing_num inside its parameters and store the result in a variable named out.

  6. At last, run the following command to convert the output back to a pandas DataFrame.-

    housing_tr = pd.DataFrame(out, columns=housing_num.columns)

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...