As you have seen, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn
provides the Pipeline
class to help with such sequences of transformations.
Copy paste the code below as is. Here we are using a pipeline to process the data by first imputing it using SimpleImputer
, then using the custom transformer created earlier to merge the columns, and finally, use the StandardScaler
class to scale the entire training data
col_names = "total_rooms", "total_bedrooms", "population", "households"
rooms_ix, bedrooms_ix, population_ix, households_ix = [
housing.columns.get_loc(c) for c in col_names]
housing_extra_attribs = pd.DataFrame(
columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
housing_num_tr = num_pipeline.fit_transform(housing_num)
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
Finally, we will fit_transform
the entire training data
housing_prepared = full_pipeline.<<your code goes here>>(housing)
