End-to-End ML Project- Beginner friendly

79 / 95


Up to now, we had to handle categorical and numerical attributes separately. It would be more convenient if we can create a pipeline that can handle both categorical and numerical attributes by applying the appropriate transformation to each attribute. But the Pipeline class applies the same transformations to each attribute. So, to solve this problem, we use the ColumnTransformer class of sklearn.

In ColumnTransformer, we can specify the list of numerical and categorical attributes in our dataset, and then it applies each transformation to appropriate columns and at last concatenates the output. Its syntax is-


where, transformers is the list of tuples (name, transformer, columns) specifying the transformer objects to be applied to subsets of the data. Here, name is the name of the variable by which we decided to create the object of the transformer, transformer is the transformer class and columns is the list of attributes on which we need to apply the particular transformation.

It follows much similar syntax as the Pipeline class. For example, we can apply StandardScaler on numerical attributes with its instance name as scaler and OneHotEncoder on categorical attributes with its instance name as cat by-

pipe = ColumnTransformer([
    ("scaler", StandardScaler(), num_attributes),
    ("cat", OneHotEncoder(), ["ocean_proximity"[)

where num_attributes is a list containing names of numerical attributes and ocean_proximity is the categorical attribute.

We can also specify a pipeline in place of the class using the same syntax. But remember, ColumnTransformer works only with transformers. So even the last estimator in the pipeline must be a transformer.

Then we can use the instance methods using the same syntax as of the Pipeline class.

Refer to ColumnTransformer documentation for further details about the class.

  1. Import the class ColumnTransformer from sklearn.compose.

  2. Create an instance of ColumnTransformer with the name full_pipeline with the transformers -

    a) Pipeline num_pipeline which we created before and specify its name as num and columns as list(housing_num). list(housing_num) contains names of all the numerical attributes.

    b) Class OneHotEncoder with the name cat and specify columns as the categorical columns of our dataset.

  3. Use the fit_transform() method on full_pipeline and specify the dataset as train_data. Store the output in a variable named housing_prepared.

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...