End-to-End ML Project- Beginner friendly

78 / 95

Transformation Pipelines

As we have seen, we have to perform several data transformation steps in the right order. So sklearn provides a Pipeline class to create a pipeline to execute all the steps in sequential order. Its syntax is-


where, steps is the list of (name, estimator) tuples in the order in which we want to perform transformations. Here, name is the variable name by which we decided to create the object of the estimator and estimator is the estimator class.

An estimator is any object that learns from data; it may be a classification, regression, or clustering algorithm or a transformer that extracts/filters useful features from raw data.

For example, we can chain SimpleImputer (with its parameter strategy set as 'mean') with its instance name as imputer and StandardScaler with its instance name as scaler in the order by-

pipe = Pipeline([

where pipe is the name of the pipeline instance.

Remember, all except the last estimator must be a transformer(i.e. they must have a fit_transform() method). The last estimator can be or cannot be a transformer. In the above example, StandardScaler is a transformer. Also, the name of the estimators can be anything until they don't contain a double underscore in it. So you can't name an estimator as std__scaler.

Then, we can call the methods fit(), transform() or fit_transform() on the object using the syntax-


When we call the fit() method on our pipeline instance, it calls the method fit_transform() on all estimators, passing the output of one as input to the next in a sequential order until the last estimator. On the last estimator, it calls the fit() method. But when we use the fit_transform() method on our pipeline, then on the last estimator too fit_transform() is called instead of the fit() method. So, if our last estimator is a transformer, it is advised to use the fit_transform() method.

Refer to Pipeline documentation for further details about the class.

  1. Import the class Pipeline from sklearn.pipeline.
  2. Create a pipeline instance with the name num_pipeline with its estimators in the order-

    a) SimpleImputer transformer with its parameter strategy specified as median. The name of the estimator should be imputer.

    b) Our custom transformer CombinedAttributesAdder with no parameters. The name of the estimator should be attribs_adder.

    c) StandardScaler with no parameters. The name of the estimator should be std_scaler.

  3. Use the method fit_transform() on the pipeline num_pipeline. Specify the dataset as housing_num and store the result in a variable named housing_num_tr.

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...