Treating Outliers in Python

14 / 16

Transforming values

Transforming variables can also eliminate outliers. These transformed values reduces the variation caused by extreme values. There are various transformation methods including:

  1. Scaling
  2. Log transformation
  3. Cube Root Normalization
  4. Box-Cox transformation

These techniques convert values in the dataset to smaller values. If the data has too many extreme values or is skewed, these methods helps to make your data normal. It is to be noted that there is no lose of data from these methods. However, these technique does not always gives the best results.

INSTRUCTIONS
  • First, we would import Pandas as pd, Numpy as np, Seaborn as sns, Pyplot as plt, and preprocessing from scikit-learn, and import scipy.

    import pandas as <<your code goes here>>
    import numpy as <<your code goes here>>
    import seaborn as <<your code goes here>>
    import matplotlib.pyplot as <<your code goes here>>
    from sklearn import preprocessing
    import scipy
    
  • Next, we create a dataset

    data = {'Name':['Tom', 'Dick', 'Harry', 'Jack', 'Alex', 'Mike', 'John'],
            'Age':[20, 21, 19, 99, 23, 18, 98]}
    orig = pd.DataFrame(data)
    
  • Now, let's plot the dataset and observe the outliers

    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    
  • First, we would use the scaling method for the outliers

    train = orig.copy()
    scaler = preprocessing.StandardScaler()
    train['Age'] = scaler.fit_transform(train['Age'].values.reshape(-1,1))
    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    
  • Next, we would use the log transformation method for the outliers

    train = orig.copy()
    train['Age'] = np.log(train['Age'])
    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    
  • Now, we will use the cube root transformation method

    train = orig.copy()
    train['Age'] = (train['Age']**(1/3))
    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    
  • Finally, we will use the box transformation method to remove the outliers

    train = orig.copy()
    train['Age'],fitted_lambda= scipy.stats.boxcox(train['Age'] ,lmbda=None)
    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...