Treating Outliers in Python

15 / 16

Imputation

Just like we can impute missing values, we can also impute outliers. To impute outliers we can use various methods by including use mean, median, zero values. Since we are imputing, there is no loss of data. Here median is appropriate because it is not affected by outliers.

INSTRUCTIONS
  • First, we will import Numpy as np, Pandas as pd, Seaborn as sns, and Pyplot as `plt

    import pandas as <<your code goes here>>
    import numpy as <<your code goes here>>
    import seaborn as <<your code goes here>>
    import matplotlib.pyplot as plt
    
  • Next, we will load the Titanic dataset

    data = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')
    
  • Now let's plot the Age attribute and observe the outliers

    sns.boxplot(data['Age'])
    plt.title("Box Plot before imputation")
    plt.show()
    
  • Now we will start using various methods of imputations. First, we will start using the mean values and observe the dataset once again after imputation

    train = data.copy()
    q1 = train['Age'].quantile(0.25)
    q3 = train['Age'].quantile(0.75)
    iqr = q3-q1
    Lower_tail = q1 - 1.5 * iqr
    Upper_tail = q3 + 1.5 * iqr
    m = np.mean(train['Age'])
    for i in train['Age']:
        if i > Upper_tail or i < Lower_tail:
                train['Age'] = train['Age'].replace(i, m)
    sns.boxplot(train['Age'])
    plt.title("Box Plot after mean imputation")
    plt.show()
    
  • Next, we will try the median value

    train = data.copy()
    q1 = train['Age'].quantile(0.25)
    q3 = train['Age'].quantile(0.75)
    iqr = q3-q1
    Lower_tail = q1 - 1.5 * iqr
    Upper_tail = q3 + 1.5 * iqr
    med = np.median(train['Age'])
    for i in train['Age']:
        if i > Upper_tail or i < Lower_tail:
                train['Age'] = train['Age'].replace(i, med)
    sns.boxplot(train['Age'])
    plt.title("Box Plot after median imputation")
    plt.show()
    
  • Finally, we will try the zero value imputation

    train = data.copy()
    q1 = train['Age'].quantile(0.25)
    q3 = train['Age'].quantile(0.75)
    iqr = q3-q1
    Lower_tail = q1 - 1.5 * iqr
    Upper_tail = q3 + 1.5 * iqr
    for i in train['Age']:
        if i > Upper_tail or i < Lower_tail:
                train['Age'] = train['Age'].replace(i, 0)
    sns.boxplot(train['Age'])
    plt.title("Box Plot after Zero value imputation")
    plt.show()
    

Observe that median replaces all the outliers.

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...