Treating Outliers in Python

13 / 16

Deleting observations

If the outliers are small in number, or were caused by data entry or data processing error then we can delete the outlier values. We can also use trimming at both ends to remove outliers. But we must always remember that deleting the observation is not a good idea when we have small dataset.

Here, we will first observer a dataset without outliers. Next, we will delete those outliers, and then visualize the dataset once again.

INSTRUCTIONS
  • We will start by importing Pandas as pd, Numpy as np, Seaborn as sns, and Pyplot as plt

    import pandas as <<your code goes here>>
    import numpy as <<your code goes here>>
    import seaborn as <<your code goes here>>
    import matplotlib.pyplot as <<your code goes here>>
    
  • Next, we will define a dataframe consisting of names and age of 6 different people

    data = {'Name':['Tom', 'Dick', 'Harry', 'Jack', 'Alex', 'Mike'],
            'Age':[20, 21, 19, 99, 23, 18]}
    train = pd.DataFrame(data)
    
  • Now let's plot the dataframe and observe the outliers

    sns.boxplot(train['Age'])
    plt.title("Box Plot before outlier removing")
    plt.show()
    
  • Next, we will define a function drop_outliers which will take a dataframe and a corresponsing column name, check for outliers using the IQR method, and finally drop the outliers from that dataframe

    def <<your code goes here>>(df, field_name):
        iqr = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
        df.drop(df[df[field_name] > (iqr + np.percentile(df[field_name], 75))].index, inplace=True)
        df.drop(df[df[field_name] < (np.percentile(df[field_name], 25) - iqr)].index, inplace=True)
    
  • Now let's call this function with our dataset

    <<your code goes here>>(train, 'Age')
    
  • We have dropped the outliers from our dataset, now let's visualize it once again

    sns.boxplot(train['Age'])
    plt.title("Box Plot after outlier removing")
    plt.show()
    

Observe that the outliers have been dropped from the dataset.

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...