Treating Outliers in Python

11 / 16

Visualizing the data

Finally, we will learn how to detect outliers using various data visualization techniques. It is useful for data cleaning, exploring data, detecting outliers and unusual groups, identifying trends and clusters etc. Here the list of data visualization plots to spot the outliers:

  1. Box and whisker plot (box plot)
  2. Scatter plot
  3. Histogram
  4. Distribution Plot
  5. QQ plot

We will be using the same Titanic dataset we used earlier to detect the outliers.

INSTRUCTIONS
  • First, we will import Pandas as pd, seaborn as sns, Pyplot as plt, and ggplot

    import pandas as <<your code goes here>>
    import seaborn as <<your code goes here>>
    import matplotlib.pyplot as <<your code goes here>>
    from statsmodels.graphics.gofplots import qqplot
    
  • Next, we will load the Titanic dataset we used earlier

    train = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')
    
  • Now, we will create a function named box_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age column

    def <<your code goes here>>(df):
        plt.figure(figsize=(10, 4))
        plt.title("Box Plot")
        sns.boxplot(df)
        plt.show()
    
    box_plots(train['Age'])
    

    Notice the datapoints lying towards the extreme right, those are the outliers.

  • Now, we will create a function named hist_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age column

    def <<your code goes here>>(df):
        plt.figure(figsize=(10, 4))
        plt.hist(df)
        plt.title("Histogram Plot")
        plt.show()
    
    hist_plots(train['Age'])
    

    Notice the tail at the end of the histogram towards the extreme right, those are the outliers.

  • Now, we will create a function named scatter_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age and Fare columns

    def <<your code goes here>>(df1,df2):
        fig, ax = plt.subplots(figsize=(10,4))
        ax.scatter(df1,df2)
        ax.set_xlabel('Age')
        ax.set_ylabel('Fare')
        plt.title("Scatter Plot")
        plt.show()
    
    scatter_plots(train['Age'],<<your code goes here>>)
    

    Notice the datapoints above the densely populated scatter plot, those are the outliers.

  • Now, we will create a function named dist_plots to plot the data for a particular column from our dataset, and then we will call this function using the Fare column

    def <<your code goes here>>(df):
        plt.figure(figsize=(10, 4))
        sns.distplot(df)
        plt.title("Distribution plot")
        sns.despine()
        plt.show()
    
    dist_plots(train['Fare'])
    

    Notice the datapoints lying towards the tail on extreme right, those are the outliers.

  • Finally, we will create a function named qq_plots to plot the data for a particular column from our dataset, and then we will call this function using the Fare column

    def <<your code goes here>>(df):
        plt.figure(figsize=(10, 4))
        qqplot(df,line='s')
        plt.title("Normal QQPlot")
        plt.show()
    
    qq_plots(train['Fare'])
    

    Notice the datapoints lying towards the extreme top right corner, those are the outliers.

We can observe the points beyond the regular boundaries, which are the outliers in these cases.

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...