Previous Index Next

Visualizing the data

Finally, we will learn how to detect outliers using various data visualization techniques. It is useful for data cleaning, exploring data, detecting outliers and unusual groups, identifying trends and clusters etc. Here the list of data visualization plots to spot the outliers:

Box and whisker plot (box plot)
Scatter plot
Histogram
Distribution Plot
QQ plot

We will be using the same Titanic dataset we used earlier to detect the outliers.

INSTRUCTIONS

First, we will import Pandas as pd, seaborn as sns, Pyplot as plt, and ggplot

import pandas as <<your code goes here>>
import seaborn as <<your code goes here>>
import matplotlib.pyplot as <<your code goes here>>
from statsmodels.graphics.gofplots import qqplot

Next, we will load the Titanic dataset we used earlier

train = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')

Now, we will create a function named box_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age column
```
def <<your code goes here>>(df):
    plt.figure(figsize=(10, 4))
    plt.title("Box Plot")
    sns.boxplot(df)
    plt.show()

box_plots(train['Age'])
```
Notice the datapoints lying towards the extreme right, those are the outliers.
Now, we will create a function named hist_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age column
```
def <<your code goes here>>(df):
    plt.figure(figsize=(10, 4))
    plt.hist(df)
    plt.title("Histogram Plot")
    plt.show()

hist_plots(train['Age'])
```
Notice the tail at the end of the histogram towards the extreme right, those are the outliers.
Now, we will create a function named scatter_plots to plot the data for a particular column from our dataset, and then we will call this function using the Age and Fare columns
```
def <<your code goes here>>(df1,df2):
    fig, ax = plt.subplots(figsize=(10,4))
    ax.scatter(df1,df2)
    ax.set_xlabel('Age')
    ax.set_ylabel('Fare')
    plt.title("Scatter Plot")
    plt.show()

scatter_plots(train['Age'],<<your code goes here>>)
```
Notice the datapoints above the densely populated scatter plot, those are the outliers.
Now, we will create a function named dist_plots to plot the data for a particular column from our dataset, and then we will call this function using the Fare column
```
def <<your code goes here>>(df):
    plt.figure(figsize=(10, 4))
    sns.distplot(df)
    plt.title("Distribution plot")
    sns.despine()
    plt.show()

dist_plots(train['Fare'])
```
Notice the datapoints lying towards the tail on extreme right, those are the outliers.
Finally, we will create a function named qq_plots to plot the data for a particular column from our dataset, and then we will call this function using the Fare column
```
def <<your code goes here>>(df):
    plt.figure(figsize=(10, 4))
    qqplot(df,line='s')
    plt.title("Normal QQPlot")
    plt.show()

qq_plots(train['Fare'])
```
Notice the datapoints lying towards the extreme top right corner, those are the outliers.

We can observe the points beyond the regular boundaries, which are the outliers in these cases.

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Treating Outliers in Python

Visualizing the data

XP

Loading comments...