Login using Social Account
     Continue with GoogleLogin using your credentials
Finally, we will learn how to detect outliers using various data visualization techniques. It is useful for data cleaning, exploring data, detecting outliers and unusual groups, identifying trends and clusters etc. Here the list of data visualization plots to spot the outliers:
We will be using the same Titanic
dataset we used earlier to detect the outliers.
First, we will import Pandas
as pd
, seaborn
as sns
, Pyplot
as plt
, and ggplot
import pandas as <<your code goes here>>
import seaborn as <<your code goes here>>
import matplotlib.pyplot as <<your code goes here>>
from statsmodels.graphics.gofplots import qqplot
Next, we will load the Titanic
dataset we used earlier
train = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')
Now, we will create a function named box_plots
to plot the data for a particular column from our dataset, and then we will call this function using the Age
column
def <<your code goes here>>(df):
plt.figure(figsize=(10, 4))
plt.title("Box Plot")
sns.boxplot(df)
plt.show()
box_plots(train['Age'])
Notice the datapoints lying towards the extreme right, those are the outliers.
Now, we will create a function named hist_plots
to plot the data for a particular column from our dataset, and then we will call this function using the Age
column
def <<your code goes here>>(df):
plt.figure(figsize=(10, 4))
plt.hist(df)
plt.title("Histogram Plot")
plt.show()
hist_plots(train['Age'])
Notice the tail at the end of the histogram towards the extreme right, those are the outliers.
Now, we will create a function named scatter_plots
to plot the data for a particular column from our dataset, and then we will call this function using the Age
and Fare
columns
def <<your code goes here>>(df1,df2):
fig, ax = plt.subplots(figsize=(10,4))
ax.scatter(df1,df2)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
plt.title("Scatter Plot")
plt.show()
scatter_plots(train['Age'],<<your code goes here>>)
Notice the datapoints above the densely populated scatter plot, those are the outliers.
Now, we will create a function named dist_plots
to plot the data for a particular column from our dataset, and then we will call this function using the Fare
column
def <<your code goes here>>(df):
plt.figure(figsize=(10, 4))
sns.distplot(df)
plt.title("Distribution plot")
sns.despine()
plt.show()
dist_plots(train['Fare'])
Notice the datapoints lying towards the tail on extreme right, those are the outliers.
Finally, we will create a function named qq_plots
to plot the data for a particular column from our dataset, and then we will call this function using the Fare
column
def <<your code goes here>>(df):
plt.figure(figsize=(10, 4))
qqplot(df,line='s')
plt.title("Normal QQPlot")
plt.show()
qq_plots(train['Fare'])
Notice the datapoints lying towards the extreme top right corner, those are the outliers.
We can observe the points beyond the regular boundaries, which are the outliers in these cases.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...