Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
01 D 20 H : 13 M : 22 S Apply NowJust like we can impute missing values, we can also impute outliers. To impute outliers we can use various methods by including use mean, median, zero values. Since we are imputing, there is no loss of data. Here median is appropriate because it is not affected by outliers.
First, we will import Numpy
as np
, Pandas
as pd
, Seaborn
as sns
, and Pyplot
as `plt
import pandas as <<your code goes here>>
import numpy as <<your code goes here>>
import seaborn as <<your code goes here>>
import matplotlib.pyplot as plt
Next, we will load the Titanic
dataset
data = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')
Now let's plot the Age
attribute and observe the outliers
sns.boxplot(data['Age'])
plt.title("Box Plot before imputation")
plt.show()
Now we will start using various methods of imputations. First, we will start using the mean
values and observe the dataset once again after imputation
train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(train['Age'])
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, m)
sns.boxplot(train['Age'])
plt.title("Box Plot after mean imputation")
plt.show()
Next, we will try the median
value
train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
med = np.median(train['Age'])
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, med)
sns.boxplot(train['Age'])
plt.title("Box Plot after median imputation")
plt.show()
Finally, we will try the zero
value imputation
train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, 0)
sns.boxplot(train['Age'])
plt.title("Box Plot after Zero value imputation")
plt.show()
Observe that median replaces all the outliers.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Please login to comment
0 Comments
There is 1 new comment.