 # Imputation

Just like we can impute missing values, we can also impute outliers. To impute outliers we can use various methods by including use mean, median, zero values. Since we are imputing, there is no loss of data. Here median is appropriate because it is not affected by outliers.

INSTRUCTIONS
• First, we will import `Numpy` as `np`, `Pandas` as `pd`, `Seaborn` as `sns`, and `Pyplot` as `plt

``````import pandas as <<your code goes here>>
import numpy as <<your code goes here>>
import seaborn as <<your code goes here>>
import matplotlib.pyplot as plt
``````
• Next, we will load the `Titanic` dataset

``````data = pd.read_csv('/cxldata/datasets/project/titanic/train.csv')
``````
• Now let's plot the `Age` attribute and observe the outliers

``````sns.boxplot(data['Age'])
plt.title("Box Plot before imputation")
plt.show()
``````
• Now we will start using various methods of imputations. First, we will start using the `mean` values and observe the dataset once again after imputation

``````train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(train['Age'])
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, m)
sns.boxplot(train['Age'])
plt.title("Box Plot after mean imputation")
plt.show()
``````
• Next, we will try the `median` value

``````train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
med = np.median(train['Age'])
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, med)
sns.boxplot(train['Age'])
plt.title("Box Plot after median imputation")
plt.show()
``````
• Finally, we will try the `zero` value imputation

``````train = data.copy()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
for i in train['Age']:
if i > Upper_tail or i < Lower_tail:
train['Age'] = train['Age'].replace(i, 0)
sns.boxplot(train['Age'])
plt.title("Box Plot after Zero value imputation")
plt.show()
``````

Observe that median replaces all the outliers.

