Previous Index Next

Transforming values

Transforming variables can also eliminate outliers. These transformed values reduces the variation caused by extreme values. There are various transformation methods including:

Scaling
Log transformation
Cube Root Normalization
Box-Cox transformation

These techniques convert values in the dataset to smaller values. If the data has too many extreme values or is skewed, these methods helps to make your data normal. It is to be noted that there is no lose of data from these methods. However, these technique does not always gives the best results.

INSTRUCTIONS

First, we would import Pandas as pd, Numpy as np, Seaborn as sns, Pyplot as plt, and preprocessing from scikit-learn, and import scipy.

import pandas as <<your code goes here>>
import numpy as <<your code goes here>>
import seaborn as <<your code goes here>>
import matplotlib.pyplot as <<your code goes here>>
from sklearn import preprocessing
import scipy

Next, we create a dataset

data = {'Name':['Tom', 'Dick', 'Harry', 'Jack', 'Alex', 'Mike', 'John'],
        'Age':[20, 21, 19, 99, 23, 18, 98]}
orig = pd.DataFrame(data)

Now, let's plot the dataset and observe the outliers

sns.boxplot(train['Age'])
plt.title("Box Plot after outlier removing")
plt.show()

First, we would use the scaling method for the outliers

train = orig.copy()
scaler = preprocessing.StandardScaler()
train['Age'] = scaler.fit_transform(train['Age'].values.reshape(-1,1))
sns.boxplot(train['Age'])
plt.title("Box Plot after outlier removing")
plt.show()

Next, we would use the log transformation method for the outliers

train = orig.copy()
train['Age'] = np.log(train['Age'])
sns.boxplot(train['Age'])
plt.title("Box Plot after outlier removing")
plt.show()

Now, we will use the cube root transformation method

train = orig.copy()
train['Age'] = (train['Age']**(1/3))
sns.boxplot(train['Age'])
plt.title("Box Plot after outlier removing")
plt.show()

Finally, we will use the box transformation method to remove the outliers

train = orig.copy()
train['Age'],fitted_lambda= scipy.stats.boxcox(train['Age'] ,lmbda=None)
sns.boxplot(train['Age'])
plt.title("Box Plot after outlier removing")
plt.show()

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Treating Outliers in Python

Transforming values

XP

Loading comments...