Previous Index Next

Isolation Forest

It is a clustering algorithm that belongs to the ensemble decision trees family and is similar in principle to Random Forest.

It classify the data point to outlier and not outliers and works great with very high dimensional data.
It works based on decision tree and it isolate the outliers.
If the result is -1, it means that this specific data point is an outlier. If the result is 1, then it means that the data point is not an outlier.

In this example, we will use the titanic dataset to determine outliers in the 'Fare' column.

INSTRUCTIONS

First, we will import the IsolationForest module from sklearn, Numpy as np, and Pandas as pd

from sklearn.ensemble import <<your code goes here>>
import numpy as <<your code goes here>>
import <<your code goes here>> as pd

Next, we will load the dataset using the read_csv from Pandas

train = pd.<<your code goes here>>('/cxldata/datasets/project/titanic/train.csv')

Now, we will define a function named iso_forest to calculate the outliers using this method

def <<your code goes here>>(df):
    iso = IsolationForest( behaviour = 'new', random_state = 1, contamination= 'auto')
    preds = iso.fit_predict(df.values.reshape(-1,1))
    data = pd.DataFrame()
    data['cluster'] = preds
    print(data['cluster'].value_counts().sort_values(ascending=False))

Finally, we will call this function using our dataset
```
iso_forest(train['Fare'])
```

From the result we can see that there are 182 outliers in the dataset corresponding to the Fare column

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Treating Outliers in Python

Isolation Forest

XP

Loading comments...