Treating Outliers in Python

10 / 16

Isolation Forest

It is a clustering algorithm that belongs to the ensemble decision trees family and is similar in principle to Random Forest.

  1. It classify the data point to outlier and not outliers and works great with very high dimensional data.
  2. It works based on decision tree and it isolate the outliers.
  3. If the result is -1, it means that this specific data point is an outlier. If the result is 1, then it means that the data point is not an outlier.

In this example, we will use the titanic dataset to determine outliers in the 'Fare' column.

INSTRUCTIONS
  • First, we will import the IsolationForest module from sklearn, Numpy as np, and Pandas as pd

    from sklearn.ensemble import <<your code goes here>>
    import numpy as <<your code goes here>>
    import <<your code goes here>> as pd
    
  • Next, we will load the dataset using the read_csv from Pandas

    train = pd.<<your code goes here>>('/cxldata/datasets/project/titanic/train.csv')
    
  • Now, we will define a function named iso_forest to calculate the outliers using this method

    def <<your code goes here>>(df):
        iso = IsolationForest( behaviour = 'new', random_state = 1, contamination= 'auto')
        preds = iso.fit_predict(df.values.reshape(-1,1))
        data = pd.DataFrame()
        data['cluster'] = preds
        print(data['cluster'].value_counts().sort_values(ascending=False))
    
  • Finally, we will call this function using our dataset

    iso_forest(train['Fare'])
    

From the result we can see that there are 182 outliers in the dataset corresponding to the Fare column

See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...