Treating Outliers in Python

9 / 16

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

As the name suggests, DBSCAN is a density-based and unsupervised machine learning algorithm. It takes multi-dimensional data as inputs and clusters them according to the model parameters — e.g. epsilon and minimum samples. Based on these parameters, the algorithm determines whether certain values in the dataset are outliers or not.

Scikit-learn has a DBSCAN module as part of its unsupervised machine learning algorithms. This algorithm has many real life implementation when it comes to detecting outliers, for example we can use it in fraud detection for credit card transactions. Here, we will demonstrate how to detect outliers in the Iris dataset.

INSTRUCTIONS
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn import datasets

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
print(df.head())

data = df[["sepal_length", "sepal_width"]]
model = DBSCAN(eps = 0.4, min_samples = 10).fit(data)

colors = model.labels_
plt.scatter(data["sepal_length"], data["sepal_width"], c = colors)

outliers = data[model.labels_ == -1]
print(outliers)
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...