Classification

Not able to play video? Try with youtube

Hi, welcome to the chapter "Classification" of AI and ML Course for Managers .

So far in this course, we have discussed only the supervised regression problems like

predicting housing prices in California or

predicting the area burned by the forest fires.

Let’s understand supervised classification problems. In classification, the algorithm learns from the labeled data given to it, builds the model and then this model is used to classify new observations into classes. Let us see some examples.

Say our task is to build a model which classifies the given image into fruits like Banana, Apple, and Kiwi. Here we have labeled images for training the model. Each image has a label such as Apple, Banana or Kiwi. Since we have labeled data such a classification will be supervised machine learning task.

We feed these labeled images to the algorithm. The algorithm learns from these labeled images and

builds the model.

Then this model can be used to classify new images into one of the fruits.

So how does the training work?

We first represent the images in the tabular form - In the form of rows and columns. We will soon revisit how to convert an image into tabular form.

Then we split the data into training and test set in the 80 20 ratio.

and then using training set we build the model and

using the test set we evaluate the performance of the model.

And then we deploy the model to production where it classifies the unknown new images into one of the fruits.

Let’s see one more example. Say we have to build a model which classifies the images into dogs or cats. Each image in the training set has a label such as a cat or a dog. We feed these labeled images to the algorithm. The algorithm learns from the features of these images and builds the model. Since here we are feeding labeled images to the algorithm, such a classification is a supervised machine learning task.

Let’s revisit the MNIST dataset from the topic “Representing data”. The MNIST dataset consists of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. As you can see each image is labeled with the digit it represents. The MNIST dataset is considered the “Hello Word” of machine learning. Whenever people come up with a new classification algorithm they want to see its performance on the MNIST dataset.

Let’s build a classifier or model using MNIST dataset which classifies an image into one of the digits between 0 and 9. For example, here the model classifies the image as class 5. This type of classification is known as multiclass classification where the model classifies the object into one of the multiple classes. How do we train such model?

We have total seventy thousand images in the MNIST dataset.

Now, the question is how do we represent images in the tabular form so that we can feed them to the algorithm.

Also, what are the features and the label

There are seventy thousand images in the MNIST dataset so the number of instances are seventy thousand.

Each image has a label associated with it.

Let’s see how do we represent these images in the tabular form.

Each image in the MNIST is of size 28 by 28 pixels. Also, each image is grayscale or in other words black-and-white. Such an image is basically a two-dimensional array of numbers. The black color is represented as 0 and white is represented as 255 and other shades are in between. So, an image is a two-dimensional matrix of numbers. There are many ways of converting a two-dimensional image into a one-dimensional array or list of numbers. For now, we will take a simple approach whereby we will append all rows and make a single dimension array.

A 28 by 28 image will have total 784 pixels and it can be represented as an array of 784 numbers. So, each row will have 784 columns hence 784 features.

The MNIST dataset can be represented in the tabular form where each row represents one hand-drawn digit and value of each pixel in the drawing is a feature. Now we know the features and the label and we have represented them in the tabular form

We feed this data to the algorithm and the algorithm builds the model.

Let’s simplify the problem for now and

first train a binary classifier which classifies the given image into

class 5, if the image is of digit 5 else

into class “not 5”.

A binary classifier classifies an object into one of the two groups.

Let’s name this classifier as “5-detector” for now. As we have learnt, machine learning algorithms prefer to work with numbers. So we will have to convert features and labels into numbers before feeding them to the algorithm. Here features are images and we have already seen how to represent images in the form which can be fed to the algorithm. The output classes “5” and “not-5” are not numerical. We will have to convert these two classes into numerical values.

How do we convert classes “5” and “not-5” into numbers?

We can either use one hot encoding or assign numbers to these classes.

Let’s assign numbers 0 and 1 to these two classes.

0 denotes the class “not-5” and

1 denotes the class 5. Now the label also contains numbers.

We can now feed features and labels to machine learning algorithm and the algorithm will build the model for 5-detector

Then this 5-detector model can be used to classify the unknown images into one of two classes “5” or “not 5”

Before training the 5-detector, the question which comes to mind is how will we validate this model? Until we have a way to validate the model, we will not have confidence in the model.

So which performance measure should we use to evaluate the performance of 5-detector?

As we have seen earlier, in the case of regression, the preferred performance measure is RMSE - root mean square error.

Can we use root mean square as a performance measure for classification?

The answer is no.

Because root mean square error of say one thousand seven hundred and eighteen will not make any sense for classification

In classification, we are interested in classifying an object into one of the categories. Hence we need different performance measures for classification. Let’s see different performance measures for the classification task.

In classification one of the performance measures is accuracy.

In the case of 5-detector, accuracy is the number of images classified correctly. Here total 2 predictions are correct out of three ...

so the accuracy is 66.67 percent.

Here is the question on accuracy. We have three images showing 4, 3 and 5. The 5-detector is some model which has detected 5 only in the second case. What is the accuracy of 5-detector in this case? Pause the video for a minute and calculate the accuracy.

Here only one prediction is correct out of three. So the accuracy is 33.33%. Though accuracy looks good in normal life. It is not such a good measure of performance in classification tasks. Let’s understand with an example.

Imagine, we have a very dumb classifier which classifies every input image as “not 5”. It's accuracy for the given example is ¾ which is 75%. Let us see the accuracy of this dumb classifier on MNIST dataset.

In the MNIST dataset, around 90% of images are of digits other than 5.

Remaining 10% images are of digit 5.

Since this dumb classifier classifies every image as “Not 5”, it will classify

90% images correctly but

10% images of digit 5 incorrectly

So the accuracy of this dumb classifier will be 90%.

This is simply because only 10% images are of 5. There is a huge class imbalance between the two classes and such a dataset is called skewed dataset.

If you always guess that an image is not 5, you will be right about 90% of the time. This demonstrates why accuracy is generally not the preferred performance measure for classifiers especially for skewed datasets. Let’s see one more example.

Let’s say we have to build a model which predicts if someone has cancer or not. We have 100 people,

out of which 95 people do not have cancer and

5 people have cancer.

Say the model is very bad and predicts every case as no cancer. In such case, it will classify

95 non-cancerous people correctly but

5 cancerous patients as non-cancerous. Now even though the model is pretty bad at predicting cancer, the accuracy of

such a bad model is 95%.

This is because there are two classes “no cancer” and “ have cancer” having 95 and 5 people respectively. There is a huge class imbalance and the dataset is skewed. This again demonstrates why accuracy is general not a preferred performance measure for classification, especially when your dataset is skewed.

Let’s see one more example. Say you have built the spam classifier which classifies the given email into spam or ham with

90% accuracy. Do you think this classifier is good?

According to an analysis by Symantec in 2009, 90% of all the emails sent on the internet are spam. With this analysis you know only 10% emails are not spam and you are dealing with the skewed dataset. Again you will not trust this classifier on the basis of accuracy

Here is a question. Say your friend has built a classifier which classifies a given image as a

Male or

a female. Your friend is really confident with this classifier as

its accuracy is 99%. By now you know accuracy is not a good performance measure for classification. What ..

….questions will you ask your friend to make sure if his classifier is really good? Pause the video for a minute and think about all the questions.

You will ask what was the percentage of males and

females images in the dataset?

Your friend replies in the dataset 80% of the images were of males and

20% images were of females. Now you surely know the dataset is skewed and

you will not trust his classifier.

Let’s explore other better performance measures for classification.

A much better way to look into the performance of the classifier is to look at the confusion matrix. The general idea of the confusion matrix is to count the number of times instances were correctly and incorrectly classified.

Let’s see the confusion matrix of 5-detector. The confusion matrix contains the counts of actual versus predicted value. Each row in the confusion matrix represents an actual class while each column represents a predicted class. Each cell in the matrix represents the count. In this matrix, there are total 5 cases where non-5s are correctly classified as not-5s. Similarly, there are total 3 cases where fives are correctly classified as 5s. While there are total 2 cases where 5 is classified as not-5s and one case where non-5s are classified as 5.If a dataset has 10 classes, the confusion matrix will have 10 rows and 10 columns.

Here is the illustrated view of confusion matrix. Let’s understand the terms associated with the confusion matrix.

True negatives are the cases in which images were correctly classified as not 5s. It includes the cases when the actual images were

of not-5s and the model also classified them as not-5s.

False Positives are the cases in which images were wrongly classified as 5s. It includes cases when the actual images were

of not-5s but the model classified them as 5s

False Negatives are the cases in which the images were wrongly classified as non-5s. It includes cases when the actual images

were of 5s but the model classified them as not-5s.

True Positives are the cases in which the images are correctly classified as 5s. It includes cases when the actual images

were of 5s and the model also predicted them as 5s. Though confusion matrix gives us a lot of information but we require more concise metrics ...

such us Precision and Recall. Let’s understand Precision and Recall.

In real-life, precision is marked by lack of mistakes..

For the 5-detector, precision is the measure that tells us what proportion of images that were classified as 5s were actually 5.

Here the 5-detector classified total four images as 5s and it was correct only ...

… three times.

So the precision is 3 out of 4. In other words, precision is about being precise. It measures how many classifications were correct out of the all the classifications. So if a model classifies only one image but classifies it correctly then its precision will be 100%.

In real life, recall means to remember something learnt in the past.

For 5-detector, recall is the measure of what proportion of images that were actually 5 were classified as class 5.

Here there are total five images of digit 5 and

the 5-detector classified only three of them as 5s.

So the recall is 3 out of 5. In other words, recall is about classifying all the images of 5 as 5s. Let’s take one more example to understand precision and recall.

Here is the confusion matrix of the model which predicts if someone has cancer or not. Here precision is the measure which tells us

how many patients that model predicted as having cancer

actually have cancer. Here model predicts that ...

110 patients have cancer but out of these 110 predictions only ...

… 100 predictions are correct.

So the precision is 100 out of 110 which is approximately 0.91.

Recall is the measure which tells us what proportion of patients who...

actually had cancer were diagnosed by the model as...

having cancer. Here total...

105 patients, have cancer and model predicts that...

only 100 of them have cancer. So the recall is...

100 by 105 which is approximately 0.95.

Instead of computing precision and recall every time we train a classifier, we prefer a single metric which combines both precision and recall. This single metric is f1 score.

F1 score favors the classifiers which have similar precision and recall.

If between precision and recall one number is really small,

Then the f1 score will be closer to the smaller number.

we have learned F1 score favors the classifiers which have almost similar precision and recall. But Depending on the problem we may want high precision, high recall or equal precision and recall (that is the case where we would want f1 score as our performance measure.). Let’s see some examples.

Say we have to build a model which detects if a video is safe for kids or not.

Would you prefer this model to have high precision or high recall for this task? Just take a moment and think what will be the meaning of high precision and high recall.

High precision means if the model classifies video 4 and video 6 as safe for kids, they are actually safe for kids.

In high precision, we are okay if the model is not able to classify video 2 as safe for kids but whichever videos it classifies as safe for kids they are actually safe.

High recall means the model will try to maximize the number of videos that are classified as safe.

In high recall there might be chances that model may classify video 3 as “safe for kids”. This is because recall is more about maximizing the ability to classify.

So will you prefer high precision or high recall for this task?

We would prefer a model which has high precision. It is okay if the model rejects many good videos but keeps only really safe ones, instead of

having high recall and classifying a few really bad videos as safe for kids.

Let’s see one more example. Say we have to build a model which detects shoplifters on the basis of surveillance images. In case, someone is marked as shoplifter, security guard would manually check.

Would you prefer this model to have high precision or high recall?

We would prefer the model to have high recall even if the precision is low because our goal is to catch almost all the shoplifters.

In the high recall, the security guard might catch and examine some non shoplifters also but we will achieve our goal of catching almost all the shoplifters.

Now you may think that we can have both high precision and high recall in a good model. But unfortunately, we can’t have both high precision and high recall at the same time.

Increasing the precision reduces recall and

vice versa. This is called

precision recall tradeoff. Let’s understand this tradeoff.

For each instance in the dataset,

the classifier computes a

Decision score. Decision score is decided by the classification algorithm. If the decision score is

greater than a threshold, the classifier assigns that instance to the positive class

else to the negative class. Here again the threshold is decided by the classification algorithm.

Here is the precision and recall values of 5-detector during the various threshold.

As we can see if we increase the threshold, the precision increases and

recall decreases.

And, if we decrease the threshold,

the precision decreases and

recall increases. We can change the threshold as per our requirement.

So how do we decide which threshold value to use? The answer is we can simply select the threshold value which gives us the best precision and recall values for our task. Say for our task we need high precision then

we increase the threshold and if we need high recall

then we decrease the threshold. Therefore it is fairly easy to create a classifier with virtually any precision or recall we want. Let’s say we want to achieve 90% precision

To achieve the same we set the threshold to around four hundred thousand.

But in this case we will have only 18% recall. A high-precision classifier is not very useful if its recall is too low.

So if your boss asks you to reach 99% precision, you should

ask him at what recall :)

Another way to select good precision-recall tradeoff is to plot precision directly against the recall. As we can see, the precision drops sharply at 80% recall. We may select the precision-recall tradeoff just before the drop ...

… at 60% recall. But again this entirely depends on your project.

Let’s understand one more measure of performance for binary classifiers, the ROC Curve - Receiver Operating Characteristic curve. Before understanding the ROC curve let’s understand two more terms associated with the confusion matrix.

Here is the confusion matrix of the model which predicts if someone has cancer or not.

Recall is also known as the true positive rate.

False positive rate is the ratio of negative instances that are incorrectly classified as positive. It is the measure which tells us what proportion of patients that

actually were not having cancer are classified as

having cancer by the model.

Here, the false positive rate is 10 out of 60 which is 0.17

Now let’s come back to ROC curve. The ROC curve plots the true positive rate against the false positive rate at various threshold

We can use ROC curves to compare various classifiers. One way to compare classifiers is to measure the area under the curve (AUC). Higher the area under the curve, better the model is. A purely random classifier have the area under the curve equal

0.5 whereas the perfect classifier have the

area under the curve equal to 1. This was the brief introduction to various performance measures used in the classification task.

Hope now you have a fair understanding of which performance measure to choose, select the right precision/recall tradeoff that fits your needs and compare various classifiers using ROC curver and area under the curve.

So far we’ve worked with a binary classifier 5-detector. Let’s come back to our original problem. We wanted to build a classifier or model using MNIST dataset which classifies an image into one of the digits between 0 and 9. As discussed this is multiclass classification. How do we train this multiclass model? There are two strategies

One-versus-all and one-versus-one. Let’s discuss these strategies.

In one-versus-all, we classify 10 different binary classifiers, one for each digit between 0 and 9 like 0-detector

1-detector

2-detector and so on upto 9-detector

While classifying an image, we pass the image to each classifier and select the best class predicted.

Let’s understand one-versus-one strategy using a very simple multiclass classification problem. Say we have to train a classifier which classifies an object into one of the three classes A, B, and C. In one-versus-one strategy,

we train multiple binary classifiers one for each possible pair of classes. As you can see we have trained three binary classifiers for each possible pair of classes A, B and C. Now to classify an object “x”, we pass it to each binary classifier and assign it a class which has the majority.

Say the A-B classifier classifies object “x” as “B”,

B-C classifier classifies object “x” as “B” and

A-C classifier classifies object “x” as “C”.

Since here class B has the majority

Our model will classify object “x” as B”.

If there are 10 classes the total number of pairs of 2 classes would be 45. If there are 100 classes, there total number of pairs will be close to 5000 which means there will be 5000 classifiers that we need to train.

The main advantage of one-versus-one strategy is that each binary classifier only needs to be trained on the part of the training set.

For example, the A-B classifier will get trained only on the small part of the training set which includes the classes A and B. This saves a lot of time and computing resources as compared to training on the full training set.

Like with most of the approaches in machine learning, you should always rely on cross validation or comparison of performance of both the strategies and take the decision based on that.

Hope now you have a good understanding of how to train a model for multiclass classification problem.

Let’s understand one more type of classification called multilabel classification. In multiclass classification, the classifier classifies the given instance into only one of the classes.

For example, a fruit can be classified as either apple or banana but not both at the same time. On the other hand in multilabel classification, an instance can be classified into multiple classes at the same time.

For example, a movie can be classified into multiple categories. Say the multilabel movie classifier is trained on four categories Biography, Drama, Sport, and Sci-Fi.

Then for the raging bull movie such a multilabel classifier will output multiple binary labels, one for each category.

Let’s do hands-on and learn how to train a binary classifier using AzureML. We will train a binary classifier which predicts if someone has breast cancer or not. We will use Breast Cancer dataset available on the AzureML. Login to AzureML. Go to experiments and click on “blank experiment”. Go to “Samples Dataset” and drag “breast cancer data” to the canvas. This dataset consists of Breast cancer diagnosis data against features from cell samples. Click on visualize to visualize the data. The dataset has 683 rows and 10 columns. Here Class is the label and other columns are features. Next drag “Select columns in dataset” and join it with the dataset box. Click on “Select columns in dataset” box, click on “launch column selector” and select all the columns. Next drag “split data” and join it with “Select columns in dataset” box. Split the data into training and test set in 80 20 ratio. Specify the random seed as 42. Type classification in the search box and here you can see all the classification algorithms. Since our classifier is a binary classifier, let’s select “Two-class Logistic Regression” for now.

Logistic Regression is used to estimate the probability that an instance belongs to a particular class. For example, logistic regression can give answer to the question like what is the probability that this image is of digit 5?

If the estimated probability is greater than 0.5 then the model predicts that the instance belongs to the positive class labeled as 1 ...

….else to the negative class labeled as 0.

Drag “train model” and join it with training set box and “Two-class Logistic Regression” box. Click on “train model” box and specify label as “Class”. Drag “Score model” and join it with “train model” and test set. Drag evaluate model and join it with “score model” box. Now we are done with steps. Let’s run the steps and wait for the execution to complete. Let’s visualize the score model step. Here we can see “Scored Labels” and “Scored Probabilities” columns. Scored Label is the class predicted by the model and Scored Probability is the probability predicted by the model. As you can see if probability is greater than 0.5 then the scored label is 1 else 0. Next evaluate the model. Here we can see the ROC curve and precision vs recall. Below is the confusion matrix. Accuracy is 96%. Precision is 97.8%. Recall is 91.7% and F1 score is 0.946. Currently the threshold is 0.5 and area under the curve is 0.99. We can change the threshold to achieve any precision or recall we want. For example to achieve 96% recall we can decrease the threshold to 0.43. As you can see on changing the threshold, precision has also changed. This was the quick demo of how to train a model for classification task. To get a better model, try different algorithms. Select the best one and then tune its hyperparameters. Once you are confident with your model, deploy it into the production.

Let’s quickly revisit the concepts we have learnt in this chapter. We learnt about how do we train a model for the classification task. Then we started with a multiclass classification problem. Then we learnt various performance measures for the classification task. We learnt accuracy is not a good performance measure especially for the skewed dataset. Then we learnt about the confusion matrix. The general idea of the confusion matrix is to count the number of times instances of class A are classified as class B. Each row in a confusion matrix represents an actual class while each column represents a predicted class. Then we learnt about various terms associated with the confusion matrix. Then we learnt about Precision and Recall.Then we learnt about f1 score which is a single metric combining both precision and recall. Then we went through some examples where we require high precision and high recall. Then we learnt about the precision-recall tradeoff. Increasing the precision reduces recall and vice versa. Then we learnt that we can achieve any precision and recall value by changing the threshold. Then we learnt about ROC curve and area under the curve. ROC curve can be used to measure the performance of various classifiers. The perfect classifier has the area under the curve as 1. Then we learnt about the two multiclass classification strategies one-versus-all and one-versus-one. Then we learnt about multilabel classification. Finally, we trained a binary classifier to predict if someone has breast cancer or not.

Hope you liked the chapter. Stay tuned for the next chapter and happy learning!

https://discuss.cloudxlab.com/c/course-discussions/ai-and-ml-for-managers

Classifying the objects present in an image is an Example of a supervised classification task.

Or you are given a huge set of emails which are already labeled as spam or ham. Now, using this data, you need to build a model that predicts whether a new email is a spam or not.

Say we have to build a model which classifies the given image into various fruits like Banana, Apple and Kiwi. Here e have labelled data where each image has a label such as Apple, Banana or Kiwi. We feed this labelled data to the algorithm and then algorithms learns from this data. So how does it work? We split the data into training and test set in the 80 20 ratio and then using training set we build the model and using test set we evaluate the performance of the model. Then the model classifies the unknown or new images to one of the fruits.

Let’s understand the various performance measures for classification tasks. Before that let’s look into MNIST dataset again. We have already discussed MNIST dataset in representing data chapter. Let’s quickly recap it. MNIST consists of a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. This set has been studied so much that it is often called the “Hello World” of Machine Learning: whenever people come up with a new classification algorithm, they are curious to see how it will perform on MNIST. Whenever someone learns Machine Learning, sooner or later they tackle MNIST.

Say we have to build a model which predicts the digit in the given image. In other word it should classify the given image into one of the 10 labels between 0 and 9.

Let’ build a very rudimentary binary classifier which classify every input image as not 5.

A binary classifier classifies the input into one of the two categories.

Let’s build a very dumb binary classifier which classifies every input image as in the “not 5” class.

So when to use accuracy? Accuracy is a good measure when the label classes in the data are nearly balanced. For example say in the training data we have 60% labelled images of apples and 40% labelled images of banana. A model which predicts whether a new image is of apple or banana 97% times correctly is a very good measure in this example

Let’ build a very dumb classifier which classifies every input image as not 5. Here 3 predictions are correct out of 4 so the accuracy is seventy five percent. You may think this classifier is quite good. If you always guess that an image is not 5, you will be right about 75% of the time. This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets.

Since the model

Earlier we discussed that we use test set to evaluate the performance of the model.

Precision is the measure of what proportion of images that were classified as 5 by the model were actually 5. In the cancer example, precision is the measure of what proportion of patients that the model diagnosed as having cancer, actually had cancer

Recall is the measure of what proportion of images that were actually 5 were classified as class 5. In the cancer example, recall is a measure of what proportion of patients that actually had cancer was diagnosed by the model as having cancer

Let’s see some more examples. Say your friend has build a classifier which classifies a given image into a male or a female. And your friend is very satisfied with this classifier as its accuracy is 99.9%. But now since you know accuracy is not a good performance measure in case of classification, what question will you ask your friend to make sure his classifier is good?

Here the total images that were classified as 5 by the model are 4 and

actual images containing 5 are 3.

So the precision is 3 by 4.

Recall is the measure of what proportion of images that were actually 5 were classified as class 5.

Between precision and recall if one number is really small, the f1 score is more closer to the smaller number than the bigger one and kind of raises a red flag.

Will you prefer high precision or high recall in such a classifier?

High precision means when the model says a video is safe for kids, it is actually safe for kids.

You would prefer a classifier that rejects many good videos but keeps only the safe one.

Will you prefer high precision or high recall in such a classifier?

False positive rate is the ratio of negative instances that are incorrectly classified as positive. It is the measure which tell us what proportion of patients that actually were not having cancer are classified as having cancer by the model.

precision recall tradeoff. Let’s understand this tradeoff. For each instance in the dataset, the classifier computes a score based on a decision function. If that score is greater than a threshold, the classifier assigns that instance to the positive class else to the negative class.

Let’s Say we want to achieve 90% precision. To achieve the same we will have to set the threshold to around four hundred thousand. But in this case, we will have only 15% recall. A high-precision classifier is not very useful if its recall is too low.

So if your boss asks you to aim for 99% precision, you should immediately ask him at what recall :)

So how do we convert classes “5” and “not-5” into numbers? We can either use one hot encoding or assign numbers to these classes.

As we have seen earlier, in the case of regression, the preferred performance measure is RMSE - root mean square error. Can we use root mean square for classification? The answer is no. Because root mean square error of say twelve thousand will not make any sense for classification. In classification we are interested in classifying an object into one of the categories, hence the root mean square will not take make any sense here. That is why we need different performance measures for classification. Let’s see different performance measures for the classification task.

Let’s see one more example

such a dataset is called skewed dataset.

Even our MNIST dataset

We should never choose accuracy as a performance measure in case of skewed data set.

90% accuracy.

Precision is the measure that tells us what proportion of images that were classified as 5s by the model were actually 5.

three of the times so the precision is 3 out of 4. In other words, precision is about being precise. It measures how many classifications were correct out of the all the classifications. So even if our model classify only one image but classify it correctly then the precision will be 100%

Here we have ROC curves of SGD - Stochastic Gradient Descent and Random forest classifiers. As we can see that the Area under the curve of Random forest classifier is more than that of SGDclassifier. Hence the Random forest classifier performs better than the SGDclassifier.

Let’s build a classifier or model using MNIST dataset which classifies an image into one of the digits between 0 and 9. This type of classification is known as multiclass classification where the model classifies the object into one of the multiple classes.

A purely random classifier randomly classifies the instances into positive and negative classes.

The true positive rates and the false positive rates will be approximately the same for such classifier. Here the dotted line represents the ROC curve of a purely random classifier.

So which strategy to use between one-versus-all and one-versus-one? As we have seen in the one-versus-one we train multiple binary classifiers one for each possible pairs of classes. Because of this the number of classifiers are more in one-versus-one compared to one-versus-all.

Each row in a confusion matrix represents an actual class while

The confusion matrix contains the counts of actual versus predicted value. In our example, there are total 5 such cases where non-5s are detected correctly. Similarly, there are total 3 such instances fives which are labeled correctly. While there are 2 such instances where 5 is classified as not-5 and one such instance where six is predicted as 6.

Also, If a dataset has 10 classes, the confusion matrix will have 10 rows and 10 columns.

Hi, welcome to the chapter "Classification" of AI and ML Course for Managers .

So far in the course, we have discussed only the supervised regression problems like

predicting housing prices in California or

predicting the area burned by the forest fires.

We feed these labeled images to the algorithm. The algorithm learns from these labeled images and

builds the model.

Then this model can be used to classify new images into one of the fruits.

So how does the training work?

We first represent the images in the tabular form - In the form of rows and columns. We will soon revisit how to convert an image into tabular form.

Then we split the data into training and test set in the 80 20 ratio.

and then using training set we build the model and

using the test set we evaluate the performance of the model.

And then we deploy the model to production where it classifies the unknown new images into one of the fruits.

Let’s build a classifier or model using MNIST dataset which classifies an image into one of the digits between 0 and 9. For example, here the model classified the image as class 5. This type of classification is known as multiclass classification where the model classifies the object into one of the multiple classes. How do we train such model?

We have total seventy thousand images in the MNIST dataset.

Now, the question is how do we represent images in the tabular form so that we can feed them to the algorithm.

Also, what are the features and label

There are seventy thousand images in the MNIST dataset so the number of instances are seventy thousand.

Each image has a label associated with it.

Let’s see how do we represent these images in the tabular form.

Each image in the MNIST is of size 28 by 28. Also, each image is grayscale or in other words black-and-white. Such an image is basically a two-dimensional array of numbers. The black color is represented as 0 and white is represented as 255 and other shades in between. So, an image is a two-dimensional matrix of numbers. The are many ways of converting a two-dimensional image into a one-dimensional array or list of numbers. For now, we will take a simple approach whereby we will append all rows and make a single dimension array.

A 28 by 28 image have total 784 pixels and it can be represented as an array of 784 numbers. So, each row will have 784 columns hence 784 features.

The MNIST dataset can be represented in the tabular form where each row represents one hand-drawn digit and value of each pixel in the drawing is a feature. Now we know the features and labels and we have represented them in the tabular form

We feed this data to the algorithm and the algorithm builds the model.

Let’s simplify the problem for now and

first train a binary classifier which classifies the given image into

class 5, if the image is of digit 5 else

into class “not 5”.

A binary classifier classifies an object into one of the two groups.

Let’s name this classifier as “5-detector” for now. As we have learned, machine learning algorithms prefer to work with numbers. So we will have to convert features and labels into numbers before feeding them to the algorithm. Here features are images and we have already seen how to represent images in the form which can be fed to the algorithm. Classes “5” and “not-5” which are labels are not numerical. We will have to convert these two classes into numerical values.

How do we convert classes “5” and “not-5” into numbers?

We can either use one hot encoding or assign numbers to these classes.

Let’s assign numbers 0 and 1 to these two classes.

0 denotes the class “not-5” and

1 denotes the class 5. Now the label also contains numbers.

We can now feed features and labels to machine learning algorithm and the algorithm will build the model for 5-detector

Then this 5-detector model can be used to classify the unknown images into one of two classes “5” or “not 5”

Before training the 5-detector, the question which comes to mind is how will we validate this model? Until we have a way to validate the model, we will not have confidence in the model.

So which performance measure should we use to evaluate the performance of 5-detector?

As we have seen earlier, in the case of regression, the preferred performance measure is RMSE - root mean square error.

Can we use root mean square as a performance measure for classification?

The answer is no.

Because root mean square error of say one thousand seven hundred and eighteen will not make any sense for classification

In classification one of the performance measures is accuracy.

In the case of 5-detector, accuracy is the number of images classified correctly. Here total 2 predictions are correct out of three ...

so the accuracy is 66.67 percent.

Here is the question on accuracy. We have three images showing 4, 3 and 5. The 5-detector is some model which has detect 5 only in the second case. What is the accuracy of 5-detector in this case? Pause the video for a minute and calculate the accuracy.

Imagine, we have a very dumb classifier which classifies every input image as “not 5”. It's accuracy for the given example is ¾ which 75%. Let us see the accuracy of this dumb classifier on MNIST dataset.

In the MNIST dataset, around 90% of images are of digits other than 5.

Remaining 10% images are of digit 5.

Since this dumb classifier classifies every image as “Not 5”, it will classify

90% images correctly but

10% images of digit 5 incorrectly

So the accuracy of this dumb classifier will be 90%.

This is simply because only 10% images are of 5. There is a huge class imbalance between the two classes and such a dataset is called skewed dataset.

So the accuracy of this dumb classifier is 90% as it classifies 90% images correctly. This is simply because only 10% images are of 5. If you always guess that an image is not 5, you will be right about 90% of the time. This demonstrates why accuracy is generally not the preferred performance measure for classifiers especially for skewed datasets. Let’s see one more example.

Let’s say we have to build a model which predicts if someone has cancer or not. We have 100 people,

out of which 95 people do not have cancer and

5 people have cancer.

Say the model is very bad and predicts every case as no cancer. In such case, it will classify

95 non-cancerous people correctly but

5 cancerous patients as non-cancerous. Now even though the model is pretty bad at predicting cancer, the accuracy of

such a bad model is 95%.

This is because there are two classes “no cancer” and “ have cancer” having 95 and 5 people respectively. There is a huge class imbalance and the dataset is skewed. This again demonstrates why accuracy is generally not a preferred performance measure for classification, especially when your dataset is skewed.

Let’s see one more example. Say you have built the spam classifier which classifies the given email into spam or ham with

90% accuracy. Do you think this classifier is good?

Here is a question. Say your friend has built a classifier which classifies a given image as a

Male or

a female. Your friend is really confident with this classifier as

its accuracy is 99%. By now you know accuracy is not a good performance measure for classification. What ..

….questions will you ask your friend to make sure if his classifier is really good? Pause the video for a minute and think about all the questions.

You will ask what was the percentage of males and

females images in the dataset?

Your friend replies in the dataset 80% of the images were of males and

20% images were of females. Now you surely know the dataset is skewed and

you will not trust his classifier.

Let’s explore other better performance measures for classification.

Let’s see the confusion matrix of 5-detector. The confusion matrix contains the counts of actual versus predicted value. Each row in the confusion matrix represents an actual class while each column represents a predicted class. Each cell in the matrix represents the count. In this matrix, there are total 5 cases where non-5s are correctly classified as not-5s. Similarly, there are total 3 cases where fives are correctly classified as 5s. While there are total 2 cases where 5 is classified as not-5s and one instance where non-5s are classified as 5.If a dataset has 10 classes, the confusion matrix will have 10 rows and 10 columns.

Here is the illustrated view of confusion matrix. Let’s understand the terms associated with the confusion matrix.

True negatives are the cases in which images were correctly classified as not 5s. It includes the cases when the actual images were

of not-5s and the model also classified them as not-5s.

False Positives are the cases in which images were wrongly classified as 5s. It includes cases when the actual images were

of not-5s but the model classified them as 5s

False Negatives are the cases in which the images were wrongly classified as non-5s. It includes cases when the actual images

were of 5s but the model classified them as not-5s.

True Positives are the cases in which the images are correctly classified as 5s. It includes cases when the actual images

were of 5s and the model also predicted them as 5s. Though confusion matrix gives us a lot of information we require more concise metric

such us Precision and Recall. Let’s understand Precision and Recall.

In real-life, precision is marked by lack of mistakes..

For the 5-detector, precision is the measure that tells us what proportion of images that were classified as 5s were actually 5.

Here the 5-detector classified total four images as 5s and it was correct only ...

… three times.

In real life, recall means to remember something learnt in the past.

For 5-detector, recall is the measure of what proportion of images that were actually 5 were classified as class 5.

Here there are total five images of digit 5 and

the 5-detector classified only three of them as 5s.

So the recall is 3 out of 5. In other words, recall is about classifying all the images of 5 as 5s. Let’s take one more example to understand precision and recall.

Here is the confusion matrix of the model which predicts if someone has cancer or not. Here precision is the measure which tells us

how many patients that model predicted as having cancers

actually have cancer. Here model predicted that ...

110 patients have cancer but out of these 110 predictions only ...

… 100 predictions are correct.

So the precision is 100 out of 110 which is approximately 0.91.

Recall is the measure which tells us what proportion of patients that...

actually had cancer was diagnosed by the model as...

having cancer. Here total...

105 patients, have cancer and model predicted that...

only 100 of them have cancer. So the recall is...

100 by 105 which is approximately 0.95.

Instead of computing precision and recall every time we train a classifier, we prefer a single metric which combines both precision and recall. This single metric is f1 score.

F1 score favors the classifiers that have similar precision and recall.

If between precision and recall one number is really small,

Then the f1 score will be closer to the smaller number than the bigger one

As we have learned F1 score favors the classifiers which have almost similar precision and recall.

Depending on the problem we may want high precision, high recall or equal precision and recall (that is the case where we would want f1 score as our performance measure.). Let’s see some examples.

Say we have to build a model which detects if a video is safe for kids or not.

Would you prefer this model to have high precision or high recall for this task? Just take a moment and think what will be the meaning of high precision and high recall.

High precision means if the model classifies video 4 and video 6 as safe for kids, they are actually safe for kids.

In high precision, we are okay if the model is not able to classify video 2 as safe for kids but whichever videos it classifies as safe for kids they are actually safe.

High recall means the model will try to maximize the number of videos that are classified as safe.

In high recall there might be chances that model may classify video 3 as “safe for kids”. This is because recall is more about classifying all the “safe for kids” videos as “safe” rather than classifying all the videos correctly.

So will you prefer high precision or high recall for this task?

We would prefer a model which has high precision and low recall. It is okay if the model rejects many good videos but keeps only really safe ones, instead of

having high recall and classifying a few really bad videos as safe for kids.

Let’s see one more example. Say we have to build a model which detects shoplifters on the basis of surveillance images. In case, someone is marked as shoplifter, we manually examine.

Would you prefer this model to have high precision or high recall?

We would prefer the model to have high recall even if the precision is low because our goal is to catch almost all the shoplifters.

In the high recall, the security guard might catch and examine some non shoplifters also but we will achieve our goal of catching almost all the shoplifters.

Now you may think that we can have both high precision and high recall in a good model. But unfortunately, we can’t have both high precision and high recall at the same time.

Increasing the precision reduces recall and

vice versa. This is called

precision recall tradeoff. Let’s understand this tradeoff.

For each instance in the dataset,

the classifier computes a

Decision score. Decision score is decided by the classification algorithm. If the decision score is

greater than a threshold, the classifier assigns that instance to the positive class

else to the negative class. Here again the threshold is decided by the classification algorithm.

Here is the precision and recall values of 5-detector during the various threshold.

As we can see if we increase the threshold, the precision increases and

recall decreases.

And, if we decrease the threshold,

the precision decreases and

recall increases. We can change the threshold as per our requirement.

So how do we decide which threshold value to use? The answer is we can simply select the threshold value which gives us the best precision-recall threshold for our task. Say for our task we need high precision then

we increase the threshold and if we need high recall

then we decrease the threshold. Therefore it is fairly easy to create a classifier with virtually any precision or recall we want. Let’s say we want to achieve 90% precision

To achieve the same we set the threshold to around four hundred thousand.

But if we set the threshold to four hundred thousand, we have only 18% recall. A high-precision classifier is not very useful if its recall is too low.

So if your boss asks you to reach 99% precision, you should

ask him at what recall :)

for example at 60% recall. But again it entirely depends on your project.

Here is the confusion matrix of the model which predicts if someone has cancer or not.

Recall is also known as the true positive rate.

False positive rate is the ratio of negative instances that are incorrectly classified as positive. It is the measure which tells us what proportion of patients that

actually were not having cancer are classified as

having cancer by the model.

Here, the false positive rate is 10 out of 60 which is 0.17

Now let’s come back to ROC curve. The ROC curve plots the true positive rate against the false positive rate at various threshold

0.5 whereas the perfect classifier have the

area under the curve equal to 1. This was the brief introduction to various performance measures used in the classification task.

Hopefully, now you have a fair understanding of which performance measure to choose, select the precision/recall tradeoff that fits your needs and compare various classifiers using ROC curves and area under the curve.

One-versus-all and one-versus-one. Let’s discuss these strategies.

In one-versus-all, we classify 10 different binary classifiers, one for each digit between 0 and 9 like 0-detector

1-detector

2-detector and so on upto 9-detector

While classifying an image, we pass the image to each classifier and select the class whose classification is good.

Say the A-B classifier classifies object “x” as “B”,

B-C classifier classifies object “x” as “B” and

A-C classifier classifies object “x” as “C”.

Since here class B has the majority

Our model will classify object “x” as B”.

If there are 10 classes the total number of pairs of 2 would be 55. If there are 100 classes, there total number of pairs will be close to 5000 which means there will be 5000 classifiers that we need to train.

The main advantage of one-versus-one strategy is that each binary classifier only needs to be trained on the part of the training set for the two classes it must distinguish.

Like with most of the approaches in machine learning, you should always rely on cross validation or comparison of performance of both models and take the decision based on that.

Hope now you have a good understanding of how will you train a model for multiclass classification problem.

Let’s understand one more type of classification called multilabel classification. In multiclass classification, the classifier classifies the given instance into only one of the classes.

For example, a movie can be classified into multiple categories. Say the multilabel movie classifier is trained on four classes Biography, Drama, Sport, and Sci-Fi.

Then for the raging bull movie such a multilabel classifier will output multiple binary label, one for each class.

Let’s do some hands-on and learn how to train a binary classifier using AzureML. We will train a binary classifier which predicts if someone has breast cancer or not. We will use Breast Cancer dataset available on the AzureML. Login to AzureML. Go to experiments and click on “blank experiment”. Go to “Samples Dataset” and drag “breast cancer data” to the canvas. This dataset consists of Breast cancer diagnosis data against features from cell samples. Click on visualize to visualize the data. The dataset has 683 rows and 10 columns. Here Class is the label and other columns are features. Next drag “Select columns in dataset” and join it with the dataset box. Click on “Select columns in dataset” box, click on “launch column selector” and select all the columns. Next drag “split data” and join it with “Select columns in dataset” box. Split the data into training and test set in 80 20 ratio. Specify the random seed as 42. Type classification in the search box and here you can see all the classification algorithms. Since our classifier is a binary classifier, let’s select “Two-class Logistic Regression” for now.

If the estimated probability is greater than 0.5 then the model predicts that the instance belongs to the positive class labeled as 1

else to the negative class labeled as 0.

Drag “train model” and join it with training set box and “Two-class Logistic Regression” box. Click on “train model” box and specify label as “Class”. Drag “Score model” and join it with “train model” and test set. Drag evaluate model and join it with “score model” box. Now we are done with steps. Let’s run the steps and wait for the execution to complete. Let’s visualize the score model step. Here we can see “Scored Labels” and “Scored Probabilities” columns. Scored Label is the class predicted by the model and Scored Probability is the probability predicted by the model. As you can see if probability is greater than 0.5 then the scored label is 1 else 0. Next evaluate the model. Here we can see the ROC curve and precision vs recall. Below is the confusion matrix. Accuracy is 96%. Precision is 97.8%. Recall is 91.7% and F1 score is 0.946. Currently the threshold is 0.5 and area under the curve is 0.99. We can change the threshold to achieve any precision or recall we want. For example to achieve 96% recall we can decrease the threshold to 0.43. As you can see on changing the threshold, precision has also changed. This was the quick demo of how to train a model for classification task. To get a better model, try 2-3 different algorithms. Select the best one and then hypertune its parameters. One you are confident with your model, deploy it to production.

Let’s quickly revisit the concepts we have learnt in this chapter. We learnt about how do we train a model for the classification task. Then we started with a multiclass classification problem. Then we learnt various performance measures for the classification task. We learnt accuracy is not a good performance measure especially for the skewed dataset. Then we learnt about the confusion matrix. The general idea of the confusion matrix is to count the number of times instances of class A are classified as class B. Each row in a confusion matrix represents an actual class while each column represents a predicted class. Then we learnt about various terms associated with the confusion matrix. Then we learnt about Precision and Recall. Precision means when the model predicts yes, how often is it correct. Recall means when it's actually yes, how often does the model predict yes. Then we learnt about f1 score which is a single metric combining both precision and recall. Then we went through some examples where we require high precision and high recall. Then we learnt about the precision-recall tradeoff. Increasing the precision reduces recall and vice versa. Then we learnt that we can achieve any precision and recall value by changing the threshold. Then we learnt about ROC curve and area under the curve. ROC curve can be used to measure the performance of various classifiers. The perfect classifier has the area under the curve as 1. Then we learnt about the two multiclass classification strategies one-versus-all and one-versus-one. Then we learnt about multilabel classification. Finally, we trained a binary classifier to predict if someone has breast cancer or not.