Hi, welcome to the third chapter "Analytics and Data Sciences" of AI and ML Course for managers.
We will start with analytics basics which will help you understand machine learning concepts.
Then we will see how data visualization helps in gaining insights from the data.
Afterwards we will learn a few important concepts in machine learning like data cleaning
And feature scaling. Alright, so Let’s start.
As discussed in previous chapter, the features in the dataset can be both numerical and categorical.
A numerical variable is a variable where the measurement or attribute has a numerical meaning. For example, Heart rate and rainfall measured in inches are numerical variables. The Numerical variables
are of two kinds: Continuous and Discrete. Continuous numerical variables can take any value within a range. For example, a person’s height can be within the range of human heights and not just certain fixed heights such as 3 ft or 4 ft.
Whereas discrete numerical variables can only take certain values, for example, the number of students in a class, because there can't be a half student.
A categorical variable is a category or type. For example, Gender is a categorical variable. As discussed earlier, categorical variables could be of two types: Ordinal and regular. Ordinal categorical values such as Low,Medium, and High have a particular order whereas regular categorical values such as Male and Female don’t have a specific order.
Let us now understand a very important concept in machine learning called Probability. Probability is the measure of the likelihood that an event will occur. The higher the probability of an event, the more likely it is that the event will occur.
Probability of an event is always between 0 and 1. An event having probability nearer to 0 is less likely to occur while the event having probability nearer to 1 is more likely to occur.
To understand Probability, let us take a simple example. What is the probability of getting “M” in the month’s name in a year?
There are 12 months in a year.
Out of these 12 months, there are 5 months with the names containing the letter “m”: March, May, September, November, and December.
Hence the probability of getting “M” in the month’s name in a year will be number of months having “M” in their names divided by the total number of months, that is 5 divided by 12
Let us consider one more example. When we roll a pair of fair dice, what is the probability that the sum the numbers on the faces is 4?
A die has six faces. So total possible ways in which a pair of dice can be rolled are 6 multiplied by 6 which is 36.
The sum of 4 can come when the faces of the both dices are 2 or
or first is 1 and second is 3
or first is 3 and second is 1
So the number of instances when the sum is 4 are 3
Hence the probability will be 3 / 36 ie 1 / 12
Another very important concept to learn is measures of central tendency.
Say you are given the data of students height in centimeters. How will you find out the middle or center of these heights?
The measure of central tendency is the single value which represents the middle or the center of the data. There are three measures of central tendency. Mean, median and mode.
Mean is the average of the given data points. Say we have to find mean of numbers 75, 69, 88, 93, 95, 54, 87, 88 and 27.
To find the mean we first sum the numbers
and then divide the sum by the total count of numbers. Hence the mean is 676 divided by 9 which is 75.11
The advantage of mean is that it can be used for both continuous and discrete numerical data. Since mean takes into account every value in the data, it may be influenced by the outliers. This is the main disadvantage of mean. Let’s understand this with an example.
Say in the previous data, instead of twenty seven we had a number five thousand. Clearly, five thousand is an outlier here as other numbers are really small as compared to 5000.
Let’s calculate the mean. As you can see how the mean has changed to 627.66 from 75.11. It got heavily influenced by that one outlier.
So, let us say we are taking measurements and for one particular reading our sensor malfunctioned. In this case the entire mean will change.
The second measure of central tendency is median. Median is the midpoint of data.
To find the median we first sort or order the numbers in the ascending order.
and then we find the middle most value. The middle value is called the median. In this example, the median is 87.
If there are two numbers in the middle ..
.. then we take the average of two middle numbers to calculate the median. Here the median is 87.5 which is the average of 87 and 88. Since the median does not take into account every value in the data, hence it is less affected by outliers compared to the mean.
The third measure of central tendency is mode. The mode is the most commonly occurring value in the data.
In this example, 88 occurs twice and rest all numbers occur only once, hence the mode is 88. The mode has an advantage over median and mean as it can be calculated for both numerical as well as categorical data.
One of the limitations of the mode is that there can be multiple modes in any given data if the multiple numbers have the same frequency. In this example, there are two modes 88 and 54 as they both have appeared twice.
So which measures of central tendency should we choose amongst mean, median and mode. In general, the median is usually the preferred measure of central tendency as it is less affected by the outliers but median is more complex to compute than mean because for median we need to sort the data. So, a number of times we use the mean when we need performance.
Now let us understand what are the measures of spread. As we have seen, we use measures of central tendency to summarize the data into a single value using mean, median, and mode. That single value is the representative of all the values in the dataset. But that single value is only a part of the picture. Measures of spread summarise the data in a way that shows how scattered the values are and how much they differ from the mean value. Let’s understand this from an example.
In the dataset A, the mean, median and mode are 6
In the dataset B also, the mean, median and mode are 6.
So you may think both the dataset are the similar because their measures of central tendency are same. Are these dataset really similar?
Let’s plot these datasets. It is clear from the plot that dataset B is more dispersed than Dataset A.
Dataset A starts from 4 and ends at 8.
While the dataset B starts at 1 and ends at 11.
So both the datasets are not similar. The measures of central tendency and measures of spread together help us to better understand the data.
Let’s see some common measures of spread.
The range is the difference between the largest and the smallest value in the data. In this data, the largest value is 9 and the smallest value is 3. This results in the range of 6 which is 9 minus 3. Like the mean, the range can also be influenced by the outliers.
Quartiles divide an ordered data set into 4 equal parts. Here Q1, Q2 and Q3 are quartiles. Let’s understand quartiles with an example.
Here there are 12 ordered values in the dataset.
The quartiles divide this data into 4 equal parts where each part has 3 values.
Since the first quartile Q1 falls between 3 and 4, its value is the average of 3 and 4 i.e. 3.5
The second quartile Q2 falls between 6 and 6. Its value is the average of 6 and 6 i.e. 6.
The third quartile Q3 falls between 8 and 9. Its value is the average of 8 and 9 i.e. 8.5.
Quartile Q1 is between the lowest 25% and the remaining 75% values. Hence it is also called 25th percentile.
Quartile Q2 is between the lowest 50% and the highest 50% values. Hence it is also called 50th percentile or median.
Quartile Q3 is between lowest 75% and highest 25% values. Hence it is also called 75th percentile.
From the values of Q1, Q2 and Q3 we can say that 25th percentile is 3.5
50th percentile or median is 6
75th percentile is 8.5
The interquartile range is the difference between the upper quartile Q3 and the lower quartile Q1 and describes the middle 50% of the values.
Here the interquartile range is 5 which is the difference between quartiles Q3 and Q1. The interquartile range is often seen as a better measure of spread than the range as it is not affected by the outliers.
In analytics, you may have to plot box plot at times. A Box Plot is the visual presentation
of the minimum value
first quartile Q1
second quartile Q2 or median
third quartile Q3
And the highest value of a given dataset.
From this box plot, you can find that the lowest value is 2
25th percentile is 4
50th percentile or median is 5.
75th percentile is 7.
And the highest value is 8.
And the interquartile range Q3 minus Q1 is 3
The other two measures of spread are variance and standard deviation. Variance and standard deviation are measures of the spread of the data around the mean. They summarise how close each data value is to the mean value.
Let’s take our earlier datasets A and B. The mean of these two datasets is 6.
Since the dataset B is more dispersed around the mean value 6, the dataset B will have more variance and standard deviation compared to dataset A. The smaller the variance, the more the mean value is indicative of the whole dataset. Therefore, if all values of a dataset are same, the variance and standard deviation are zero.
So how do we calculate variance? Let’s see this with an example.
First, we calculate the difference of each value in the dataset from mean. As discussed earlier, the mean of dataset A is 6.
Then we square the differences.
Then we sum the squared differences.
And then we divide the sum by the number of data points.
Here the variance is 1.16.
Why do we square the differences in step 2? We square the differences so that negative values do not cancel out the positive values. Also squaring the difference emphasizes the variance of individual data points i.e. numbers farther from the mean are emphasized more.
What is the downside of squaring the difference? Firstly, because the difference from mean is squared, the variance may give more weightage to the outliers. Secondly, due to squaring the differences, the variance is measured in square of the units, which is not same unit as our actual dataset. Calculating the standard deviation rather than variance rectifies this problem
Standard deviation is the square root of the variance. Hence, the standard deviation is in the same unit as of the actual dataset.
Now let’s learn normal distribution.
If we plot the frequencies of the values in data such that the values are on x-axis or horizontal axis and the count or frequency of each value on y-axis. Meaning more the frequency of a particular value, higher is the bar.
Data can be distributed in different ways.
It can be spread out more to the left meaning higher values are more frequent.
Or it can be spread more to the right meaning the smaller values have occurred more often.
Or it can be jumbled up.
But there are many cases where data tends to be
around the central values and symmetrical around the left and right. Such distribution is called the normal distribution. It is often called a "Bell Curve" because it looks like a bell.
In the normal distribution, the mean, median and mode are same. It has symmetry about the center. 50% values are less than the mean and remaining 50% are greater than the mean. In the real world, many things closely follow a normal distribution. For example, heights of people and the marks scored by students in the exam will generally have the normal distribution.
Meaning, the number of students achieving the average marks are very high while the number of students get very high or very low marks are very rare.
The normal distribution is usually represented by mean and standard deviation.
Also, normal distribution follows 68-95-99.7% rule. What is this rule?
In normal distribution, 68% of the values are within 1 standard deviation from the mean. What does it mean?
Say we have heights of the students which follow the normal distribution.
The mean of the heights is 150cm and the standard deviation is 10cm.
The rule says 68% of the values would be within 1 standard deviation from the mean. It means that 34% students will have the heights between 150 cm and 160 cm.
Also 34% students will have the heights between 140 cm and 150 cm.
We can also say that 68% students will have the heights between 140 cm and 160 cm
In the normal distribution, 95% of the values are within 2 standard deviations from the mean.
Since standard deviation is 10, 2 standard deviations is 2 multiplied by 10 i.e 20.
It means that 47.5% students will have the heights between 150 cm and 170 cm.
And 47.5% students will have heights between 130 cm and 150 cm
We can also say that 95% students will have heights between 130 cm and 170 cm.
In the normal distribution, 99.7% of the values are within 3 standard deviations from the mean.
Since standard deviation is 10, 3 standard deviations is 3 multiplied by 10 i.e 30.
It means that 99.7% of the students will have the heights between 120 and 180.
The remaining 0.3% numbers are called outliers. Any normal distribution will follow the 68-95-99.7% rule.
Also please note that if the data distribution is tail heavy then it becomes harder for some Machine Learning algorithms to detect patterns. In such cases, we try transforming the data to have more bell-shaped or normal distribution. Let’s see a question on the normal distribution.
A college admissions officer wants to determine which of the two applicants scored better on their standardized test. Pam, who scored 1800 on her SAT, or Jim, who scored 24 on his ACT? Below are the mean and standard deviation for the SAT and ACT score.
In SAT, the mean score is 1500 and the standard deviation is 300. In ACT, the mean score is 21 and the standard deviation is 5.
We can find out who performed better between Pam and Jim by finding out who is more standard deviation away from the mean.
As you can see that Pam is one standard deviation away from the mean and Jim is 0.6 standard deviation away from the mean. This shows that Pam has performed better on the test compared to Jim.