Data Visualization, Data Cleaning and Feature Scaling

Not able to play video? Try with youtube

In machine learning, it is always important to know what kind of data you are dealing with.

It is good to get a general understanding of the data before feeding the data to the algorithm. Before diving into data visualization, let’s understand a few important plots which we frequently use in the machine learning.

Histograms are great for illustrating the distribution of your data.

The continuous variable shown on the X-axis is broken into discrete intervals and the number of data points in that discrete interval determines the height of the bar on the Y-axis. Let’s understand this with an example.

Let’s say we need to plot heights of black cherry trees using Histogram. Heights of black cherry trees are 60, 63, 75, 82 feet and so on.

For plotting histogram, the values of continuous variable “height” is broken into discrete intervals on the x-axis. On the y-axis, we plot the number of data points in that discrete interval.

Here, we have divided the “height” into intervals 60 to 65

65 to 70

70 to 75 and so on

Interval 60 to 65 contains all the black cherry trees whose height is greater than or equal to 60 feet and less than 65 feet.

Interval 65 to 70 contains all the trees whose height is greater than or equal to 65 feet and less than 70 feet.

Here, in the interval 70 to 75, there are 8 trees. It means 8 trees have the height greater than or equal to 70 feet and less than 75 feet.

Scatter plot uses dots to represent the values obtained for two different variables

one plotted along the x-axis

and the other plotted along the y-axis.

Here is the scatter plot of car weights in kilogram and their fuel efficiencies in mpg - miles per gallon. Each dot represents one car with its weight measured along x-axis and fuel efficiency measured along y-axis.

For example, the car with weight around 1500 kilogram

has fuel efficiency of approximately 23 miles per gallon.

Scatter plots are used when we want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated.

So what is correlation? Correlation indicates the extent to which two or more variables fluctuate together. It often refers to how close two variables are having a linear relationship with each other. Two variables may have positive, negative or no correlation at all.

Positive correlation means values increase together. If one value increases then the other value also increases. Chart on the left shows a perfect positive correlation. As the value on the x-axis increases, the corresponding value on the y-axis also increases. In the perfect positive correlation, if you draw a straight line then all the points will be on the straight line.

In perfect positive correlation, the correlation coefficient will always be one.The correlation coefficient is a statistical measure that calculates the strength of the relationship between the two variables.

Chart on the middle shows high positive correlation. It’s correlation coefficient is 0.9. As the value on x-axis increases, the corresponding value on y-axis may or may not increase for some of the points.

Chart on the right shows low positive correlation. It’s correlation coefficient is 0.5. If you draw straight line here, many of the points will be off the line.

When two variables are not linked at all then we say variables are not correlated. In this diagram, there is no pattern. We can not clearly say if one value is increasing or decreasing when the other value increases. Correlation coefficient zero indicates there is no correlation.

In the negative correlation, one value decreases as the other value increases. Chart on the left shows perfect negative correlation. As the value on the x-axis increases, the corresponding value on y-axis decreases. In the perfect negative correlation, if you draw a straight line then all the points will be on the straight line. In perfect negative correlation, the correlation coefficient will always be minus one.

The charts on the middle and right show less negative correlation than chart on the left. Their correlation coefficient is minus 0.9 and minus 0.5 respectively.

The correlation coefficient always takes values in the range of minus one to plus one.

Correlation coefficient with value of minus one shows perfect negative correlation.

The zero value shows there is no correlation.

And plus one shows there is a perfect positive correlation

So why do we need to know correlation in machine learning? We will figure it out shortly. For now let’s understand the importance of data visualization in machine learning with few examples.

Let’s say our task is to build a model of housing prices in California using the California census data. This data has features such as the population, median income, latitude, longitude, ocean proximity, and median housing price. In the table, we have shown some of the features of this dataset. Let’s plot the histogram of some of the columns and see if we can make sense of the data.

Here is the histogram of median income. Did you notice anything unusual in this histogram?

The median income attribute does not seems to be expressed in US dollars. Say, after checking with the team that collected the data, you are told that the data was scaled and capped at 15 for higher median incomes.

and at 0.5 for lower median incomes. Working with preprocessed attributes is common in Machine Learning, and it is not necessarily a problem, but it is good to know how the data was computed.

Let’s plot the histogram of housing median age. As you can see housing median age is also capped at 50.

Let’s plot the histogram of the median house value. Median house value is also the target variable or label.

As you can see median house value is also capped at five hundred thousand dollars. But this is a problem since median house value is the label which we have to predict for unknown instances.

If these values are capped then the machine learning model may learn that the housing prices never go beyond five hundred thousand dollars. Some of the housing prices may actually go beyond five hundred thousand dollars but the model will not be able to predict that as the model will learn that the housing prices never go beyond five hundred thousand dollars. In this case, we have two options.

Either correct the values of those instances whose values were capped.

Or remove those instances whose labels were capped from the training set as well as the test set. In this example, we will drop the row 2 and 4.

Let’s draw the scatter plot of latitude and longitude. The longitude is on the x-axis and the latitude is on the y-axis. The size of the circle represents the population. The bigger the size of the circle, the larger will be the population in that area. The color of the circle represents the median house value. The color of the circle ranges from blue to red. The blue color circles have low housing prices and the red color circles have high housing prices. Any observations in this scatter plot?

Housing prices look higher in the locations which are closer to the ocean and have high population density. So you may say that housing prices are related to the location and the population density.

But this is not a general rule as you can see housing prices in the coastal area of northern California are not too high.

Now let’s understand why should we know correlation in machine learning? In machine learning, it is good to identify if there are any data quirks. We remove the data quirks before feeding the data to an algorithm. Let’s understand this with an example.

Again, let us take a look at the same task of predicting housing pricing in california. There are various features in the data. Let’s plot median income versus median house value. This plot reveals a few things.

First, the correlation is very strong as you can clearly see the upward

trend and the points are not too dispersed.

Second, the price cap that we noticed earlier is clearly visible as a horizontal line at five hundred thousand dollars.

Third, this plot reveals other less obvious straight lines like a horizontal line around four fifty thousand dollars

another around three fifty thousand dollars

perhaps one around two eighty thousand dollars. You may want to remove corresponding instances from the dataset to prevent your algorithms from learning to reproduce these data quirks. This was the quick introduction of data visualization. Hope it gave you an idea of a few ways you can explore the data and gain insights.

Now lets’ learn data cleaning. Data cleaning is one of the important steps in the machine learning process.

ML algorithms cannot work with missing features. It means ML algorithms cannot work if there are missing values in the columns.

As you can see the "Age" column has missing values. We have to deal with missing values before feeding this data to the algorithm. How do we deal with missing values? There are three approaches to the same.

The first approach is to drop the rows which have missing values.

In this approach we drop the third row which contains the missing value.

The second approach is to drop the entire column which has missing values.

In the second approach, we drop the age column.

In the third approach, we replace the missing values either with the zero or the mean or the median of that column in the entire training set.

For replacing missing values with 0, we replace all the missing values with 0.

For replacing missing values with the mean, we first compute the mean of that column in the training set and then replace missing values with the computed mean.

Here the mean of the Age column is 27

so we replace the missing values in age column with 27

For replacing missing values with the median, we first compute the median of that column in the training set and then replace missing values with the median

Here the median of the "Age" column is 22.

so we replace the missing values of the "age" column with 22

Please note we can replace missing values with zero, mean, median only for the numerical attributes. If cabin column which contains categorical variables, had missing values then we can not replace the missing values in the cabin column with mean or median.

Also, note if you replace the missing values in the training set with the median ..

then use the same old computed median value ..

.. to replace missing values in the test set when you want to evaluate the model. Also once the model goes live, use the same old median value to replace missing values in the new incoming data.

Let’s understand feature scaling.

Here is the housing dataset

In this dataset, the total number of rooms ranges from about 6 to 39320

while the median incomes range from 0 to 15.

Machine Learning algorithms don’t perform well when the input numerical features have very different scales. So how should we handle the case when the input data contains such a varied scale? There are two ways to make all attributes on the same scale.

min-max scaling and standardization.

So what is min-max scaling? Min-max scaling is also known as normalization.

In min-max scaling, all the values are shifted and rescaled so that they end up ranging between 0 to 1.

The minimum value in the original data

becomes 0 in the normalized data

and the maximum value in the original data

becomes 1 in the normalized data.

The remaining values in the original data take values between 0 and 1 in the normalized data.

To calculate the normalized value we first subtract the original value with the minimum value of the list. And then we divide it with the range of the list i.e difference of maximum value and the minimum value in the list.

Let’s find the normalized value of 50

Original data consist of minus 100, minus 50, 0, 50 and 100.

In the original data, the minimum value is minus 100

And the maximum value is 100

So normalized value of 50 will be 50 minus negative 100 divided by 100 minus negative 100

which is 0.75. In short in min-max scaling values are shifted and rescaled so that the new values are between 0 and 1.

The second approach of scaling is standardization. It is quite different from min-max scaling. As you can see in the above chart, min-max scaling scaled the input data in the range of 0 and 1

But standardization does not bound values to a specific range.

In the standardization, we scale the values by calculating how many standard deviations away the value is from the mean. In standardization, features are rescaled so that output has the properties of standard normal distribution with ...

Zero mean and Unit variance.

So which approach should we use for feature scaling - min-max scaling or standardization?

Min-max scaling is good for neural network algorithms as neural network algorithms often expect input values in the range of 0 and 1.

Unlike min-max scaling, standardization does not bound values to a specific range.

In other words, the min-max scaling always results in values between 0 and 1 while the standardization may result in larger range.

Compared to min-max scaling, standardization is less affected by outliers. If we are using machine learning algorithms like support vector machines and logistic regression, we use standardization for scaling.

One important point to note is scaling the target values or label is generally not required.

Let us revisit quickly what we have learnt so far in the chapter.

In this chapter, we learnt the numerical and categorical variables. Heart rate and rainfall measured in inches are numerical variables whereas Gender is a categorical variable.

Then we learnt probability. Probability is the measure of the likelihood that an event will occur.

Then we learnt measures of central tendency - Mean, median and mode. Mean is the average, median is the midpoint and mode is the most frequent number.

Then we learnt measures of spread - range, quartiles, interquartile range, variance, and Standard Deviation

Then we learnt the normal distribution ...

And the 68-95-99.7 percent rule.

Then we learnt different types of plot. Histograms are great for illustrating the distribution of your data.

Scatter plots are used when you want to show the relationship between two variables.

Then we learnt about correlation. Correlation indicates the extent to which two or more variables fluctuate together. The negative correlation means one value decreases as the other value increases. No correlation means the values are not linked at all. The positive correlation means values increase together.

Then we learnt how data visualization helps in gaining insights from the data

and in removing data quirks.

Then we learnt data cleaning. Machine learning algorithms cannot work with missing features hence we deal with the missing features before feeding data to the machine learning algorithms. There are three approaches to handle missing values

In the first approach, we drop the rows which have missing values

In the second approach, we drop the entire column which has missing values.

In the third approach, we replace the missing values either with the zero or the mean or the median.

Then we learnt feature scaling. Machine Learning algorithms don’t perform well when the input numerical features have very different scales. There are two ways of feature scaling - min-max scaling and standardization.

In min-max scaling, the rescaled values are always in the range of zero to one

In standardization, rescaled values have the properties of standard normal distribution with Zero mean and Unit variance.

Hope you liked the chapter. Stay tuned for the next chapter and happy learning!

https://discuss.cloudxlab.com/c/course-discussions/ai-and-ml-for-managers

Analytics and Data Sciences

Data Visualization, Data Cleaning and Feature Scaling

XP

Please login to comment

0 Comments