Building End to End Machine Learning Project

Not able to play video? Try with youtube

Slides

Hi, welcome to the chapter "End to End Project" of AI and ML Course for managers . In this chapter, we are going to build a complete machine learning project. You will get an understanding of the workflow of building a typical machine learning project.

A typical machine learning project consists of a sequence of tasks. Let’s understand this using a checklist.

The first step is to look at the big picture. When we are given an objective of the machine learning project, we first understand the business objective and how does the business expect to use and benefit from this model.

Then we get the data. Getting the data involves downloading or obtaining the data on which the machine learning model will be trained.

Next, we explore the acquired data to gain meaningful insights ...

...And then we prepare the data for machine learning algorithms. Preparing the data involves data cleaning ...

...and feature scaling

once our data is ready, we feed it to different algorithms and explore various models ..

and then we shortlist the best model.

Next, we fine-tune the model for better accuracy.

And then we present the solution to the team

And finally, we launch the model ….

...monitor its performance..

... and periodically do the maintenance. We will see each of these steps in details.

Let’s build an end-to-end project to predict housing prices in California using the California census data.

The dataset for this project is based on the data collected from the 1990 California census. Let’s have a quick look at the data. As you can see, there are various columns such as longitude, latitude, and housing_median_age for each block group. Each row is a block group, not an individual house. Block groups are the smallest geographical unit for which the US Census Bureau publishes data. A block group typically has a population of 600 to 3,000 people. Let’s call them districts for short. The columns are basically the attributes or features of each block.

Our model should learn from this data and be able to predict the median housing prices in any district, given all the other features.

You can download the complete data from CloudxLab GitHub repository link as displayed on the screen.

Let’s build the model by following the checklist.

The first step in the checklist is to look at the big picture. It generally consists of three tasks

Frame the problem

Select a performance measure and ...

...Check the assumptions. Let’s look at each task.

To frame the problem we ask questions like what is the business objective and

how does the company expect to use and benefit from this model? These questions help in determining

Which algorithm to select and ...

...which performance measure to choose to evaluate the model

Say after discussing with the business team, we come to know that the housing prices …...

…..predicted by the model will be fed to the investment analysis system. Based on the predicted housing prices and other signals this system will determine if it is worth investing in a given area or not.

Getting this right is critical, as it directly affects revenue.

The next question to ask if there is any current solution and how does it look like?

Say your team informs you that currently the district housing prices are estimated manually by the experts and ...

...They use complex rules to estimate the housing price. This process is costly, time-consuming and above all their estimates have ….

….around 15% error. With this information, you know that your model should have less than 15% of error if it has to replace the current solution

Next, we have to ask questions like if this problem is supervised …...

….unsupervised or …..

….reinforcement learning problem?

Also if the problem is a classification task ….

…. regression task or something else?

Should we use batch learning or

online learning?

Please pause this video for a minute and try to answer these questions… (gap) ... Hope you have found the answers. Let’s discuss.

It’s a supervised learning task as we have labeled training data. We need to predict Median House Value and we are provided with past Median House Value from which our model need to learn.

And in supervised, it is a regression task as we have to predict housing prices of unknown houses. The housing price is a numeric value.

Since the California Census Data is not updated that often, we don't need to learn on the go meaning we don't need online learning. Instead, we need batch learning in this case which means we need to train our model whenever new California Census Data is available. And use this model more often in the pipeline to predict the median house value.

Now with all this information, we are ready to design the system.

The second task in looking at the big picture is to select criteria for measuring the performance of the model. In other words, we need to select a performance measure.

A typical performance measure for regression problems is Root Mean Square Error (RMSE) which is the square root of Mean Squared Error. The mean squared error basically is the sum of squares of the errors in each prediction.

The third task in looking at the big picture is to verify the assumptions that were made so far. This helps in catching serious issues early on

Till now we have learned that the housing prices predicted by the model will be fed to the investment analysis system. Based on the predicted housing prices and other signals this system will determine if it is worth investing in a given area or not. It is important to check with the team if the investment analysis system requires the housing prices in numbers. If the investment system requires housing prices in categories such as cheap, medium and expensive then this problem will become the classification task instead of the regression task. We do not want to learn this after spending days on building the regression model.

At CloudxLab, we have built bootml, which helps in training and building machine learning models without writing any code. Access it at cloudxlab dot com slash bootml.

Go to cloudxlab dot com slash bootml. Click on “Give it a Shot”. Please note we require an active lab subscription to access Boot ML. On the left side, you can see the list of steps to be followed. Let’s use it to train the model which predicts housing prices in California.

The first step is to select a project. We can either select the publicly available projects or create a new project. Let’s create a new project. Click on create new project and specify the project name. We can also specify the objective of the project and the current approach if any. Keep the project private if you do not want other users to see it. Click on Next. In this step, select the type of project as supervised learning. Type of supervised learning as regression and performance measure as mean squared error MSE. Let’s go back to the checklist.

The second step in the checklist is to get the data.

The first task in getting the data is to download or obtain the data on which the model can be trained

The second task is to convert the data into the format which machines can understand and

The third task is to split the data into training and test set. Let’s look at each task in detail

Before downloading the data first we need to figure out what all data do we need and how much do we need? This helps in determining how much storage is required.

Next, we find out the sources from where we can get this data

Afterward, we set up the workspace with enough storage.

Finally, we download or obtain the data from the source and store it in the workspace

Next, we ensure that sensitive information is deleted or protected. We generally anonymize the sensitive information.

In the next task, we convert the data in the form of ...

..rows and columns so that we can process it

In the next task, we split the data into training and test set.

We keep 80% data in the training set and the remaining 20% in the test set.

Let’s specify the dataset in the BootML. We already have the CSV file containing the housing prices in California. Select the Housing California dataset from the list of existing datasets. You can also upload your own dataset. For that click on create your own machine learning dataset. Specify the name of the dataset and description if any. Then click on step one to create a folder for your dataset. Now click on step 2 to upload your dataset files. You will be redirected to Jupyter notebook. The Jupyter Notebook is an open-source web application that allows to create and share documents that contain live code, equations, and visualizations. Sign in using your lab username and password. Now upload the files by clicking on the upload button. In the next step, select the file and its type. In the next step, discard the fields which are not required for training the model. Click on “see your data” to see the sample of data. In the next step, select the features and label. Since the model will predict the housing price, select the label as median house value. Rest other columns can be used as features. In the next step, split the data into training and test set in the 80:20 ratio. Also, note that we can specify any number as a random seed. Do not change the random seed once the data is split into the training and test set. Random seed makes sure that the same set of rows go into training and test set every time we split the data. Also, select the column in which you want to apply stratified sampling. You can leave this box unchecked if you do not want to apply stratified sampling on any column.

The third step in the checklist is “Explore the data to gain insights”.

As we have seen in the previous chapter, data visualization helps in gaining insights from the data ...

...and removing data quirks. Let’s go back to BootML

Plot the scatter plot between latitude and longitude. Set transparency to 0.1. Setting transparency makes it easier to visualize the places where there is a high density of data points. Generate correlations and scatter matrix. Scatter matrix is a type of matrix that plots every numeric attribute against every other numeric attribute.

The next step in the checklist is “Prepare the data for Machine Learning algorithms”.

In this step, we clean the data. The data might have missing values. For example, the age for passenger 9 is missing. The algorithms that we apply may not work with records having missing values.

We have multiple choices here:

Drop the age column,
Drop passenger 9
Fill the age of passenger 9 with either zero, average of the age column or median of the age column.

If most of the values of the age column are blank, we drop the column. If most of the values of a record are blank and the record is not significant from a diversity perspective, we drop the record. Otherwise, we fill it some value.

Mostly, we don't drop the column or the row, instead, we fill it with the median of the column. This process is known as imputation or Imputer selection.

Afterward, we perform feature scaling. Feature scaling essentially involves brings the values within a certain range. Most of the machine learning algorithms are known to perform well when the various features i.e. columns of the dataset have similar ranges. One of the ways of features scaling is min-max scaling. In min-max scaling, we move the minimum value to 0 and maximum value to 1 and everything else in between is moved proportionately.

The other common form of scaling is standardization whereby we first calculate the mean and standard deviation. Then we compute how many standard deviations away each value is from the mean. These newly calculated values are basically standardized values. Let’s go back to BootML.

Replace the missing values with the median and move all the categorical fields to the categorical fields box. The fields in the categorical field boxes will be converted into numerical values using one hot encoded.

In the next step, select columns for min-max scaling and standardization. Let’s apply standardization in all the columns.

The next step in the checklist is “Explore many different models and short-list the best ones”. There are many models that can do predictions. In this step, we try many models from different categories such as

Linear regression

SVM i.e. support vector machines

random forests and

neural networks. We’ll learn these algorithms later in the course.

Afterward, we measure and compare different model performance. For each model, we use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds. Let’s understand cross-validation.

Earlier, we had split the original dataset into training and test set.

We do not touch the test set until we have finalized the model.

So, before touching the test set, in order to compare the various model's performance, we use a technique called cross-validation.

In cross-validation, we split the training set into ...

...distinct smaller training set ..

…..and validation set. Basically, we split the training set into smaller folds.

In each iteration, we pick one fold as a test set and remaining records as the training set.

If we choose the number of folds as 10 for a dataset having 450 records. There will be 10 iterations. In each iteration, we will pick one fold having 45 records as the test set and remaining 405 records as the training set. So, in each iteration, we will be training on 90% percent of the data and 10% of the dataset will be used for validation.

In each of the 10 iterations, we measure the performance in terms of mean squared error in this case. we compute the mean and standard deviation of the performance measured in 10 iterations. Next, we short-list the top three to five most promising models. Let’s go back to Boot ML.

Specify the number of folds for cross-validation as 10. We can specify any number of folds but cross-validation comes at the cost of training the model several times. If we increase the folds, it is going to take more time.

Select linear regression, random forests and decision tree algorithms to train three different models.

The next step in the checklist is to fine-tune the model.

In this step, we optimize the hyperparameters using grid search or randomized search. We will learn about these algorithms and their hyperparameters later in the course.

So what are hyperparameters?

Hyperparameters are the parameters of the machine learning algorithms which cannot be directly learned during the training process. The algorithms can't figure out these parameters by themselves the way they figure out the model.

These parameters express higher-level properties of the algorithm such as

learning rate meaning how fast or slow the model should learn

the number of hidden layers in a deep neural network and ...

….the number of clusters in k-means clustering.

In case, you got multiple models that are performing equivalent, you can combine multiple models. This is called ensembling.

To achieve the better performance we should try Ensemble methods. Combining your best models often perform better than running them individually. Most of the models that have been winning on Kaggle such as XGBOOST are basically ensemble models.

In an ensemble, we basically ask multiple models for the predictions and we give the most popular prediction to the end user. Such predictions are more accurate than the predictions of individual models.

We programmatically try the various values of the hyperparameters and pick the combinations that are giving the best performance. This way hyperparameter tuning is called Grid Search. Also, we random try some fixed number of random selections of hyperparameters. This is known as RandomSearch. We usually start with RandomSearch and then further fine-tune the hyperparameters using GridSearch.

After tweaking the hyperparameters of the model for a while we evaluate the final model on the test set. After we have evaluated the model on the test set, we must resist the temptation to tweak the hyperparameters to make the model look good on the test set because such improvements are unlikely to generalize on the unknown data. Let's go back to the BootML.

Tune the hyperparameters using Grid Search. Now we are done with all the steps required to predict the housing prices in California. In the next step, we click on generate the notebook to generate the Jupyter notebook. This notebook contains the code required for building the model. Let's have a quick look into the generated notebook. Open the notebook. If you are not already logged in then login using your lab username and password. Each cell of the notebook contains the code based on your steps in BootML. You can either run the individual cell by pressing shift + enter or run all the cells by clicking on cells and then selecting “Run All”. Let’s run all the cells. Wait for cells to execute and you can see final root mean square error on the test data in the last cell.

Let’s see the various sections of the notebook. Here we have defined the data location.

Then we have split the data into training and test set. Instead of random sampling, we are using stratified sampling. In this project, we came to know from business people that the median income is a very important criteria as it impacts the predictions the most. Therefore, we want to ensure that the every income group should have the right representation in training and test set.

For that, we first create the median income categories and then use the stratified sampling on median income categories to split the dataset into training and test set.

The code in the next section is for visualizing the data. Currently, latitude and longitude are on X and Y axis respectively as we had selected in the BootML. We can change the variables in the code to get the visualization of attributes of our choice. Then we generate correlations and scatter matrix.

Currently, the scatter matrix is looking a bit congested. To get a better understanding we can reduce the number of parameters.

Here we prepare the data for ML algorithms by cleaning the missing values, encoding the categorical values and feature scaling.

Finally, we train the model using the selected algorithms and do cross-validation to compare the various models. Now, we can see and compare the performance of the various models. The mean squared error for decision tree is approximately seventy-one thousand and the mean squared error for linear regression is approximately sixty-nine thousand and the same for Random Forest is approximately twenty-two thousand. So, Random Forest is a clear winner as it has the minimum error. The other measure of the good model is the low standard deviation in error. Also, for Random Forest, the standard deviation is also minimum. Hence, we will be using the Random Forest.

We can further improve the model by fine-tuning the hyperparameters for the selected model. After fine tuning the various parameters the mean squared error has come down to approximately forty-nine thousand.

Now we can analyze the best models, see the importance score of each attribute and test it on the test set to get the final root mean square error. The root mean square error on the test set is 47766.

The next step in the checklist is to present the solution. This step is part of the pre-launch phase.

In this step, we document everything highlighting what worked and….

...what did not work.

What assumptions were made and ...

...what are the model’s limitations.

Next, create nice presentations with ...

….clear visualizations and easy-to-remember statements such as

the median income is the number one predictor of housing prices

The final step in the checklist is to “Launch monitor and maintain the system”. In this step,

we get the model ready for production by plugging the production input data.

Also, we write monitoring code to check the model’s performance at regular intervals and

trigger alert when it drops.

Also evaluating the model performance requires sampling the model's predictions and evaluating them by human experts.

You can either work with field experts or

workers on a crowdsourcing platform such as Amazon Mechanical Turk

Also for online learning systems, it is important to monitor the quality of input data. An online learning system keeps updating the model or retraining the model as and when new data comes. The quality of a model may drop if the quality of input data has dropped. For example, the performance may degrade if a malfunctioning sensor starts sending random values suddenly and your model learns using this bad input.

Also, the model tends to rot with time as the model may not be as relevant on new data as it was on older data.

So keep on retraining your model on fresh data regularly. This is all for this chapter. Keep this checklist handy as it can guide you through your Machine Learning projects. Let’s quickly revisit the checklist.

Let's take a look at all the steps that we followed. First, we look at the big picture which involves framing the problem and selecting the performance measure.

Next, we download data, convert it into the form which machines can understand and then we split it into training and test set.

Next, we explore the data to gain insights and remove data quirks.

Next, we prepare the data for machine learning algorithms which involves data cleaning and feature scaling.

Next, we try different models and shortlist the top three to five most promising models.

Next, we tune the hyperparameters to improve the model.

Next, we document everything and present the solution.

And finally, we launch, monitor and maintain the system.

Hope you liked the chapter. Stay tuned for the next chapter and happy learning!

https://discuss.cloudxlab.com/c/course-discussions/ai-and-ml-for-managers

End to End Project - Regression

Building End to End Machine Learning Project

Slides

XP

Loading comments...