Predicting housing pricing in California using AzureML

Not able to play video? Try with youtube

Resources -

Housing Dataset

Let's build the model for building housing prices in California using Microsoft AzureML. AzureML is a collection of services and tools intended to help developers train and deploy machine learning models without writing code.

Go to studio.azureml.net and log in with your Azure account. If you don't have account, you can just signup for free.

Let’s follow the machine learning checklists to build this model.

First, we look at the big picture which involves framing the problem and selecting the performance measure. This is a supervised learning problem so the performance measure can be the root mean square error.

Next, we download data, convert it into the form which machines can understand and then we split it into training and test set. Let’s get the data.

Download the housing.csv file using the link displayed on the screen https://raw.githubusercontent.com/cloudxlab/ml/master/machine_learning/datasets/housing/housing.csv

Now go to Azure ML, click on the dataset, click on new and upload the housing dataset from your local machine. Now click on experiments and create a blank experiment. Here, you will see a canvas where you can drag and drop various services to train your model. Name the experiment as Housing California. Now drag your dataset. Click on visualize to see the data. Next, drag the select columns in dataset and join it with the dataset box. Click on select columns in dataset box and click on “launch column selector”. Here select all the columns which you think are required for training the model. Let’s select all the columns for now. Next drag split data and join it with “select columns in dataset” box. Click on “split data” box and split the data into training and test set in 80 20 ratio. Specify any number as a random seed. Set Stratified split to true and select median income column for the stratified split.

Next, we explore the data to gain insights and remove data quirks.

Select on the dataset box and click on visualize to visualize the data.

Next, we prepare the data for machine learning algorithms which involves data cleaning and feature scaling.

Let’s clean the training and test set. Drag “Clean missing data” and join it with training and test set. Select “cleaning mode” as “replace with median”. Please note that we can replace missing values with median only in the numerical columns. Let’s select the numerical columns for both the “clean missing data” boxes.

Click on “launch column selector”, select “no columns” and select “column names”. Select all the numerical columns. Do the same for the test set “clean missing data” box.

Now feature scale the training and test set. Drag “normalize data” and join it with “clean missing data” boxes. Select the “transformation method” as “zscore” which is standardization. Next, select the columns for standardization. Please note that feature scaling of the label is generally not required.

Next, we try different models and shortlist the top three to five most promising models. Let’s try linear regression and boosted decision tree regression algorithms.

First, try linear regression algorithm. Drag “Linear regression” and “train model”. Join “linear regression” and “training set normalize data” with “train model” box. Click on “train model” box and select label as median house value.

Also, cross-validate the model. Drag “cross-validate” model and join it with “train model” and “normalized training set”. Select the column as housing median value. Let’s run the steps.

Now select “cross-validate model” box and click on visualize to see the root mean square error of linear regression algorithm across all the folds. As you can see the mean root mean square error across all the folds is approximately sixty-eight thousand. Let’s see if we can find a better model. Change Linear Regression to boosted decision tree regression. Let's run the steps again.

The root mean square error of boosted decision tree regression algorithm is approximately forty-seven thousand which is a lot better than the linear regression algorithm. Let’s select the model generated by boosted decision tree regression algorithm for hyperparameters tuning.

Next we improve the model by fine tuning its hyperparameters

Drag the “tune model hyperparameters” and join it with “train model” and “normalized training set”. Select the column as median house value and “metric for measuring performance” as the root mean square error. This step gives us the final hyperparameters tuned model. Let’s evaluate it on the test set.

Drag “score model” and join it with “normalized test set” and “tune model hyperparameters”. Next, drag “evaluate model” and join it with the “score model” box. Let’s run the steps. The root mean square on the test set is approximately seventy thousand.

If you are happy with the results then you can deploy this model as a web service. Azure ML gives this functionality out of the box. With web service, you can call this model from your mobile app or web application. Select the model generated by “tune model hyperparameters” box, click on “setup web service” and select “predictive web service”. Click on run and after the run is complete, click on “deploy web service” to deploy the web service. Here you can see your API key and the API help page. Open API help page. Here is the HTTP POST URL and below are the sample codes in C#, Python and R using which you can call this API.

Hope you liked this project and happy learning!

https://discuss.cloudxlab.com/c/course-discussions/ai-and-ml-for-managers

End to End Project - Regression

Predicting housing pricing in California using AzureML

XP

Loading comments...