Let us build one more machine learning project using BootML. Say we have to build a model which predicts the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data.
The dataset contains various features like month, day, temperature and wind. Our algorithm should learn from this data and build the model. Then this model should be able to predict the area burned by forest fire given all the other features.
Here column X and Y contains the x and y-axis spatial coordinates respectively within the Montesinho park map.
Month and day column represent the month of the year and day of the week respectively.
FFMC, DMC, DC and ISI column contains the corresponding indexes from the Fire Weather Index system.
temp column contains the temperature in Celsius degrees.
RH column records the relative humidity in percentage.
wind column contains the wind speed in km/h.
Rain column contains the outside rain in millimeter per square meter.
And the Area column contains the burned area of the forest in hectares.
For the detailed description of the dataset please check the link displayed on the screen. This dataset is already part of the BootML. Let’s train the model. Open BootML
Create a new project and name it as “Forest fires”. Click on “Next”. This problem is a supervised learning regression task and as we know the root mean square error is the preferred performance measure for regression tasks. Next, we select the existing Forest fires dataset. Then, select the forestfires.csv file and specify the file type as csv. Lets move to next step. click on “see your data here” to see the first few rows of the data. Next, discard the columns which you think are not required for training. And then, select features and label. Here the label is area and all other columns are features. Move to In the next step. We plan to split the data into training and test set in the 80 20 ratio. Specify the random seed as 42. You could specify any other random seed.
Afterwards, select the columns for scatter plot and check the generate correlations and scatter matrix checkboxes. Next, select median as an imputer for fixing the missing values. Move next. Specify numerical and categorical fields. Click on see your data here to see the data. As you can see, only month and day columns are categorical. Move them to categorical field column. In this step, scale all the columns using Standardization. Move next. In this step, specify the number of cross fold as 10 and train three different models using Linear regression, Random forest, and decision tree algorithms. Tune the hyperparameters using Grid Search algorithm. Click on next. In this step, generate the notebook. Open the notebook and run all the cells to see the results.
Now, let us walk thru the data and project notebook to understand in more details.
Here is the dataset location and the filename is forestfires.csv. Let’s quickly see the data in the lab. Login to web console using your lab username and password and see the first few lines of the dataset. The dataset has a total of 13 columns and 517 rows. Here we can see the mean, standard deviation, minimum, 1st quartile, 2nd quartile, 3rd quartile and maximum of all the columns. Here we have split the data into 80 20 ratio with random seed of 42.
Here we can see how the area is correlated with other features. The area has positive correlation with DMC and temp features and negative correlation with RH and rain features.
Next, we fix the missing values with median and separate the numerical and categorical features. Then we apply one hot encoding in the categorical columns and standardization in the numerical columns. Next, we train the model using Linear regression, decision tree and Random forest algorithms and compute RMSE and the cross-validation score for each algorithm. Let’s see the RMSE of each algorithm.
As you can see the RMSE of decision tree algorithm is 0.62 which is minimum among the three algorithms. You might think that the decision tree is the perfect model because it has the minimum RMSE. But how do we validate if the model generated by the decision tree algorithm is really a perfect model? This is where we use cross-validation. Let’s see the mean RMSE of 10 fold cross-validation.
The mean RMSE score of 10 fold cross validation in Linear regression algorithm is 36.07 while the same is 69.68 and 47.44 in the Decision tree and Random forest algorithms respectively. Here we can see that the model generated by the decision tree algorithm is overfitting as mean RMSE of 10 fold cross validation is much higher than the RMSE on the training set.
Infact Decision tree model performed worse than the linear regression.
Let’s discard the decision tree algorithm. We can see RMSE of Random forest in the training set is lower than the RMSE of Linear Regression …..
...but its mean RMSE in cross-validation is much higher than on the training set. It means that Random forest is overfitting the training set. To solve this we can either regularize the model or get a lot more training data. We will learn regularization later in the course. Let’s select the Random forest model now for hyperparameter tuning.
Here we are fine-tuning the hyperparameters. The model after hyperparameters tuning is the final model. Next, we calculate the root mean square error on the test data and if are happy with the results then we deploy this model in the production. The root mean square error on the test set is 109.40.
One important thing to note is during the model training process, we can combine the features and create our own custom features if we think these features are good predictors of the label. For example, we can create a new feature RH by wind which contains the quotient of division of RH by wind.