End-to-End ML Project - California Housing

End to End ML Project - Train a Decision Tree model

Now that we have prepared the data, we will train a Decision Tree model on that data and see how it performs. Since this is a regression problem, we will use the DecisionTreeRegressor class from Scikit-learn.

  • Import the DecisionTreeRegressor class from Scikit-learn

    from sklearn.tree import <<your code goes here>>
  • Now let's train the DecisionTreeRegressor

    tree_reg = DecisionTreeRegressor(random_state=42)
    tree_reg.fit(housing_prepared, housing_labels)
  • To evaluate the performance of our model, we will import the mean_squared_error class from Scikit-learn

    from sklearn.metrics import <<your code goes here>>
  • Now let's predict using our model using the predict method

    housing_predictions = tree_reg.<<your code goes here>>(housing_prepared)
  • Finally, let's evaluate our model

    tree_mse = mean_squared_error(housing_labels, housing_predictions)
    tree_rmse = np.sqrt(tree_mse)

    If you trained your model correctly, the rmse would come to 0.0. This means that our model is most likely overfitting. How to check and resolve this issue? We will come to that in a bit, but before that we will train a Random Forest model.

Note- If a model performs significantly better on the training data than on the testing data, it can be overfitting on the training data. But we can't be sure that it's always overfitting because this scenario can also arise due to some other problems such as data mismatch on training and test set. It means the test set contains different type of data(having different distribution) which was not there in training data. It can also be the case that you are observing this because of the stochastic nature of the algorithm. So we have to check whether the model is really overfitting or if it is suffering from some other problem.

