End-to-End Project- Self contained

5 / 31

End-to-end Machine Learning Project Part-4






Slides


Please login to comment

59 Comments

My Notebook stops responding after running cross validation step.

  Upvote    Share

Hi,

It might take some time while fitting or hyper parameter tuning as it needs to try all the cominations in the search space. So please wait for 5-10 min. If it doesn't work, restart the kernel and run all the cells from beginning.

Thanks.

  Upvote    Share

Hi,

The error is self-explanatory. You have either not defined the n_features variable, or have not executed the cell containing the code for the same. If you have not defined the variable, please check the step hint/answer. If you have not executed the cell, or have restarted your browser, please execute your code from the beginning.

Thanks.

  Upvote    Share

getting below error while running this code:

cat_encoder = CategoricalEncoder(encoding="onehot-dense")
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot

 

  Upvote    Share

Hi 

Here in this model, the most important attribute related to median house value was median income i.e. numerical.

But in some other model suppose that is a categorical attribute, obviously, we'll have to do one hot encoding for that attribute.

Now I am having difficulty in understanding how will we train and predict in that case?

 

  Upvote    Share

Hi,

Since we can't use the raw categorical values, we represent that categorical data into numerical forms, followed by training and testing.

Thanks.

  Upvote    Share

Correlation gives the score on the scale of 0 to 1, with sign signifying the positive or negative relation.

Kindly explain 'feature importance' values obtained here. How are they better than the correlation values, as being said in the video. The last value 'ISLAND' is 5.8729...

Thanks.

 1  Upvote    Share

Hi,

Good question.

Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other. Feature importance on the other hand is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well.

Thanks.

  Upvote    Share

1:44:20,

'Upto longitude & latitude, the accuracy seems to be good.'

How are we reaching to this conclusion?

Thanks.

  Upvote    Share

Hi,

If you look at the code, you would see that we have sorted these features in a reverse order, or rather descending order based on their importance. So we can roughly see that till "longitude and latitude" they have above 50% importance. That is how we can come to the above conclusion.

Thanks.

  Upvote    Share

Q1. Why is 'ISLAND' value not on top?

Q2. I was just surfing about feature importance, is it available only in the case of random forest? Are there feature importance scores/values in other alg like lin reg and decision tree? Do correct if i am wrong.

 

Thanks.

  Upvote    Share

Hi,

1. I am not sure why it is not showing at the top even though it is supposed to. Try without the zip() as shown below:

important_features_list = sorted(important_features_dict,
                                 key=important_features_dict.get,
                                 reverse=True)

print 'Most important features: %s' %important_features_list

2. Feature importance can be calculated for all models.

Thanks.

  Upvote    Share

Time stamp- 59:10

Regarding non zero error cross validation for regression tree, it is being said that, 'it is not based on overfitting, it is based on real error.' Didn't understand this. Isn't the algorithm same? Or will this be explained in further modules?

Does cross validation brings some sort of 'hyper parameter' ? I mean from '0' error to 'non-zero' error in all 10 iterations?

Thanks.

  Upvote    Share

'Hyper parameter' or the 'constraint' that is being talked about. Currently in the midway of the video.

  Upvote    Share

Hi,

Cross Validation is a resampling technique that is used to evaluate machine learning models on a limited data sample.

A test set should still be kept aside for final evaluation. We would no longer need a validation set (which is sometimes called the dev set) while doing cross validation. The training set is split into k smaller sets (there are other approaches too, but they generally follow the same principles). The following procedure is followed for each of the k folds:

A model is trained using k-1 of the folds as training data

The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

Hope this explains the process.

Thanks.

  Upvote    Share

I wanted to know if Decision Tree could have computed zero error for each iterations in cross validation? Or is it unlikely, generally?

  Upvote    Share

Hi,

A zero error can happen if the model is overfitting.

Thanks.

  Upvote    Share

This comment has been removed.

Hello Team,

I'm confused between univariate and multivariate regression. As per slide page 64 says, we will use univariate regression but we are using more than one attribute even additional attributes like "rooms per household" in housing project. As per my understanding we are using multivariate  regression and after looking for correlation with median house value we find most promising attribute. 

Need clearification and correct me if i'm wrong!

  Upvote    Share

Hi,

The problem given in slide# 64 is different than the End-to-End project given later where we calculate rooms per household. This problem is given here with only one feature, house area, for the purpose of explaining what univariate regression is.

Thanks.

  Upvote    Share

Thanks for your response.

So you say end to end project is based upon multivariate regression. But after finding most promising attribute i.e median income, can we ignore other features? because we are dealing with only one attribute median income due to strongly correlated with price value even after generating additional features. how other features are creating difference with their presence to prepare data for ml algorithm?

  Upvote    Share

Hi,

What you are referring to is called Feature Selection. However, we do not necessarily select one single feature, not even in the case of the end-to-end project. I can refer to atleast one more attribute that can be considered as a feature to train the model other than "median income", it is "rooms per household". There are others too.

Thanks.

  Upvote    Share

Hi Team,

Facing attached error while comparing Linear Reg. prediction with actuals. Please Advise.

  Upvote    Share

Hi,

You need to reshape the data.

Thanks.

  Upvote    Share

This comment has been removed.

I created a new file (California House price Extimator) and took the parts which were chosen by Mr.Sandeep to be used to calculate final rmse value. But my value is different from the one obtained here in course. I double checked but couldn't find the error. Can you please kindly check and let me know. Here is the link for file.

https://jupyter.e.cloudxlab.com/user/atulkhera4454/notebooks/CaliforniaHousePredictValue.ipynb

As per standards value should not change for same set of data anywhere.

I am pretty sure that i might have missed something while creating a new file. Kindly point that out for me so i can learn from it.

 1  Upvote    Share

How is the feature importance array perfectly aligned with the columns and the categorical column values so that we are able to combine them. Can you explain in detail? What would be the order of the resultant array if we had two categorical columns ? 

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

.............

cat_one_hot_attribs = list(cat_encoder.categories_[0])

attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances,attributes),reverse = True)

 

  Upvote    Share

Hi,

Please go through the below link for more details:

https://stackoverflow.com/questions/41900387/mapping-column-names-to-random-forest-feature-importances

Let me know if this answers your query.

Thanks.

  Upvote    Share

Did we not remove those horizontal lines data(500000) for housing value. Can this improve the RMSE value.

  Upvote    Share

Hi,

Could you please tell me which horizontal lines data are you referring to by mentioning the video timestamp if possible?

Thanks.

  Upvote    Share

Sir, 

Can you please explain more about What does Feature importance measures and how?

Thank you.

  Upvote    Share

Hi,

Could you please explain me why we use -ve value for scoring Mean Squared Error (scoring="neg_mean_squared_error")?

Thanks

############################################################################

Code Snippet below:

# Cross Validation in Random Forest model

from sklearn.model_selection import cross_val_score


forest_rmse_scores = np.sqrt(-foforest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)rest_scores)
display_scores(forest_rmse_scores)

  Upvote    Share

Hi,

In cross validation, training set is split into smaller training and validation set. When we perform K fold cross validation, the test set in one fold becomes training set in another fold, Does not it increase chances of data leakage and model learning the test set also ?

Thanks.

 2  Upvote    Share

This comment has been removed.

Hi,

Excellent question, we often use nested cross-validation to avoid data leakage:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

Thanks.

 

  Upvote    Share

Hi Team,

My concepts on fine tuning the model is not clear. Why am i fine tuning the parameters at the end?? why am i doing it??what are those parameters here??

As per my understanding, we are supposed to fine tune parameters on validation dataset before finalizing the best machine learning algorithm for our problem statement.

Regards,S Dash.

 

 

 

 

  Upvote    Share

Hi,

Would suggest you to go through slides 377 onwards, we are not fine-tuning on test set but the validation set.

Thanks.

  Upvote    Share

Hi,

I went through slides. it is of great help. Still i have some questions.

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
we are optimizing few parameters of random forest model( as we selected it as the best model). How do we select those values of n_estimators and max_features?? Why and how these 2 parameters(we can take max_depth as a parameter for tuning the model)??What does( " # try 12 (3×4) combinations of hyperparameters") this mean?? 

Regards,

S Dash.

  Upvote    Share

Hi,

Out of all the combinations of the hyperparameters, the combination that yields the best scores are picked up. 

Thanks.

  Upvote    Share

is it mandatory to do hyperparameter tuning in all the models?

  Upvote    Share

Hi,

We do hyperparameter tuning to check if we could get better results by changing the parameters.

Thanks.

  Upvote    Share

please help with grid concept not cleared much

  Upvote    Share

Hi,

By grid concept, are you referring to the GridSearchCV?

Thanks.

  Upvote    Share

yes sir

  Upvote    Share

Hi,

Would request you to go through the below link for an detailed explanation on GridSearchCV:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Thanks.

  Upvote    Share

Please share the link of complete code;grid concept not cleared much please help 

  Upvote    Share

Hi,

You will find all the notebook in our GitHub repository from the below link:

https://github.com/cloudxlab/ml

Thanks.

  Upvote    Share

How can mean error 52,564 be good? Why we are considering such huge value as mean error?

Kindly help me understand. P.S. I might have missed some points.

  Upvote    Share

Hi,

These are the mean score of the RMSE for the Random Forest model, what this means is that this is the value by which the model deviates from the actual values. Now this might seem like a high value, but there are 2 things you need to consider. First, this is less compared to the overall mean of the predicted variable. Second, this model is better than the rest, but we will not keep at that. Next, we would fine tune this model to make it better.

Thanks.

  Upvote    Share

That was really helpful. Thank You Sir. 

Also in the final model why we are considering grid_search not radom_search ?

  Upvote    Share

Hi,

This is because here GridSearchCV was giving better results.

Thanks.

  Upvote    Share

Also in Slide 374 and 375:

We have final_predictions= final_model.predict(X_test_prepared)

and 

final_rmse = final_model.predict(X_test_prepared)

respectively.

Can you please confirm if its misprint or something?

  Upvote    Share

Hi,

Why do you think it is a misprint?

Thanks.

  Upvote    Share

Good comprehensive & exhaustive coverage of Exploratory Data via End-to-End Machine Learning example. Thanks for updating and revising the contents of the study-material for End-to-End-to-End Project!!!

Suggestion - If the topic of Data Conversions by using Python libraries or packages could have been incoporated in this,PPT,  the contents would be welcome.. Flor eg. 

  • char to numeric/float
  • numeric/float to char
  • a list to a df (dataframe)
  • a tuple to a df & vice versa
  • Appending the df
  • Merging the df

In real-time scenarios, have seen many struggle due to untidy data-sets, The aforesaid topics would act as a quick reckoner for the students and learners.

 1  Upvote    Share

Hi,

Thank you for your feedback! We would keep them in mind and would definitely try to improve our course materials over time.

Thanks.

 1  Upvote    Share

Hi Rajtilak,

Appreciate fr making a note of my comments. Such techniques are inded needed in Real-time Projects for  for Data cleansing process.

Thanks & Regards,

  Upvote    Share