End-to-End Project- Self contained

5 / 31

Previous Index Next

End-to-end Machine Learning Project Part-4

Slides

Previous Index Next

Please login to comment

59 Comments

Vaibhav Dubey

4 years ago

My Notebook stops responding after running cross validation step.

Upvote Share

Vagdevi K

4 years ago

Hi,

It might take some time while fitting or hyper parameter tuning as it needs to try all the cominations in the search space. So please wait for 5-10 min. If it doesn't work, restart the kernel and run all the cells from beginning.

Thanks.

Upvote Share

Vaishally Batra

4 years ago

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

The error is self-explanatory. You have either not defined the n_features variable, or have not executed the cell containing the code for the same. If you have not defined the variable, please check the step hint/answer. If you have not executed the cell, or have restarted your browser, please execute your code from the beginning.

Thanks.

Upvote Share

Vaishally Batra

4 years ago

getting below error while running this code:

cat_encoder = CategoricalEncoder(encoding="onehot-dense")
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot

Upvote Share

Atif Khan

4 years ago

Here in this model, the most important attribute related to median house value was median income i.e. numerical.

But in some other model suppose that is a categorical attribute, obviously, we'll have to do one hot encoding for that attribute.

Now I am having difficulty in understanding how will we train and predict in that case?

Upvote Share

Vagdevi K

4 years ago

Hi,

Since we can't use the raw categorical values, we represent that categorical data into numerical forms, followed by training and testing.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

Correlation gives the score on the scale of 0 to 1, with sign signifying the positive or negative relation.

Kindly explain 'feature importance' values obtained here. How are they better than the correlation values, as being said in the video. The last value 'ISLAND' is 5.8729...

Thanks.

1 Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Good question.

Correlations essentially measure the positive/negative 'change' in one feature as you increase/decrease the other. Feature importance on the other hand is more likely to actually identify which features are most influential when differentiating your classes, provided that the model performs well.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

1:44:20,

'Upto longitude & latitude, the accuracy seems to be good.'

How are we reaching to this conclusion?

Thanks.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

If you look at the code, you would see that we have sorted these features in a reverse order, or rather descending order based on their importance. So we can roughly see that till "longitude and latitude" they have above 50% importance. That is how we can come to the above conclusion.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

Q1. Why is 'ISLAND' value not on top?

Q2. I was just surfing about feature importance, is it available only in the case of random forest? Are there feature importance scores/values in other alg like lin reg and decision tree? Do correct if i am wrong.

Thanks.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

1. I am not sure why it is not showing at the top even though it is supposed to. Try without the zip() as shown below:

important_features_list = sorted(important_features_dict,
                                 key=important_features_dict.get,
                                 reverse=True)

print 'Most important features: %s' %important_features_list

2. Feature importance can be calculated for all models.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

Time stamp- 59:10

Regarding non zero error cross validation for regression tree, it is being said that, 'it is not based on overfitting, it is based on real error.' Didn't understand this. Isn't the algorithm same? Or will this be explained in further modules?

Does cross validation brings some sort of 'hyper parameter' ? I mean from '0' error to 'non-zero' error in all 10 iterations?

Thanks.

Upvote Share

Abhinav Singh

4 years ago

'Hyper parameter' or the 'constraint' that is being talked about. Currently in the midway of the video.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Cross Validation is a resampling technique that is used to evaluate machine learning models on a limited data sample.

A test set should still be kept aside for final evaluation. We would no longer need a validation set (which is sometimes called the dev set) while doing cross validation. The training set is split into k smaller sets (there are other approaches too, but they generally follow the same principles). The following procedure is followed for each of the k folds:

A model is trained using k-1 of the folds as training data

The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

Hope this explains the process.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

I wanted to know if Decision Tree could have computed zero error for each iterations in cross validation? Or is it unlikely, generally?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

A zero error can happen if the model is overfitting.

Thanks.

Upvote Share

This comment has been removed.

Gaurav Karki

4 years ago

Hello Team,

I'm confused between univariate and multivariate regression. As per slide page 64 says, we will use univariate regression but we are using more than one attribute even additional attributes like "rooms per household" in housing project. As per my understanding we are using multivariate regression and after looking for correlation with median house value we find most promising attribute.

Need clearification and correct me if i'm wrong!

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

The problem given in slide# 64 is different than the End-to-End project given later where we calculate rooms per household. This problem is given here with only one feature, house area, for the purpose of explaining what univariate regression is.

Thanks.

Upvote Share

Gaurav Karki

4 years ago

Thanks for your response.

So you say end to end project is based upon multivariate regression. But after finding most promising attribute i.e median income, can we ignore other features? because we are dealing with only one attribute median income due to strongly correlated with price value even after generating additional features. how other features are creating difference with their presence to prepare data for ml algorithm?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

What you are referring to is called Feature Selection. However, we do not necessarily select one single feature, not even in the case of the end-to-end project. I can refer to atleast one more attribute that can be considered as a feature to train the model other than "median income", it is "rooms per household". There are others too.

Thanks.

Upvote Share

Ankit Kataria

4 years ago

Hi Team,

Facing attached error while comparing Linear Reg. prediction with actuals. Please Advise.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

You need to reshape the data.

Thanks.

Upvote Share

This comment has been removed.

Atul Khera

4 years ago

I created a new file (California House price Extimator) and took the parts which were chosen by Mr.Sandeep to be used to calculate final rmse value. But my value is different from the one obtained here in course. I double checked but couldn't find the error. Can you please kindly check and let me know. Here is the link for file.

https://jupyter.e.cloudxlab.com/user/atulkhera4454/notebooks/CaliforniaHousePredictValue.ipynb

As per standards value should not change for same set of data anywhere.

I am pretty sure that i might have missed something while creating a new file. Kindly point that out for me so i can learn from it.

1 Upvote Share

Vagdevi K

4 years ago

Hi,

Please go through this: https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project_bootcamp.ipynb

Thanks.

Upvote Share

Zach Ivan

4 years ago

How is the feature importance array perfectly aligned with the columns and the categorical column values so that we are able to combine them. Can you explain in detail? What would be the order of the resultant array if we had two categorical columns ?

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

.............

cat_one_hot_attribs = list(cat_encoder.categories_[0])

attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances,attributes),reverse = True)

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Please go through the below link for more details:

https://stackoverflow.com/questions/41900387/mapping-column-names-to-random-forest-feature-importances

Let me know if this answers your query.

Thanks.

Upvote Share

Zach Ivan

4 years ago

Did we not remove those horizontal lines data(500000) for housing value. Can this improve the RMSE value.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Could you please tell me which horizontal lines data are you referring to by mentioning the video timestamp if possible?

Thanks.

Upvote Share

Debkrishna Manna

4 years ago

Sir,

Can you please explain more about What does Feature importance measures and how?

Thank you.

Upvote Share

Shashwat Verma

4 years ago

Hi,

Could you please explain me why we use -ve value for scoring Mean Squared Error (scoring="neg_mean_squared_error")?

Thanks

############################################################################

Code Snippet below:

# Cross Validation in Random Forest model

from sklearn.model_selection import cross_val_score

forest_rmse_scores = np.sqrt(-foforest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)rest_scores)
display_scores(forest_rmse_scores)

Upvote Share

Vagdevi K

4 years ago

Hi,

Please have a look at this:

https://stackoverflow.com/a/48246255

https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

Thanks.

Upvote Share

Bhavika Sehgal

4 years ago

Hi,

In cross validation, training set is split into smaller training and validation set. When we perform K fold cross validation, the test set in one fold becomes training set in another fold, Does not it increase chances of data leakage and model learning the test set also ?

Thanks.

2 Upvote Share

This comment has been removed.

Rajtilak Bhattacharjee

4 years ago

Hi,

Excellent question, we often use nested cross-validation to avoid data leakage:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

Thanks.

Upvote Share

sasmita dash

5 years ago

Hi Team,

My concepts on fine tuning the model is not clear. Why am i fine tuning the parameters at the end?? why am i doing it??what are those parameters here??

As per my understanding, we are supposed to fine tune parameters on validation dataset before finalizing the best machine learning algorithm for our problem statement.

Regards,S Dash.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would suggest you to go through slides 377 onwards, we are not fine-tuning on test set but the validation set.

Thanks.

Upvote Share

sasmita dash

5 years ago

Hi,

I went through slides. it is of great help. Still i have some questions.

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
we are optimizing few parameters of random forest model( as we selected it as the best model). How do we select those values of n_estimators and max_features?? Why and how these 2 parameters(we can take max_depth as a parameter for tuning the model)??What does( " # try 12 (3×4) combinations of hyperparameters") this mean??

Regards,

S Dash.

Upvote Share

Vagdevi K

5 years ago

Hi,

Out of all the combinations of the hyperparameters, the combination that yields the best scores are picked up.

Thanks.

Upvote Share

payal patel

5 years ago

is it mandatory to do hyperparameter tuning in all the models?

Upvote Share

Vagdevi K

5 years ago

Hi,

We do hyperparameter tuning to check if we could get better results by changing the parameters.

Thanks.

Upvote Share

Amit Padhi

5 years ago

please help with grid concept not cleared much

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

By grid concept, are you referring to the GridSearchCV?

Thanks.

Upvote Share

Amit Padhi

5 years ago

yes sir

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would request you to go through the below link for an detailed explanation on GridSearchCV:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Thanks.

Upvote Share

Amit Padhi

5 years ago

Please share the link of complete code;grid concept not cleared much please help

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

You will find all the notebook in our GitHub repository from the below link:

https://github.com/cloudxlab/ml

Thanks.

Upvote Share

Pushkar Shinde

5 years ago

How can mean error 52,564 be good? Why we are considering such huge value as mean error?

Kindly help me understand. P.S. I might have missed some points.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

These are the mean score of the RMSE for the Random Forest model, what this means is that this is the value by which the model deviates from the actual values. Now this might seem like a high value, but there are 2 things you need to consider. First, this is less compared to the overall mean of the predicted variable. Second, this model is better than the rest, but we will not keep at that. Next, we would fine tune this model to make it better.

Thanks.

Upvote Share

Pushkar Shinde

5 years ago

That was really helpful. Thank You Sir.

Also in the final model why we are considering grid_search not radom_search ?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

This is because here GridSearchCV was giving better results.

Thanks.

Upvote Share

Pushkar Shinde

5 years ago

Also in Slide 374 and 375:

We have final_predictions= final_model.predict(X_test_prepared)

and

final_rmse = final_model.predict(X_test_prepared)

respectively.

Can you please confirm if its misprint or something?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Why do you think it is a misprint?

Thanks.

Upvote Share

Sameer Sippy

5 years ago

Good comprehensive & exhaustive coverage of Exploratory Data via End-to-End Machine Learning example. Thanks for updating and revising the contents of the study-material for End-to-End-to-End Project!!!

Suggestion - If the topic of Data Conversions by using Python libraries or packages could have been incoporated in this,PPT, the contents would be welcome.. Flor eg.

char to numeric/float
numeric/float to char
a list to a df (dataframe)
a tuple to a df & vice versa
Appending the df
Merging the df

In real-time scenarios, have seen many struggle due to untidy data-sets, The aforesaid topics would act as a quick reckoner for the students and learners.

1 Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Thank you for your feedback! We would keep them in mind and would definitely try to improve our course materials over time.

Thanks.

1 Upvote Share

Sameer Sippy

5 years ago

Hi Rajtilak,

Appreciate fr making a note of my comments. Such techniques are inded needed in Real-time Projects for for Data cleansing process.

Thanks & Regards,

Upvote Share

End-to-End Project- Self contained

End-to-end Machine Learning Project Part-4

Slides

XP

Please login to comment

59 Comments