Training Models

2 / 57

Previous Index Next

Machine Learning Training Models Part-2

Slides

Download the slides

Previous Index Next

Please login to comment

48 Comments

Arun Chiriyankandath

2 years ago

Long video session seems to be boring. Better to have interactive exercises with examples. Please take it as constructive criticism.

Upvote Share

Shubh Tripathi

2 years ago

Hi Arun,

We appreciate your feedback and couldn't agree more with the importance of interactive learning. We share your belief that it's much more engaging and enjoyable to learn through hands-on exercises and practical examples. That's why we have carefully curated a mix of content and interactive exercises, placing them strategically to enhance your learning experience. Thank you for your constructive criticism, as it helps us continuously improve and provide the best learning environment for our users.

Upvote Share

Gaurav Karki

4 years ago

Hi,

@slide 170

Formula says there will be 10 new featuers but there are only 9 features are mentioned. what is 10th feature?

1 Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Here are the 10 features:

1, a, b, a^2, a^3, b^2, b^3, ab, a^2b, ab^2

You can go through the below link for more explanation:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Thanks.

Upvote Share

Debkrishna Manna

4 years ago

Sir, I have two queries -

1. For SGD or mini-batch, how can we be so sure that at each iteration it is minimizing the cost function? Because, for these methods we are not taking all of the observations.

2. Can you suggest any books for further reading of all gradient decent methods?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

1. The best way is to observe the result after each iteration because there is no guarentee that SGD will minimize the cost function. For example, if the learning rate is too high it will not converge.

2. I am not aware of any book which focuses on all gradient descent methods, however, I personally prefer the following book for a detailed study on optimization methods:

Algorithms for Optimization by Mykel J. Kochenderfer, Tim A. Wheeler

Let me know if you find it useful.

Thanks.

Upvote Share

Debkrishna Manna

4 years ago

Thank you sir for your valuable information. I have downloaded the book. I'll let youknow if I find anything insightful about the algorithems.

1 Upvote Share

Manjunath Malagavi

4 years ago

Can we build polynomial models using OLS method?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Yes you can.

Thanks.

Upvote Share

Amit Singh

4 years ago

can you please provide the link of the jupitor notebook page that has been used for SGD?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Please find below the link to your GitHub repository:

https://github.com/cloudxlab/ml

Within the Machine Learning folder, you will find the Jupyter notebook for Training Models.

Thanks.

Upvote Share

Pranav Shetty

4 years ago

Hi.

In slide 108, in the diagram where the learning rate is 0.1, why and how does the line (algorithm) change it's length of jumps ( lessens it ) as it approaches optimal solution of minimal RSME?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

It does not, we do not need to change the learning rate in real-case scenario. However, we have changed it here manually to show the cause and effect.

Thanks.

Upvote Share

Tushar Anand

4 years ago

Which book is preferable for this course

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Here is a list of ML/DL books that you can choose from:

https://cloudxlab.com/blog/gigantic-list-of-machine-learning-books/

Thanks.

Upvote Share

Manjari Singh

5 years ago

when to use rmse and when mse?

what is the code for RMSE? any library?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

The smaller the Mean Squared Error, the closer the fit is to the data. The MSE has the units squared of whatever is plotted on the vertical axis. The RMSE is directly interpretable in terms of measurement units, and so is a better measure of goodness of fit than a correlation coefficient.

Would request you to go through the tutorial for details.

Thanks.

Upvote Share

This comment has been removed.

Rajtilak Bhattacharjee

5 years ago

Hi,

As mentioned in the slides, m is the number instances in the training dataset.

Thanks.

Upvote Share

Tathagata Biswas

5 years ago

Dear sir, i cannot find the notebook file "training_linear_models.ipynb" in the shown path. How to get that? Please help

Upvote Share

Sachin Giri

5 years ago

Hi Tathagata,

You can find the notebooks here: https://github.com/cloudxlab/ml.

1 Upvote Share

Elite Coder

5 years ago

Hello sirs,

I had one question. Does the normal equation take into account bias? if not how would one optimize it without gradient descent?

Thank you for your time

1 Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would request you to go over through slide# 56 onwards for an explanation of the same.

Thanks.

1 Upvote Share

Jerome Gomes

5 years ago

Different problems/ apllications will have different learning rate........ How fo we determine the closet/ perfect lerning rate for different problems??? Is it based on trial and error ????

1 Upvote Share

Jerome Gomes

5 years ago

got it

1 Upvote Share

Birendra Singh

5 years ago

Hi Team,

In below sample code:

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly_features.fit_transform(X)
X #Original features
X_poly #Original plus new feature

Doubts:

We have only one variable 'x' so it is having 1 feature right ?
Equation used was, y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1) -- So, the degres is 2 right ?
poly_features = PolynomialFeatures(degree=3, include_bias=False) -- Here we are mentioning degree to be greater than 2 and we are observing X_poly changing the dimension accordingly. For instance if degree is mentioned 3 then X_poly is 100x3. It is confusing me 2 degree eqation but feature calculation is done on more.Please explain this.
Include_bias = False, does this mean we are ignoring distance from centre ?
Formula of calculating number of features in slide 167 is not satisfying here. Ideally it should have 4 features but in X_poly it is showing 3 features. Is it because include_bias =False ?

Thanks

Birendra Singh

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

1. Yes, this has one feature. But that need not be the case, it can contain multiple features.

2. Yes, it's a 2nd degree polynomial.

3. We are changing the degree of freedom here. Please go through the lecture video for a detailed explanation.

4. We are ignoring the bias term here.

5. Yes, that's right.

Thanks.

Upvote Share

Birendra Singh

5 years ago

Hi Team,

While plotting gradient descent for various learning rates, we used theta path in 1 learning rate i.e 0.1 only. What is the purpose of theta path and why it is not used in 1st plot when learning rate is 0.02

plt.subplot(131); plot_gradient_descent(theta, eta=0.02) // no theta path
plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd) // theta path
plt.subplot(133); plot_gradient_descent(theta, eta=0.5) // no theta path

And please explain, the plot logic as well:

if iteration < 10:
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--"
plt.plot(X_new, y_predict, style)

Thanks

Birendra Singh

Upvote Share

Satyajit Das

5 years ago

Hi, Birendra.

Answer1 :-

In code the list theta_path_bgd will be appended with all the theta claculated.

There is also modifications of the code in below steps I corrected it:-

if theta_path_bgd is not None:
theta_path_bgd.append(theta)

In the plot also plt.subplot(132); plot_gradient_descent(theta, eta=0.1) there is no significance as we are plotting b/t theta, eta=0.1 which is directly calculated in the

For plotting the grapg b/t the heta, eta, theta_path_bgd is redundant. It is just use to store/ collect all thetas. There is no significance of theta_path=theta_path_bgd here.

Answer2 :-

if iteration < 10: Till 9th ietartions it is calculating.
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--" --> Here it will plot the grapg in blue color if when iterations from 1-9 if iterations value is 0 it will show in red color. Just for showing that iterations started.
plt.plot(X_new, y_predict, style)

All the best!

Upvote Share

Satyajit Das

5 years ago

Upvote Share

Birendra Singh

5 years ago

Thanks Satyajit

Upvote Share

This comment has been removed.

Aman Garg

5 years ago

sir,

On what basis we are selecting values of t0 and t1?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Could you please help me by pointing out which part of this video or the slides your query is referring to?

Thanks.

Upvote Share

Vijay Saini

5 years ago

Hi Team,

this file training_linear_models.ipynb is not opening ......is that any problem in this file.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

How are you trying to open the file, from the GitHub page, or from your lab?

Thanks.

Upvote Share

Dhyey Kotecha

5 years ago

@disqus_XTh3bUKOBh:disqus Team,

In video at 2:01:18, the code at line 21 calculates the gradient.

I think, the equation should be divided by the minibatch_size variable; this being in line with what is being done for the Batch Gradient Descent where value is averaged by dividing the results by m (i.e. number of instances).

Batch Gradient Descent:
gradients = 2/m * X.T.dot(X.dot(theta) - y)

Mini-Batch Gradient Descent:
gradients = 2/minibatch_size * X.T.dot(X.dot(theta) - y)

Upvote Share

CloudxLab

5 years ago

Hi,

You are right! Thank you for pointing this out, we would change the code in our GitHub repository shortly.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Dhyey Kotecha

5 years ago

Thanks for confirming with a quick turnaround.

It would be my pleasure to help.

Upvote Share

Prachi Singla

5 years ago

Hi
I am getting this error in code Plz help.

Upvote Share

CloudxLab

5 years ago

Hi,

Can you try max_iter instead of n_iter and try once again.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Prachi Singla

5 years ago

yes now its working
Thanks

Upvote Share

Vivek Bohra

5 years ago

How to simply define
- overfitting
- underfitting

Upvote Share

CloudxLab

5 years ago

Hi,

You can define them as follows:

*Overfitting*: Good performance on the training data, poor generalization to other data. *Underfitting*: Poor performance on the training data and poor generalization to other data.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Gopendra Mohan

5 years ago

Hi,

Could anyone please let me know the purpose of multiplying X_b.T in gradient calculations part????/

Upvote Share

Billy Lee

5 years ago

and I couldn't understand why is it the (n+d)!/d!n! , what's the reasoning behind this?

Upvote Share

Billy Lee

5 years ago

Did you say something wrong ? because you said at 2:07:36 that Batch GD doesn't have to load all the data in the memory , but the diagram you showed me said that you can't operate out of the core

Upvote Share

Training Models

Machine Learning Training Models Part-2

Slides

XP

Please login to comment

48 Comments