When we want our machine learning model to accomplish certain task(say classification of images), the model should have knowledge of these different classes. For example, if dog vs cat classification, the model should have the knowledge of attributes of cats and dogs, which attributes are distinguishing both of them etc. So we try to impart this knowledge with the help of some varied types of images of both cats and dogs. This set of images which we use to impart knowledge into the model is called training data, since we are using these images to train the model and enable it to be able to distinguish between cat and dog images.
Further, we also assess the trained model, as in, how good has the model learnt about cats and dogs, how good will it be able to use this knowledge when it should classify an image of cat or dog, which it is never seen before. This is just like, tutoring a child about how to solve a problem(for example teaching him how to solve a mathematical problem) and then conducting an exam to test his knowledge. This is called testing phase.
We train the model to impart knowledge, and test it to know how well it would be able to perform on unseen data, so that we would understand if we need to train it more, or would it need more variety of data, is it memorising or actually learning, etc.
Some days I got irritated, specifically the theory part but on some days I loved your videos, specifically the hands-on part. Now I am in love hahahahha....you are just brilliant
Why have we separated the numerical and categorical data for applying the imputer first and then the one-hot encoder?
Can't we specify the columns to which we want the imputer to apply and similarly the encoder?
I have a question. Can you please explain why we are using these modules "BaseEstimator, TransformerMixin" as inputs to the custom classes that we are creating?
When we are creating our custom classes, we generally add BaseEstimator and TransformerMixin as base classes to get the advantage of their methods. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method.
1. This is initializing the class CombinedAttributedAdder. Would suggest you to go back to the Python tutorial for more information on classes in Python.
2. add_bedrooms_per_room is a parameter, if it's set to True, we calculate rooms_per_household and return the same.
3. Here y is set to None. So y does not have any value, we are fitting the CategoricalEncoder only to X and not y.
Hope this helps explain your query. Let me know if you need help with any other topic.
It is somewhat clear now, but what was the purpose of having a condition add_bedrooms_per_room =True when it is a necessary quantity, as in there is enough correlation.
CombineAttributeAdder() is a class that combines all the attributes we have created.
When we are creating our custom classes (i.e. transformer, estimator), we can add BaseEstimator and TransformerMixin as base classes. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method for free.
s = housing.population/100, implies that the size of the points plotted via scatter plots should vary as per the corresponding population. Division by 100 in order to reduce the size of points plotted as population is a huge numeric value. Try with and without dividing by 100.
where() works as, if the condition, the first attribute is False, ie if income_cat<5 is False, replace it with the value mention in the 2nd attribute, 5.0 in this case.
inplace=True makes the changes in the data frame permanent.
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
After dropping "median_house_value" from the dataframe, how can we copy it ?
Here, by using drop, we are storing all the columns - except median_house_value - as a new dataframe housing. strat_train_set is not getting modified, it just has all the columns including median_house_value. Then in the next line, we are storing the strat_train_set["median_house_value"] values in a new variable housing_labels.
So basically, we are not disturbing the original dataframe strat_train_set, but we are just storing some columns in one variable and the median_house_value in another variable.
Note that we will be modifying the original dataframe when we set its parameter inplace=True. Since this is False by default, no changes happen to that dataframe.Hope this helps.
Could you please tell me which part of the lecture video you are referring to? If this is something related to creating charts, you can use Matplotlib for the same. If you want to know about Matplotlib, you can try out our free Intro to Matplotlib project.
Please ignore the 2nd image. I was trying to point out the statement "Test set generated using stratified sampling has income category proportion is quite skewed". Is it correct ?
Shouldn't it have been "Test set generated using Random sampling has income category proportion is quite skewed" ?
Got it. I got confused with the underscore. Here set is taking the set of data, both strat_train_set and strat_test_set, one at a time, and then remove the column income_cat from them using the drop() function.
The data was capped while it was being recorded. This is a property of this dataset. Yes, there can definitely be a situation that the price will not exceed certain value, however, if we notice that there is an abnormal number of instances in the last value, it would almost surely mean that that value is capped.
Can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"
I appreciate your patience. This code help in adding 3 features, rooms per household, population per household and bedrooms per room. However, the last one is added only if you mention it while calling this function. This is a is a small transformer class that adds the combined attributes. The bedrooms per room hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not.
Thanks a lot. Just one query pending for today. I guess have bothered you too much for the day :)
#############################################
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
In the screenshot below, can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"
I have a question. During the initial exploration of the 'housing' dataaet we see that there was missing entry in the column " total_bedrooms 20433 non-null float64 ". However, when we execute
isn = housing.isnull()
isn.any(axis=1)
The column entries shows 'False'. What about missing entries from "total_bedrooms"? Why doesn't it shows here?
I have through the details from the link. Still I have 2 main questions:-->
1) Why does this code (isn = housing.isnull() /n isn.any(axis=1)) do not return "True' for missing values from the column 'total_bedrooms'?
2) housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 158
population 0
households 0
median_income 0
ocean_proximity 0
dtype: int64
With axis=0, we did get the column "total_bedrooms" as False. But isn't axis=0 represent Rows and axis=1 represent Columns? So why did we use axis=1?
Also, the point 2 also needs clarity from previous email. i.e. housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
I have gone through it and unable to understand and this is the reason I am asking several times. Hope you AND/OR sandeep can help me explain. It does difficult specially when the Q&A is not live and the sessions are recorded.
Also my second part of query is still awaiting clarification .i.e.
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
Let me explain. The axis switch specifies the axis along which the means are computed. So, axis=0 means along the rows, which basically means is that it would consider all rows from a given columns. This is the reason, even though you found missing values initially, when you used axis=1 it simply calculated the missing values across the rows, not the columns. I know it can be a bit confusing. In brief, axis=0 is said to be "column-wise" (and axis=1 "row-wise").
As for your second query, I am still awaiting the screenshot from your end.
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this difference too. Thanks!!
Could you please tell me how you are trying to attach the screenshots? You need to click on the image button, click on the Upload tab, select the image by clicking on the Choose File button, click on Send it to server and that should do the trick.
For some strange reason there has been a constant issue with uploading screenshots. Though before posting my comments I can see the preview of the screenshot. But once I have submitted the comments, the screenshots do not appear. Trying once again -
The path is relative to the location of the Jupyter notebook. Also note that the path you provided in your comment is an URL and not an actual path. It will throw an error if you use that.
I am unable to post screenshot. I take the screenshot and it gets uploaded too. But once I post the comments it disappears. Could you please let me know why?
Anyways I am sending you the location of the file ' california.png':
How can we implement splitting with identifier along with StratifiedShuffleSplit??
Say, we have datasets which contains data from 200 scanners, the data is being updated regularly and in the train and test sets, we want the proportion of the scanners to be similar. How can we do that?
Yeah so basically, after training our model with trainset, we would test it in order to gain confidence about its performance on unseen data which is the test data. So splitting it initially makes sure that our data remains unseen.
There is no one-rule-fits-all concept in Machine Learning. So even though here we are using them only for correlation, another project might have some other use for them. For example, you may want to find out the most crowded area, or the area of the highest number of tax payers, in these cases you will have to use latitude and longitude data.
What could we infer from low median income and high median house value like 2 income on x axis and 500000 house value on y axis.Similerly for example @ 4 median income we ge house ranging from 100k to 500k .Whether this depends on proximity to the sea or popular area?
X[:, rooms_ix] accesses all the rows from index 0 till index rooms_ix-1. You could check the difference between the two by printing what they would yield. Hope this helps.
Could you please elaborate a little more about your issue? Which file you are trying to access. Please share a screenshot of your code and the error that you are getting.
Would request you to go through the lecture video, the slides, and the Jupyter notebook from our GitHub repository. The concept have been explained in detail here.
I got this solved, earlier when I opened the lab the End to End project was under "Cloudxlab_jupyter_notebooks", from there I was not able to call ml folders.
Later I figured out that the end to end project is outside unde ml/.. directory.
It seems 'from sklearn.preprocessing import Imputer' has been deprecated. This gives an error 'cannot import name "Imputer" '
The following was successful - from sklearn.impute import SimpleImputer. Can you please confirm if SimpleImputer needs to be used now, instead of Imputer?
how we will know ,to do feature scaling,how we will verifiy in data set ,like totta number rooms range from 0 to ect..,any partucal way to get those huge range coumns in data set??
And also how is it goign to help if we do scaling?how it is related to other columns??
and if we do standaztiion intsted of min and max...vlaues are not bounded by 0 and 1,so agian features are scalled widelyright ..mainly how feature scaling wil help,as other coulmns are not scalled accoridnly?
Feature Scaling or Standardization is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.
The Min-Max scaler is an that estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
For detailed discussion, I would suggest you to go through the lecture videos once again.
Hi sir, is there any method through which we can know when to apply Normalization and when to use Standardization? What is the general rule to follow ?
So this is basically like creating a histogram. There are a few ways using which you can calculate the bins of a histogram:
1. Count the number of data points.
2. Calculate the number of bins by taking the square root of the number of data points and round up. 3. Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins.
I just checked the end_to_end_project.ipynb file from my end and it is running fine. Could you please tell me which file you are trying to run, and what kernel you are using?
# Just run this cell, or copy it to your code, do not try to understand it (yet).
# Definition of the CategoricalEncoder class, copied from PR #9151.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
----------------
----------------
else: return outf not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
1.
I am trying to run the CategoricalEncoder code from the cell in the end_to_end_project.ipynb
2.
The code is giving error in the file itself (file name- end_to_end_project.ipynb)
In end_to_end_project_bootcamp.ipynb file the end code is as follows:
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
2.
But in the end_to_end_project.ipynb file the end code is as follows:
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return outf not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
3.
Due to the variation in the ending code it was earlier producing error:
File "<ipython-input-5-a83bca1aa9f7>", line 178
return outf not np.all(valid_mask):
^
SyntaxError: invalid syntax
So when I tried on my end, I was working on the end_to_end_project.ipynb file and it worked fine on my end. Would request you to get the latest version of the file from our git repository and run it again.
1. n_splits is the number of re-shuffling & splitting iterations. It will re-shuffle and split the dataset 10 times in case you pass on the default value of 10.
2. In case of the split() function, it is mandatory to pass the X and y values. You can check more about it here:
When we use cross_val_score with cv = 10, so finally which trained model is selected for doing evaluation on test data. the one which showed min RMSE or the last one..... Can we manually choose the trained model out of 10.
Hi Team, In the video tutor said random search is better as it takes the random values for hyperparameters as compared to grid search.I see only benefit is less computation for CPU as few random values are taken. But , we cannot guarantee its result will be 100 percent correct as compared to grid. Even we should not compare grid with random when we have high variance because grid cannot be used first of all according to what i understood from lecture. So, shall I conclude that if we are lucky enough then random search will give correct set of hyperparametres otherwise no guarantee ?
While it’s possible that RandomizedSearchCV will not find as accurate of a result as GridSearchCV, it surprisingly picks the best result more often than not and in a *fraction* of the time it takes GridSearchCV would have taken. Given the same resources, Randomized Search can even outperform Grid Search. Also, "less computation" is one of the key benefits that acts as a deciding factor when it comes to ML/DL models.
Please help me in clearing below doubt. 1. What is estimator in the machine learning? 2. why we are extending the BaseEstimator and TransformerMixin classes while creating our own classes like CombinedAttributesAdder and CategoricalEncoder? 3. what is BaseEstimator and TransformerMixin classes ? 4. is there any other way to implement above classes , meaning can't we achieve the functionality of above classes by just creating an user defined function , because CombinedAttributesAdder class functionality can be achieved using function upto my understanding. 5. sir when you introduced these above classes in your session after that several things are not clear to me , please suggest some reading materials which can help me in understanding these concepts .
Without using custome transformer pipelines, if we choose to use scikit learn method as done in previous video in various steps then, how to union the numerical and categorical columns into housing again(code please)???
Also in the very last step when using the stratified test data to predict the testset result if we dont want to use pipeline then what to use instead of full.pipelined.transformed??
NO, slide# 276 of 401 FeatureUnion class here is used to union the custom transforM using pipeline . BUT I am asking to union the " housing_cat_1hot" and "housing_tr" which was found without doing custom transform and without using pipelining...
So how to union these two now without using pipelining????
I tried doing ColumnTransform by passing the last column as (-1) BUT after that when i am splitting the h i.e. housing into training and testset and printing test_set.head(), i am error as
AttributeError: 'numpy.ndarray' object has no attribute 'head'
h.head() is printing tables as inn pandas but after doing ColumnTransform h is no more pandas?? i m not getting the error
You can check the following articles to know more about how to solve this error that you are getting: https://stackoverflow.com/q...https://stackoverflow.com/q... Is there any specific reason you are not using the Pipeline class as shown in the tutorial?
Getting below error. Do you have any suggestion for this? ImportError: cannot import name '_NAN_METRICS' from 'sklearn.metrics.pairwise' (C:\Users\csree\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py)
Hi, You can download your jupyter notebook by clicking on "FILE" option and there selecting the "DOWNLOAD AS" option from there you can select your desired format in which you want to download.
DataFrameSelector is used before this step and for seperating numerical and categorical data then how it helps in passing numpy array to 'attrib_adder'
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Here in this for loop we are dividing the stratified data into a test and train set as per their index. First we split the data where the Income Category is assigned to the test index and the rest to the train index, then we set these to the variables strat_train_set and strat_test_set using those indexes.
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
For all models you are using same data for training and prediction . For example
# Train a model using Decision Tree from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor(random_state=42) tree_reg.fit(housing_prepared, housing_labels)
# Calculate RMSE in Decision Tree model housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) tree_rmse
Here we are using housing_prepared . So why we are not using other data set for testing because it obvious that it will give the accurate result if we try to predict the result for data set that is already used in training. For testing we use the data set that never used during the training.
Hi Sandeep, we implemented the One Hot Encoder for categorical data but why we did resolve the dummy variable trap issue? should we consider the dummy variable trap or not in pre processing step of any model?
Please login to comment
220 Comments
After converting the text labels to numerical, I am getting zero in all the index and it is not matching with the oupput shown in the video part - 3
Please help!
Hi,
Would request you to match your code against the actual code given on our GitHub repository:
https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project.ipynb
Thanks.
Upvote ShareThanks
Upvote SharePlease help with the below error:
Hi,
Your screenshot did not get attach. Could you please reattach it and share again?
Thanks.
Upvote ShareCan you please explain the concept of training and test data set.
Upvote ShareHi,
When we want our machine learning model to accomplish certain task(say classification of images), the model should have knowledge of these different classes. For example, if dog vs cat classification, the model should have the knowledge of attributes of cats and dogs, which attributes are distinguishing both of them etc. So we try to impart this knowledge with the help of some varied types of images of both cats and dogs. This set of images which we use to impart knowledge into the model is called training data, since we are using these images to train the model and enable it to be able to distinguish between cat and dog images.
Further, we also assess the trained model, as in, how good has the model learnt about cats and dogs, how good will it be able to use this knowledge when it should classify an image of cat or dog, which it is never seen before. This is just like, tutoring a child about how to solve a problem(for example teaching him how to solve a mathematical problem) and then conducting an exam to test his knowledge. This is called testing phase.
We train the model to impart knowledge, and test it to know how well it would be able to perform on unseen data, so that we would understand if we need to train it more, or would it need more variety of data, is it memorising or actually learning, etc.
Hope this helps.
Thanks.
Upvote ShareThis comment has been removed.
Some days I got irritated, specifically the theory part but on some days I loved your videos, specifically the hands-on part. Now I am in love hahahahha....you are just brilliant
2 Upvote ShareThis comment has been removed.
Why have we separated the numerical and categorical data for applying the imputer first and then the one-hot encoder?
Upvote ShareCan't we specify the columns to which we want the imputer to apply and similarly the encoder?
Hi,
Good question. You cannot apply one-hot encoder or imputer to categorical data. So it was separated.
Thanks.
Upvote ShareThanks for the reply
However, this is what I tried:
from sklearn.impute import SimpleImputer
imputer= SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(strat_train_set.iloc[:,0:-1])
strat_train_set.iloc[:,0:-1]=imputer.transform(strat_train_set.iloc[:,0:-1])
We could specify indexes like this right?
Want to know the issue with this approach.
Hi,
Please go through the code from our GitHub repository and match against your code to understand the difference:
https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project.ipynb
Thanks.
Upvote ShareHello,
I have a question. Can you please explain why we are using these modules "BaseEstimator, TransformerMixin" as inputs to the custom classes that we are creating?
Thanks
1 Upvote ShareHi,
When we are creating our custom classes, we generally add BaseEstimator and TransformerMixin as base classes to get the advantage of their methods. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method.
Thanks.
1 Upvote ShareIt goes over the head, whenever I watch! These libraires have not been explained at all in the video.
Demotivating to be honest.
Upvote ShareDon't understand how it is working..
Upvote ShareHi,
Could you please tell me which libraries/which part of the video you are unable to understand. I can help you with those.
Thanks.
Upvote Share1. def __init__(self,add_bedrooms_per_room=True):
self.add_bedrooms_per_room = add_bedrooms_per_room
What is the initialisation for?
Why add_bedrooms_per_room? Why not add_bedrooms_per household?
2. def fit(self, X , y=None)
Is X the same from X = imputer.transform(housing_num)? What is 'y'?
Upvote ShareHi,
1. This is initializing the class CombinedAttributedAdder. Would suggest you to go back to the Python tutorial for more information on classes in Python.
2. add_bedrooms_per_room is a parameter, if it's set to True, we calculate rooms_per_household and return the same.
3. Here y is set to None. So y does not have any value, we are fitting the CategoricalEncoder only to X and not y.
Hope this helps explain your query. Let me know if you need help with any other topic.
Thanks.
Upvote ShareIt is somewhat clear now, but what was the purpose of having a condition add_bedrooms_per_room =True when it is a necessary quantity, as in there is enough correlation.
I used this for pipeline, it is working no issue:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
1 Upvote Sharedef fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]
Where can i find the code snippet for CombinedAttributesAdder() ?
Upvote ShareHi,
You can find all code that is referenced in this course from our GitHub repository. For this Class, you can find it's code from the below link:
ml/end_to_end_project.ipynb at master · cloudxlab/ml (github.com)
Thanks.
Upvote ShareHi,
Can some one explain the code for user defined CombineAttributeAdder() and DataFrameSelector() ?
What are BaseEstimator and TransformerMixin?
How is it all working?
Thanks.
1 Upvote ShareHi,
Good question!
CombineAttributeAdder() is a class that combines all the attributes we have created.
When we are creating our custom classes (i.e. transformer, estimator), we can add BaseEstimator and TransformerMixin as base classes. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method for free.
Thanks.
Upvote ShareHi.
One question -
housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude', alpha = 0.4, s = housing.population/100, label = 'population',
figsize = (10,7), c = 'median_house_value', cmap = plt.get_cmap('jet'), colorbar = True, sharex = False)
plt.legend()
Please explain this code -
s = housing.population/100
(a) - What is s?
(b) - why are we dividing by 100?
Upvote Share's' means size
s = housing.population/100, implies that the size of the points plotted via scatter plots should vary as per the corresponding population. Division by 100 in order to reduce the size of points plotted as population is a huge numeric value. Try with and without dividing by 100.
1 Upvote ShareThanks a lot Abhinav. Really appreciate your help on this.
1 Upvote ShareCan you help me understand how the above piece of code is labeling housing["income_cat"] above 5 as 5 ?
Thanks.
Upvote Sharewhere() works as, if the condition, the first attribute is False, ie if income_cat<5 is False, replace it with the value mention in the 2nd attribute, 5.0 in this case.
inplace=True makes the changes in the data frame permanent.
1 Upvote ShareCombineAttributeAdder() is a user defined class right?
Upvote ShareFrom where I can get the CatergoricalEncoder Code or How I can use it ? Any help would be appriciable.
Upvote ShareHi,
Please refer to our GitHub repository for the complete code, the link to which is given below:
cloudxlab/ml: Machine Learning Projects and Learning Content (github.com)
Thanks.
Upvote ShareThanks I got it.
Upvote ShareHi,
After dropping "median_house_value" from the dataframe, how can we copy it ?
FYI, I've gove through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html and w3schools and other links, but they didn't help much to clarify the doubt.
Thank you.
Upvote ShareHi,
Here, by using drop, we are storing all the columns - except median_house_value - as a new dataframe housing. strat_train_set is not getting modified, it just has all the columns including median_house_value. Then in the next line, we are storing the strat_train_set["median_house_value"] values in a new variable housing_labels.
So basically, we are not disturbing the original dataframe strat_train_set, but we are just storing some columns in one variable and the median_house_value in another variable.
Note that we will be modifying the original dataframe when we set its parameter inplace=True. Since this is False by default, no changes happen to that dataframe.Hope this helps.
Thanks.
1 Upvote ShareHi,
The code for plotting coordinates on background image/map is not mentioned in slides?
Thanks.
Upvote ShareHi,
Could you please tell me which part of the lecture video you are referring to? If this is something related to creating charts, you can use Matplotlib for the same. If you want to know about Matplotlib, you can try out our free Intro to Matplotlib project.
Thanks.
Upvote ShareTime stamp: 00:18:10
Upvote ShareHi,
The slides are for presentational purposes only. Please check our GitHub repository for the codes:
ml/end_to_end_project.ipynb at master · cloudxlab/ml (github.com)
Thanks.
Upvote ShareShouldn't it have been "Test set generated using Random sampling has income category proportion is quite skewed", instead of:
Thank you.
Hi,
We didn't get your question and the second image is not visible here. Can you please look into this?
Thanks.
Upvote SharePlease ignore the 2nd image. I was trying to point out the statement "Test set generated using stratified sampling has income category proportion is quite skewed". Is it correct ?
Shouldn't it have been "Test set generated using Random sampling has income category proportion is quite skewed" ?
Thank you.
Upvote Sharefor set_ in (strat_train_set, strat_test_set):
set_.drop('income_cat',axis=1,inplace =True)
How does the loop iterate in this case.
Sorry, i know this might be a silly question, but I don't know why I am struggling to grasp this. :|
Thanks.
Upvote ShareHi,
Could you tell me which slide has this code been referred in?
Thanks.
Upvote ShareHi,
SLIDE: 144
Thanks
Upvote ShareHi,
Got it. I got confused with the underscore. Here set is taking the set of data, both strat_train_set and strat_test_set, one at a time, and then remove the column income_cat from them using the drop() function.
Thanks.
1 Upvote Shareslide 89,
median age - 50 , median house value - 500000 are capped and due to which ml algo. may learn that price never go beyond that limit.
silly question but how the data is getting capped here? and there must be a situation where the price may not exceed at some point then in that case ?
Upvote ShareHi,
The data was capped while it was being recorded. This is a property of this dataset. Yes, there can definitely be a situation that the price will not exceed certain value, however, if we notice that there is an abnormal number of instances in the last value, it would almost surely mean that that value is capped.
Thanks.
Upvote ShareTrying once again
Upvote Sharescrrenshot
Upvote Shareattaching screenshot
Upvote ShareFor some strange reason I am not able to upload screenshot. And thus for this query I am including the code snippet below with query
########################################################################
from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()
##############################################################################3
My Question is -->
Can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
Thanks
Hi,
I appreciate your patience. This code help in adding 3 features, rooms per household, population per household and bedrooms per room. However, the last one is added only if you mention it while calling this function. This is a is a small transformer class that adds the combined attributes. The bedrooms per room hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not.
Thanks.
Upvote ShareThanks a lot. Just one query pending for today. I guess have bothered you too much for the day :)
#############################################
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
Upvote ShareHi,
Apologies for the late reply. Is this caused by the axis=0 or 1 issue? Could you please check and let me know.
Thanks.
Upvote ShareHi,
I did not mention the axis but just used this code 'housing.isnull().sum()'.
Thanks
Upvote ShareHi,
I am not getting any missing values on my end in the dataset. Here is the screenshot of the output from my code:
Would request you to re-download the dataset by cloning the repository once again.
Thanks.
Upvote ShareHello,
In the screenshot below, can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
Thanks
Upvote ShareHello,
I have a question. During the initial exploration of the 'housing' dataaet we see that there was missing entry in the column " total_bedrooms 20433 non-null float64 ". However, when we execute
isn = housing.isnull()
isn.any(axis=1)
The column entries shows 'False'. What about missing entries from "total_bedrooms"? Why doesn't it shows here?
Thanks
Hi,
Please go through the below discussion for more details on this:
https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean
Thanks.
Upvote ShareHi,
I have through the details from the link. Still I have 2 main questions:-->
1) Why does this code (isn = housing.isnull() /n isn.any(axis=1)) do not return "True' for missing values from the column 'total_bedrooms'?
2) housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
Hi,
Try it with axis=0 and let me know the results.
Thanks.
Upvote ShareHi,
With axis=0, we did get the column "total_bedrooms" as False. But isn't axis=0 represent Rows and axis=1 represent Columns? So why did we use axis=1?
Also, the point 2 also needs clarity from previous email. i.e. housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
Hi,
I referred you to the link to understand this. Please go through it once again.
Thanks.
Upvote ShareI have gone through it and unable to understand and this is the reason I am asking several times. Hope you AND/OR sandeep can help me explain. It does difficult specially when the Q&A is not live and the sessions are recorded.
Also my second part of query is still awaiting clarification .i.e.
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.
Thanks for understanding and helping out!
Upvote ShareHi,
Let me explain. The axis switch specifies the axis along which the means are computed. So, axis=0 means along the rows, which basically means is that it would consider all rows from a given columns. This is the reason, even though you found missing values initially, when you used axis=1 it simply calculated the missing values across the rows, not the columns. I know it can be a bit confusing. In brief, axis=0 is said to be "column-wise" (and axis=1 "row-wise").
As for your second query, I am still awaiting the screenshot from your end.
Thanks.
Got it Rajtilak. This was really confusing!
The second of initial query was this -
housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this difference too. Thanks!!
***********************************************************************
Screenshot is different query which I am facing difficulty in uploading it.
Upvote ShareHi,
Could you please tell me how you are trying to attach the screenshots? You need to click on the image button, click on the Upload tab, select the image by clicking on the Choose File button, click on Send it to server and that should do the trick.
Thanks.
Upvote ShareUploaded it. Thanks for the quick help.
Upvote Shareimage
Hi,
Please use the following path:
Thanks.
Upvote ShareIt worked. Thanks
Upvote ShareGood work!
Upvote ShareScreenshot
Upvote ShareAttaching the screenshot again
Upvote ShareFor some strange reason there has been a constant issue with uploading screenshots. Though before posting my comments I can see the preview of the screenshot. But once I have submitted the comments, the screenshots do not appear. Trying once again -
Upvote ShareHi,
Please mail us the screenshot with details of the issue you are facing and the name of the topic.
Thanks.
Upvote ShareSent the image. managed it this time
Upvote ShareHi,
I am getting the following error on running this code. Please advise as the image file exists at the location:
"https://jupyter.f.cloudxlab.com/user/shashwatv8093/tree/ml/machine_learning/images/end_to_end_project"
import matplotlib.image as mpimg
california_img=mpimg.imread('images/end_to_end_project/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=housing['population']/100, label="Population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=False, alpha=0.4,
)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
plt.show()
**********************************************************************************************************
Attached screenshot
Upvote ShareHi,
Please try the following path:
ml/machine_learning/images/end_to_end_project/california.png
or
/ml/machine_learning/images/end_to_end_project/california.png
The path is relative to the location of the Jupyter notebook. Also note that the path you provided in your comment is an URL and not an actual path. It will throw an error if you use that.
Thanks.
Upvote ShareI am still getting the same error. Have tried your recommendation. Still getting the same error:
Hi,
Please share a screenshot of the location where the file is stored.
Thanks.
Upvote SharePlease see the location
Upvote ShareI am unable to post screenshot. I take the screenshot and it gets uploaded too. But once I post the comments it disappears. Could you please let me know why?
Anyways I am sending you the location of the file ' california.png':
https://jupyter.f.cloudxlab.com/user/shashwatv8093/tree/ml/machine_learning/images/end_to_end_project
Topic name - End-to-End Machine Learning Project Part-3
Issue - Unable to plot
******************************************************************************
I did try print screen + snipping tool option for uploading screenshot but this is not working. Any workaroung for the same please?
Thanks
1 Upvote ShareHello sir, Can you please suugest how can we give the correct path for this housing.csv file. as i have already downloaded it in my local machine.
HOUSING_PATH = '/home/Downloads/housing.csv'
But still gives an error that 'File is not found'.
Upvote ShareHi,
Please share a screenshot of the location where the file is stored in the lab, and the location of your Jupyter notebook in which you are working.
Thanks.
Upvote ShareHi,
Please fix the indentation of compare_props, right now it is showing as inside the function, you need to ensure that it is outside the function.
Thanks.
Upvote ShareHi,
Can you please verify why the below error is coming
Upvote ShareHow can we implement splitting with identifier along with StratifiedShuffleSplit??
Say, we have datasets which contains data from 200 scanners, the data is being updated regularly and in the train and test sets, we want the proportion of the scanners to be similar. How can we do that?
Please, help and suggest some readups
1 Upvote ShareHi,
Are you referring to an imbalanced dataset? If yes, then you have a number of options you can use to deal with imbalanced datasets:
1. Random Under-Sampling
2. Random Over-Sampling
3. Cluster-Based Over Sampling
4. Synthetic Minority Over-sampling Technique for imbalanced data
5. Modified synthetic minority oversampling technique (MSMOTE) for imbalanced da
6. Bagging Based techniques for imbalanced data
There are several other methods that you can read about if you search in Google.
Thanks.
Upvote ShareThank you very much for the quick response
Infact yes, I was referring to an imbalanced dataset like the ones we have to deal with for anomaly detection.
I went through some of the mentioned sampling methods, but they are primarily dealing with imbalanced data which doesn't have unique time stamps.
So how to deal with the data if it is highly imbalanced time-series data
Thank you again
1 Upvote ShareHi,
Interesting question! You can go through the below link to find more details on imbalanced time series data:
https://datascience.stackexchange.com/questions/28200/when-should-you-balance-a-time-series-dataset
And then there is also this interesting research paper:
https://www.nrso.ntua.gr/geyannis/wp-content/uploads/geyannis-pc327.pdf
Thanks.
Upvote ShareThis comment has been removed.
While I was looking for solutions I found this:
https://cran.r-project.org/web/packages/OSTSC/vignettes/Over_Sampling_for_Time_Series_Classification.pdf
I think it be of help to others who might want to refer
1 Upvote Sharehi team,
can anybody explain what does n_splits=1 means in below code?
Hi,
n_splits constitutes the number of re-shuffling & splitting iterations.
Thanks.
1 Upvote ShareHello,
We perform train- test split initially only, to avoid :
Am I correct ?
Upvote ShareHi,
Yeah so basically, after training our model with trainset, we would test it in order to gain confidence about its performance on unseen data which is the test data. So splitting it initially makes sure that our data remains unseen.
Thanks.
Upvote ShareDo Longitude and Lattitude play any key significance??? IN this example, we are using Longitude and Latitude for correlation.
Upvote ShareHi,
There is no one-rule-fits-all concept in Machine Learning. So even though here we are using them only for correlation, another project might have some other use for them. For example, you may want to find out the most crowded area, or the area of the highest number of tax payers, in these cases you will have to use latitude and longitude data.
Thanks.
Upvote Sharehello Sir,
please explain.
What could we infer from low median income and high median house value like 2 income on x axis and 500000 house value on y axis.Similerly for example @ 4 median income we ge house ranging from 100k to 500k .Whether this depends on proximity to the sea or popular area?
reference plot @ lecture 36:30, slide 183.
thanks
Upvote ShareHi,
Median is any measure of central tendency like mean. So this means those areas have average incomes rates as given.
Thanks.
Upvote Sharerom sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()
in the above class can you please Explain why we are writng this:
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
whynot we simply write :
rooms_per_household = X[rooms_ix] / X[household_ix]
Upvote ShareHi,
X[:, rooms_ix] accesses all the rows from index 0 till index rooms_ix-1. You could check the difference between the two by printing what they would yield. Hope this helps.
Thanks.
Upvote ShareThis comment has been removed.
i am getting file not found error ,what i have to do ,to get that data file to my library
Upvote ShareHi,
Could you please elaborate a little more about your issue? Which file you are trying to access. Please share a screenshot of your code and the error that you are getting.
Thanks.
Upvote Sharewine.plot(kind="scatter", x="longitude", y="latitude", alph=0.2)
Getting keyError: longitude exception.
Please help
Upvote ShareHi,
It is unable to find longitude, please see the correct syntax below:
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot
Thanks.
Upvote SharePlease explain how to add the one-hot-encoded dataframe into the original files??
Hi,
Would request you to go through the lecture video, the slides, and the Jupyter notebook from our GitHub repository. The concept have been explained in detail here.
Thanks.
Upvote ShareFile not found for california.png
Upvote ShareHi,
Please share a screenshot of your code and the error that you are getting.
Thanks.
Upvote ShareHi,
I got this solved, earlier when I opened the lab the End to End project was under "Cloudxlab_jupyter_notebooks", from there I was not able to call ml folders.
Later I figured out that the end to end project is outside unde ml/.. directory.
Upvote ShareThis comment has been removed.
sir plz can you explain CombineAttributeAdder() ??
Upvote ShareHi,
This is a custom transformer to combine all attributes.
Thanks.
Upvote ShareHi,
I am getting invalid username and password
Hi,
I have reset your password. Please type in your new password instead of copy pasting it.
Thanks.
Upvote ShareHello,
It seems 'from sklearn.preprocessing import Imputer' has been deprecated. This gives an error 'cannot import name "Imputer" '
The following was successful - from sklearn.impute import SimpleImputer. Can you please confirm if SimpleImputer needs to be used now, instead of Imputer?
Thanks,
Kartik
2 Upvote ShareHi,
You are absolutely correct!
As mentioned in the slides 215 and 216, we need to use SimpleImputer now.
Thanks.
1 Upvote Sharehow we will know ,to do feature scaling,how we will verifiy in data set ,like totta number rooms range from 0 to ect..,any partucal way to get those huge range coumns in data set??
Upvote ShareAnd also how is it goign to help if we do scaling?how it is related to other columns??
and if we do standaztiion intsted of min and max...vlaues are not bounded by 0 and 1,so agian features are scalled widelyright ..mainly how feature scaling wil help,as other coulmns are not scalled accoridnly?
Hi,
Feature Scaling or Standardization is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.
The Min-Max scaler is an that estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
For detailed discussion, I would suggest you to go through the lecture videos once again.
Thanks.
Upvote ShareHi sir, is there any method through which we can know when to apply Normalization and when to use Standardization? What is the general rule to follow ?
Upvote Shareinstead of "round" why are we using "ceil" to get discrete categories???
Upvote ShareHi,
Good question!
Here is a link which discussed in-depth about the difference between round and ceil function in Python:
https://blog.tecladocode.com/rounding-in-python/
Thanks.
Upvote ShareOne qucik question why we need to divind exactly with 1.5 to limit number of strata? housing["income_cat"]=np.ceil(housing["median_income"]/1.5)
Upvote ShareHi,
Good question!
As mentioned in the slides, we are dividing by 1.5 so that we will not have too many strata and each stratum will be large enough.
Thanks.
Upvote ShareThanks for quick reply,why eaxtly 1.5 is what i am looking for,any criteria to get this number
Upvote ShareHi,
So this is basically like creating a histogram. There are a few ways using which you can calculate the bins of a histogram:
1. Count the number of data points.
2. Calculate the number of bins by taking the square root of the number of data points and round up.
3. Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins.
Thanks.
Upvote ShareThank you
Upvote ShareSir ,
the code for CategoricalEncoder is generating an error after being copying from the repository ,
Please help
Upvote ShareHi,
I just checked the end_to_end_project.ipynb file from my end and it is running fine. Could you please tell me which file you are trying to run, and what kernel you are using?
Thanks.
Upvote Share# Just run this cell, or copy it to your code, do not try to understand it (yet).
# Definition of the CategoricalEncoder class, copied from PR #9151.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
----------------
----------------
else:
return outf not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
1.
I am trying to run the CategoricalEncoder code from the cell in the end_to_end_project.ipynb
2.
The code is giving error in the file itself (file name- end_to_end_project.ipynb)
3. I am running it in Python 3 kernel
Hi,
Please restart your server and try once again:
https://discuss.cloudxlab.com/t/im-having-problem-with-assessment-engine-how-should-i-fix/3734
Thanks.
Upvote ShareThanks for your efforts sir.
My code started working .
Just a remainder that the code for CategoricalEncoder in file
end_to_end_project.ipynb is incomplete and the complete code is written in the
end_to_end_project_bootcamp.ipynb file.
Upvote ShareHi,
Could you please point out which part is missing?
Thanks.
1 Upvote Share1.
In end_to_end_project_bootcamp.ipynb file the end code is as follows:
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
2.
But in the end_to_end_project.ipynb file the end code is as follows:
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return outf not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
3.
Due to the variation in the ending code it was earlier producing error:
Upvote ShareHi,
So when I tried on my end, I was working on the end_to_end_project.ipynb file and it worked fine on my end. Would request you to get the latest version of the file from our git repository and run it again.
Thanks.
Upvote ShareThis comment has been removed.
after completing the video the status is showing is incompleted why is that ??
Upvote ShareAfter clicking the "Mark Completed" button is it still showing incomplete?
Upvote ShareHI ,
Unable to import Imputer function from sklearn.preprocessing library.
Thanks In advance!!
Upvote ShareHi,
Please go through the below discussion:
https://discuss.cloudxlab.com/t/solved-cannot-import-imputer/4052
Thanks.
Upvote ShareLess than 38 latitude has more populated
Hi,
Are you facing any challenge with this?
Thanks.
Upvote ShareThis comment has been removed.
Hi,
Im unable to clone the repository as I am unable to login to github.
git clone https://github.com/cloudxlab/ml.git
Also, I tried to to do the same through terminal.Could not locate the ml directory.
Can you please help ?
Hi,
Try the following command on a web console:
git clone https://github.com/cloudxlab/ml ~/ml
If this does not work, please share a screenshot.
Thanks.
Upvote ShareHi Rajtilak,
I tried the above mentioned command and it worked.Thanks for your help !
doubts in Housing project
1 What is n_splits in StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)?What happens if i use the default value 10?
2 Is it necessary to pass the label based on which the splitting occurs( in case of housing it was median income)?
3 Can we split data based on more than one column?
Upvote ShareHi,
1. n_splits is the number of re-shuffling & splitting iterations. It will re-shuffle and split the dataset 10 times in case you pass on the default value of 10.
2. In case of the split() function, it is mandatory to pass the X and y values. You can check more about it here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
3. I think what you are referring to is nested StratifiedShuffleSplit, you can find more about it from the below links:
https://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn
https://stackoverflow.com/questions/45516424/sklearn-train-test-split-on-pandas-stratify-by-multiple-columns
Thanks.
Upvote Shareclass cars():
def __init__(self,modelname,yearm,price):
self.modelname = modelname
self.yearm = yearm
self.price = price
def price_inc(self):
self.price = (self.price*1.15)
class SuperCars(cars):
def __init__(self,modelname,yearm,price):
super.__init__(modelname,yearm,price)
self.cc=cc
honda = SuperCars('City',2019,1000000)
honda.cc=1500
honda.price_inc()
print(honda.price)
output :-
what is wrong with my code??
Upvote ShareHi,
Could you please tell me this is a part of which assessment?
Thanks.
Upvote ShareNo sir i have tried on my own but it is not working.
That's y i want to know what is wrong with this code
Upvote ShareHi,
Would request you to post this on our discussion forum at the below link:
https://discuss.cloudxlab.com/
Thanks.
Upvote ShareWhen we use cross_val_score with cv = 10, so finally which trained model is selected for doing evaluation on test data.
Upvote Sharethe one which showed min RMSE or the last one.....
Can we manually choose the trained model out of 10.
Hi,
What you are talking about it called early stopping, it has been covered in this course. You can also read about it from the below link:
https://en.wikipedia.org/wi...
Thanks.
-- Rajtilak Bhattacharjee
Upvote Sharesend me the link of jupyter notebook where sir is teaching the codes.
Upvote ShareHi,
Please find the link to our GitHub repository where all our Jupyter notebooks from this course is hosted:
https://github.com/cloudxla...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi Team,
In the video tutor said random search is better as it takes the random values for hyperparameters as compared to grid search.I see only benefit is less computation for CPU as few random values are taken. But , we cannot guarantee its result will be 100 percent correct as compared to grid. Even we should not compare grid with random when we have high variance because grid cannot be used first of all according to what i understood from lecture.
So, shall I conclude that if we are lucky enough then random search will give correct set of hyperparametres otherwise no guarantee ?
Thnx
Upvote ShareHi,
While it’s possible that RandomizedSearchCV will not find as accurate of a result as GridSearchCV, it surprisingly picks the best result more often than not and in a *fraction* of the time it takes GridSearchCV would have taken. Given the same resources, Randomized Search can even outperform Grid Search. Also, "less computation" is one of the key benefits that acts as a deciding factor when it comes to ML/DL models.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi,
While executing the below command; iam getting Axis error
housing_prepared = full_pipeline.fit_transform(housing)
AxisError: axis -1 is out of bounds for array of dimension 0
My code Housing_1 is in below link,
Upvote Sharehttps://jupyter.e.cloudxlab...
Hi,
Would request you to recheck your code. If you are stuck somewhere, you can take a hint or look at the answer.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi Team,
I am getting error of No Space left on device.Please suggest how to fix this. I don't have anything stored apart from my work , which is hardly 2MB.
Thankyou
Upvote ShareWe have fixed these issues. Let us if it persists
-- Praveen Pavithran
Upvote ShareNot able to access the jupyter notebook. 503-service unavailable error is coming. Can you please help?
Upvote ShareHi,
Would request you to follow all the steps from the link given below:
https://discuss.cloudxlab.c...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHow can I access the code housing median project code which is discussed in class
Upvote ShareHi,
You can all the codes in our GitHub repository. The URL to the same is given below:
https://github.com/cloudxla...
Thanks.
-- Rajtilak Bhattacharjee
Upvote Sharei am having the following error.Please resolve
Hi @sandeepgiri:disqus sir,
Please help me in clearing below doubt.
Upvote Share1. What is estimator in the machine learning?
2. why we are extending the BaseEstimator and TransformerMixin classes while creating our own
classes like CombinedAttributesAdder and CategoricalEncoder?
3. what is BaseEstimator and TransformerMixin classes ?
4. is there any other way to implement above classes , meaning can't we achieve the functionality of above classes by just creating an user defined function , because CombinedAttributesAdder class functionality can be achieved using function upto my understanding.
5. sir when you introduced these above classes in your session after that several things are not clear to me
, please suggest some reading materials which can help me in understanding these concepts .
These questions are answered in the latter part of the course. You can use these here for now.
-- Praveen Pavithran
Upvote Shareutkarshtrivedi1403@gmail.com
Upvote ShareI have completed Topic 3 but it still showing 98%, kindly rectify it
Getting error while trying Imputer
ImportError: cannot import name 'SimpleImputer'
Code tried
from sklearn.preprocessing import SimpleImputer
imputer = EimpleImputer(strategy='median')
This is as per end to end notebook code. Please help
Upvote ShareHi,
Please change the following line:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
We have updated our notebooks, would request you to pull the latest set from our GitHub repository.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi,
Without using custome transformer pipelines,
if we choose to use scikit learn method as done in previous video in various steps then,
how to union the numerical and categorical columns into housing again(code please)???
Also in the very last step when using the stratified test data to predict the testset result if we dont want to use pipeline then what to use instead of full.pipelined.transformed??
please explain me sir
Upvote ShareHi,
We use the FeatureUnion class for the same. Please refer to slide# 276 of 401 for more details.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareNO,
slide# 276 of 401 FeatureUnion class here is used to union the custom transforM using pipeline .
BUT
I am asking to union the " housing_cat_1hot" and "housing_tr" which was found without doing custom transform and without using pipelining...
So how to union these two now without using pipelining????
Upvote ShareHi,
ColumnTransformer class is an alternative you might want to have a look at. You would find the details here:
https://scikit-learn.org/st...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi,
I tried doing ColumnTransform by passing the last column as (-1) BUT
after that when i am splitting the h i.e. housing into training and testset and printing test_set.head(), i am error as
AttributeError: 'numpy.ndarray' object has no attribute 'head'
h.head() is printing tables as inn pandas but after doing ColumnTransform h is no more pandas?? i m not getting the error
Please help me out sir!!!
Hi,
You can check the following articles to know more about how to solve this error that you are getting:
https://stackoverflow.com/q... https://stackoverflow.com/q...
Is there any specific reason you are not using the Pipeline class as shown in the tutorial?
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareThanks,
Upvote ShareNo, I just wanted to try by doing the other method too.
I am facing an error importing mlxtend.preprocessing for both end_to_end_project_bootcamp and end_to_end_project
Hi,
Would request you to restart your server using the following method and try once again:
https://discuss.cloudxlab.c...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareGetting below error. Do you have any suggestion for this?
Upvote ShareImportError: cannot import name '_NAN_METRICS' from 'sklearn.metrics.pairwise' (C:\Users\csree\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py)
Hi,
Could you please share a screenshot of your code and the error that you are getting.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi
Upvote ShareI have some queries regarding recording sessions on machine learning. Plz let me know where can i ask my queries?
Thanks
Hi,
You can write your queries here, or send us an email, or write to us in the discussion forum, and we would be happy to help you with them.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareI would like to mail my all queries. Can you please tell me the mail address?
Upvote ShareThanks
Hi,
You can mail us at through this link, our email id is given there too:
https://cloudxlab.com/conta...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareOk Thanks
Upvote ShareHi CloudXlabs,
Please let me know how can i download my python notebooks, .ipynb (which I have created in my lab ) to my local machine.
Thanks
Upvote ShareHi,
You can download your jupyter notebook by clicking on "FILE" option and there selecting the "DOWNLOAD AS" option from there you can select your desired format in which you want to download.
Thanks.
-- Mayank Sharma
Upvote Sharein the num_pipeline the data after fit() & transform() from the 'imputer' should be in numpy array then how this data is passed to 'attrib_adder'
Upvote ShareHi Mahipal,
We use DataFrameSelector for this purpose.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareDataFrameSelector is used before this step and for seperating numerical and categorical data then how it helps in passing numpy array to 'attrib_adder'
Upvote ShareHi Mahipal,
The detailed process is discussed here:
https://github.com/ageron/h...
Hope this helps.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi,
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
Upvote Sharewhat does n_splits signify?
Hi Vernika,
It signifies the number of re-shuffling and splitting iterations.
Thanks.
-- Rajtilak Bhattacharjee
Upvote Sharehow combine_attribute_adder is different from Feature_engineering?
Upvote ShareHi Anuj,
CombinedAttributesAdder class is used for creating a pipeline, feature engineering is used to created new features.
Thanks.
-- Rajtilak Bhattacharjee
Upvote Shareplease explain this for loop i am not able to understand this loop
from sklearn.model_selection import StratifiedShuffleSplit
Upvote Sharesplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2,
random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Hello Disqus,
Thanks for contacting CloudxLab!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Cheers,
Upvote ShareThe CloudxLab Team
Hi Anubhav,
Here in this for loop we are dividing the stratified data into a test and train set as per their index. First we split the data where the Income Category is assigned to the test index and the rest to the train index, then we set these to the variables strat_train_set and strat_test_set using those indexes.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi Sharathchandran,
The class is deprecated, use the following instead:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
All the best!
Upvote ShareHello Disqus,
Thanks for contacting CloudxLab!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Cheers,
Upvote ShareThe CloudxLab Team
Hello Disqus,
Thanks for contacting CloudxLab!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Cheers,
Upvote ShareThe CloudxLab Team
For all models you are using same data for training and prediction . For example
# Train a model using Decision Tree
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
# Calculate RMSE in Decision Tree model
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Here we are using housing_prepared . So why we are not using other data set for testing because it obvious that it will give the accurate result if we try to predict the result for data set that is already used in training. For testing we use the data set that never used during the training.
Upvote ShareHi Sandeep, we implemented the One Hot Encoder for categorical data but why we did resolve the dummy variable trap issue? should we consider the dummy variable trap or not in pre processing step of any model?
Upvote ShareAre Fit(),Transfom are internal methods ?
Upvote ShareYes, every algorithm in sklearn has these two methods.
Upvote ShareWhy do we use a desciion tree for regression
Upvote ShareI read_csv this housing.csv but there is error
Upvote Share'utf-8' codec can't decode byte 0xa4 in position 25: invalid start byte
Can't solve it