End-to-End Project- Self contained

4 / 31

End-to-End Machine Learning Project Part-3

Recording of Session

Slides


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

220 Comments

After converting the text labels to numerical, I am getting zero in all the index and it is not matching with the oupput shown in the video part - 3

Please help!

 

  Upvote    Share

Hi,

Would request you to match your code against the actual code given on our GitHub repository:

https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project.ipynb

Thanks.

  Upvote    Share

Thanks

  Upvote    Share

Please help with the below error:

 

  Upvote    Share

Hi,

Your screenshot did not get attach. Could you please reattach it and share again?

Thanks.

  Upvote    Share

Can you please explain the concept of training and test data set.

  Upvote    Share

Hi,

When we want our machine learning model to accomplish certain task(say classification of images), the model should have knowledge of these different classes. For example, if dog vs cat classification, the model should have the knowledge of attributes of cats and dogs, which attributes are distinguishing both of them etc. So we try to impart this knowledge with the help of some varied types of images of both cats and dogs. This set of images which we use to impart knowledge into the model is called training data, since we are using these images to train the model and enable it to be able to distinguish between cat and dog images. 

Further, we also assess the trained model, as in, how good has the model learnt about cats and dogs, how good will it be able to use this knowledge when it should classify an image of cat or dog, which it is never seen before. This is just like, tutoring a child about how to solve a problem(for example teaching him how to solve a mathematical problem) and then conducting an exam to test his knowledge. This is called testing phase.

We train the model to impart knowledge, and test it to know how well it would be able to perform on unseen data, so that we would understand if we need to train it more, or would it need more variety of data, is it memorising or actually learning, etc.

Hope this helps.

Thanks. 

  Upvote    Share

This comment has been removed.

Some days I got irritated, specifically the theory part but on some days I loved your videos, specifically the hands-on part. Now I am in love hahahahha....you are just brilliant

 2  Upvote    Share

This comment has been removed.

Why have we separated the numerical and categorical data for applying the imputer first and then the one-hot encoder?
Can't we specify the columns to which we want the imputer to apply and similarly the encoder?

  Upvote    Share

Hi,

Good question. You cannot apply one-hot encoder or imputer to categorical data. So it was separated.

Thanks.

  Upvote    Share

Thanks for the reply

However, this is what I tried:

from sklearn.impute import SimpleImputer
imputer= SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(strat_train_set.iloc[:,0:-1])
strat_train_set.iloc[:,0:-1]=imputer.transform(strat_train_set.iloc[:,0:-1])

 

We could specify indexes like this right?
Want to know the issue with this approach.

 

  Upvote    Share

Hi,

Please go through the code from our GitHub repository and match against your code to understand the difference:

https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project.ipynb

Thanks.

  Upvote    Share

Hello,

I have a question. Can you please explain why we are using these modules "BaseEstimator, TransformerMixin" as inputs to the custom classes that we are creating?

Thanks

 1  Upvote    Share

Hi,

When we are creating our custom classes, we generally add BaseEstimator and TransformerMixin as base classes to get the advantage of their methods. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method.

Thanks.

 1  Upvote    Share

It goes over the head, whenever I watch! These libraires have not been explained at all in the video.

Demotivating to be honest.

  Upvote    Share

Don't understand how it is working..

  Upvote    Share

Hi,

Could you please tell me which libraries/which part of the video you are unable to understand. I can help you with those.

Thanks.

  Upvote    Share

1. def __init__(self,add_bedrooms_per_room=True):

          self.add_bedrooms_per_room = add_bedrooms_per_room

What is the initialisation for? 

Why add_bedrooms_per_room? Why not add_bedrooms_per household?

2. def fit(self, X , y=None)

Is X the same from X = imputer.transform(housing_num)? What is 'y'?

  Upvote    Share

Hi,

1. This is initializing the class CombinedAttributedAdder. Would suggest you to go back to the Python tutorial for more information on classes in Python.

2. add_bedrooms_per_room is a parameter, if it's set to True, we calculate rooms_per_household and return the same.

3. Here y is set to None. So y does not have any value, we are fitting the CategoricalEncoder only to X and not y.

Hope this helps explain your query. Let me know if you need help with any other topic.

Thanks.

  Upvote    Share

It is somewhat clear now, but what was the purpose of having a condition add_bedrooms_per_room =True when it is a necessary quantity, as in there is enough correlation.

I used this for pipeline, it is working no issue:

 

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:,  household_ix]
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]

 1  Upvote    Share

Where can i find the code snippet for CombinedAttributesAdder() ?

  Upvote    Share

Hi,

You can find all code that is referenced in this course from our GitHub repository. For this Class, you can find it's code from the below link:

ml/end_to_end_project.ipynb at master · cloudxlab/ml (github.com)

Thanks.

  Upvote    Share

Hi,

Can some one explain the code for user defined CombineAttributeAdder() and DataFrameSelector() ?

 

What are BaseEstimator and TransformerMixin?

How is it all working?

 

Thanks.

 1  Upvote    Share

Hi,

Good question!

CombineAttributeAdder() is a class that combines all the attributes we have created.

When we are creating our custom classes (i.e. transformer, estimator), we can add BaseEstimator and TransformerMixin as base classes. The former one gives us get_params() and set_params() methods and the latter gives us fit_transform() method for free.

Thanks.

  Upvote    Share

Hi. 

One question - 

housing.plot(kind = 'scatter', x = 'longitude', y = 'latitude', alpha = 0.4, s = housing.population/100, label = 'population', 
            figsize = (10,7), c = 'median_house_value', cmap = plt.get_cmap('jet'), colorbar = True, sharex = False)
plt.legend()

 

Please explain this code  - 

s = housing.population/100

(a) - What is s?

(b) - why are we dividing by 100?

  Upvote    Share

's' means size

s = housing.population/100, implies that the size of the points plotted via scatter plots should vary as per the corresponding population. Division by 100 in order to reduce the size of points plotted as population is a huge numeric value. Try with and without dividing by 100.

 1  Upvote    Share

Thanks a lot Abhinav. Really appreciate your help on this. 

 1  Upvote    Share
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

Can you help me understand how the above piece of code is labeling housing["income_cat"] above 5 as 5 ?

Thanks.

  Upvote    Share

where() works as, if the condition, the first attribute is False, ie if income_cat<5 is False, replace it with the value mention in the 2nd attribute, 5.0 in this case.

inplace=True makes the changes in the data frame permanent.

 1  Upvote    Share

CombineAttributeAdder() is a user defined class right?

  Upvote    Share

From where I can get the CatergoricalEncoder Code or How I can use it ? Any help would be appriciable.

  Upvote    Share

Hi,

Please refer to our GitHub repository for the complete code, the link to which is given below:

cloudxlab/ml: Machine Learning Projects and Learning Content (github.com)

Thanks.

  Upvote    Share

Thanks I got it. 

  Upvote    Share

Hi,

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

After dropping "median_house_value" from the dataframe, how can we copy it ? 

FYI, I've gove through https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html and w3schools and other links, but they didn't help much to clarify the doubt.

Thank you.

  Upvote    Share

Hi,

Here, by using drop, we are storing all the columns - except median_house_value - as a new dataframe housingstrat_train_set is not getting modified, it just has all the columns including median_house_value. Then in the next line, we are storing the strat_train_set["median_house_value"] values in a new variable housing_labels

So basically, we are not disturbing the original dataframe strat_train_set, but we are just storing some columns in one variable and the median_house_value in another variable.

Note that we will be modifying the original dataframe when we set its parameter inplace=True. Since this is False by default, no changes happen to that dataframe.Hope this helps.

Thanks.

 1  Upvote    Share

Hi,

The code for plotting coordinates on background image/map is not mentioned in slides? 

Thanks.

  Upvote    Share

Hi,

Could you please tell me which part of the lecture video you are referring to? If this is something related to creating charts, you can use Matplotlib for the same. If you want to know about Matplotlib, you can try out our free Intro to Matplotlib project.

Thanks.

  Upvote    Share

Time stamp: 00:18:10

  Upvote    Share

Hi,

The slides are for presentational purposes only. Please check our GitHub repository for the codes:

ml/end_to_end_project.ipynb at master · cloudxlab/ml (github.com)

Thanks.

  Upvote    Share

Shouldn't it have been "Test set generated using Random sampling has income category proportion is quite skewed", instead of: 

 ?

Thank you.

 

  Upvote    Share

Hi,

We didn't get your question and the second image is not visible here. Can you please look into this?

Thanks.

  Upvote    Share

Please ignore the 2nd image. I was trying to point out the statement "Test set generated using stratified sampling has income category proportion is quite skewed". Is it correct ?

Shouldn't it have been "Test set generated using Random sampling has income category proportion is quite skewed" ?

Thank you.

  Upvote    Share

for set_ in (strat_train_set, strat_test_set):

  set_.drop('income_cat',axis=1,inplace =True)

 

How does the loop iterate in this case.

Sorry, i know this might be a silly question, but I don't know why I am struggling to grasp this. :|

 

Thanks.

  Upvote    Share

Hi,

Could you tell me which slide has this code been referred in?

Thanks.

  Upvote    Share

Hi,

SLIDE: 144

Thanks

  Upvote    Share

Hi,

Got it. I got confused with the underscore. Here set is taking the set of data, both strat_train_set and strat_test_set, one at a time, and then remove the column income_cat from them using the drop() function.

Thanks.

 1  Upvote    Share

slide 89,

median age - 50 , median house value - 500000 are capped and due to which ml algo. may learn that price never go beyond that limit.  

silly question but how the data is getting capped here? and there must be a situation where the price may not exceed at some point then in that case ? 

  Upvote    Share

Hi,

The data was capped while it was being recorded. This is a property of this dataset. Yes, there can definitely be a situation that the price will not exceed certain value, however, if we notice that there is an abnormal number of instances in the last value, it would almost surely mean that that value is capped.

Thanks.

  Upvote    Share

Trying once again

  Upvote    Share

scrrenshot

  Upvote    Share

attaching screenshot

  Upvote    Share

For some strange reason I am not able to upload screenshot. And thus for this query I am including the code snippet below with query

########################################################################

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

##############################################################################3

My Question is -->

Can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"

 if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

Thanks

 

  Upvote    Share

Hi,

I appreciate your patience. This code help in adding 3 features, rooms per household, population per household and bedrooms per room. However, the last one is added only if you mention it while calling this function. This is a is a small transformer class that adds the combined attributes. The bedrooms per room hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not.

Thanks.

  Upvote    Share

Thanks a lot. Just one query pending for today. I guess have bothered you too much for the day :)

#############################################

housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.

  Upvote    Share

Hi,

Apologies for the late reply. Is this caused by the axis=0 or 1 issue? Could you please check and let me know.

Thanks.

  Upvote    Share

Hi,

I did not mention the axis but just used this code 'housing.isnull().sum()'.

Thanks

  Upvote    Share

Hi,

I am not getting any missing values on my end in the dataset. Here is the screenshot of the output from my code:

 

Would request you to re-download the dataset by cloning the repository once again.

Thanks.

  Upvote    Share

Hello,

In the screenshot below, can you please explain why we used the code to ''bedrooms_per_room" when we just wanted to add only 2 new columns as "rooms_per_household, population_per_household"

 if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

Thanks

  Upvote    Share

Hello,

I have a question. During the initial exploration of the 'housing' dataaet we see that there was missing entry in the column " total_bedrooms 20433 non-null float64 ".  However, when we execute 

isn = housing.isnull()
isn.any(axis=1) 

The column entries shows 'False'. What about missing entries from "total_bedrooms"? Why doesn't it shows here?

Thanks

  Upvote    Share

Hi,

Please go through the below discussion for more details on this:

https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean

Thanks.

  Upvote    Share

Hi,

I have through the details from the link. Still I have 2 main questions:-->

1) Why does this code (isn = housing.isnull() /n  isn.any(axis=1)) do not return "True' for missing values from the column 'total_bedrooms'?

2) housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        158
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64

  Upvote    Share

Hi,

Try it with axis=0 and let me know the results.

Thanks.

  Upvote    Share

Hi,

With axis=0, we did get the column "total_bedrooms" as False. But isn't axis=0 represent Rows and axis=1 represent Columns? So why did we use axis=1? 

Also, the point 2 also needs clarity from previous email. i.e. housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.

 

 

  Upvote    Share

Hi,

I referred you to the link to understand this. Please go through it once again.

Thanks.

  Upvote    Share

I have gone through it and unable to understand and this is the reason I am asking several times. Hope you AND/OR sandeep can help me explain. It does difficult specially when the Q&A is not live and the sessions are recorded.

Also my second part of query is  still awaiting clarification .i.e. 

housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this discrepancy.

Thanks for understanding and helping out!

  Upvote    Share

Hi,

Let me explain. The axis switch specifies the axis along which the means are computed. So, axis=0 means along the rows, which basically means is that it would consider all rows from a given columns. This is the reason, even though you found missing values initially, when you used axis=1 it simply calculated the missing values across the rows, not the columns. I know it can be a bit confusing. In brief, axis=0 is said to be "column-wise" (and axis=1 "row-wise").

As for your second query, I am still awaiting the screenshot from your end.

Thanks.

 

  Upvote    Share

Got it Rajtilak. This was really confusing!

The second of initial query was this -

housing.info() data shows that there are 206 missing values in 'total_bedrooms'. But when I execute this 'housing.isnull().sum()' , I get 158 missing values from column "total_bedrooms"? Please explain this difference too. Thanks!!

***********************************************************************

Screenshot is different query which I am facing difficulty in uploading it.

  Upvote    Share

Hi,

Could you please tell me how you are trying to attach the screenshots? You need to click on the image button, click on the Upload tab, select the image by clicking on the Choose File button, click on Send it to server and that should do the trick.

Thanks.

  Upvote    Share

Uploaded it. Thanks for the quick help.

  Upvote    Share

image

  Upvote    Share

Hi,

Please use the following path:

../ml/machine_learning/images/end_to_end_project/california.png

Thanks.

  Upvote    Share

It worked. Thanks

  Upvote    Share

Good work!

  Upvote    Share

Screenshot

  Upvote    Share

Attaching the screenshot again

  Upvote    Share

For some strange reason there has been a constant issue with uploading screenshots. Though before posting my comments I can see the preview of the screenshot. But once I have submitted the comments, the screenshots do not appear. Trying once again -

  Upvote    Share

Hi,

Please mail us the screenshot with details of the issue you are facing and the name of the topic.

Thanks.

  Upvote    Share

Sent the image. managed it this time 

  Upvote    Share

Hi,

I am getting the following error on running this code. Please advise as the image file exists at the location:

"https://jupyter.f.cloudxlab.com/user/shashwatv8093/tree/ml/machine_learning/images/end_to_end_project"

import matplotlib.image as mpimg
california_img=mpimg.imread('images/end_to_end_project/california.png')

ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()

**********************************************************************************************************

FileNotFoundError: [Errno 2] No such file or directory: 'images/end_to_end_project/california.png'

 

  Upvote    Share

Attached screenshot 

  Upvote    Share

Hi,

Please try the following path:

ml/machine_learning/images/end_to_end_project/california.png

or

/ml/machine_learning/images/end_to_end_project/california.png

The path is relative to the location of the Jupyter notebook. Also note that the path you provided in your comment is an URL and not an actual path. It will throw an error if you use that.

Thanks.

  Upvote    Share

I am still getting the same error. Have tried your recommendation. Still getting the same error:

FileNotFoundError: [Errno 2] No such file or directory: 'ml/machine_learning/images/end_to_end_project/california.png'

 

  Upvote    Share

Hi,

Please share a screenshot of the location where the file is stored.

Thanks.

  Upvote    Share

Please see the location

  Upvote    Share

I am unable to post screenshot. I take the screenshot and it gets uploaded too. But once I post the comments it disappears. Could you please let me know why?

Anyways I am sending you the location of the file ' california.png':

https://jupyter.f.cloudxlab.com/user/shashwatv8093/tree/ml/machine_learning/images/end_to_end_project

Topic name - End-to-End Machine Learning Project Part-3

Issue - Unable to plot 

california_img=mpimg.imread('ml/machine_learning/images/end_to_end_project/california.png')
FileNotFoundError: [Errno 2] No such file or directory: 'ml/machine_learning/images/end_to_end_project/california.png'

******************************************************************************

I did try print screen + snipping tool option for uploading screenshot but this is not working. Any workaroung for the same please?

Thanks

 1  Upvote    Share

Hello sir, Can you please suugest how can we give the correct path for this housing.csv file. as i have already downloaded it in my local machine. 

HOUSING_PATH = '/home/Downloads/housing.csv'

But still gives an error that 'File is not found'.

  Upvote    Share

Hi,

Please share a screenshot of the location where the file is stored in the lab, and the location of your Jupyter notebook in which you are working.

Thanks.

  Upvote    Share

 

  Upvote    Share

Hi,

Please fix the indentation of compare_props, right now it is showing as inside the function, you need to ensure that it is outside the function.

Thanks.

  Upvote    Share

Hi,

Can you please verify why the below error is coming

  Upvote    Share

How can we implement splitting with identifier along with StratifiedShuffleSplit??

Say, we have datasets which contains data from 200 scanners, the data is being updated regularly and in the train and test sets, we want the proportion of the scanners to be similar. How can we do that?

Please, help and suggest some readups

 1  Upvote    Share

Hi,

Are you referring to an imbalanced dataset? If yes, then you have a number of options you can use to deal with imbalanced datasets:

1. Random Under-Sampling

2.   Random Over-Sampling

3. Cluster-Based Over Sampling

4. Synthetic Minority Over-sampling Technique for imbalanced data

5. Modified synthetic minority oversampling technique (MSMOTE) for imbalanced da

6. Bagging Based techniques for imbalanced data

There are several other methods that you can read about if you search in Google.

Thanks.

  Upvote    Share

Thank you very much for the quick response

Infact yes, I was referring to an imbalanced dataset like the ones we have to deal with for anomaly detection.

I went through some of the mentioned sampling methods, but they are primarily dealing with imbalanced data which doesn't have unique time stamps.

So how to deal with the data if it is highly imbalanced time-series data

Thank you again

 1  Upvote    Share

Hi,

Interesting question! You can go through the below link to find more details on imbalanced time series data:

https://datascience.stackexchange.com/questions/28200/when-should-you-balance-a-time-series-dataset

And then there is also this interesting research paper:

https://www.nrso.ntua.gr/geyannis/wp-content/uploads/geyannis-pc327.pdf

Thanks.

  Upvote    Share

This comment has been removed.

While I was looking for solutions I found this:

 https://cran.r-project.org/web/packages/OSTSC/vignettes/Over_Sampling_for_Time_Series_Classification.pdf

I think it be of help to others who might want to refer

 1  Upvote    Share

hi team, 

 

can anybody explain what does n_splits=1 means in below code?

 

#stratified sampling using skitlearn StratifiedShuffleSplit class

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit (n_splits = 1, test_size=0.2,random_state=42)

for train_index, test_index in split.split(housing,housing['income_cat']):
    
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

 

 

  Upvote    Share

Hi,

n_splits constitutes the number of re-shuffling & splitting iterations.

Thanks.

 1  Upvote    Share

Hello,

We perform train- test split initially only, to avoid :

  • Data Leakage
  • Create Bias free test set 

Am I correct ?

  Upvote    Share

Hi,

Yeah so basically, after training our model with trainset, we would test it in order to gain confidence about its performance on unseen data which is the test data. So splitting it initially makes sure that our data remains unseen.

Thanks.

  Upvote    Share

Do Longitude and Lattitude play any key significance??? IN this example, we are using Longitude and Latitude for correlation.

  Upvote    Share

Hi,

There is no one-rule-fits-all concept in Machine Learning. So even though here we are using them only for correlation, another project might have some other use for them. For example, you may want to find out the most crowded area, or the area of the highest number of tax payers, in these cases you will have to use latitude and longitude data.

Thanks.

  Upvote    Share

hello Sir,

please explain.

What could we infer from low median income and high median house value like 2 income on x axis and 500000 house value on y axis.Similerly for example @ 4 median income we ge house ranging from 100k to 500k .Whether this depends on proximity to the sea or popular area?

reference plot @ lecture 36:30, slide 183.

thanks

  Upvote    Share

Hi,

Median is any measure of central tendency like mean. So this means those areas have average incomes rates as given.

Thanks.

  Upvote    Share

rom sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

 

in the above class can you please Explain why we are writng this:

rooms_per_household = X[:, rooms_ix] / X[:, household_ix]

whynot we simply write :

rooms_per_household = X[rooms_ix] / X[household_ix]

  Upvote    Share

Hi,

 X[:, rooms_ix] accesses all the rows from index 0 till index  rooms_ix-1. You could check the difference between the two by printing what they would yield. Hope this helps.

Thanks.

  Upvote    Share

This comment has been removed.

i am getting file not found error ,what i have to do ,to get that data file to my library

  Upvote    Share

Hi,

Could you please elaborate a little more about your issue? Which file you are trying to access. Please share a screenshot of your code and the error that you are getting.

Thanks.

  Upvote    Share

wine.plot(kind="scatter", x="longitude", y="latitude", alph=0.2)

 

Getting keyError: longitude exception.

Please help

  Upvote    Share

Hi,

It is unable to find longitude, please see the correct syntax below:

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot

Thanks.

  Upvote    Share

Please explain how to add the one-hot-encoded dataframe into the original files??

 

  Upvote    Share

Hi,

Would request you to go through the lecture video, the slides, and the Jupyter notebook from our GitHub repository. The concept have been explained in detail here.

Thanks.

  Upvote    Share

File not found for california.png

  Upvote    Share

Hi,

Please share a screenshot of your code and the error that you are getting.

Thanks.

  Upvote    Share

Hi,

I got this solved, earlier when I opened the lab the End to End project was under "Cloudxlab_jupyter_notebooks", from there I was not able to call ml folders.

Later I figured out that the end to end project is outside unde ml/.. directory.

  Upvote    Share

This comment has been removed.

sir plz can you explain CombineAttributeAdder() ??

  Upvote    Share

Hi,

This is a custom transformer to combine all attributes.

Thanks.

  Upvote    Share

Hi,

I am getting invalid username and password

 

  Upvote    Share

Hi,

I have reset your password. Please type in your new password instead of copy pasting it.

Thanks.

  Upvote    Share

Hello,

It seems 'from sklearn.preprocessing import Imputer' has been deprecated. This gives an error 'cannot import name "Imputer" '

The following was successful - from sklearn.impute import SimpleImputer. Can you please confirm if SimpleImputer needs to be used now, instead of Imputer?

Thanks,

Kartik

 2  Upvote    Share

Hi,

You are absolutely correct!

As mentioned in the slides 215 and 216, we need to use SimpleImputer now.

Thanks.

 1  Upvote    Share

how we will know ,to do feature scaling,how we will verifiy in data set ,like totta number rooms range from 0 to ect..,any partucal way to get those huge range coumns in data set??

  Upvote    Share

And also how is it goign to help if we do scaling?how it is related to other columns??

and if we do standaztiion intsted of min and max...vlaues are not bounded by 0 and 1,so agian features are scalled widelyright ..mainly how feature scaling wil help,as other coulmns are not scalled accoridnly?

 

  Upvote    Share

Hi,

Feature Scaling or Standardization is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm.

The Min-Max scaler is an that estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

For detailed discussion, I would suggest you to go through the lecture videos once again.

Thanks.

  Upvote    Share

Hi sir, is there any method through which we can know when to apply Normalization and when to use Standardization? What is the general rule to follow ?

  Upvote    Share

instead of "round" why are we using "ceil" to get discrete categories???

  Upvote    Share

Hi,

Good question!

Here is a link which discussed in-depth about the difference between round and ceil function in Python:

https://blog.tecladocode.com/rounding-in-python/

Thanks.

  Upvote    Share

One qucik question why we need to divind exactly with 1.5 to limit number of strata? housing["income_cat"]=np.ceil(housing["median_income"]/1.5)

  Upvote    Share

Hi,

Good question!

As mentioned in the slides, we are dividing by 1.5 so that we will not have too many strata and each stratum will be large enough.

Thanks.

  Upvote    Share

Thanks for quick reply,why eaxtly 1.5 is what i am looking for,any criteria to get this number

  Upvote    Share

Hi,

So this is basically like creating a histogram. There are a few ways using which you can calculate the bins of a histogram:

1. Count the number of data points.
2. Calculate the number of bins
 by taking the square root of the number of data points and round up.
3. Calculate the bin width by dividing the specification tolerance or range (USL-LSL or Max-Min value) by the # of bins.

Thanks.

  Upvote    Share

Thank you

  Upvote    Share

Sir ,

the code for CategoricalEncoder is generating an error after being copying from the repository ,

Please help


File "<ipython-input-5-a83bca1aa9f7>", line 178
    return outf not np.all(valid_mask):
                     ^
SyntaxError: invalid syntax
  Upvote    Share

Hi,

I just checked the end_to_end_project.ipynb file from my end and it is running fine. Could you please tell me which file you are trying to run, and what kernel you are using?

Thanks.

  Upvote    Share

# Just run this cell, or copy it to your code, do not try to understand it (yet).
# Definition of the CategoricalEncoder class, copied from PR #9151.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):

----------------

----------------

else:
return outf not np.all(valid_mask):
                if self.handle_unknown == 'error':

                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:

1.

I am trying to run the CategoricalEncoder code from the cell in the end_to_end_project.ipynb

2.

The code is giving error in the file itself (file name- end_to_end_project.ipynb)

3. I am running it in Python 3 kernel

 

  Upvote    Share

Thanks for your efforts sir.

My code started working .

Just a remainder that the code for CategoricalEncoder in file 

end_to_end_project.ipynb is incomplete and the complete code is written in the

end_to_end_project_bootcamp.ipynb file.

  Upvote    Share

Hi,

Could you please point out which part is missing?

Thanks.

 1  Upvote    Share

1.

In end_to_end_project_bootcamp.ipynb file the end code is as follows:

 if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out

 

2.

But in the end_to_end_project.ipynb file the end code is as follows:

if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return outf not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and

3.

Due to the variation in the ending code it was earlier producing error:

File "<ipython-input-5-a83bca1aa9f7>", line 178
    return outf not np.all(valid_mask):
                     ^
SyntaxError: invalid syntax
  Upvote    Share

Hi,

So when I tried on my end, I was working on the end_to_end_project.ipynb file and it worked fine on my end. Would request you to get the latest version of the file from our git repository and run it again.

Thanks.

  Upvote    Share

This comment has been removed.

 after completing the video the status is showing is incompleted why is that ??

  Upvote    Share

After clicking the "Mark Completed" button is it still showing incomplete?

  Upvote    Share

HI ,

Unable to import Imputer function from sklearn.preprocessing library.

What should I do?

Thanks In advance!!

  Upvote    Share

Hi,

Please go through the below discussion:

https://discuss.cloudxlab.com/t/solved-cannot-import-imputer/4052

Thanks.

  Upvote    Share

Less than 38 latitude has more populated 

 

 

  Upvote    Share

Hi,

Are you facing any challenge with this?

Thanks.

  Upvote    Share

This comment has been removed.

Hi,

Im unable to clone the repository as I am unable to login to github.

git clone https://github.com/cloudxlab/ml.git

Also, I tried to to do the same through terminal.Could not locate the ml directory.

Can you please help ?

 

  Upvote    Share

Hi,

Try the following command on a web console:

git clone https://github.com/cloudxlab/ml ~/ml

If this does not work, please share a screenshot.

Thanks.

  Upvote    Share

Hi Rajtilak,

I tried the above mentioned command and it worked.Thanks for your help !

 

  Upvote    Share

doubts in Housing project

1 What is n_splits in StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)?What happens if i use the default value 10?

2 Is it necessary to pass the label based on which the splitting occurs( in case of housing it was median income)?

3 Can we split data based on more than one column?

  Upvote    Share

Hi,

1. n_splits is the number of re-shuffling & splitting iterations. It will re-shuffle and split the dataset 10 times in case you pass on the default value of 10.

2. In case of the split() function, it is mandatory to pass the X and y values. You can check more about it here:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

3. I think what you are referring to is nested StratifiedShuffleSplit, you can find more about it from the below links:

https://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn

https://stackoverflow.com/questions/45516424/sklearn-train-test-split-on-pandas-stratify-by-multiple-columns

Thanks.

  Upvote    Share

class cars():
    def __init__(self,modelname,yearm,price):
        self.modelname = modelname
        self.yearm = yearm
        self.price = price
    def price_inc(self):
        self.price = (self.price*1.15)
class SuperCars(cars):
    def __init__(self,modelname,yearm,price):
        super.__init__(modelname,yearm,price)
        self.cc=cc


honda = SuperCars('City',2019,1000000)
honda.cc=1500
honda.price_inc()
print(honda.price)

output :-

TypeError                                 Traceback (most recent call last)
<ipython-input-6-944ff83e2317> in <module>
     12 
     13 
---> 14 honda = SuperCars('City',2019,1000000)
     15 honda.cc=1500
     16 honda.price_inc()

<ipython-input-6-944ff83e2317> in __init__(self, modelname, yearm, price)
      8 class SuperCars(cars):
      9     def __init__(self,modelname,yearm,price):
---> 10         super.__init__(modelname,yearm,price)
     11         self.cc=cc
     12 

TypeError: descriptor '__init__' requires a 'super' object but received a 'str'

what is wrong with my code??

  Upvote    Share

Hi,

Could you please tell me this is a part of which assessment?

Thanks.

  Upvote    Share

No sir i have tried on my own but it is not working.

That's y i want to know what is wrong with this code

  Upvote    Share

Hi,

Would request you to post this on our discussion forum at the below link:

https://discuss.cloudxlab.com/

Thanks.

  Upvote    Share

When we use cross_val_score with cv = 10, so finally which trained model is selected for doing evaluation on test data.
the one which showed min RMSE or the last one.....
Can we manually choose the trained model out of 10.

  Upvote    Share

Hi,

What you are talking about it called early stopping, it has been covered in this course. You can also read about it from the below link:

https://en.wikipedia.org/wi...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

send me the link of jupyter notebook where sir is teaching the codes.

  Upvote    Share

Hi,

Please find the link to our GitHub repository where all our Jupyter notebooks from this course is hosted:

https://github.com/cloudxla...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Team,
In the video tutor said random search is better as it takes the random values for hyperparameters as compared to grid search.I see only benefit is less computation for CPU as few random values are taken. But , we cannot guarantee its result will be 100 percent correct as compared to grid. Even we should not compare grid with random when we have high variance because grid cannot be used first of all according to what i understood from lecture.
So, shall I conclude that if we are lucky enough then random search will give correct set of hyperparametres otherwise no guarantee ?

Thnx

  Upvote    Share

Hi,

While it’s possible that RandomizedSearchCV will not find as accurate of a result as GridSearchCV, it surprisingly picks the best result more often than not and in a *fraction* of the time it takes GridSearchCV would have taken. Given the same resources, Randomized Search can even outperform Grid Search. Also, "less computation" is one of the key benefits that acts as a deciding factor when it comes to ML/DL models.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,
While executing the below command; iam getting Axis error
housing_prepared = full_pipeline.fit_transform(housing)

AxisError: axis -1 is out of bounds for array of dimension 0

My code Housing_1 is in below link,
https://jupyter.e.cloudxlab...

  Upvote    Share

Hi,

Would request you to recheck your code. If you are stuck somewhere, you can take a hint or look at the answer.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Team,

I am getting error of No Space left on device.Please suggest how to fix this. I don't have anything stored apart from my work , which is hardly 2MB.

Thankyou

  Upvote    Share

We have fixed these issues. Let us if it persists

-- Praveen Pavithran

  Upvote    Share

Not able to access the jupyter notebook. 503-service unavailable error is coming. Can you please help?

  Upvote    Share

Hi,

Would request you to follow all the steps from the link given below:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

How can I access the code housing median project code which is discussed in class

  Upvote    Share

Hi,

You can all the codes in our GitHub repository. The URL to the same is given below:

https://github.com/cloudxla...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

i am having the following error.Please resolve

  Upvote    Share

Hi @sandeepgiri:disqus sir,

Please help me in clearing below doubt.
1. What is estimator in the machine learning?
2. why we are extending the BaseEstimator and TransformerMixin classes while creating our own
classes like CombinedAttributesAdder and CategoricalEncoder?
3. what is BaseEstimator and TransformerMixin classes ?
4. is there any other way to implement above classes , meaning can't we achieve the functionality of above classes by just creating an user defined function , because CombinedAttributesAdder class functionality can be achieved using function upto my understanding.
5. sir when you introduced these above classes in your session after that several things are not clear to me
, please suggest some reading materials which can help me in understanding these concepts .

  Upvote    Share

These questions are answered in the latter part of the course. You can use these here for now.

-- Praveen Pavithran

  Upvote    Share

utkarshtrivedi1403@gmail.com
I have completed Topic 3 but it still showing 98%, kindly rectify it

  Upvote    Share

Getting error while trying Imputer

ImportError: cannot import name 'SimpleImputer'

Code tried
from sklearn.preprocessing import SimpleImputer
imputer = EimpleImputer(strategy='median')

This is as per end to end notebook code. Please help

  Upvote    Share

Hi,

Please change the following line:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')

We have updated our notebooks, would request you to pull the latest set from our GitHub repository.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

Without using custome transformer pipelines,
if we choose to use scikit learn method as done in previous video in various steps then,
how to union the numerical and categorical columns into housing again(code please)???

Also in the very last step when using the stratified test data to predict the testset result if we dont want to use pipeline then what to use instead of full.pipelined.transformed??

please explain me sir

  Upvote    Share

Hi,

We use the FeatureUnion class for the same. Please refer to slide# 276 of 401 for more details.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

NO,
slide# 276 of 401 FeatureUnion class here is used to union the custom transforM using pipeline .
BUT
I am asking to union the " housing_cat_1hot" and "housing_tr" which was found without doing custom transform and without using pipelining...

So how to union these two now without using pipelining????

  Upvote    Share

Hi,

ColumnTransformer class is an alternative you might want to have a look at. You would find the details here:
https://scikit-learn.org/st...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

I tried doing ColumnTransform by passing the last column as (-1) BUT
after that when i am splitting the h i.e. housing into training and testset and printing test_set.head(), i am error as

AttributeError: 'numpy.ndarray' object has no attribute 'head'

h.head() is printing tables as inn pandas but after doing ColumnTransform h is no more pandas?? i m not getting the error

Please help me out sir!!!

  Upvote    Share

Hi,

You can check the following articles to know more about how to solve this error that you are getting:
https://stackoverflow.com/q... https://stackoverflow.com/q...
Is there any specific reason you are not using the Pipeline class as shown in the tutorial?

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Thanks,
No, I just wanted to try by doing the other method too.

  Upvote    Share

I am facing an error importing mlxtend.preprocessing for both end_to_end_project_bootcamp and end_to_end_project

  Upvote    Share

Hi,

Would request you to restart your server using the following method and try once again:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Getting below error. Do you have any suggestion for this?
ImportError: cannot import name '_NAN_METRICS' from 'sklearn.metrics.pairwise' (C:\Users\csree\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py)

  Upvote    Share

Hi,

Could you please share a screenshot of your code and the error that you are getting.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi
I have some queries regarding recording sessions on machine learning. Plz let me know where can i ask my queries?
Thanks

  Upvote    Share

Hi,

You can write your queries here, or send us an email, or write to us in the discussion forum, and we would be happy to help you with them.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I would like to mail my all queries. Can you please tell me the mail address?
Thanks

  Upvote    Share

Hi,

You can mail us at through this link, our email id is given there too:
https://cloudxlab.com/conta...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Ok Thanks

  Upvote    Share

Hi CloudXlabs,
Please let me know how can i download my python notebooks, .ipynb (which I have created in my lab ) to my local machine.

Thanks

  Upvote    Share

Hi,
You can download your jupyter notebook by clicking on "FILE" option and there selecting the "DOWNLOAD AS" option from there you can select your desired format in which you want to download.

Thanks.

-- Mayank Sharma

  Upvote    Share

in the num_pipeline the data after fit() & transform() from the 'imputer' should be in numpy array then how this data is passed to 'attrib_adder'

  Upvote    Share

Hi Mahipal,

We use DataFrameSelector for this purpose.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

DataFrameSelector is used before this step and for seperating numerical and categorical data then how it helps in passing numpy array to 'attrib_adder'

  Upvote    Share

Hi Mahipal,

The detailed process is discussed here:

https://github.com/ageron/h...

Hope this helps.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
what does n_splits signify?

  Upvote    Share

Hi Vernika,

It signifies the number of re-shuffling and splitting iterations.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

how combine_attribute_adder is different from Feature_engineering?

  Upvote    Share

Hi Anuj,

CombinedAttributesAdder class is used for creating a pipeline, feature engineering is used to created new features.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

please explain this for loop i am not able to understand this loop

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,
random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi Anubhav,

Here in this for loop we are dividing the stratified data into a test and train set as per their index. First we split the data where the Income Category is assigned to the test index and the rest to the train index, then we set these to the variables strat_train_set and strat_test_set using those indexes.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Sharathchandran,

The class is deprecated, use the following instead:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

All the best!

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

For all models you are using same data for training and prediction . For example

# Train a model using Decision Tree
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Calculate RMSE in Decision Tree model
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Here we are using housing_prepared . So why we are not using other data set for testing because it obvious that it will give the accurate result if we try to predict the result for data set that is already used in training. For testing we use the data set that never used during the training.

  Upvote    Share

Hi Sandeep, we implemented the One Hot Encoder for categorical data but why we did resolve the dummy variable trap issue? should we consider the dummy variable trap or not in pre processing step of any model?

  Upvote    Share

Are Fit(),Transfom are internal methods ?

  Upvote    Share

Yes, every algorithm in sklearn has these two methods.

  Upvote    Share

Why do we use a desciion tree for regression

  Upvote    Share

I read_csv this housing.csv but there is error
'utf-8' codec can't decode byte 0xa4 in position 25: invalid start byte
Can't solve it

  Upvote    Share