End-to-End Project- Self contained

3 / 31

End-to-End Machine Learning Project Part-2

Recording of Session

Slides


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

241 Comments

if possbile kindly edit the video recordings through some tools and reduce the repettion and unwanted lag for the benifit of the learners 

  Upvote    Share

Hi Arun,

Thank you for your feedback! We appreciate you taking the time to share your thoughts. Your input is valuable to us, and we'll make sure to consider it as we continue to improve. If you have any further suggestions or questions, feel free to let us know.

  Upvote    Share

Dear Sir/Mam,

Can you explain once again why we are comparing the hash with 256?

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

 

  Upvote    Share

This comment has been removed.

Hi Mayank,

This function helps in creating the test set. If test_ratio is 20%, it would return true approximately 20% times. We could have simply generated a random number between 1 and 100 and check if it is less than 20. That would also do the same job.

But here since we want the random number to be dependent on the data attributes, we are doing some juggling of the values of an id column. It is like saying take the id from each row do some hashing and then pick the last byte of it (which would be somewhere between 0 and 256) and then checking if it is less than 0.2*256.

If the random number generated is dependent on id column, then that row will always be either in test or train.

  Upvote    Share

What does %matplotlib inline means?

Can you explain it in simple terms.

  Upvote    Share

Hi,

It allows us to display the matplotlib charts soon after executing its code in the code-cell of the Jupyter notebook.

Thanks.

  Upvote    Share

 

But without using it also matplotlib charts are displaying

  Upvote    Share

Hi,

Good point. Yes, it would still display the charts, however, you will have to explicitely call plt.show() every time.

Thanks.

  Upvote    Share

Hi,

Do sciklearn test train splut function takes care of test data ?? like the identifier code in the video ? to protect test data ? 

  Upvote    Share

Hi,

With scikit-learn, you wouldn't need to add any identifier column.

Thanks.

  Upvote    Share

this section of course is from "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron 

Not a problem froom where u got this.. 

 2  Upvote    Share

Hi Sandeep,

Based on my understanding, I think to avoid the sampling bias are we not introducing the data snooping bias? we do stratified sampling only after analyzing the data which is nothing but snooping the data and creating the model which can precisely work for test data. Isn't data snooping bias? In such a case how can we make better test data? Please correct me If my understanding is wrong.

  Upvote    Share

Hi,

Good question. Data snooping bias appears when exhaustively searching for combinations of variables, the probability that a result arose by chance grow with the number of combinations tested. You can refer to the below comic strip from XKCD for a visual representation:

https://xkcd.com/882/

I think you want to refer to is data leakage. Data leakage is when information from outside the training dataset is used to create the model. This makes the model learn something it's not supposed to and thus leads to an erronous model.

Thanks.

  Upvote    Share

When tried to run the below code:

import numpy as np

def split_train_test(data,test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc(train_indices), data.iloc(test_indices)

np.random.seed(42)
train_set, test_set = split_train_test(housing,0.2)
print(len(train_set), "train +", len(test_set), "test")

 

Got below error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-f4e2d78e57f6> in <module>
      9 
     10 np.random.seed(42)
---> 11 train_set, test_set = split_train_test(housing,0.2)
     12 print(len(train_set), "train +", len(test_set), "test")

<ipython-input-19-f4e2d78e57f6> in split_train_test(data, test_ratio)
      6     test_indices = shuffled_indices[:test_set_size]
      7     train_indices = shuffled_indices[test_set_size:]
----> 8     return data.iloc(train_indices), data.iloc(test_indices)
      9 
     10 np.random.seed(42)

/usr/local/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in __call__(self, axis)
    110 
    111         if axis is not None:
--> 112             axis = self.obj._get_axis_number(axis)
    113         new_self.axis = axis
    114         return new_self

/usr/local/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    400     @classmethod
    401     def _get_axis_number(cls, axis):
--> 402         axis = cls._AXIS_ALIASES.get(axis, axis)
    403         if is_integer(axis):
    404             if axis in cls._AXIS_NAMES:

TypeError: unhashable type: 'numpy.ndarray'

 

Can anyone help?

 

  Upvote    Share

Is this code from github.com/cloudxlab/ml ?

  Upvote    Share

Thank you so much. It was my syntax mistake. It should be in square brackets "[]".

data.iloc(train_indices), data.iloc(test_indices)

  Upvote    Share

np.random.seed(42)

What is logic using 42 in seed function?

  Upvote    Share

its same as random state in R and its use is to avoid the randomness in the model.

 1  Upvote    Share

How we will decide the value of random state. example 42 or 24 or 21???

  Upvote    Share

Any seed is ok.

  Upvote    Share

This comment has been removed.

This comment has been removed.

This comment has been removed.

Hi,

I have a couple of questions:

1. If we have Mean Square Error, why do we need RMSE?

2. How do we decide whether to use MSE or RMSE?

Thanks in advance.

  Upvote    Share

This comment has been removed.

Hi. Couple of questions - 

1. 

Current Code- I think we are converting all values of housing["income_cat"] that are below 5 to 5

# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

why it is not -

housing["income_cat"].where(housing["income_cat"] > 5, 5.0, inplace=True)

as we want capping to be done all values above/beyond 5 so, we should be usin > sign and not < sign

 

2. We divided by 1.5. Why we considered 1.5 only for division. Is this a standard procedure to reduce the spread by dividing by 1.5 or can we use some other number as well? 

 

Please clarify. Thanks in advance.

  Upvote    Share

Please revert. Waiting for the response. Thanks

  Upvote    Share

This comment has been removed.

Hello,

from sklearn.model_selection import train_test_split
train_set, test_set=train_test_split(housing, test_size=0.2,random_state=42)
print(len(train_set),"train +",len(test_set),"test")
test_set.head()

Receiving the following error:

NameError                                 Traceback (most recent call last)
<ipython-input-3-afeea3dcd5fb> in <module>
      1 from sklearn.model_selection import train_test_split
----> 2 train_set, test_set=train_test_split(housing, test_size=0.2,random_state=42)
      3 print(len(train_set),"train +",len(test_set),"test")
      4 test_set.head()

NameError: name 'housing' is not defined
  Upvote    Share

Hi,

Please ensure you have defined the housing variable, and ran all the cells from the beginning before you ran this cell.

Thanks.

  Upvote    Share

Hello,

Facing the same error file not found, tried with the whole path still the error exists.The code is mentioned below:

import pandas as pd
import os
HOUSING_PATH='ml/machine_learning/datasets/housing/housing.csv'
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)

 

housing = load_housing_data()
housing.head()

 

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-6a9011700846> in <module>
----> 1 housing = load_housing_data()
      2 housing.head()

<ipython-input-12-5ebf01cf80e5> in load_housing_data(housing_path)
      4 def load_housing_data(housing_path=HOUSING_PATH):
      5     csv_path = os.path.join(housing_path,"housing.csv")
----> 6     return pd.read_csv(csv_path)
  Upvote    Share

Hi,

First, check if the file exists at that location. If it does not, use the following command to download the file by cloning our GitHub repository:

git clone https://github.com/cloudxlab/ml ~/ml

If the file exists at that location, please use this path instead:

../ml/machine_learning/datasets/housing/housing.csv

Thanks.

  Upvote    Share

Hello Sir,

I tried as mentioned by you still the problem exists.

  Upvote    Share

Hi,

Please share a screenshot of the location of the file and your code.

Thanks.

  Upvote    Share

Hello Sir,

Here is the code and screenshot:

import pandas as pd
import os
HOUSING_PATH='../ml/machine_learning/datasets/housing/housing.csv'
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

 

  Upvote    Share

Hi,

Since you are using the function load_housing_data() to load the dataset, that function is adding housing.csv at the end of the path. So please change the path to the following and your code should run fine:

'../ml/machine_learning/datasets/housing/'

Thanks.

  Upvote    Share

Hello Sir,

The above line worked fine.

Thanks.

  Upvote    Share

Hi,

Please explain in more detail as: WHY would "the solution will break next time when we fetch an updated dataset" using split_train_test() function ?

Thank you.

  Upvote    Share

Hi,

This is because when we have an updated dataset, it will have different number of rows, and so the model will treat it as a new dataset. So it will basically be creating a new model instead of using the same model to work on the new dataset.

Thanks.

  Upvote    Share

Hi,

What does data.iloc[a] does ?

Thank you.

  Upvote    Share

Hi,

`iloc` would fetch the series of the given integer index from the given data frame(here `data`). Please have a look here for more info on `iloc`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Hope this helps.

Thanks.

  Upvote    Share

Another question,

Let's say we plot a data on histogram and see that the distribution isn't normal. The data has a mean and standard deviation.

If we use the value of mean and dev in the normal distribution function formula and plot it to get a normal distribution curve, that would result in erroneous analysis, right? 

Thanks.

  Upvote    Share

Hi,

The value of the z-score tells you how many standard deviations you are away from the mean. If a z-score is equal to 0, it is on the mean. A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean.

Thanks.

  Upvote    Share

Hi,

Time stamp: 37:00

In the example let's say Jim's score is 28.5, ie z score is '1.5', greater than z score of Pam, '1'. Does that mean Jim is a better performer in this case even though the standard deviation of ACT score,'5' is much less than that of SAT, '300'.

Thanks.

  Upvote    Share

np.ceil just seems like a very faulty way to round off numbers. How can 1.1 round off to 2? It doesn't make sense. Why wasn't a better function used to round off numbers?

  Upvote    Share

Hi,

You can try round() for rounding numbers.

Thanks.

  Upvote    Share

sir,

      could you please explain me why you take digest()[-1]....main problem is why you take [-1]....

  Upvote    Share

Hi,

When you put negative sign for array elements, it means those many elements from the end. try the following code to understand this better:

a = [1,2,3,4]
a[-1]

Thanks.

  Upvote    Share

sir,

         i know [-1] means from the end...but thing is why choose from end rather than starting....

  Upvote    Share

Hi,

This is because we only want to take the last byte of the hash as mentioned in the slides.

Thanks.

  Upvote    Share

Hello,

Can you please explain to categorize median_income column, why is it being divided by 1.5 and why not any other number ?

  Upvote    Share

Hi,

It's the bin value for the histogram. It is 1.5 given here because it is gives the optimal number of bins. However, with different data, or with even this one, you are free to use your own value. You can try that here and see how different the result is than the one given here.

Thanks.

  Upvote    Share

Thanks for replying.

I find this concept very interesting. Can you please suggest some datasets on which I can do some hands on.

  Upvote    Share

Hi,

You can login to Kaggle and search for datasets.

Thanks.

  Upvote    Share

Hello,

while using housing.hist(), the plots coming up are not bell shaped. 

1. Does that depict that data we are usins is not correct or wrongly sampled?

2. How bell shaped plots give assurance that model is predicting correct value ? or why so much importance is given to bell shaped curves in statistics.

Thanks!

  Upvote    Share

Hi,

Having a bell shaped plot means it is a normal distribution, ML models give better results with such datasets.

Thanks.

  Upvote    Share

Hello,

In the video, it has been said that np.random.seed() is valid for single run.

What does that mean?

  Upvote    Share

Hi,

This means that next time you are running the code, you need to set the seed value once again.

Thanks.

  Upvote    Share

Sir why i am getting FileNotFoundError

  Upvote    Share

Hi. 

Kindly check the hoysing data path HOUSING_PATH = '../ml/machine_learning/datasets/housing/'

 1  Upvote    Share

FileNotFoundError: [Errno 2] File b'datasets/housing/housing.csv' does not exist: b'datasets/housing/housing.csv'

 

please advice what to do next.

Even though I have clone as instructed earlier.

  Upvote    Share

Hi,

Please check if the file exists. If it does, then please provide the full path to the file.

Thanks.

  Upvote    Share

Hi, 

 

I did not understand the idea of ML algorithims not able to detect the patterens when data is heavy tailed?

Can you explain it more concretely. 

Thanks 

  Upvote    Share

Hi,

Tail heavy is just another way of saying that the data has outliers. Some of the ML algorithms do not perform well with dataset that contains outliers.

Thanks.

  Upvote    Share

Getting error when reading from URL. How to do this?

I could read another csv file successfully using URL: 

  Upvote    Share

Hi Gaurav,

Use this URL "https://github.com/cloudxlab/ml/raw/master/machine_learning/datasets/housing/housing.csv".

 1  Upvote    Share

Awesome, it worked to read data. Thank you!

  Upvote    Share

This comment has been removed.

 

Please resolve the issue. I am not getting the read file output. Data results are not showing. Please refer screenshot.

  Upvote    Share

This comment has been removed.

Hi,

Please check the location of the file and then provide the full path to the file.

Thanks.

  Upvote    Share

You can see, my housing.csv file is in 'datasets/housing'

Guide me where I am going wrong.

  Upvote    Share

Hi,

That is true, but the Python notebook is not. So please provide this path instead:

~/ml/machine_learning/datasets/housing

Thanks.

 2  Upvote    Share

Query solved. Thank you so much.

  Upvote    Share

For those who need to get a look at Pandas: 

NumPy & Pandas | Python for Machine Learning | Session 11

https://youtu.be/LxVzfncBcng?t=5688 Pandas

 

 

 1  Upvote    Share

Hi Team,

Can you please check Why am i unable to login Hue?

Also, I have Housing.csv file in hadoop and in the local but when trying to load this file in Jupyter, Its' not showing data from all columns.

 

  Upvote    Share

Hi., Dilip. 

Kindly use housing.describe() to see all the columns. \t is the tab between the columns. the columnbs name is housing_median_age

KIndly check before posting!

All the best !

  Upvote    Share

Hi Satyajit,

Thanks for prompt reply.

Now the error is 'Housing.csv' does not exist. But the file is there in the local and in hadoop system. I can access this file from the console.

Please let me know if i am missing anything.

https://cloudxlab.com/assessment/displayslide/1317/end-to-end-machine-learning-project-part-2?course_id=73&playlist_id=414#c23029

  Upvote    Share

Hi,

Try housing.csv instead of Housing.csv. Also, if the issue persists, please share a screenshot of your code and the error that you are getting.

Thanks.

  Upvote    Share

Hi,
Referring to slide #56 Video 7:47 mins.  For calculating why do we square differences from the mean? 

1. To eliminate negatives canceling positive differences. 

2. Amplify the higher difference e than the lower differences. 

3. To ease the calculation than the absolute distance

Questions: 

1. Why do we need to amplify the higher differences so that they are weighted more heavily?

2. What does #3  mean?

Thank you very much for the help.

  Upvote    Share

Hi,

The main reason to square the values is so they are all positive. You could take the absolute value instead, but squaring means that more variable points have a higher weighting. Squaring rather than taking the absolute value also means that taking the derivative of the function is easier.

Thanks.

  Upvote    Share

Thank you for the response Rajtilak,

But my question was about what you said:  "but squaring means that more variable points have a higher weighting."

Why the more variable points have to be given a higher weighting?

Thanks,

Harish

 

 

  Upvote    Share

This comment has been removed.

This comment has been removed.

please elaborate using hash function here ; it is just to keep training data set constant so that content should be consistent ? 

  Upvote    Share

Hi,

You can find the answer to your query on slide# 101.

Thanks.

  Upvote    Share

What is the significance of bins parameter while plotting hist from dataframe ?

housing.hist(bins=50, figsize=(20,15))

  Upvote    Share

Hi,

histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that binBins are also sometimes called "intervals", "classes", or "buckets".

Thanks.

  Upvote    Share

Why the command can not read the file?

 1  Upvote    Share

Hi,

You need to specify the full path.

Thanks.

 1  Upvote    Share

Could you please clarify what the full path would be? I did not get you.

 

 1  Upvote    Share

Hi,

The full path to the file including the root directory. It may look something like this, but depends on where you saved your file:

~/ml/machine_learning/datasets/housing/housing.csv

Thanks.

 1  Upvote    Share

1)

from sklearn.model_selection import train_test_split
test_set,train_set=train_test_split(housing,test_size=0.2,random_state=42)
print('test_set is',len(test_set),'train_set is',len(train_set))
len(housing)

O/P: 

test_set is 13209 train_set is 3303
16512

2)

# With unique and immutable identifier

import hashlib

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

print(len(train_set), "train +", len(test_set), "test")
housing.head()
print(len(housing))

O/P:                    

13064 train + 3448 test
16512

 

Using method 1 and method 2 ,I'm getting 2 different 20% of test sets.How do we know which is a reliable method use ?

  Upvote    Share

Hi,

Please refer to the lecture video to understand the difference between the two methods. One of the methods is executing the process from scratch, we mostly do not use this. This has been shown to help you understand how the process works.

Thanks.

  Upvote    Share

I totally understood the part where one is an inbuilt feature while one is done by us.What I fail to understand is that both the methods should give the same output as test ratio is taken as 0.2.

Another thing the video fails to address is when using the hash method its understood that 0,2 test ratio of records are taken ,but there is no mention as to if it is randomized 20% or the 1st 20% is encountered.In which order is the 20% chosen ?

  Upvote    Share

Hi,

The difference in number of samples maybe due to the algorithm the train_test_split uses.

Also, to gain a better understanding of how the split_train_test_by_id works, you can try using the hash function on a smaller dataset, and then check what kind of value the hash function generates. The dataset is split based on the hash values generated, a detailed explanation is given in slide# 100 and subsequent slides.

Thanks.

  Upvote    Share

have the videos been updated with new changes ??

  Upvote    Share

Hi,

Yes, we keep updating our content to make it better.

Thanks.

  Upvote    Share

its really sad that it took 2 years to update content .......I had to go through the entire video which I had already finished just to see what was updated .Even after updation many things are still unclear

  Upvote    Share

Hi,

We only update the courseware when it is necessary. If you need more clarity on a topic, you can always reach out to us.

Thanks.

  Upvote    Share

HI CloudXTeam,

Random Sampling method works fine if the data set is large enough. How to decide how much is  "Large Enough"?..quntitatievly, measurably how can we decode "Large enough" for a problem?

Regards,

Ravikiran Nalla

  Upvote    Share

Hi,

Large data is when the dataset is quantitatively large in size, which refers to the numbers of samples.

Thanks.

  Upvote    Share

Hi CloudXTeam,

Please refer to the slides no: 98 to 102.

We know that there is an inherent problem in "split_train_test()" function that the solution breaks if an updated data set avialable and the suggested solution is to use "Hash" and use the last byte. I understand of using the hash value to identify the unique dataset identifier, but how using the last byte of Hash in determining it should go to the "Test Set".

Can you clarify?

Regards,

Ravi

  Upvote    Share

Hi,

When we add new data to the function, it will treat the entire dataset as new data, so the old inferences will not hold true anymore.

Thanks.

  Upvote    Share

I am getting below error while trying OneHot method

NameError: name 'CategoricalEncoder' is not defined

Pllease suggest immediately

  Upvote    Share

Hi,

Would request you to get the latest notebooks from our Github repository.

Thanks.

  Upvote    Share

Please explain why we are using second parameter housing["income_cat"] in the line below? This is very confusing. What role of 'housing["income_cat"]' has in Split function?

  Upvote    Share

https://prnt.sc/t7rsxm sir i hv been trying since yesterday evening to restart my server as given in the instructions but still I'm facing the same problem .plss help me solve this

  Upvote    Share

Hi,

Please try the following command in a web console:

rsync -avz --ignore-existing /cxldata/cloudxlab_jupyter_notebooks/ /home/$USER/cloudxlab_jupyter_notebooks/

Thanks.

  Upvote    Share

Please explain why we are using second parameter housing["income_cat"] in the line below? This is very confusing. What role of 'housing["income_cat"]' has in Split function?

 

for train_index, test_index in split.split(housing, housing["income_cat"]): 

  Upvote    Share

Please someone from Cloudxl response on my question

  Upvote    Share

Hi,

Here we are splitting the data. The housing part (without the income_cat column) is being split in the train_index variable, the income_cat is being split in the test_index variable.

Thanks.

  Upvote    Share

Dear Team, 

At 2:36:20 part of the video, we are importing Imputer Class. However. this throws an error while importing.

I just googled and found the new syntax to import Imputer as listed below:

from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Kindly advise, if i am doing something wrong. If not, please update the video and ppt with the changes.

Thank you.

  Upvote    Share

Hi,

Would request you to go through the following:

https://discuss.cloudxlab.com/t/solved-cannot-import-imputer/4052

Thanks.

  Upvote    Share

the professor is just reading the slides....unable to make understand a single concept...one eg 'stratified' concept totaly bouncer which he started at about 1h 17 min

  Upvote    Share

Hi,

Would suggest you to try the code yourself while listening to the video. You also have the slides to explain the same concept that was shown in the lecture video. You can cloen our GitHub repository using the following command in the web console:

git clone https://github.com/cloudxlab/ml ~/ml

Thanks.

  Upvote    Share

kindly go over the video from 1h 17 min. He didn't explain the coding lines, why we are using them, how these work, what is the importance of doing that. How am i supposed to learn just by seeing the code and not understanding it. Also in comparision proportion code, he read just read 2-3 lines and directly took the output and neither he explained the code nor the output . in my opinion concept clearing matters than copy pasting. kindly understand my point.

  Upvote    Share

Hi Harmeet,

This topic is to give your an overview of how the Machine Learning pipeline works. The next couple of topics will talk about the in-depth process of how individual Machine Learning algorithms like Classification etc works. Also, if you are unable to understand any particular code, let me know, I will help you by explaining it to you.

Thanks.

  Upvote    Share

kindly explain the stratified and comparision proportion code. i rewatched it number of times. i am stuck here

  Upvote    Share

Hi,

Stratified sampling refers to a type of sampling method . With stratified sampling, the researcher divides the population into separate groups, called strata. Then, a probability sample (often a simple random sample ) is drawn from each group.

Stratified sampling has several advantages over simple random sampling. For example, using stratified sampling, it may be possible to reduce the sample size required to achieve a given precision. Or it may be possible to increase the precision with the same sample size.

Could you help me by referring to the slide which has the comparison proportion code?

Thanks.

  Upvote    Share

hello sir ,

can you please explain the work of plt.legend() in matplotlib

  Upvote    Share

Hi,

This function places a legend on the axes. You can find more about it from the below link:

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html

Thanks.

  Upvote    Share

Hello sir,

i'm getting doubt in split.split(housing,housing["income_cat"]) please explain why it is used.

  Upvote    Share

Hi,

We are splitting the data here using StratifiedShuffleSplit() into strat_train_set and strat_test_set.

Thanks.

  Upvote    Share

hi team,

All attributes are numerical, except the ocean_proximity field. Its type is object, so it could hold any kind of Python object, but since you loaded this data from a CSV file you know that it must be a text attribute.

what is the logical meaning of line i.e its type of object, so it could hold any kind of python object ?

As i know An object is simply a collection of data (variables) and methods (functions) that act on those data. Similarly, a class is a blueprint for that object. ... An object is also called an instance of a class and the process of creating this object is called instantiation. i can visualize this defination

but that line seems confusing to me.
kindly make it clear with example

  Upvote    Share

Hi,

The textual data is referred to as categorical data. As for having it as a data type object, you can change it if you want, but it is usually given so that it can hold non-numerical data. Now as you you, non-numerical data can be anything, from images, to sound. It depends on the dataset what would be the type of the features.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Why Specific benefit we got by :-
# By Dividing "median_income" by 1.5 to limit the number of income categories
# Round up using ceil to have discrete categories
# Truncating it to 5
to create a new column "income_cat", using this sample is Stratified for carrying out test.

Cant we directly Stratify using "median_income" only after using ceil function..... as this will also ensure that all income group is represented in the test_data

  Upvote    Share

Hi,

We divide the median income by 1.5 so that we will not have so many strata, and each stratum will be large enough. This is done to address the anomalies in the data that we have. Would request you to go through the part of the lecture where it is being discussion about the shape of the data, that would make things clear for you.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

why should each stratum be large enough?

  Upvote    Share

Hi,

It should be large enough to be representative.

Thanks.

  Upvote    Share

1)we already split the data into test and train data sets.. then why we again split the data("income_cat") using "StratifiedShuffleSplit" class?

2) "housing["income_cat"].value_counts() / len(housing)" ....y we used this line of code?

  Upvote    Share

Hi,

PLEASE TELL ME HOW TO TAKE y_score FOR CATEGORICAL VALUES AND PLOT ROC CURVE??

  Upvote    Share

Hi,

I have already replied to your mail, please check.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Can you please tell any way so that I can save my Jupiter notebooks for future references...after completing project?? As after completing project we won't be able to access the lab...please tell me..

  Upvote    Share

Hi,

There are 2 things you can do:

1. You can save your notebooks as .ipynb files. Open your notebook, click on File -> Download as -> notebook (.ipynb) and save it on your local hard drive.

2. You can publish your notebooks on your personal GitHub accounts. However, would request you to read the following before you go ahead with this option:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

np.random.seed(10)

what's the meaning of 10 in seed method? what this value signify?

  Upvote    Share

Hi. Rachit.

The seed() is given to make the output of the np.random same if you rerun your Jupyter cell or program again and again.
np.random() will be generating the random numbers and to make that random number constant through out your program we use seed() and the argument for seed() can be any positive integer.

All the best!

-- Satyajit Das

  Upvote    Share

Hi Team,
# Let's use Scikit-Learn Imputer class to fill missing values

import sklearn
from sklearn.preprocessing import Imputer

Error: ---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-51-46d3542f179c> in <module>
2
3 import sklearn
----> 4 from sklearn.preprocessing import Imputer

ImportError: cannot import name 'Imputer'

Please Help

  Upvote    Share

Hi,

Please find the solution to this issue in the below link:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,
Below import isn't working. Cannot import Imputer. Am I missing any syntax?
from sklearn.preprocessing import Imputer

  Upvote    Share

Please refer to this note

-- Praveen Pavithran

  Upvote    Share

in END to End project
in Project

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

why split = 1? size=0.2? , random_state=42( if we take any seed except 42 then what will happen?

  Upvote    Share

Hi,

Here n_split defines the number of re-shuffling & splitting iterations. test_size can be either float or int, if int, represents the absolute number of test samples. random_state controls the randomness of the training and testing indices produced. You can find more details in the link given below:
https://scikit-learn.org/st...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Why error

  Upvote    Share

Hi Rajeev,
It is showing error because the answer is not correct.

Hint: Please consider the second line (a[2:5] = [7,4,9]). Here, we have changed the values of some elements.

-- Sachin Giri

  Upvote    Share

import pandas as pd
import os
HOUSING_PATH = 'datasets/housing/'

def load_housing_data(housing_path=HOUSING_PATH):

csv_path = os.path.join(housing_path,
"housing.csv")

return pd.read_csv(csv_path)
housing =pd.read_csv('datasets/housing/housing.csv')
housing.head()
i am not able get output why

  Upvote    Share

Hi, Varshith.

Kindly give the complete/absolute path of the file.

All the best1

  Upvote    Share

not able to see the slides, 'The connection was reset' message appears. Tried reloading , log off on etc but no luck

  Upvote    Share

Hi Disqus,

Thank you for contacting us.
Are you not able to see the complete pdf or the slides are taking time to load? The problem might be with the internet connection. Kindly check. Please feel free to let me know if you have any queries and I'll be glad to help.

Hope this helps.

Thanks.

-- Anupam Singh Vishal

  Upvote    Share

You have not cleared my query, kindly get it cleared

  Upvote    Share

I have completed topic 3, but it is still showing that I have done it 98%.
Kindly rectify it

  Upvote    Share

Hi,

Would request you to share your email id with us.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

utkarshtrivedi1403@gmail.com

  Upvote    Share

utkarshtrivedi1403@gmail.com

  Upvote    Share

Hi,
It is correct now. Can you please check?

  Upvote    Share

Now, my topic 1 is 99%, yet I have completed all along with the newly added lambda ledcture

  Upvote    Share

It is corrected. Also, please check if you completed all slides from topic 1 as we have added 2 slides to that topic. If you remark any slide from that topic the topic progress is updated.

  Upvote    Share

Sir, in my topic 1, only one new track has been completed which is lambda function.
And, if two tracks are added then, kindly add the new one.
If only one track has been added, then I have completed that and it is still showing 99%. Kindly, correct it.
It is very unfortunate that I have paid the money, but I have to send the queries again and again.I am feeling betrayed.

  Upvote    Share

Reminder

  Upvote    Share

Reminder2

  Upvote    Share

Hi,

You still have not completed the sub-topic Tuples (# 119) in topic 1, this is why it is showing 99% complete. Would request you to complete it to set the completion percentage to 100% for topic 1.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Again same problem, after completing #Task 119

  Upvote    Share

Hi,

Could you please tell me which topic you are referring to since topic# 1 is showing 100% complete.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I t is again 99% complete

  Upvote    Share

I am talking about Topic 1

  Upvote    Share

Hi Utkarsh,

Topic# 1 is showing 100% complete. Would request you to check once again and confirm.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Sorry, Sir. Now it is corrected.
I think server takes some time to process.Kindly, dont consider my three latest queries

  Upvote    Share

Hi,

No problem. We are always there to help you out. Happy learning!

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Closing ticket

-- Praveen Pavithran

  Upvote    Share

Kindly, get it corrected

  Upvote    Share

Sir! now the same problem is with topic 1 and topic 7. Kindly correct it

  Upvote    Share

now, my topic 1& 6 are showing the same problem. Kindly check

  Upvote    Share

When we make a prediction, Is it made on training set or the test set ?

  Upvote    Share

Hi,

We create the model using the training set, then make the predictions using that model on the test set.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

It applied ceiling on median income to get value counts from 1 to 10 and then added where condition to get value counts from 1 to 5, instead we can directly apply where condition to get value counts from 1 to 5 without applying ceiling

Why it applied both ceil and where on housing data set median income ?

  Upvote    Share

Got it, without ceil it creates number of buckets for each unique value less then 5
Thanks

  Upvote    Share

Hi Team,

Why the tutor is running a for loop to drop the 'income_cat'. We could have done it directly like df.drop(['income_cat'], axis=1). Any difference in doing the other way.
Please explain, if there is difference in both.

Thanks,

  Upvote    Share

Hi,

Could you please help me by pointing out where it was mentioned that the income cat was dropped using a for loop.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

Without using custome transformer pipelines,
if we choose to use scikit learn method as done in this video in various steps then,
how to union the numerical and categorical columns into housing again without using pipelining(code please)???

Also in the very last step when using the stratified test data to predict the testset result if we dont want to use pipeline then what to use instead of full.pipelined.transformed??

please explain me sir

  Upvote    Share

Sir,
please provide me a solution for the above that how to union the numerical and categorical columns into housing again without using pipelining method and hoe to pass the test set to predict without using pipelining transformer (only using libraries as done in this video)

  Upvote    Share

Hi,Am I right or wrong please clarify sir.

new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household" are used nowhere later in creating the model??

Did we created the new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household" in the copy set of housing ???? please clarify??
because when we print housing.info() after seeing the null values in housing(beginning of cleaning) there is no newly created above columns,

So cleaning -----we are doing in original strat_train_set??
and creating the new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household-----we are doing in a copy of strat_train_set???

Am I right or wrong please clarify sir.

  Upvote    Share

how can I see the version of "sklearn" in cloudxlab and how could i upgrade it if require?

  Upvote    Share

Hi,

You can use the following command to check the version of sklearn:

import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__)) You do not need to upgrade it since we keep the libraries updated from our end.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

Would request you to follow the instructions given here:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

"from sklearn.preprocessing import Imputer " is not working in jupiter notebook in cloudex lab as getting an error "cannot import name 'Imputer'"

  Upvote    Share

What does negative corelation mean, how are we concluding that bedrooms per room is more appropriate.

  Upvote    Share

Hi,

*Negative correlation* is a relationship between two variables in which one variable increases as the other decreases, and vice versa.

If you mean that how to determine that bedroom per room should be the feature to be added, you need to study the problem that you are trying to solve to find/formulate these features.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

1. I have not understood about the hashcode generation and division of data procedure. Please explain in me detail sir.?

2. As we have created category column at the beginning and after dividing dataset into training and test set is there values from both median_income and income_cate column in the training and test sets? I think we should not include median_income values during splittig as we are already using income_cate instead??

we can drop the category column.. as we have manipulated the median_income into income_cat for more accuracy and also less error% from original so we should use the income_cat and drop median income before splitting into training and test sets.

also while dropping axis=1 means row but we are dropping the column income_cat so it should be axis=0?

  Upvote    Share

Unable to import SimpleImputer in sklearn for end-to-end project. Screenshot attached. Pl help.

  Upvote    Share

Hi,

Would request you to restart your server using the following method and then try once again:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I restarted server and followed as mentioned in above link page and tried to import SimpleImputer on the right hand-side jupyter notebook for end-to-end project but its not working...msg is "cannot import name 'SimpleImputer'. This is run to use scikit learn Imputer class to fill missing values. Pl help,...I unable to proceed further. Screenshot attached

  Upvote    Share

It is solved now. Probably scikit learn version issues. So I had to use "from sklearn.impute import SimpleImputer" in stead of "from sklearn.preprocessing import SimpleImputer". Thanks.

  Upvote    Share

For loading california.png the path is '../ml/machine_learning/images/end_to_end_project/california.png'

why not 'images/end_to_end_project/california.png'? Pl explain.

  Upvote    Share

Hi,

This is because it is a relative path with respect to the location of your notebook.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi
How can i save all files present in cloudxlabs in my PC for future use and reference.
Thanks
Prachi

  Upvote    Share

Hi,

Very good question. You can open individual Jupyter notebook in the lab, click on File -> Download as -> notebook (.ipynb), then you can choose your local desktop to save the notebook.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

ok Thanks

  Upvote    Share

compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100

compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
sir ,please explain me the above expressions and there purpose.i am not able to understand that what they are actually doing?

  Upvote    Share

Hi Mohini,

We are comparing the income category proportion in Stratified Sampling and Random Sampling. In the last 2 lines, we are calculating the error percentage of the same compared to the overall results.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Why is this not working ?

  Upvote    Share

Hi Ritu,

Would request you to check if the file exists in that path.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I did and it exist in the path.

  Upvote    Share

Hi Ritu,

Try this path instead:

'ml/machine_learning/images/end_to_end_project/california.png'

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Tried this but not working.

  Upvote    Share

Hi Ritu,

Could you please share a screenshot of that file within that folder.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Ritu,

Try this path instead:

'/ml/machine_learning/images/end_to_end_project/california.png'

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

It says "cannot import imputer"
since Imputer is a python class, i believe i dont have to install it exclusively

Kindly help me out with this

  Upvote    Share

Hi Rohit,

Please follow these steps to solve your issue:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

So to do the OneHotEncoding first we have to use factorise, we cannot use onehotencoder directly,?

  Upvote    Share

please explain me the working of the for loop i the below code it's a bit confusing for me

thankyou

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,
random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

  Upvote    Share

what is the difference between train_test_split & split_train_test or why we use one over the other?

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi Anubhav,

train_test_split is an in-built function, whereas split_train_test was the function that Sandeep created from scratch which had the same functionality. This was done so that learners can understand the underlying working on the in-built function.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Compute hash of each instance’s identifier
? Take only last byte of hash
? If last byte value is lower or equal to 51 (20% of 256)
? Put instance in the test set

please explain the concept of hash and If last byte value is lower or equal to 51 (20% of 256)

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi Anubhav,

Here we are trying to segregate the data into train and test sets. The function split_train_test was initially plagued by a problem that this solution will break next time when we fetch an updated dataset. To avoid this issue, we would compute hash of each instance’s identifier and use that to divide it into train and test sets. Now the question is, what is a hash? A hash is a function that is deterministic, such that if a==b, then f(a)==f(b), and if a!=b, then with a very high probability f(a)!=f(b). This conditions are met by the in-built hash
function. Would suggest you to go through the slides accompanying this video to get a better understanding of the problems and their solutions.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share


what's wrong in this code

  Upvote    Share

Hi Sharathchandran,

Please ensure that the file housing.csv exists at that given path.

Thanks.

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi. My lab access is over. I am unable to access any files from my courses. Also how am I going to submit the answers for the exercises. is it compulsory to renew the lab access?

  Upvote    Share

Hi, Abhinav.

Yes, you need to renew your Lab to access the files , kindly contact the Cloudxlab team to renew the Lab.

All the best

  Upvote    Share

What the torrent site name having the datasets?

  Upvote    Share

Slide no. 213

Data Cleaning - Missing Values - Option Two

Following code is not working...
>>> sample_incomplete_rows.drop(subset=["total_bedrooms"])

TypeError: drop() got an unexpected keyword argument 'subset'

I tired following code to drop a column but it is neither producing error not dropping the column. Please help.

>>>sample_incomplete_rows.drop(columns = ['total_bedrooms']) #Not working
>>>sample_incomplete_rows.drop(['total_bedrooms'],axis = 1) #Not working
>>>sample_incomplete_rows.drop('total_bedrooms',axis = 1) #Not working

Slide no. 218

>>> from sklearn.preprocessing import Imputer

ImportError: cannot import name 'Imputer'

code not working because with version scikit-learn 0.20 the implementation has been changed. Please replace deprecated methods/functions/classes.

https://scikit-learn.org/st...

Changed the code as follows:

>>>from sklearn.impute import SimpleImputer
>>>imputer = SimpleImputer(strategy="median")

https://scikit-learn.org/de...

sklearn.impute

New module, adopting preprocessing.Imputer as impute.SimpleImputer with minor changes (see under preprocessing below).

Major Feature Added impute.MissingIndicator which generates a binary indicator for missing values. #8075 by Maniteja Nandana and Guillaume Lemaitre.

Feature The impute.SimpleImputer has a new strategy, 'constant', to complete missing values with a fixed one, given by the fill_value parameter. This strategy supports numeric and non-numeric data, and so does the 'most_frequent' strategy now. #11211 by Jeremie du Boisberranger.

  Upvote    Share

i cannot import the fetch_mldata , pls help, the folder scikit_learn is not there

  Upvote    Share

The reshape mehod is not working out for me

  Upvote    Share

Hi,
I have a question.
if categorical column value is not ordinal it should be one hot encoded, right?
but if it has many unique categories, like if a categorical column CatA has 200 unique values, and lets just say dataset has 10000 rows. Then will it still be a good idea to one hot encode them? because doing so will create 199 additional columns.
If not, then how to deal with such attribute?

  Upvote    Share

I can not see the End_to_end_project.ipynb content , please help !

  Upvote    Share

Hi, Arindam.

I request you to please recheck the tutorials and follow the steps the "End_to_end_project.ipynb" will be present.

All the best

  Upvote    Share

how did we come to conclusion that median_income is important feature and categorizes that

  Upvote    Share

how do we do stratified sampling? is it a only term or is there a function?

  Upvote    Share

Hi,
You can find more about this in the article.

https://www.surveygizmo.com...

All the best.

  Upvote    Share

about creating strata "line housing["income_cat"]=np.ceil(housing["median_income"]/1.3)"
how to decide this strata here i divided with 1.3 instead of 1.5 but the plotting histogram is same what we got in 1.5?

import matplotlib.pyplot as mt
import numpy as np
import pandas as pd
import os
HOUSING_PATH='datasets/housing'
def load_housing_data(housing_path=HOUSING_PATH):
path=os.path.join(housing_path,'housing.csv')
return pd.read_csv(path)
housing=load_housing_data()

housing["median_income"].hist()
mt.plot()
mt.show()
housing["income_cat"]=np.ceil(housing["median_income"]/1.3)

housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)
housing["income_cat"].value_counts()
housing["income_cat"].hist()

  Upvote    Share

what is meant by Capped data ? In last lecture I did not get completely that part?

  Upvote    Share

Sometimes while saving the data, we set a limit. For example, all the values of income beyond 1000 will be set to 1000. This is generally a human introduced defect.

  Upvote    Share