End-to-End Project- Self contained

3 / 31

Previous Index Next

End-to-End Machine Learning Project Part-2

Recording of Session

Slides

Previous Index Next

Please login to comment

241 Comments

Arun Chiriyankandath

2 years ago

if possbile kindly edit the video recordings through some tools and reduce the repettion and unwanted lag for the benifit of the learners

Upvote Share

Shubh Tripathi

2 years ago

Hi Arun,

Thank you for your feedback! We appreciate you taking the time to share your thoughts. Your input is valuable to us, and we'll make sure to consider it as we continue to improve. If you have any further suggestions or questions, feel free to let us know.

Upvote Share

Mayank Chaubey

4 years ago

Dear Sir/Mam,

Can you explain once again why we are comparing the hash with 256?

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

Upvote Share

This comment has been removed.

Sandeep Giri

4 years ago

Hi Mayank,

This function helps in creating the test set. If test_ratio is 20%, it would return true approximately 20% times. We could have simply generated a random number between 1 and 100 and check if it is less than 20. That would also do the same job.

But here since we want the random number to be dependent on the data attributes, we are doing some juggling of the values of an id column. It is like saying take the id from each row do some hashing and then pick the last byte of it (which would be somewhere between 0 and 256) and then checking if it is less than 0.2*256.

If the random number generated is dependent on id column, then that row will always be either in test or train.

Upvote Share

Himshi Chouksey

4 years ago

What does %matplotlib inline means?

Can you explain it in simple terms.

Upvote Share

Vagdevi K

4 years ago

Hi,

It allows us to display the matplotlib charts soon after executing its code in the code-cell of the Jupyter notebook.

Thanks.

Upvote Share

Himshi Chouksey

4 years ago

But without using it also matplotlib charts are displaying

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Good point. Yes, it would still display the charts, however, you will have to explicitely call plt.show() every time.

Thanks.

Upvote Share

Saran Ravali

4 years ago

Hi,

Do sciklearn test train splut function takes care of test data ?? like the identifier code in the video ? to protect test data ?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

With scikit-learn, you wouldn't need to add any identifier column.

Thanks.

Upvote Share

Nukala Akshith

4 years ago

this section of course is from "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron

Not a problem froom where u got this..

2 Upvote Share

Sekar Mp

4 years ago

Hi Sandeep,

Based on my understanding, I think to avoid the sampling bias are we not introducing the data snooping bias? we do stratified sampling only after analyzing the data which is nothing but snooping the data and creating the model which can precisely work for test data. Isn't data snooping bias? In such a case how can we make better test data? Please correct me If my understanding is wrong.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Good question. Data snooping bias appears when exhaustively searching for combinations of variables, the probability that a result arose by chance grow with the number of combinations tested. You can refer to the below comic strip from XKCD for a visual representation:

https://xkcd.com/882/

I think you want to refer to is data leakage. Data leakage is when information from outside the training dataset is used to create the model. This makes the model learn something it's not supposed to and thus leads to an erronous model.

Thanks.

Upvote Share

Munish Kumar

4 years ago

When tried to run the below code:

import numpy as np

def split_train_test(data,test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc(train_indices), data.iloc(test_indices)

np.random.seed(42)
train_set, test_set = split_train_test(housing,0.2)
print(len(train_set), "train +", len(test_set), "test")

Got below error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-f4e2d78e57f6> in <module>
      9 
     10 np.random.seed(42)
---> 11 train_set, test_set = split_train_test(housing,0.2)
     12 print(len(train_set), "train +", len(test_set), "test")

<ipython-input-19-f4e2d78e57f6> in split_train_test(data, test_ratio)
      6     test_indices = shuffled_indices[:test_set_size]
      7     train_indices = shuffled_indices[test_set_size:]
----> 8     return data.iloc(train_indices), data.iloc(test_indices)
      9 
     10 np.random.seed(42)

/usr/local/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in __call__(self, axis)
    110 
    111         if axis is not None:
--> 112             axis = self.obj._get_axis_number(axis)
    113         new_self.axis = axis
    114         return new_self

/usr/local/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_axis_number(cls, axis)
    400     @classmethod
    401     def _get_axis_number(cls, axis):
--> 402         axis = cls._AXIS_ALIASES.get(axis, axis)
    403         if is_integer(axis):
    404             if axis in cls._AXIS_NAMES:

TypeError: unhashable type: 'numpy.ndarray'

Can anyone help?

Upvote Share

Sandeep Giri

4 years ago

Is this code from github.com/cloudxlab/ml ?

Upvote Share

Vagdevi K

4 years ago

Hi,

You may refer to: https://github.com/cloudxlab/ml/blob/master/machine_learning/end_to_end_project_bootcamp.ipynb

Thanks.

1 Upvote Share

Munish Kumar

4 years ago

Thank you so much. It was my syntax mistake. It should be in square brackets "[]".

data.iloc(train_indices), data.iloc(test_indices)

Upvote Share

Manish Berwal

4 years ago

np.random.seed(42)

What is logic using 42 in seed function?

Upvote Share

Manjunath Malagavi

4 years ago

its same as random state in R and its use is to avoid the randomness in the model.

1 Upvote Share

Manish Berwal

4 years ago

How we will decide the value of random state. example 42 or 24 or 21???

Upvote Share

Sandeep Giri

4 years ago

Any seed is ok.

Upvote Share

This comment has been removed.

Diksha Mittal

4 years ago

Hi,

I have a couple of questions:

1. If we have Mean Square Error, why do we need RMSE?

2. How do we decide whether to use MSE or RMSE?

Thanks in advance.

Upvote Share

This comment has been removed.

Vipin Sharma

4 years ago

Hi. Couple of questions -

Current Code- I think we are converting all values of housing["income_cat"] that are below 5 to 5

# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

why it is not -

housing["income_cat"].where(housing["income_cat"] > 5, 5.0, inplace=True)

as we want capping to be done all values above/beyond 5 so, we should be usin > sign and not < sign

2. We divided by 1.5. Why we considered 1.5 only for division. Is this a standard procedure to reduce the spread by dividing by 1.5 or can we use some other number as well?

Please clarify. Thanks in advance.

Upvote Share

Vipin Sharma

4 years ago

Please revert. Waiting for the response. Thanks

Upvote Share

This comment has been removed.

Shashank Dherange

4 years ago

Hello,

from sklearn.model_selection import train_test_split
train_set, test_set=train_test_split(housing, test_size=0.2,random_state=42)
print(len(train_set),"train +",len(test_set),"test")
test_set.head()

Receiving the following error:

NameError                                 Traceback (most recent call last)
<ipython-input-3-afeea3dcd5fb> in <module>
      1 from sklearn.model_selection import train_test_split
----> 2 train_set, test_set=train_test_split(housing, test_size=0.2,random_state=42)
      3 print(len(train_set),"train +",len(test_set),"test")
      4 test_set.head()

NameError: name 'housing' is not defined

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Please ensure you have defined the housing variable, and ran all the cells from the beginning before you ran this cell.

Thanks.

Upvote Share

Shashank Dherange

4 years ago

Hello,

Facing the same error file not found, tried with the whole path still the error exists.The code is mentioned below:

import pandas as pd
import os
HOUSING_PATH='ml/machine_learning/datasets/housing/housing.csv'
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path,"housing.csv")
return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-6a9011700846> in <module>
----> 1 housing = load_housing_data()
      2 housing.head()

<ipython-input-12-5ebf01cf80e5> in load_housing_data(housing_path)
      4 def load_housing_data(housing_path=HOUSING_PATH):
      5     csv_path = os.path.join(housing_path,"housing.csv")
----> 6     return pd.read_csv(csv_path)

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

First, check if the file exists at that location. If it does not, use the following command to download the file by cloning our GitHub repository:

git clone https://github.com/cloudxlab/ml ~/ml

If the file exists at that location, please use this path instead:

../ml/machine_learning/datasets/housing/housing.csv

Thanks.

Upvote Share

Shashank Dherange

4 years ago

Hello Sir,

I tried as mentioned by you still the problem exists.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Please share a screenshot of the location of the file and your code.

Thanks.

Upvote Share

Shashank Dherange

4 years ago

Hello Sir,

Here is the code and screenshot:

import pandas as pd
import os
HOUSING_PATH='../ml/machine_learning/datasets/housing/housing.csv'
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path,"housing.csv")
return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Since you are using the function load_housing_data() to load the dataset, that function is adding housing.csv at the end of the path. So please change the path to the following and your code should run fine:

'../ml/machine_learning/datasets/housing/'

Thanks.

Upvote Share

Shashank Dherange

4 years ago

Hello Sir,

The above line worked fine.

Thanks.

Upvote Share

Prasun Banerjee

4 years ago

Hi,

Please explain in more detail as: WHY would "the solution will break next time when we fetch an updated dataset" using split_train_test() function ?

Thank you.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

This is because when we have an updated dataset, it will have different number of rows, and so the model will treat it as a new dataset. So it will basically be creating a new model instead of using the same model to work on the new dataset.

Thanks.

Upvote Share

Prasun Banerjee

4 years ago

Hi,

What does data.iloc[a] does ?

Thank you.

Upvote Share

Vagdevi K

4 years ago

Hi,

`iloc` would fetch the series of the given integer index from the given data frame(here `data`). Please have a look here for more info on `iloc`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Hope this helps.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

Another question,

Let's say we plot a data on histogram and see that the distribution isn't normal. The data has a mean and standard deviation.

If we use the value of mean and dev in the normal distribution function formula and plot it to get a normal distribution curve, that would result in erroneous analysis, right?

Thanks.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

The value of the z-score tells you how many standard deviations you are away from the mean. If a z-score is equal to 0, it is on the mean. A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean.

Thanks.

Upvote Share

Abhinav Singh

4 years ago

Hi,

Time stamp: 37:00

In the example let's say Jim's score is 28.5, ie z score is '1.5', greater than z score of Pam, '1'. Does that mean Jim is a better performer in this case even though the standard deviation of ACT score,'5' is much less than that of SAT, '300'.

Thanks.

Upvote Share

Ankit Gokhale

4 years ago

np.ceil just seems like a very faulty way to round off numbers. How can 1.1 round off to 2? It doesn't make sense. Why wasn't a better function used to round off numbers?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

You can try round() for rounding numbers.

Thanks.

Upvote Share

Jagadeesh 792

4 years ago

sir,

could you please explain me why you take digest()[-1]....main problem is why you take [-1]....

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

When you put negative sign for array elements, it means those many elements from the end. try the following code to understand this better:

a = [1,2,3,4]
a[-1]

Thanks.

Upvote Share

Jagadeesh 792

4 years ago

sir,

i know [-1] means from the end...but thing is why choose from end rather than starting....

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

This is because we only want to take the last byte of the hash as mentioned in the slides.

Thanks.

Upvote Share

Bhavika Sehgal

4 years ago

Hello,

Can you please explain to categorize median_income column, why is it being divided by 1.5 and why not any other number ?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

It's the bin value for the histogram. It is 1.5 given here because it is gives the optimal number of bins. However, with different data, or with even this one, you are free to use your own value. You can try that here and see how different the result is than the one given here.

Thanks.

Upvote Share

Bhavika Sehgal

4 years ago

Thanks for replying.

I find this concept very interesting. Can you please suggest some datasets on which I can do some hands on.

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

You can login to Kaggle and search for datasets.

Thanks.

Upvote Share

Bhavika Sehgal

4 years ago

Hello,

while using housing.hist(), the plots coming up are not bell shaped.

1. Does that depict that data we are usins is not correct or wrongly sampled?

2. How bell shaped plots give assurance that model is predicting correct value ? or why so much importance is given to bell shaped curves in statistics.

Thanks!

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

Having a bell shaped plot means it is a normal distribution, ML models give better results with such datasets.

Thanks.

Upvote Share

Bhavika Sehgal

4 years ago

Hello,

In the video, it has been said that np.random.seed() is valid for single run.

What does that mean?

Upvote Share

Rajtilak Bhattacharjee

4 years ago

Hi,

This means that next time you are running the code, you need to set the seed value once again.

Thanks.

Upvote Share

Jagadeesh 792

5 years ago

Sir why i am getting FileNotFoundError

Upvote Share

Satyajit Das

5 years ago

Hi.

Kindly check the hoysing data path HOUSING_PATH = '../ml/machine_learning/datasets/housing/'

1 Upvote Share

MAHENDRA KONHAR

5 years ago

FileNotFoundError: [Errno 2] File b'datasets/housing/housing.csv' does not exist: b'datasets/housing/housing.csv'

please advice what to do next.

Even though I have clone as instructed earlier.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Please check if the file exists. If it does, then please provide the full path to the file.

Thanks.

Upvote Share

Pardeep Garg

5 years ago

Hi,

I did not understand the idea of ML algorithims not able to detect the patterens when data is heavy tailed?

Can you explain it more concretely.

Thanks

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Tail heavy is just another way of saying that the data has outliers. Some of the ML algorithms do not perform well with dataset that contains outliers.

Thanks.

Upvote Share

Gaurav Kulkarni

5 years ago

Getting error when reading from URL. How to do this?

I could read another csv file successfully using URL:

Upvote Share

Sachin Giri

5 years ago

Hi Gaurav,

Use this URL "https://github.com/cloudxlab/ml/raw/master/machine_learning/datasets/housing/housing.csv".

1 Upvote Share

Gaurav Kulkarni

5 years ago

Awesome, it worked to read data. Thank you!

Upvote Share

This comment has been removed.

Ankita Kothari

5 years ago

Please resolve the issue. I am not getting the read file output. Data results are not showing. Please refer screenshot.

Upvote Share

This comment has been removed.

Rajtilak Bhattacharjee

5 years ago

Hi,

Please check the location of the file and then provide the full path to the file.

Thanks.

Upvote Share

Ankita Kothari

5 years ago

You can see, my housing.csv file is in 'datasets/housing'

Guide me where I am going wrong.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

That is true, but the Python notebook is not. So please provide this path instead:

~/ml/machine_learning/datasets/housing

Thanks.

2 Upvote Share

Ankita Kothari

5 years ago

Query solved. Thank you so much.

Upvote Share

Harish P

5 years ago

For those who need to get a look at Pandas:

NumPy & Pandas | Python for Machine Learning | Session 11

https://youtu.be/LxVzfncBcng?t=5688 Pandas

1 Upvote Share

DILIP PRASAD

5 years ago

Hi Team,

Can you please check Why am i unable to login Hue?

Also, I have Housing.csv file in hadoop and in the local but when trying to load this file in Jupyter, Its' not showing data from all columns.

Upvote Share

Satyajit Das

5 years ago

Hi., Dilip.

Kindly use housing.describe() to see all the columns. \t is the tab between the columns. the columnbs name is housing_median_age

KIndly check before posting!

All the best !

Upvote Share

DILIP PRASAD

5 years ago

Hi Satyajit,

Thanks for prompt reply.

Now the error is 'Housing.csv' does not exist. But the file is there in the local and in hadoop system. I can access this file from the console.

Please let me know if i am missing anything.

https://cloudxlab.com/assessment/displayslide/1317/end-to-end-machine-learning-project-part-2?course_id=73&playlist_id=414#c23029

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Try housing.csv instead of Housing.csv. Also, if the issue persists, please share a screenshot of your code and the error that you are getting.

Thanks.

Upvote Share

Harish P

5 years ago

Hi,
Referring to slide #56 Video 7:47 mins. For calculating why do we square differences from the mean?

1. To eliminate negatives canceling positive differences.

2. Amplify the higher difference e than the lower differences.

3. To ease the calculation than the absolute distance

Questions:

1. Why do we need to amplify the higher differences so that they are weighted more heavily?

2. What does #3 mean?

Thank you very much for the help.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

The main reason to square the values is so they are all positive. You could take the absolute value instead, but squaring means that more variable points have a higher weighting. Squaring rather than taking the absolute value also means that taking the derivative of the function is easier.

Thanks.

Upvote Share

Harish P

5 years ago

Thank you for the response Rajtilak,

But my question was about what you said: "but squaring means that more variable points have a higher weighting."

Why the more variable points have to be given a higher weighting?

Thanks,

Harish

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Please find below a link with a detailed explanation on the same:

https://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia

Thanks.

Upvote Share

This comment has been removed.

Amit Padhi

5 years ago

please elaborate using hash function here ; it is just to keep training data set constant so that content should be consistent ?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

You can find the answer to your query on slide# 101.

Thanks.

Upvote Share

Pankaj Deshpande

5 years ago

What is the significance of bins parameter while plotting hist from dataframe ?

housing.hist(bins=50, figsize=(20,15))

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called "intervals", "classes", or "buckets".

Thanks.

Upvote Share

Harshita Saraswat

5 years ago

Why the command can not read the file?

1 Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

You need to specify the full path.

Thanks.

1 Upvote Share

Aaryan Pawar

5 years ago

Could you please clarify what the full path would be? I did not get you.

1 Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

The full path to the file including the root directory. It may look something like this, but depends on where you saved your file:

~/ml/machine_learning/datasets/housing/housing.csv

Thanks.

1 Upvote Share

Reynold Barboza

5 years ago

from sklearn.model_selection import train_test_split
test_set,train_set=train_test_split(housing,test_size=0.2,random_state=42)
print('test_set is',len(test_set),'train_set is',len(train_set))
len(housing)

O/P:

test_set is 13209 train_set is 3303

# With unique and immutable identifier

import hashlib

def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

print(len(train_set), "train +", len(test_set), "test")
housing.head()
print(len(housing))

O/P:

13064 train + 3448 test
16512

Using method 1 and method 2 ,I'm getting 2 different 20% of test sets.How do we know which is a reliable method use ?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Please refer to the lecture video to understand the difference between the two methods. One of the methods is executing the process from scratch, we mostly do not use this. This has been shown to help you understand how the process works.

Thanks.

Upvote Share

Reynold Barboza

5 years ago

I totally understood the part where one is an inbuilt feature while one is done by us.What I fail to understand is that both the methods should give the same output as test ratio is taken as 0.2.

Another thing the video fails to address is when using the hash method its understood that 0,2 test ratio of records are taken ,but there is no mention as to if it is randomized 20% or the 1st 20% is encountered.In which order is the 20% chosen ?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

The difference in number of samples maybe due to the algorithm the train_test_split uses.

Also, to gain a better understanding of how the split_train_test_by_id works, you can try using the hash function on a smaller dataset, and then check what kind of value the hash function generates. The dataset is split based on the hash values generated, a detailed explanation is given in slide# 100 and subsequent slides.

Thanks.

Upvote Share

Reynold Barboza

5 years ago

have the videos been updated with new changes ??

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Yes, we keep updating our content to make it better.

Thanks.

Upvote Share

Reynold Barboza

5 years ago

its really sad that it took 2 years to update content .......I had to go through the entire video which I had already finished just to see what was updated .Even after updation many things are still unclear

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

We only update the courseware when it is necessary. If you need more clarity on a topic, you can always reach out to us.

Thanks.

Upvote Share

ravikiran nalla

5 years ago

HI CloudXTeam,

Random Sampling method works fine if the data set is large enough. How to decide how much is "Large Enough"?..quntitatievly, measurably how can we decode "Large enough" for a problem?

Regards,

Ravikiran Nalla

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Large data is when the dataset is quantitatively large in size, which refers to the numbers of samples.

Thanks.

Upvote Share

ravikiran nalla

5 years ago

Hi CloudXTeam,

Please refer to the slides no: 98 to 102.

We know that there is an inherent problem in "split_train_test()" function that the solution breaks if an updated data set avialable and the suggested solution is to use "Hash" and use the last byte. I understand of using the hash value to identify the unique dataset identifier, but how using the last byte of Hash in determining it should go to the "Test Set".

Can you clarify?

Regards,

Ravi

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

When we add new data to the function, it will treat the entire dataset as new data, so the old inferences will not hold true anymore.

Thanks.

Upvote Share

Amit Kumar

5 years ago

I am getting below error while trying OneHot method

NameError: name 'CategoricalEncoder' is not defined

Pllease suggest immediately

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would request you to get the latest notebooks from our Github repository.

Thanks.

Upvote Share

Amit Kumar

5 years ago

Please explain why we are using second parameter housing["income_cat"] in the line below? This is very confusing. What role of 'housing["income_cat"]' has in Split function?

Upvote Share

Vennela K

5 years ago

https://prnt.sc/t7rsxm sir i hv been trying since yesterday evening to restart my server as given in the instructions but still I'm facing the same problem .plss help me solve this

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Please try the following command in a web console:

rsync -avz --ignore-existing /cxldata/cloudxlab_jupyter_notebooks/ /home/$USER/cloudxlab_jupyter_notebooks/

Thanks.

Upvote Share

Amit Kumar

5 years ago

Please explain why we are using second parameter housing["income_cat"] in the line below? This is very confusing. What role of 'housing["income_cat"]' has in Split function?

for train_index, test_index in split.split(housing, housing["income_cat"]):

Upvote Share

Amit Kumar

5 years ago

Please someone from Cloudxl response on my question

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Here we are splitting the data. The housing part (without the income_cat column) is being split in the train_index variable, the income_cat is being split in the test_index variable.

Thanks.

Upvote Share

Pawan Kumar Lohumi

5 years ago

Dear Team,

At 2:36:20 part of the video, we are importing Imputer Class. However. this throws an error while importing.

I just googled and found the new syntax to import Imputer as listed below:

from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Kindly advise, if i am doing something wrong. If not, please update the video and ppt with the changes.

Thank you.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would request you to go through the following:

https://discuss.cloudxlab.com/t/solved-cannot-import-imputer/4052

Thanks.

Upvote Share

Harmeet Singh

5 years ago

the professor is just reading the slides....unable to make understand a single concept...one eg 'stratified' concept totaly bouncer which he started at about 1h 17 min

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Would suggest you to try the code yourself while listening to the video. You also have the slides to explain the same concept that was shown in the lecture video. You can cloen our GitHub repository using the following command in the web console:

git clone https://github.com/cloudxlab/ml ~/ml

Thanks.

Upvote Share

Harmeet Singh

5 years ago

kindly go over the video from 1h 17 min. He didn't explain the coding lines, why we are using them, how these work, what is the importance of doing that. How am i supposed to learn just by seeing the code and not understanding it. Also in comparision proportion code, he read just read 2-3 lines and directly took the output and neither he explained the code nor the output . in my opinion concept clearing matters than copy pasting. kindly understand my point.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi Harmeet,

This topic is to give your an overview of how the Machine Learning pipeline works. The next couple of topics will talk about the in-depth process of how individual Machine Learning algorithms like Classification etc works. Also, if you are unable to understand any particular code, let me know, I will help you by explaining it to you.

Thanks.

Upvote Share

Harmeet Singh

5 years ago

kindly explain the stratified and comparision proportion code. i rewatched it number of times. i am stuck here

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

Stratified sampling refers to a type of sampling method . With stratified sampling, the researcher divides the population into separate groups, called strata. Then, a probability sample (often a simple random sample ) is drawn from each group.

Stratified sampling has several advantages over simple random sampling. For example, using stratified sampling, it may be possible to reduce the sample size required to achieve a given precision. Or it may be possible to increase the precision with the same sample size.

Could you help me by referring to the slide which has the comparison proportion code?

Thanks.

Upvote Share

Aman Garg

5 years ago

hello sir ,

can you please explain the work of plt.legend() in matplotlib

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

This function places a legend on the axes. You can find more about it from the below link:

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html

Thanks.

Upvote Share

Somendra Tiwari

5 years ago

Hello sir,

i'm getting doubt in split.split(housing,housing["income_cat"]) please explain why it is used.

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

We are splitting the data here using StratifiedShuffleSplit() into strat_train_set and strat_test_set.

Thanks.

Upvote Share

Vijay Saini

5 years ago

hi team,

All attributes are numerical, except the ocean_proximity field. Its type is object, so it could hold any kind of Python object, but since you loaded this data from a CSV file you know that it must be a text attribute.

what is the logical meaning of line i.e its type of object, so it could hold any kind of python object ?

As i know An object is simply a collection of data (variables) and methods (functions) that act on those data. Similarly, a class is a blueprint for that object. ... An object is also called an instance of a class and the process of creating this object is called instantiation. i can visualize this defination

but that line seems confusing to me.
kindly make it clear with example

Upvote Share

CloudxLab

5 years ago

Hi,

The textual data is referred to as categorical data. As for having it as a data type object, you can change it if you want, but it is usually given so that it can hold non-numerical data. Now as you you, non-numerical data can be anything, from images, to sound. It depends on the dataset what would be the type of the features.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Saurav Sahay

5 years ago

Why Specific benefit we got by :-
# By Dividing "median_income" by 1.5 to limit the number of income categories
# Round up using ceil to have discrete categories
# Truncating it to 5
to create a new column "income_cat", using this sample is Stratified for carrying out test.

Cant we directly Stratify using "median_income" only after using ceil function..... as this will also ensure that all income group is represented in the test_data

Upvote Share

CloudxLab

5 years ago

Hi,

We divide the median income by 1.5 so that we will not have so many strata, and each stratum will be large enough. This is done to address the anomalies in the data that we have. Would request you to go through the part of the lecture where it is being discussion about the shape of the data, that would make things clear for you.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

KARTHIK R

5 years ago

why should each stratum be large enough?

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi,

It should be large enough to be representative.

Thanks.

Upvote Share

Cyril George

5 years ago

1)we already split the data into test and train data sets.. then why we again split the data("income_cat") using "StratifiedShuffleSplit" class?

2) "housing["income_cat"].value_counts() / len(housing)" ....y we used this line of code?

Upvote Share

Queen Saikia

5 years ago

Hi,

PLEASE TELL ME HOW TO TAKE y_score FOR CATEGORICAL VALUES AND PLOT ROC CURVE??

Upvote Share

CloudxLab

5 years ago

Hi,

I have already replied to your mail, please check.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Pappala Saikalyan

5 years ago

Can you please tell any way so that I can save my Jupiter notebooks for future references...after completing project?? As after completing project we won't be able to access the lab...please tell me..

Upvote Share

CloudxLab

5 years ago

Hi,

There are 2 things you can do:

1. You can save your notebooks as .ipynb files. Open your notebook, click on File -> Download as -> notebook (.ipynb) and save it on your local hard drive.

2. You can publish your notebooks on your personal GitHub accounts. However, would request you to read the following before you go ahead with this option:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Rachit Shah

5 years ago

np.random.seed(10)

what's the meaning of 10 in seed method? what this value signify?

Upvote Share

CloudxLab

5 years ago

Hi. Rachit.

The seed() is given to make the output of the np.random same if you rerun your Jupyter cell or program again and again.
np.random() will be generating the random numbers and to make that random number constant through out your program we use seed() and the argument for seed() can be any positive integer.

All the best!

-- Satyajit Das

Upvote Share

Mrityunjay

5 years ago

Hi Team,
# Let's use Scikit-Learn Imputer class to fill missing values

import sklearn
from sklearn.preprocessing import Imputer

Error: ---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-51-46d3542f179c> in <module>
2
3 import sklearn
----> 4 from sklearn.preprocessing import Imputer

ImportError: cannot import name 'Imputer'

Please Help

Upvote Share

CloudxLab

5 years ago

Hi,

Please find the solution to this issue in the below link:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Kuppuraj N

5 years ago

Hi,
Below import isn't working. Cannot import Imputer. Am I missing any syntax?
from sklearn.preprocessing import Imputer

Upvote Share

CloudxLab

5 years ago

Please refer to this note

-- Praveen Pavithran

Upvote Share

Mrityunjay

5 years ago

in END to End project
in Project

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

why split = 1? size=0.2? , random_state=42( if we take any seed except 42 then what will happen?

Upvote Share

CloudxLab

5 years ago

Hi,

Here n_split defines the number of re-shuffling & splitting iterations. test_size can be either float or int, if int, represents the absolute number of test samples. random_state controls the randomness of the training and testing indices produced. You can find more details in the link given below:
https://scikit-learn.org/st...
Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Rajeev Pathania

5 years ago

Why error

Upvote Share

CloudxLab

5 years ago

Hi Rajeev,
It is showing error because the answer is not correct.

Hint: Please consider the second line (a[2:5] = [7,4,9]). Here, we have changed the values of some elements.

-- Sachin Giri

Upvote Share

Varshith4869 Regonda

5 years ago

import pandas as pd
import os
HOUSING_PATH = 'datasets/housing/'

def load_housing_data(housing_path=HOUSING_PATH):

csv_path = os.path.join(housing_path,
"housing.csv")

return pd.read_csv(csv_path)
housing =pd.read_csv('datasets/housing/housing.csv')
housing.head()
i am not able get output why

Upvote Share

Satyajit Das

5 years ago

Hi, Varshith.

Kindly give the complete/absolute path of the file.

All the best1

Upvote Share

Vinod Kc

5 years ago

not able to see the slides, 'The connection was reset' message appears. Tried reloading , log off on etc but no luck

Upvote Share

CloudxLab

5 years ago

Hi Disqus,

Thank you for contacting us.
Are you not able to see the complete pdf or the slides are taking time to load? The problem might be with the internet connection. Kindly check. Please feel free to let me know if you have any queries and I'll be glad to help.

Hope this helps.

Thanks.

-- Anupam Singh Vishal

Upvote Share

Utkarsh Trivedi

5 years ago

You have not cleared my query, kindly get it cleared

Upvote Share

Utkarsh Trivedi

5 years ago

I have completed topic 3, but it is still showing that I have done it 98%.
Kindly rectify it

Upvote Share

CloudxLab

5 years ago

Hi,

Would request you to share your email id with us.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Utkarsh Trivedi

5 years ago

utkarshtrivedi1403@gmail.com

Upvote Share

Utkarsh Trivedi

5 years ago

utkarshtrivedi1403@gmail.com

Upvote Share

Sandeep Akode

5 years ago

Hi,
It is correct now. Can you please check?

Upvote Share

Utkarsh Trivedi

5 years ago

Now, my topic 1 is 99%, yet I have completed all along with the newly added lambda ledcture

Upvote Share

Sandeep Akode

5 years ago

It is corrected. Also, please check if you completed all slides from topic 1 as we have added 2 slides to that topic. If you remark any slide from that topic the topic progress is updated.

Upvote Share

Utkarsh Trivedi

5 years ago

Sir, in my topic 1, only one new track has been completed which is lambda function.
And, if two tracks are added then, kindly add the new one.
If only one track has been added, then I have completed that and it is still showing 99%. Kindly, correct it.
It is very unfortunate that I have paid the money, but I have to send the queries again and again.I am feeling betrayed.

Upvote Share

Utkarsh Trivedi

5 years ago

Reminder

Upvote Share

Utkarsh Trivedi

5 years ago

Reminder2

Upvote Share

CloudxLab

5 years ago

Hi,

You still have not completed the sub-topic Tuples (# 119) in topic 1, this is why it is showing 99% complete. Would request you to complete it to set the completion percentage to 100% for topic 1.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Utkarsh Trivedi

5 years ago

Again same problem, after completing #Task 119

Upvote Share

CloudxLab

5 years ago

Hi,

Could you please tell me which topic you are referring to since topic# 1 is showing 100% complete.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Utkarsh Trivedi

5 years ago

I t is again 99% complete

Upvote Share

Utkarsh Trivedi

5 years ago

I am talking about Topic 1

Upvote Share

CloudxLab

5 years ago

Hi Utkarsh,

Topic# 1 is showing 100% complete. Would request you to check once again and confirm.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Utkarsh Trivedi

5 years ago

Sorry, Sir. Now it is corrected.
I think server takes some time to process.Kindly, dont consider my three latest queries

Upvote Share

CloudxLab

5 years ago

Hi,

No problem. We are always there to help you out. Happy learning!

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

CloudxLab

5 years ago

Closing ticket

-- Praveen Pavithran

Upvote Share

Utkarsh Trivedi

5 years ago

Kindly, get it corrected

Upvote Share

Utkarsh Trivedi

5 years ago

Sir! now the same problem is with topic 1 and topic 7. Kindly correct it

Upvote Share

Utkarsh Trivedi

5 years ago

now, my topic 1& 6 are showing the same problem. Kindly check

Upvote Share

Chandrahas Kocharlakota

5 years ago

When we make a prediction, Is it made on training set or the test set ?

Upvote Share

CloudxLab

5 years ago

Hi,

We create the model using the training set, then make the predictions using that model on the test set.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Yash Kiran

5 years ago

It applied ceiling on median income to get value counts from 1 to 10 and then added where condition to get value counts from 1 to 5, instead we can directly apply where condition to get value counts from 1 to 5 without applying ceiling

Why it applied both ceil and where on housing data set median income ?

Upvote Share

Yash Kiran

5 years ago

Got it, without ceil it creates number of buckets for each unique value less then 5
Thanks

Upvote Share

Biru

5 years ago

Hi Team,

Why the tutor is running a for loop to drop the 'income_cat'. We could have done it directly like df.drop(['income_cat'], axis=1). Any difference in doing the other way.
Please explain, if there is difference in both.

Thanks,

Upvote Share

CloudxLab

5 years ago

Hi,

Could you please help me by pointing out where it was mentioned that the income cat was dropped using a for loop.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Queen Saikia

5 years ago

Hi,

Without using custome transformer pipelines,
if we choose to use scikit learn method as done in this video in various steps then,
how to union the numerical and categorical columns into housing again without using pipelining(code please)???

Also in the very last step when using the stratified test data to predict the testset result if we dont want to use pipeline then what to use instead of full.pipelined.transformed??

please explain me sir

Upvote Share

Queen Saikia

5 years ago

Sir,
please provide me a solution for the above that how to union the numerical and categorical columns into housing again without using pipelining method and hoe to pass the test set to predict without using pipelining transformer (only using libraries as done in this video)

Upvote Share

Queen Saikia

5 years ago

Hi,Am I right or wrong please clarify sir.

new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household" are used nowhere later in creating the model??

Did we created the new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household" in the copy set of housing ???? please clarify??
because when we print housing.info() after seeing the null values in housing(beginning of cleaning) there is no newly created above columns,

So cleaning -----we are doing in original strat_train_set??
and creating the new colums- "rooms_per_household,"bedrooms_per_room", "population_per_household-----we are doing in a copy of strat_train_set???

Am I right or wrong please clarify sir.

Upvote Share

Debasish Deb

5 years ago

how can I see the version of "sklearn" in cloudxlab and how could i upgrade it if require?

Upvote Share

CloudxLab

5 years ago

Hi,

You can use the following command to check the version of sklearn:

import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__)) You do not need to upgrade it since we keep the libraries updated from our end.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Debasish Deb

5 years ago

Upvote Share

CloudxLab

5 years ago

Hi,

Would request you to follow the instructions given here:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Debasish Deb

5 years ago

"from sklearn.preprocessing import Imputer " is not working in jupiter notebook in cloudex lab as getting an error "cannot import name 'Imputer'"

Upvote Share

Vamsi

5 years ago

What does negative corelation mean, how are we concluding that bedrooms per room is more appropriate.

Upvote Share

CloudxLab

5 years ago

Hi,

*Negative correlation* is a relationship between two variables in which one variable increases as the other decreases, and vice versa.

If you mean that how to determine that bedroom per room should be the feature to be added, you need to study the problem that you are trying to solve to find/formulate these features.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Debasish Deb

5 years ago

Upvote Share

Queen Saikia

5 years ago

Hi,

1. I have not understood about the hashcode generation and division of data procedure. Please explain in me detail sir.?

2. As we have created category column at the beginning and after dividing dataset into training and test set is there values from both median_income and income_cate column in the training and test sets? I think we should not include median_income values during splittig as we are already using income_cate instead??

we can drop the category column.. as we have manipulated the median_income into income_cat for more accuracy and also less error% from original so we should use the income_cat and drop median income before splitting into training and test sets.

also while dropping axis=1 means row but we are dropping the column income_cat so it should be axis=0?

Upvote Share

Sanjay Ray

5 years ago

Unable to import SimpleImputer in sklearn for end-to-end project. Screenshot attached. Pl help.

Upvote Share

CloudxLab

5 years ago

Hi,

Would request you to restart your server using the following method and then try once again:
https://discuss.cloudxlab.c...
Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Sanjay Ray

5 years ago

I restarted server and followed as mentioned in above link page and tried to import SimpleImputer on the right hand-side jupyter notebook for end-to-end project but its not working...msg is "cannot import name 'SimpleImputer'. This is run to use scikit learn Imputer class to fill missing values. Pl help,...I unable to proceed further. Screenshot attached

Upvote Share

Sanjay Ray

5 years ago

It is solved now. Probably scikit learn version issues. So I had to use "from sklearn.impute import SimpleImputer" in stead of "from sklearn.preprocessing import SimpleImputer". Thanks.

Upvote Share

Sanjay Ray

5 years ago

For loading california.png the path is '../ml/machine_learning/images/end_to_end_project/california.png'

why not 'images/end_to_end_project/california.png'? Pl explain.

Upvote Share

CloudxLab

5 years ago

Hi,

This is because it is a relative path with respect to the location of your notebook.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Prachi Singla

5 years ago

Hi
How can i save all files present in cloudxlabs in my PC for future use and reference.
Thanks
Prachi

Upvote Share

CloudxLab

5 years ago

Hi,

Very good question. You can open individual Jupyter notebook in the lab, click on File -> Download as -> notebook (.ipynb), then you can choose your local desktop to save the notebook.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Prachi Singla

5 years ago

ok Thanks

Upvote Share

Mohini Singhal

5 years ago

compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100

compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
sir ,please explain me the above expressions and there purpose.i am not able to understand that what they are actually doing?

Upvote Share

CloudxLab

5 years ago

Hi Mohini,

We are comparing the income category proportion in Stratified Sampling and Random Sampling. In the last 2 lines, we are calculating the error percentage of the same compared to the overall results.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Ritu Bansal

5 years ago

Why is this not working ?

Upvote Share

CloudxLab

5 years ago

Hi Ritu,

Would request you to check if the file exists in that path.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Ritu Bansal

5 years ago

I did and it exist in the path.

Upvote Share

CloudxLab

5 years ago

Hi Ritu,

Try this path instead:

'ml/machine_learning/images/end_to_end_project/california.png'

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Ritu Bansal

5 years ago

Tried this but not working.

Upvote Share

CloudxLab

5 years ago

Hi Ritu,

Could you please share a screenshot of that file within that folder.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Ritu Bansal

5 years ago

Upvote Share

CloudxLab

5 years ago

Hi Ritu,

Try this path instead:

'/ml/machine_learning/images/end_to_end_project/california.png'

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Rohit Raj Jalheria

5 years ago

It says "cannot import imputer"
since Imputer is a python class, i believe i dont have to install it exclusively

Kindly help me out with this

Upvote Share

CloudxLab

5 years ago

Hi Rohit,

Please follow these steps to solve your issue:

https://discuss.cloudxlab.c...

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Anubhav Gupta

5 years ago

So to do the OneHotEncoding first we have to use factorise, we cannot use onehotencoder directly,?

Upvote Share

Anubhav Gupta

5 years ago

please explain me the working of the for loop i the below code it's a bit confusing for me

thankyou

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,
random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

Upvote Share

Anubhav Gupta

5 years ago

what is the difference between train_test_split & split_train_test or why we use one over the other?

Upvote Share

CloudxLab

5 years ago

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

Upvote Share

CloudxLab

5 years ago

Hi Anubhav,

train_test_split is an in-built function, whereas split_train_test was the function that Sandeep created from scratch which had the same functionality. This was done so that learners can understand the underlying working on the in-built function.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

Anubhav Gupta

5 years ago

Compute hash of each instance’s identifier
? Take only last byte of hash
? If last byte value is lower or equal to 51 (20% of 256)
? Put instance in the test set

please explain the concept of hash and If last byte value is lower or equal to 51 (20% of 256)

Upvote Share

CloudxLab

5 years ago

Hello Disqus,

Thanks for contacting CloudxLab!

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

Upvote Share

CloudxLab

5 years ago

Hi Anubhav,

Here we are trying to segregate the data into train and test sets. The function split_train_test was initially plagued by a problem that this solution will break next time when we fetch an updated dataset. To avoid this issue, we would compute hash of each instance’s identifier and use that to divide it into train and test sets. Now the question is, what is a hash? A hash is a function that is deterministic, such that if a==b, then f(a)==f(b), and if a!=b, then with a very high probability f(a)!=f(b). This conditions are met by the in-built hash
function. Would suggest you to go through the slides accompanying this video to get a better understanding of the problems and their solutions.

Thanks.

-- Rajtilak Bhattacharjee

Upvote Share

CloudxLab

5 years ago

Hello Disqus,

Thanks for contacting CloudxLab!

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

Upvote Share

Sharathchandran Komravelli

5 years ago

what's wrong in this code

Upvote Share

Rajtilak Bhattacharjee

5 years ago

Hi Sharathchandran,

Please ensure that the file housing.csv exists at that given path.

Thanks.

Upvote Share

CloudxLab

5 years ago

Hello Disqus,

Thanks for contacting CloudxLab!

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

Upvote Share

Abhinav Kumar

5 years ago

Hi. My lab access is over. I am unable to access any files from my courses. Also how am I going to submit the answers for the exercises. is it compulsory to renew the lab access?

Upvote Share

Satyajit Das

5 years ago

Hi, Abhinav.

Yes, you need to renew your Lab to access the files , kindly contact the Cloudxlab team to renew the Lab.

All the best

Upvote Share

Vivek Bohra

5 years ago

What the torrent site name having the datasets?

Upvote Share

Punit Bhilota

5 years ago

Slide no. 213

Data Cleaning - Missing Values - Option Two

Following code is not working...
>>> sample_incomplete_rows.drop(subset=["total_bedrooms"])

TypeError: drop() got an unexpected keyword argument 'subset'

I tired following code to drop a column but it is neither producing error not dropping the column. Please help.

>>>sample_incomplete_rows.drop(columns = ['total_bedrooms']) #Not working
>>>sample_incomplete_rows.drop(['total_bedrooms'],axis = 1) #Not working
>>>sample_incomplete_rows.drop('total_bedrooms',axis = 1) #Not working

Slide no. 218

>>> from sklearn.preprocessing import Imputer

ImportError: cannot import name 'Imputer'

code not working because with version scikit-learn 0.20 the implementation has been changed. Please replace deprecated methods/functions/classes.

https://scikit-learn.org/st...

Changed the code as follows:

>>>from sklearn.impute import SimpleImputer
>>>imputer = SimpleImputer(strategy="median")

https://scikit-learn.org/de...

sklearn.impute

New module, adopting preprocessing.Imputer as impute.SimpleImputer with minor changes (see under preprocessing below).

Major Feature Added impute.MissingIndicator which generates a binary indicator for missing values. #8075 by Maniteja Nandana and Guillaume Lemaitre.

Feature The impute.SimpleImputer has a new strategy, 'constant', to complete missing values with a fixed one, given by the fill_value parameter. This strategy supports numeric and non-numeric data, and so does the 'most_frequent' strategy now. #11211 by Jeremie du Boisberranger.

Upvote Share

Avishek Desarkar

5 years ago

i cannot import the fetch_mldata , pls help, the folder scikit_learn is not there

Upvote Share

Santosh Chavan

5 years ago

The reshape mehod is not working out for me

Upvote Share

Ashish Tiwari

5 years ago

Hi,
I have a question.
if categorical column value is not ordinal it should be one hot encoded, right?
but if it has many unique categories, like if a categorical column CatA has 200 unique values, and lets just say dataset has 10000 rows. Then will it still be a good idea to one hot encode them? because doing so will create 199 additional columns.
If not, then how to deal with such attribute?

Upvote Share

Arindam Majumdar

5 years ago

I can not see the End_to_end_project.ipynb content , please help !

Upvote Share

Satyajit Das

5 years ago

Hi, Arindam.

I request you to please recheck the tutorials and follow the steps the "End_to_end_project.ipynb" will be present.

All the best

Upvote Share

Learn Things

5 years ago

how did we come to conclusion that median_income is important feature and categorizes that

Upvote Share

Learn Things

5 years ago

how do we do stratified sampling? is it a only term or is there a function?

Upvote Share

Satyajit Das

5 years ago

Hi,
You can find more about this in the article.

https://www.surveygizmo.com...

All the best.

Upvote Share

Shivom Srivastava

5 years ago

about creating strata "line housing["income_cat"]=np.ceil(housing["median_income"]/1.3)"
how to decide this strata here i divided with 1.3 instead of 1.5 but the plotting histogram is same what we got in 1.5?

import matplotlib.pyplot as mt
import numpy as np
import pandas as pd
import os
HOUSING_PATH='datasets/housing'
def load_housing_data(housing_path=HOUSING_PATH):
path=os.path.join(housing_path,'housing.csv')
return pd.read_csv(path)
housing=load_housing_data()

housing["median_income"].hist()
mt.plot()
mt.show()
housing["income_cat"]=np.ceil(housing["median_income"]/1.3)

housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)
housing["income_cat"].value_counts()
housing["income_cat"].hist()

Upvote Share

Satishab

6 years ago

what is meant by Capped data ? In last lecture I did not get completely that part?

Upvote Share

Sandeep Giri

6 years ago

Sometimes while saving the data, we set a limit. For example, all the values of income beyond 1000 will be set to 1000. This is generally a human introduced defect.

Upvote Share

End-to-End Project- Self contained

End-to-End Machine Learning Project Part-2

Recording of Session

Slides

XP

Please login to comment

241 Comments