End-to-End ML Project- Beginner friendly

47 / 94

StratifiedShuffleSplit

Now as we have categorized median_income, we can perform stratified sampling on it. We categorized the attribute into 5 categories because if we look at the histogram of median_income, we can see that most values are clustered between 1.5 and 6. So, as the range is very small, 5 categories are enough to represent strata.

We use the StratifiedShuffleSplit class from sklearn library to perform stratified sampling. It is present inside the submodule model_selection of sklearn. It outputs train/test indices to split data into train/test sets using cross-validation. We'll discuss cross-validation later.

StratifiedShuffleSplit has 3 important parameters-

  1. n_splits- It is a parameter of every cross-validator. It determines how many different validation (and training) sets you will create. We'll study more about it later.

  2. test_size- It represents the proportion of the dataset to include in the dataset. If we specify it as 0.2, then 20% of the data will be included in the test set and 80% in the training set.

  3. random_state- It simply sets the seed to the random generator. If we don't specify random_state, then every time we run our code, a new random value will be generated and so the training and test set will contain different instances assigned to them everytime. However, if we specify random_state = some_number, then no matter how many times we run our code, the training and test set will contain the same instances on every run. So, it solves the first problem which we discussed, i.e., the problem of different training and testing datasets generated on every run. We generally specify its value as 42 arbitrarily. 42 is a reference from Hitchhikers guide to a galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.

You can refer to StratifiedShuffleSplit documentation for more details.

INSTRUCTIONS

Import library sklearn

Import StratifiedShuffleSplit class from submodule model_selection of sklearn.



Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...