Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Now as we have categorized median_income, we can perform stratified sampling on it. We categorized the attribute into 5 categories because if we look at the histogram of median_income, we can see that most values are clustered between 1.5 and 6. So, as the range is very small, 5 categories are enough to represent strata.
We use the StratifiedShuffleSplit
class from sklearn
library to perform stratified sampling. It is present inside the submodule model_selection
of sklearn. It outputs train/test indices to split data into train/test sets using cross-validation. We'll discuss cross-validation later.
StratifiedShuffleSplit
has 3 important parameters-
n_splits- It is a parameter of every cross-validator. It determines how many different validation (and training) sets you will create. We'll study more about it later.
test_size- It represents the proportion of the dataset to include in the dataset. If we specify it as 0.2
, then 20% of the data will be included in the test set and 80% in the training set.
random_state- It simply sets the seed to the random generator. If we don't specify random_state, then every time we run our code, a new random value will be generated and so the training and test set will contain different instances assigned to them everytime. However, if we specify random_state = some_number,
then no matter how many times we run our code, the training and test set will contain the same instances on every run. So, it solves the first problem which we discussed, i.e., the problem of different training and testing datasets generated on every run. We generally specify its value as 42
arbitrarily. 42
is a reference from Hitchhikers guide to a galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.
You can refer to StratifiedShuffleSplit documentation for more details.
Import library sklearn
Import StratifiedShuffleSplit
class from submodule model_selection
of sklearn
.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...