Previous Index Next

End to End ML Project - Split the dataset

In this step, we will split the dataset into train and test sets. We will be using the StratifiedShuffleSplit method from the sklearn library which is a cross-validator that provides train/test indices to split data in train/test sets.

INSTRUCTIONS

Import StratifiedShuffleSplit from sklearn

from sklearn.model_selection import <<your code goes here>>

Now let's divide the dataset in a 80-20 split, for this you need to set the test_size as 0.2

split = StratifiedShuffleSplit(n_splits=1, test_size=<<your code goes here>>, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Finally, we will drop the income_cat column from both the train and test set since it is the attribute that our model will predict. For this we will use the drop method
```
for set_ in (strat_train_set, strat_test_set):
    set_.<<your code goes here>>("income_cat", axis=1, inplace=True)
```

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

End-to-End ML Project - California Housing

End to End ML Project - Split the dataset

XP

Loading comments...