After having analyzed the dataset, we shall divide the entire dataset into training and test set using train_test_split in the ratio 70:30 It uses random sorting and hence the resulting train_set and test_set is sorted by daycount.
Task: Correct the train_test_split function to split the test set in the ratio 70:30.
from sklearn.model_selection import train_test_split train_set, test_set = train_test_split(bikesData, test_size=0.3, random_state=42) train_set.sort_values('dayCount', axis= 0, inplace=True) test_set.sort_values('dayCount', axis= 0, inplace=True) print(len(train_set), "train +", len(test_set), "test")
This is counter-intuitive to what we understand and can introduce the problem of snooping as discussed.
The division is done using train_test_split() function provided in sklearn module. This may not be the best way to divide it. Can you think of a better way of sampling the dataset?
Taking you to the next exercise in seconds...