Project - Bike Rental Forecasting

40 / 49

End to End Project - Bikes Assessment - Divide into training/ test dataset

Dividing the dataset into training and test dataset

After having analyzed the dataset, we shall divide the entire dataset into training and test set using train_test_split in the ratio 70:30 It uses random sorting and hence the resulting train_set and test_set is sorted by daycount.

Task: Correct the train_test_split function to split the test set in the ratio 70:30.


        from sklearn.model_selection import train_test_split
        train_set, test_set = train_test_split(bikesData, test_size=0.3, random_state=42)
        train_set.sort_values('dayCount', axis= 0, inplace=True)
        test_set.sort_values('dayCount', axis= 0, inplace=True)
        print(len(train_set), "train +", len(test_set), "test")


  • This is counter-intuitive to what we understand and can introduce the problem of snooping as discussed.

  • The division is done using train_test_split() function provided in sklearn module. This may not be the best way to divide it. Can you think of a better way of sampling the dataset?