 18 / 38

# End to End Project - Bikes Assessment - Basic - Divide into training/ test dataset

Now, since we have cleaned the `bikesData` data set, let us split it into `Training` and `Test` data sets into 70:30 ratio using scikit-learn's `train_test_split()` function.

Also, `train_test_split()` function uses 'Random Sampling', hence resulting `train_set` and `test_set` data sets have to be sorted by `dayCount`. Random Sampling may not be the best way to split the data, what other types of best Sampling method you can think of?

We will also define an utility function named `display_scores`. This function is used to calculate the basics stats of observed scores from cross-validation of models. Please copy this function in your code, we will be using it often in this project.

INSTRUCTIONS
• Set np random seed to 42 using code below to ensure the results of the exercise are repeatable.

``````np.random.seed(42)
``````
• Import `train_test_split` function from scikit-learn's `model_selection`

• Please add a new feature(column) `dayCount` to `bikesData` data set using below code:

``````bikesData['dayCount'] = pd.Series(range(bikesData.shape))/24
``````
• Split the `bikesData` data set into Training set `train_set` and Test set `test_set` in 70:30 ratio using scikit-learn's `train_test_split()` function.

• Sort the `train_set` and `test_set` values by `dayCount` by using the below code:

``````train_set.sort_values('dayCount', axis= 0, inplace=True)
test_set.sort_values('dayCount', axis= 0, inplace=True)
``````
• Now print the 'number of instances' for `train_set` and `test_set` data sets.

• Finally, create the function `display_scores` as shown below:

``````def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
``````
Get Hint

Answer is not availble for this assesment

Note - Having trouble with the assessment engine? Follow the steps listed here