End-to-End ML Project- Beginner friendly

47 / 95

Split criteria

On doing some research, we find out that the median_income is an important attribute to predict median_house_value. So it can be a characteristic to create strata. Now, we may want to ensure that the test set is representative of the various categories of income in the whole dataset. Since the median_income is a continuous numerical attribute, we first need to convert it into a categorical attribute.

We can use the cut() function from the pandas library, for converting median_income to a categorical attribute. Its syntax is-


where x is the input array to be binned or categorized.

cut() has 2 important parameters-

  1. bins- The criteria for the bin by. So if we provide bins as [1,4,7,10], then it will put all the values of x ranging from 1 to 4 in category 1, 4 to 7 in category 2, and 7 to 10 in category 3.

  2. labels- Specify the labels for the returned bins. So for the above, if we provide labels as [1,2,3] then all the instances belonging to category 1 will be valued 1, category 2 will be named 2 and category 3 will be named 3. On the other hand, if we provide labels as ['one', 'two', 'three'], then all instances belonging to category 1 will be valued one, category 2 will be named two and category 3 will be named three. We can name the values anything. Remember, it must be the same length as the resulting bins.

Refer to pd.cut() documentation for more details about the method.


Categorize the median_income attribute of our dataset in 5 categories and store it in a variable named income_cat such as-

  1. All values from 0 to 1.5 are valued at 1.
  2. All values from 1.5 to 3 are valued at 2.
  3. All values from 3 to 4.5 are valued at 3.
  4. All values from 4.5 to 6 are valued at 4.
  5. All values from 6 to 16 are valued at 5.

Display the first five rows of the income_cat using the head() method.

Note- We took the last value as 16 because the max value of median_income is 15.000100 and making the last bin value 16 will cover all the instances for sure. We can use any number larger than 15.000100 and the result will be always the same.

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...