Stratified sampling

Splitting the dataset may look like an easy task, but it isn't. We can't just randomly split our dataset because-

If we'll run the program again, it will generate a different test dataset. That makes evaluating our model harder.
It may introduce sampling bias. For example, suppose we have data for a school in which there are 65% male and 35% female. Now we take a sample of 100 students and assume that they represent all students. So, a good sample would consist of 65 males and 35 females because it maintains the real ratio of the male to female in school, which is 65:35. But using random sampling, it's much rare to attain this ratio in our sample and more chances to introduce sampling bias.

Sampling bias is introduced when our sample can't capture the actual distribution of our data.

So, we split the data in such a way that it generates the same test data every time, and also the sample captures the actual distribution of our dataset. We capture the actual distribution by using stratified sampling instead of random sampling.

In stratified sampling, we first divide our dataset into homogeneous subgroups called strata, which is based on some characteristics in our data, and then the right number of instances is sampled from each strata to guarantee that the test set is representative of the overall data.

Like in the above college data example, we tried to maintain the 65:35 ratio in our sample. That was stratified sampling and the characteristic used to create the strata was gender.

Note- Stratified sampling doesn't remove sampling bias completely but reduces it to a much greater extent.

Previous Index Next

End-to-End ML Project- Beginner friendly

Stratified sampling

XP

Loading comments...