Flash Sale: Flat 70% + Addl. 25% Off on all Courses | Use Coupon DS25 in Checkout | Offer Expires InEnroll Now
Why don't we want class imbalance?
From our analysis, we observe there is a lot of imbalance in the classes, with most of the transactions were Non-Fraud (99.83%) of the time, while Fraud transactions occur (0.17%) of the time in the dataframe.
Using this imbalanced data as such is not a good idea for training a model to classify if a transaction is fraudulent or not.
This is because, if we use this imbalanced data is used to train a model, the algorithm does not have a decent amount of fraudulent-data to learn the patterns of fraudulent transactions. Thus, it most probably assumes that every transaction is non-fraudulent(the dominant class of the data).
This would be a pity because the model naively assumes but doesn't learn/detect the patterns in order to classify.
Yes! To make the dataset balanced, we could either undersample or oversample it.
What are we going to do now?
We should do most pre-processing steps (splitting the data, normalization/standardization, etc) before under/over-sampling the data.
This is because many sampling techniques require a simple model to be trained (e.g. SMOTE uses a k-NN algorithm to generate samples). These models have better performance on pre-processed datasets (e.g. both k-NN and k-means use euclidean distance, which requires the data to be normalized).
So, in order for the sampling techniques to work best, we should previously perform any pre-processing steps we can. Then we shall proceed to use SMOTE technique to oversample the train data in order to use it to rain the classification algorithm.
No hints are availble for this assesment
Answer is not availble for this assesment