Only Few Seats Left for Advanced Certification Courses on Data Science, ML & AI by E&ICT Academy IIT RoorkeeApply Now
We have been given the Europe credit-card transaction data of 2 days. For privacy reasons, the personal details have been represented in the form of Principle Components. The Amount(the transaction Amount) and Time(the seconds elapsed between each transaction and the first transaction in the dataset) are also part of the columns other than the principal components. The transactions are of valid and fraudulent types. The goal is to build a classifier to detect fraudulent transactions.
We have first loaded the data, explored it, and checked for any null values. While exploring, we found that the data is of high class-imbalance, with around 99.83% being valid transactions whereas about 0.17% are fraudulent.
It is not a good idea to train a classifier with such highly imbalanced data as it leads to mere assumptions rather than learning by the algorithm. We could either undersample or oversample the data to achieve a balance between the class-wise data samples.
We have split the data into train and test parts, in order to prevent any data leakage and to keep the test data untouched, before oversampling.
We have scaled the Amount and Time features using StandardScaler.
We then applied the SMOTE technique to oversample the train data and formed a new dataset with the thus obtained over-sampled datapoints.
We used the GridSearch method with different parameter values, trained logistic regression classifiers with the different combinations of these parameters, and got the best logistic regression classifier which yields the least loss on the over-sampled data-set. All this mechanism is internally implemented by GridSearchCV of sklearn.
We then used the best estimator thus obtained to evaluate its performance on the unseen test data. We calculated the recall, confusion-matrix and roc-auc scores.
No hints are availble for this assesment
Answer is not availble for this assesment