I was going through the F1-score section to measure model's perfomance then I saw some different types of f1-score while browsing something. Basically when we say f1-score what do we refer to calculate, weighted, macro,micro f1- score.
In which scenario what type of f1-score is used and reason for using that particular type ?
Since in the 5 not 5 binary classification, the random forest gave us the best results, I tried using the same algorithm for multilabel classifications too.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_preds_forest = cross_val_predict(forest_clf,X_train,y_train,cv=3)
The accuracy of this model using cross validation score as below comes out to be
Will OvA make the dataset imbalanced? For eg: if we take 100 data for each digit then for any of the dataset it will be 100 vs 900. Will that affect the classifier?
OvA does not mean we are taking a batch of data from the entire dataset. It is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
Binary Classification Problem 1: red vs [blue, green]
Binary Classification Problem 2: blue vs [red, green]
Binary Classification Problem 3: green vs [red, blue]
I used to see the Jupyter notebook on the right panel. But, right now it is not appearing and also i cannot see the "Show Playground" button as well. Can you hep?
The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. In case of multivariate data, this is done feature-wise (in other words independently for each column of the data). Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.
Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model.
These are array of numbers, I am not sure how you are getting a single value for each. Would request you to check with the notebook, and post running that cell, print the values of these separately.
a) After dividing the given Sample dataset (whatever be the dataset or the scenario), you you are dividing it into Train & Test Dataset Samples. b) Thereafter you are implementing SGD (Stochastic Gradient Descent) which is an iterative Optimization Classifier technique. This technique is carried out IF AND ONLY IF the output is a class of Ordinal or Nominal Data-type or in Binomial Probability scenarios. c) Thereafter you are using appropriate Performance Metrics i.e. Confusion Matrix, Precision Recall Tradeoff, F1 Score & ROC on each dataset. d) Based on these Performance Metrics, you select the most appropriate Performance Metric amongst them. e) Thereafter that you do the fine-tuning of the Performance Metrics based on the given dataset sampling scenarios. During this phase, you go a step further and again DEPENDING OF THE CLASS OUTPUTS, the end-user does further classification of SGD Classifiers ie. Binary classifier, Multi-class Classifier, Multi-label Classifier & Multi-Output Classsifier. However, the usage of these classifiers is purely dependent on the given business scenario and the sampling data that is shared.
Indications/Usage of these different types of SGD Classifiers depends on the ML Techniques to be incorporated viz.: a) Binary classifier - For Logistic Regression & Binomial Probability Scenarios b) Multi-class Classifier - For Random Forests, , Naive Baye's Classification c) Multi-label Classifier - For K-NN d) Multi-Output Classifier - For K-NN
Aadhar Data, SSN (~ equivalent of Aadhar Card i.e. Social Security Number) Data used in USA, Finger-Print Analysis (Criminology), Facial Visualization, Iris Analysis, Baggage screening at the Airports, Cargo Screening & Clearance at the Customs, Supply-Chain Management etc are examples of Classification ML techniques and SGD can be employed on them with full gusto as the "appropriate technique of choice".
Kindly clarify my mistakes and understanding of this aforesaid concept, wherever I have gone/understood wrongly.
2. Then you say, we need scores for ROC curve plot, not probability values. So the work around is "use the positive class’s probability as the score:" y_scores_forest = y_probas_forest[:, 1]
I really didn't get this logic. What do you mean by use +ve classes probability as score ? And why so ? the cross_val_predict on RF is already giving us prob scores of both classes. Why are we leaving out -ve classes and saying lets take the probabilities of only +ve classes and directly consider them as scores to plot ROC ? Please explain with proper technical explanation.
In classification problems, we use two types of algorithms (dependent on the kind of output it creates):
Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms which can convert these class outputs to probability. But these algorithms are not well accepted by the statistics community.
Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.
Now regarding the ROC:
1. For a model which gives class as output, will be represented as a single point in ROC plot.
2. Such models cannot be compared with each other as the judgement needs to be taken on a single metric and not using multiple metrics. For instance, model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared.
3. In case of probabilistic model, we were fortunate enough to get a single number which was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions. It is also possible that one model performs better in some region and other performs better in other.
Rajtilak in further continuation to your clarifications, pts 1, 2, & 3 are especially relevant for SGD wherein only 1 point is considered and comparisons cannot be done. However, in case of Gradient Descent, I believe comparisons can be done between the data-points, but a rather lengthy and tedious process and therefore seldom used. What is the case for Mini-Batch Gradient Descent? Can one use various points for comparison? What are the indications for its usage???
Please note that fetch_mldata is deprecated, we have updated our code using fetch_openml. You can find the updated code in our GitHub repository at the below link: https://github.com/cloudxla... Thanks.
Sir, In the video towards the end it is mentioned that the interview questions will be uploading soon in the interview preparing blog so MAY I GET THE LINK OF THE INTERVIEW PREPARING BLOG???
Yes you are right! The features al always the columns. In rows you have the data points. here we are doing the sum columns wise and then dividing it with all elements for doing the normalization.
1.For finding the confidence of a digit here how do we know that which confidence is better is there any limit in this example??. ALSO WHAT IS THE CONFIDENCE INTERVAL WE ARE TAKING IN THIS PROBLEM LIKE 95% INTERVAL ETC.??? or is the threshold value works same as confidence interval we take????
2.In the previous video for calculating cross validation score what is mean by taking 3-fold???
First you need to understand what is confidence of a digit. It is the probability of predicting that digit correctly. Which is basically the accuracy of your model. Now, the video has in-depth description how to measure the accuracy of your mode, whether to consider accuracy as the measure of success for a model etc. So, would request you to review the materials once more to gain a better understanding.
oksay thanks sir please answer my 2nd query I have not understood what is meant by 3-fold ? 2.In the previous video for calculating cross validation score what is mean by taking 3-fold???
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds): plt.figure(figsize=(16,5)) # Removing last value to avoid divide by zero in precision computation plt.plot(thresholds, precisions[:-1], "b--", label="Precision") plt.plot(thresholds, recalls[:-1], "g-", label="Recall") plt.xlabel("Threshold") plt.legend(loc="upper left") plt.ylim([0, 1])
Hi, Kindly write plt.show <http: disq.us="" url?url="http%3A%2F%2Fplt.show%3ACKLZj5HEpqZp72VVcgxm7nEIVtg&cuid=4082636">() inside the function where you are defining the plotting properties. If you write outside the it will be assessed with the plt object defined inside the function. Kindly refer the matplotlib plottings.
I was able to complete the assignment with the latest file. I don't think we need a call at this point in time. Thanks for your help. I have added couple of queries in '44. Project - Spam classifier'. Please have a look.
Thanks Rajtilak. Yes, I observed the modified 'Fetch Data' code. Could you please install urlextract for my user id? Not sure what is to be done for google.colab. Without these two packages spam classifier project can not be continued.
We are still deciding the best option to go for the Spam Classifier project which would be easiest for our learners. We will get back to you on the same. Meanwhile, would request you to continue with the rest of the course. Happy learning!
While practicing i found multiple warnings, and it not calculate the desired results. As per my view you all need to review your videos and update it, may be some method get deprecated. This types of warning give hopelessness to us about the course that whether we have taken right course or not.
This is not the first time we are getting this type of warning again and again while practicing
I request you please update the videos so that it improve cloudxlab efficiency and trust.
I am expecting as a positive response from you all guys, so that people not loss there interest on learning.
Would request you to share a screenshot of the error that you are getting. The second screenshot was not uploaded properly and we are unable to view it. Also, please note that warnings are very common, errors are not. Are you getting a warning or an error? How do you know that you are not getting the desired result? Are you not able to submit your code? If you are stuck somewhere, you can always take a hint or look at the answer to compare with your code and check if it needs any amendment.
Please find the warning image attached. we found this type of warning message multiple times, while practicing and even giving assessments also. You ask to take help of hint, we take help of hint but warning always stuck my practice and assessments.
Could you please tell me how do you know that you are not getting the desired result? Are you not able to submit your code? You mentioned that you have taken the hint and looked at the answer. Was it working after that? Thanks.
I am following Sandeep videos step by step and practicing, but in multiple cases i not found desired results, even data set are same though. As an example i tried to load Mnist data set but due to deprecated i was not able to load, again searched a lot then got to know that that method has been deprecated.
So my suggestion was for updating of videos (or dived videos into multiple smaller part so that in case any deprecation happens only update that part of videos) so that this may not get messed with deprecated method or features in future.
This is only suggestion, its totally in your hand, you want this suggestion or not.
Because I am a software guy and we need to write code in such a manner that can be extensible and modified any any point of time, so why not we can apply this for videos?
Thank you very much for your feedback. We really appreciate it. Please note that we have updated the notebooks associated with this course, and they contains the updated codes, including the fetch_openml code for Fashion-MNIST. Would request you to clone our GitHub repository to access the latest codes.
When the decision scores are calculated using cross_val_predict, the length of the output is equivalent to the number of instances (i.e. 60,000 in the test set). However, when we calculate the precision and recall for each level of threshold, the length reduces below 60,000. I am curious to understand the behaviour. The code is as below -
sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value. sgd_clf.fit(X_train, y_train_5)
While executing above getting below error, can you please suggest what is the issue?
ValueError Traceback (most recent call last) <ipython-input-19-a34e7e88f69f> in <module> 3 4 sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value. ----> 5 sgd_clf.fit(X_train, y_train_5)
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in fit(self, X, y, coef_init, intercept_init, sample_weight) 709 loss=self.loss, learning_rate=self.learning_rate, 710 coef_init=coef_init, intercept_init=intercept_init, --> 711 sample_weight=sample_weight) 712 713
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _fit(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, sample_weight) 548 549 self._partial_fit(X, y, alpha, C, loss, learning_rate, self.max_iter, --> 550 classes, sample_weight, coef_init, intercept_init) 551 552 if (self.tol is not None and self.tol > -np.inf
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init) 512 raise ValueError( 513 "The number of classes has to be greater than one;" --> 514 " got %d class" % n_classes) 515 516 return self
ValueError: The number of classes has to be greater than one; got 1 class
Would request you to share a screenshot of your code and the error that you are getting. You can also take a hint, or look at the answer to compare with your code and check where you need to make amends.
When I execute code for from the begining y[36000] displayes error "out of index' but when I execute following code again and then rerun above code it executes successfully and returns following output: [False False False ... False False False] 5
For ‘5’ and ‘Not 5’ classifier ? For threshold = 0, classifier correctly classifies 5 as 5 ? For threshold = 20000, classifier incorrectly classifies digit 5 as not 5
classification.ipynb cell no. 27 (Coded by cloudxlab)
print ("The index of the maximum score is ", np.argmax(some_digit_scores))
The index of the maximum score is 8
Query: I am getting the output as 8. When the decision_function is applied on digit 8 (y_train[1000] and x_train[1000]) how the maximum score is displayed as '5' in your code?
Could you now try again? I have updated the classifications notebook. Please note that in some cases, specifically cross validation and fit() steps, it might take a long time.
Please let me know if it works for you. --Sandeep Giri
Thanks for your prompt response. I tried to execute the file but the kernel/server got disconnected multiple times. I did not even reach to CV or FIT. I will try to execute entire notebook tomorrow and get back. Meanwhile can you please explain the below workaround?
# y_scores_1 = y_scores[:,1] # This code was raising exception 'too many indices for array'
# hack to work around issue #9589 in Scikit-Learn 0.19.0 if y_scores.ndim == 2: y_scores = y_scores[:, 1]
We have made some changes in our notebooks since we last spoke, we would request you to clone our repository once again and work on the updated code. Do let us know if you are still facing any further challenges.
Hi Rajtilak, I cloned the repository first and tried to execute the notebook but as I mentioned I encountered obstacles. I will try to execute the code as soon as I get time.
The kernel getting disconnected issue has got nothing to do with the codes in the notebook. I tried this exact location on my system and it worked fine. However, would request you to share your email to check what is causing the issue on your end.
I can see that you have used more than 70 % of the allocated disk space and due to which you were not able to connect. You can check this using the "df /dev/sda1" or "df -h" commands.
You can delete some of the heavy files or datasets so work the Jupyter seamlessly. At the right hand side there is control panel option, kindly restart your server and logout and login again and restart your kernel. If still did not work out kindly let me know.
[punitnbxxxx@cxln4 ~]$ du -sk * 8508 cloudxlab_jupyter_notebooks 161600 ml 2268 myproject[punitnbxxxx@cxln4 ~]$ du -sk 177472
I am using total 177 mb and the command you provided shows below data.
[punitnb7985@cxln4 ~]$ df /dev/sda1 Filesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 314557500 239049056 75508444 76% / As per this command it shows 300 gb data is it correct? Can you please clarify? This may be for multiple users and with other installed software packages.
After speaking to you the last time, we have made some more changes to the classification notebook. Would request you to comment out the contents of cell 62 before running the notebook. Also, use the rm -r command to remove the scikit_learn_data folder before you proceed. This will be automatically created on the next run. Please let me know if this worked, else we can schedule a one-to-one session to check the issues you are facing.
I am good. Hope you are safe. I will do the needful. Yesterday I could not execute the notebook since the mentioned cell took too long to execute. I will take the latest file and try again.
Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
The answer to all of your queries are there in the tutorial. the predic_proba method gives an array of probabilities for each row is equal to the number of categories in target variable. The ROC curve is similar to the F1 score but uses a different metric. Would suggest you to go through the materials once again if required.
what is the cost function used in this classification problem? If iam correct sgd is an algorithm to update the parameters of the model but that itself is not the cost function and what is function or (model) which takes all the 784 pixel values(784 inputs) and outputs the digit?
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Hi, Cost function means the average sum of all the error of the ground truth(original) inputs to the predicted output.
Yes you are right SGD(Stochastic gradient descent) is a optimization algorithm to reduce the cost function, it is already implemented in sklearn, you just need to import it.
You are using the SGDClassifier model from the classification here. I recommend you to kindly watch the tutorial again!
Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:
Please login to comment
94 Comments
Hi, are the slides not accessible anymore? It says drive location changed for me.
Upvote ShareHi Sathish,
It's working fine from my end. Can you please share screenshot of the issue?
Upvote ShareHi Team,
I was going through the F1-score section to measure model's perfomance then I saw some different types of f1-score while browsing something. Basically when we say f1-score what do we refer to calculate, weighted, macro,micro f1- score.
In which scenario what type of f1-score is used and reason for using that particular type ?
Regards,
Birendra Singh
Upvote SharePlease assist with below:
Hi,
It should be OneVsOneClassifier.
Thanks.
Upvote ShareHi,
Since in the 5 not 5 binary classification, the random forest gave us the best results, I tried using the same algorithm for multilabel classifications too.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_preds_forest = cross_val_predict(forest_clf,X_train,y_train,cv=3)
The accuracy of this model using cross validation score as below comes out to be
Code: cross_val_score(forest_clf, X_train, y_train, cv=3, scoring="accuracy")
Accuracy: [0.94041192, 0.93879694, 0.93949092]
If I then apply the standard scaler, I don't see any change in the accuracy as such:
Code:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(forest_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Post Standard Scaler Accuracy: [0.94056189, 0.93909695, 0.93914087]
Does this mean that a randomForestClassifier takes care of the scaling internally? Any other reason that could explain this?
Thanks,
Rohit
1 Upvote ShareHi,
Good question. Tree based algorithms do not require scaling of data.
Thanks.
2 Upvote ShareDear CloudxLab,
On performing "ovo_clf.predict([some_digit])" the system predicted the value as "array([4], dtype=int8)".
Here iam a bit confused so has the modle predicted the value as "4" or has it pridicted the value which is stored in y[4], where y[4]=9?
Because the [some_digit] = X[36000] = int(9). and the answer if array[4] is y[4] whose value is alos int(9).
Hi,
array([4], dtype=int8) means it is a NumPy array having an element 4, which is of type int8. So it predicted 4.
Thanks.
Upvote ShareThank You.
So basically the ovo_clf.predict prediction was incorrect.
Upvote Sharehow to calaculate probability in OVO and OVR ?
Upvote Shareat slide No.116...........how tocalculate score.?
Upvote ShareGood question!
Refer this : https://discuss.cloudxlab.com/t/mnist-classification/5400/5
https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
https://scikit-learn.org/stable/modules/multiclass.html#ovr-classification
All the best!
Upvote ShareHi,
Will OvA make the dataset imbalanced? For eg: if we take 100 data for each digit then for any of the dataset it will be 100 vs 900. Will that affect the classifier?
Thanks
Sneha
Upvote ShareHi,
OvA does not mean we are taking a batch of data from the entire dataset. It is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
Thanks.
Upvote Sharewhy is digit 5 so relevant in this part
Upvote ShareHi,
It is not relevant, you can try the same code with any other digit too.
Thanks.
Upvote ShareWhen I am writing the code:
y_train_5 = (y_train == 5)
I am getting all the values in my dataset are false, Why?
Just because that when I am executing sgd clf command, I am getting the error "
Please Explain.
Upvote ShareHi,
Is this a part of any assessment? If not, would request you to follow the notebook given in our GitHub repository.
Thanks.
Upvote ShareGood Morning,
In slide 41
sgd_clf.fit(X_train, y_train_5) is taken. But we did not define y_train_5 in building the model.
When we run the above code, we get an error showing that y_train_5 is not defined.
Upvote ShareHi,
Would request you to refer to our notebooks from our GitHub repository.
Thanks.
Upvote ShareThank you sir.
Upvote ShareThis comment has been removed.
Hi CloudxLab,
I used to see the Jupyter notebook on the right panel. But, right now it is not appearing and also i cannot see the "Show Playground" button as well. Can you hep?
Hi,
This is a lecture video only slide and does not have any assessments, so it does not have any Jupyter notebook on the right.
Thanks.
Upvote Sharewhat is the meaning of "probability of positive class" in random forest classifier ?
Upvote ShareHi,
Here were are detecting the digit "5", the positive class is when it is the digit "5".
Thanks.
Upvote Sharehi
Unlike my previous sessions why am I not seeing Jupyter note book on right panel for classification section?
Upvote ShareHi Sumbul,
Try locating a "Show Playground" button on the top-right corner of your screen. If you are not able to locate it kindly send us a screenshot.
Thanks
Upvote ShareCan you please share training models slides?
Upvote ShareHi Shreya,
The training models slides are available in next session.
Why are we performing standard scaler on X_train? Couldn't understand the logic.
Also at 1:43 hrs why are we normalizing the rows?
Upvote ShareHi,
The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. In case of multivariate data, this is done feature-wise (in other words independently for each column of the data). Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.
For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.
Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model.
Thanks.
Upvote ShareHi Team,
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
len(precisions) - 59711
len(recalls) - 59711
len(thresholds) - 59710
why the length of threshold is one less than the length of precison and recalls ?
kindly make it clear
Upvote ShareHi,
Could you tell me which slide are you referring to?
Thanks.
Upvote ShareClassification slide - precision - recall tradeoff - threshold
from sklearn.metrics import precision_recall_curve
Upvote Shareprecisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
Hi,
These are array of numbers, I am not sure how you are getting a single value for each. Would request you to check with the notebook, and post running that cell, print the values of these separately.
Thanks.
Upvote ShareDear sir,
you did'nt get my question please have a look below your notebook code
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
After that run this code to calculate the length of each
print('Precision:', precisions.size, '\nRecall:', recalls.size, '\nThreshold:', thresholds.size)
Precision: 59903
Recall: 59903
Threshold: 59902
why threshold one less than the precision and recall ?
Upvote ShareHi,
Would request you to take a look at the following topic in the forum:
https://discuss.cloudxlab.com/t/classification-sgd-classifier-precision-recall/4659
Thanks.
Upvote ShareHi Cloud X Team,
Let me sum up & correct me if I'm wrong:
a) After dividing the given Sample dataset (whatever be the dataset or the scenario), you you are dividing it into Train & Test Dataset Samples.
b) Thereafter you are implementing SGD (Stochastic Gradient Descent) which is an iterative Optimization Classifier technique. This technique is carried out IF AND ONLY IF the output is a class of Ordinal or Nominal Data-type or in Binomial Probability scenarios.
c) Thereafter you are using appropriate Performance Metrics i.e. Confusion Matrix, Precision Recall Tradeoff, F1 Score & ROC on each dataset.
d) Based on these Performance Metrics, you select the most appropriate Performance Metric amongst them.
e) Thereafter that you do the fine-tuning of the Performance Metrics based on the given dataset sampling scenarios. During this phase, you go a step further and again DEPENDING OF THE CLASS OUTPUTS, the end-user does further classification of SGD Classifiers ie. Binary classifier, Multi-class Classifier, Multi-label Classifier & Multi-Output Classsifier. However, the usage of these classifiers is purely dependent on the given business scenario and the sampling data that is shared.
Indications/Usage of these different types of SGD Classifiers depends on the ML Techniques to be incorporated viz.:
a) Binary classifier - For Logistic Regression & Binomial Probability Scenarios
b) Multi-class Classifier - For Random Forests, , Naive Baye's Classification
c) Multi-label Classifier - For K-NN
d) Multi-Output Classifier - For K-NN
Aadhar Data, SSN (~ equivalent of Aadhar Card i.e. Social Security Number) Data used in USA, Finger-Print Analysis (Criminology), Facial Visualization, Iris Analysis, Baggage screening at the Airports, Cargo Screening & Clearance at the Customs, Supply-Chain Management etc are examples of Classification ML techniques and SGD can be employed on them with full gusto as the "appropriate technique of choice".
Kindly clarify my mistakes and understanding of this aforesaid concept, wherever I have gone/understood wrongly.
Upvote ShareHi CloudXlab,
I have a doubt while we are comparing RF and SGD using ROC AUC. In the code :
1. You say that RF uses predict_proba instead of decision_function and gives a probability value of identifying the class( i.e is 5 or not 5)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
2. Then you say, we need scores for ROC curve plot, not probability values. So the work around is "use the positive class’s probability as the score:"
y_scores_forest = y_probas_forest[:, 1]
I really didn't get this logic. What do you mean by use +ve classes probability as score ? And why so ? the cross_val_predict on RF is already giving us prob scores of both classes. Why are we leaving out -ve classes and saying lets take the probabilities of only +ve classes and directly consider them as scores to plot ROC ? Please explain with proper technical explanation.
Upvote ShareHi,
In classification problems, we use two types of algorithms (dependent on the kind of output it creates):
Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms which can convert these class outputs to probability. But these algorithms are not well accepted by the statistics community.
Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.
Now regarding the ROC:
1. For a model which gives class as output, will be represented as a single point in ROC plot.
2. Such models cannot be compared with each other as the judgement needs to be taken on a single metric and not using multiple metrics. For instance, model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared.
3. In case of probabilistic model, we were fortunate enough to get a single number which was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions. It is also possible that one model performs better in some region and other performs better in other.
Hope this addresses your query.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareRajtilak in further continuation to your clarifications, pts 1, 2, & 3 are especially relevant for SGD wherein only 1 point is considered and comparisons cannot be done.
Upvote ShareHowever, in case of Gradient Descent, I believe comparisons can be done between the data-points, but a rather lengthy and tedious process and therefore seldom used. What is the case for Mini-Batch Gradient Descent? Can one use various points for comparison? What are the indications for its usage???
Sir,
Upvote Sharei m unable to import fetch_mldata in jupyter notebook while studying classification of mnist dataset
Hi,
Please note that fetch_mldata is deprecated, we have updated our code using fetch_openml. You can find the updated code in our GitHub repository at the below link:
https://github.com/cloudxla...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareSir,
Upvote ShareIn the video towards the end it is mentioned that the interview questions will be uploading soon in the interview preparing blog so MAY I GET THE LINK OF THE INTERVIEW PREPARING BLOG???
Hi,
Please find the machine learning interview questions here https://cloudxlab.com/blog/...
Upvote ShareIN ERROR ANALYSIS STEP:
WE ARE CALCULATING ROW SUM BUT AXIS=1 MEANS COLUMN SO MAY BE IT SHOULD BE AXIS=0 FOR ROW SUMS????
row_sums = conf_mx.sum(axis=1, keepdims=True)
Upvote Sharenorm_conf_mx = conf_mx / row_sums
Hi, Queen.
Yes you are right!
The features al always the columns. In rows you have the data points. here we are doing the sum columns wise and then dividing it with all elements for doing the normalization.
All the best!
-- Satyajit Das
Upvote Shareokay thanks
Upvote Share1.For finding the confidence of a digit here how do we know that which confidence is better is there any limit in this example??.
ALSO WHAT IS THE CONFIDENCE INTERVAL WE ARE TAKING IN THIS PROBLEM LIKE 95% INTERVAL ETC.???
or is the threshold value works same as confidence interval we take????
2.In the previous video for calculating cross validation score what is mean by taking 3-fold???
3.Can confusion matrices be used for regression??
PLEASE CLARIFY THIS SIR.
Upvote ShareHi,
First you need to understand what is confidence of a digit. It is the probability of predicting that digit correctly. Which is basically the accuracy of your model. Now, the video has in-depth description how to measure the accuracy of your mode, whether to consider accuracy as the measure of success for a model etc. So, would request you to review the materials once more to gain a better understanding.
Thanks.
-- Rajtilak Bhattacharjee
Upvote Shareoksay thanks sir please answer my 2nd query
Upvote ShareI have not understood what is meant by 3-fold ?
2.In the previous video for calculating cross validation score what is mean by taking 3-fold???
# Plotting our results
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(16,5))
# Removing last value to avoid divide by zero in precision computation
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
ERROR
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-87-7849f863f576> in <module>
9 plt.legend(loc="upper_left")
10
---> 11 plot_precision_recall_vs_threshold(precisions,recalls,thresholds)
12 plt.show()
<ipython-input-87-7849f863f576> in plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
1 #ploting results
2 def plot_precision_recall_vs_threshold(precisions,recalls,thresholds):
----> 3 plt.figure(figsize=(16,5))
4 #last column removed from precisions or recalls to avoid zero division error
5 plt.plot(thresholds,(precisions[:-1]),"b--",label="precisions")
TypeError: 'tuple' object is not callable
Upvote ShareHi,
Kindly write plt.show <http: disq.us="" url?url="http%3A%2F%2Fplt.show%3ACKLZj5HEpqZp72VVcgxm7nEIVtg&cuid=4082636">() inside the function where you are defining the plotting properties.
If you write outside the it will be assessed with the plt object defined inside the function.
Kindly refer the matplotlib plottings.
All the best!
-- Satyajit Das
Upvote ShareHi Rajtilak,
I was able to complete the assignment with the latest file. I don't think we need a call at this point in time. Thanks for your help. I have added couple of queries in '44. Project - Spam classifier'. Please have a look.
Upvote ShareHi Punit,
Good to hear that you were able to complete the assignment. We are looking into the spam classifier comments and will get back to you.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareThanks Rajtilak. Yes, I observed the modified 'Fetch Data' code. Could you please install urlextract for my user id? Not sure what is to be done for google.colab. Without these two packages spam classifier project can not be continued.
Upvote ShareHi Punit,
We are still deciding the best option to go for the Spam Classifier project which would be easiest for our learners. We will get back to you on the same. Meanwhile, would request you to continue with the rest of the course.
Happy learning!
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi @disqus_XTh3bUKOBh:disqus,
While practicing i found multiple warnings, and it not calculate the desired results. As per my view you all need to review your videos and update it, may be some method get deprecated. This types of warning give hopelessness to us about the course that whether we have taken right course or not.
This is not the first time we are getting this type of warning again and again while practicing
I request you please update the videos so that it improve cloudxlab efficiency and trust.
I am expecting as a positive response from you all guys, so that people not loss there interest on learning.
Thanks
Upvote ShareAmit
Hi,
Would request you to share a screenshot of the error that you are getting. The second screenshot was not uploaded properly and we are unable to view it. Also, please note that warnings are very common, errors are not. Are you getting a warning or an error? How do you know that you are not getting the desired result? Are you not able to submit your code? If you are stuck somewhere, you can always take a hint or look at the answer to compare with your code and check if it needs any amendment.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi Rajtilak,
Please find the warning image attached. we found this type of warning message multiple times, while practicing and even giving assessments also.
You ask to take help of hint, we take help of hint but warning always stuck my practice and assessments.
Hi,
Could you please tell me how do you know that you are not getting the desired result? Are you not able to submit your code? You mentioned that you have taken the hint and looked at the answer. Was it working after that?
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareI am following Sandeep videos step by step and practicing, but in multiple cases i not found desired results, even data set are same though.
As an example i tried to load Mnist data set but due to deprecated i was not able to load, again searched a lot then got to know that that method has been deprecated.
So my suggestion was for updating of videos (or dived videos into multiple smaller part so that in case any deprecation happens only update that part of videos) so that this may not get messed with deprecated method or features in future.
This is only suggestion, its totally in your hand, you want this suggestion or not.
Because I am a software guy and we need to write code in such a manner that can be extensible and modified any any point of time, so why not we can apply this for videos?
Upvote ShareHi,
Thank you very much for your feedback. We really appreciate it. Please note that we have updated the notebooks associated with this course, and they contains the updated codes, including the fetch_openml code for Fashion-MNIST. Would request you to clone our GitHub repository to access the latest codes.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi! @disqus_XTh3bUKOBh:disqus ,
I have a query and need your help to clarify it.
When the decision scores are calculated using cross_val_predict, the length of the output is equivalent to the number of instances (i.e. 60,000 in the test set). However, when we calculate the precision and recall for each level of threshold, the length reduces below 60,000. I am curious to understand the behaviour. The code is as below -
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')
print(y_scores.size)
>> 60000
Upvote Sharefrom sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
print('Precision:', precisions.size, '\nRecall:', recalls.size, '\nThreshold:', thresholds.size)
>> Precision: 59903
>> Recall: 59903
>> Threshold: 59902
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value.
sgd_clf.fit(X_train, y_train_5)
While executing above getting below error, can you please suggest what is the issue?
ValueError Traceback (most recent call last)
<ipython-input-19-a34e7e88f69f> in <module>
3
4 sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value.
----> 5 sgd_clf.fit(X_train, y_train_5)
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in fit(self, X, y, coef_init, intercept_init, sample_weight)
709 loss=self.loss, learning_rate=self.learning_rate,
710 coef_init=coef_init, intercept_init=intercept_init,
--> 711 sample_weight=sample_weight)
712
713
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _fit(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, sample_weight)
548
549 self._partial_fit(X, y, alpha, C, loss, learning_rate, self.max_iter,
--> 550 classes, sample_weight, coef_init, intercept_init)
551
552 if (self.tol is not None and self.tol > -np.inf
/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)
512 raise ValueError(
513 "The number of classes has to be greater than one;"
--> 514 " got %d class" % n_classes)
515
516 return self
ValueError: The number of classes has to be greater than one; got 1 class
Upvote ShareHi,
Would request you to share a screenshot of your code and the error that you are getting. You can also take a hint, or look at the answer to compare with your code and check where you need to make amends.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareThere are 4 queries. Please respond to all. Let me know how I can I share my ipynb so the code can be looked at if need be.
1. Cell no. 11 in my file
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
print(y_train_5)
y[36000]
When I execute code for from the begining y[36000] displayes error "out of index' but when I execute following code again and then rerun above code it executes successfully and returns following output:
[False False False ... False False False]
5
Code executed again:
Cell no. 2nd from top
X, y = mnist["data"], mnist["target"]
Query: Could you please let me know the reason?
2. Precision / Recall Tradeoff - Thresholds [pdf slide no. - 90]
classification.ipynb cell no. 25 (Coded by cloudxlab)
y_scores = sgd_clf.decision_function([some_digit])
y_scores
Output: array([1206.46829305])
I am getting output as: array([76209.10439098]) [Cell no. 24 in my file]
Why the output is different for the same data and what is the impact?
3. Precision / Recall Tradeoff - Thresholds pdf slide no. - 92
For ‘5’ and ‘Not 5’ classifier
? For threshold = 0, classifier correctly classifies 5 as 5
? For threshold = 20000, classifier incorrectly classifies digit 5 as not 5
classification.ipynb cell no. 27 (Coded by cloudxlab)
threshold = 20000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Output: array([False])
I had to set the theshold to 77000 but for '20000' result is 'True'
[Cell no. 26 in my file]
threshold = 77000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
Output: array([False])
Why the output is different for the same data and what is the impact?
4. Multi-class Classification [pdf slide no. - 117]
classification.ipynb cell no. 62,61,47 (Coded by cloudxlab)
y_train[1000]
Output - 8.0
some_digit_scores = sgd_clf.decision_function([X_train[1000]])
some_digit_scores
Output: array([[-330256.74131412, -408727.27252892, -93759.48808581,
-460978.05122121, -189552.87772983, -281278.72718979,
-272451.01681648, -198320.5270848 , 37284.63986995,
-169374.51720389]])
# The highest score is indeed the one corresponding to class 5:
print ("The index of the maximum score is ", np.argmax(some_digit_scores))
The index of the maximum score is 5
My code and output:
y_train[1000]
Output - 8.0
some_digit_scores = sgd_clf.decision_function([X_train[1000]])
some_digit_scores
Output - array([[-267509.20001414, -394156.44832612, -56146.63583926,
-372654.032655 , -114735.98485769, -217137.99447395,
-302545.78761571, -226519.39466287, -36787.41625735,
-277927.61280844]])
print ("The index of the maximum score is ", np.argmax(some_digit_scores))
The index of the maximum score is 8
Query: I am getting the output as 8. When the decision_function is applied on digit 8 (y_train[1000] and x_train[1000]) how
Upvote Sharethe maximum score is displayed as '5' in your code?
Hi,
Would request you to mail us your notebook with your queries.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareThanks Rajtilak. I shared the files.
Upvote ShareHi Punit,
Could you now try again? I have updated the classifications notebook. Please note that in some cases, specifically cross validation and fit() steps, it might take a long time.
Please let me know if it works for you.
Upvote Share--Sandeep Giri
Hi Sandeep,
Thanks for your prompt response. I tried to execute the file but the kernel/server got disconnected multiple times. I did not even reach to CV or FIT. I will try to execute entire notebook tomorrow and get back. Meanwhile can you please explain the below workaround?
# y_scores_1 = y_scores[:,1] # This code was raising exception 'too many indices for array'
# hack to work around issue #9589 in Scikit-Learn 0.19.0
Upvote Shareif y_scores.ndim == 2:
y_scores = y_scores[:, 1]
Hi Punit,
We have made some changes in our notebooks since we last spoke, we would request you to clone our repository once again and work on the updated code. Do let us know if you are still facing any further challenges.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi Rajtilak,
Upvote ShareI cloned the repository first and tried to execute the notebook but as I mentioned I encountered obstacles. I will try to execute the code as soon as I get time.
Hi Punit,
Take your time, we are always there to help you.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareHi,
I again tired to execute classification notebook. I could execute it till the end of the section 1.7. The kernel got disconnected post that.
Could not execute from below mentioned section. PLEASE DO SOMETHING.
1.8 Comparision of SGDClassifier and RandomForestClassifier on the basis of ROC-AUC
from sklearn.ensemble import RandomForestClassifier
Upvote Shareforest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
y_probas_forest
Hi Punit,
The kernel getting disconnected issue has got nothing to do with the codes in the notebook. I tried this exact location on my system and it worked fine. However, would request you to share your email to check what is causing the issue on your end.
Thanks.
-- Rajtilak Bhattacharjee
Upvote Sharepunitnb@gmail.com
Upvote ShareHi Punit,
I can see that you have used more than 70 % of the allocated disk space and due to which you were not able to connect. You can check this using the "df /dev/sda1" or "df -h" commands.
You can delete some of the heavy files or datasets so work the Jupyter seamlessly. At the right hand side there is control panel option, kindly restart your server and logout and login again and restart your kernel.
If still did not work out kindly let me know.
All the best!
-- Rajtilak Bhattacharjee
Upvote ShareHi Rajtilak,
Below data is from my home directory
[punitnbxxxx@cxln4 ~]$ du -sk *
8508 cloudxlab_jupyter_notebooks
161600 ml
2268 myproject[punitnbxxxx@cxln4 ~]$ du -sk
177472
I am using total 177 mb and the command you provided shows below data.
[punitnb7985@cxln4 ~]$ df /dev/sda1
Upvote ShareFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 314557500 239049056 75508444 76% /
As per this command it shows 300 gb data is it correct? Can you please clarify? This may be for multiple users and with other installed software packages.
Hi Punit,
Hope you are doing well!
After speaking to you the last time, we have made some more changes to the classification notebook. Would request you to comment out the contents of cell 62 before running the notebook. Also, use the rm -r command to remove the scikit_learn_data folder before you proceed. This will be automatically created on the next run. Please let me know if this worked, else we can schedule a one-to-one session to check the issues you are facing.
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareI am good. Hope you are safe. I will do the needful. Yesterday I could not execute the notebook since the mentioned cell took too long to execute. I will take the latest file and try again.
Upvote ShareHi Punit,
Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:
rajtilak.youcanbook.me
Thanks.
-- Rajtilak Bhattacharjee
Upvote Sharewhy y_scores_forest=y_probas_forest[:,1]?
Upvote Sharewhat does predict_proba() method return?
what is meaning of roc curve?
Hello Disqus,
Thanks for contacting CloudxLab!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Cheers,
Upvote ShareThe CloudxLab Team
Hi Sharathchandran,
The answer to all of your queries are there in the tutorial. the predic_proba method gives an array of probabilities for each row is equal to the number of categories in target variable. The ROC curve is similar to the F1 score but uses a different metric. Would suggest you to go through the materials once again if required.
All the best.
-- Rajtilak Bhattacharjee
Upvote Sharewhat is the cost function used in this classification problem?
Upvote ShareIf iam correct sgd is an algorithm to update the parameters of the model but that itself is not the cost function and what is function or (model) which takes all the 784 pixel values(784 inputs) and outputs the digit?
Hello Disqus,
Thanks for contacting CloudxLab!
This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.
If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.
- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>
If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!
Cheers,
Upvote ShareThe CloudxLab Team
Hi,
Cost function means the average sum of all the error of the ground truth(original) inputs to the predicted output.
Yes you are right SGD(Stochastic gradient descent) is a optimization algorithm to reduce the cost function, it is already implemented in sklearn, you just need to import it.
You are using the SGDClassifier model from the classification here.
I recommend you to kindly watch the tutorial again!
All the best!
-- Satyajit Das
Upvote Sharey_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
why we are using y_train and not y_multilabel????
1 Upvote ShareWhere can I download these slides?
Upvote ShareWhat is the criteria to receive the certificate of course completion
Upvote ShareHi, Vennela.
You need to complete 60% of course and mandatory projects.
All the best.
Upvote ShareHi Punit,
Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:
rajtilak.youcanbook.me
Thanks.
-- Rajtilak Bhattacharjee
Upvote Share