Classification

2 / 38

Machine Learning Classification Part -2

Recording of Session

Slides


No hints are availble for this assesment

Answer is not availble for this assesment

Please login to comment

94 Comments

Hi, are the slides not accessible anymore? It says drive location changed for me. 

  Upvote    Share

Hi Sathish,

It's working fine from my end. Can you please share screenshot of the issue?

  Upvote    Share

Hi Team,

I was going through the F1-score section to measure model's perfomance then I saw some different types of f1-score while browsing something. Basically when we say f1-score what do we refer to calculate, weighted, macro,micro f1- score.

In which scenario what type of f1-score is used and reason for using that particular type ?

 

Regards,

Birendra Singh

  Upvote    Share

Please assist with below:

 

  Upvote    Share

Hi,

It should be OneVsOneClassifier.

Thanks.

  Upvote    Share

Hi,

Since in the 5 not 5 binary classification, the random forest gave us the best results, I tried using the same algorithm for multilabel classifications too.

 

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_preds_forest = cross_val_predict(forest_clf,X_train,y_train,cv=3)

 

The accuracy of this model using cross validation score as below comes out to be 

Code: cross_val_score(forest_clf, X_train, y_train, cv=3, scoring="accuracy")

Accuracy: [0.94041192, 0.93879694, 0.93949092]

If I then apply the standard scaler, I don't see any change in the accuracy as such:

Code: 

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(forest_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

Post Standard Scaler Accuracy: [0.94056189, 0.93909695, 0.93914087]

 

Does this mean that a randomForestClassifier takes care of the scaling internally? Any other reason that could explain this?

Thanks,

Rohit

 1  Upvote    Share

Hi,

Good question. Tree based algorithms do not require scaling of data.

Thanks.

 2  Upvote    Share

Dear CloudxLab,

On performing "ovo_clf.predict([some_digit])" the system predicted the value as "array([4], dtype=int8)".

Here iam a bit confused so has the modle predicted the value as "4" or has it pridicted the value which is stored in y[4], where y[4]=9?

Because the [some_digit]  = X[36000] =  int(9). and the answer if array[4] is y[4] whose value is alos int(9).

 

  Upvote    Share

Hi,

array([4], dtype=int8) means it is a NumPy array having an element 4, which is of type int8. So it predicted 4.

Thanks.

  Upvote    Share

Thank You. 

So basically the ovo_clf.predict  prediction was incorrect.

  Upvote    Share

how to calaculate probability in OVO and OVR ?

  Upvote    Share

Hi,

Will OvA make the dataset imbalanced? For eg: if we take 100 data for each digit then for any of the dataset it will be 100 vs 900. Will that affect the classifier?

Thanks

Sneha

  Upvote    Share

Hi,

OvA does not mean we are taking a batch of data from the entire dataset. It is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.

For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:

  • Binary Classification Problem 1: red vs [blue, green]
  • Binary Classification Problem 2: blue vs [red, green]
  • Binary Classification Problem 3: green vs [red, blue]

Thanks.

  Upvote    Share

why is digit 5 so relevant in this part

  Upvote    Share

Hi,

It is not relevant, you can try the same code with any other digit too.

Thanks.

  Upvote    Share

When I am writing the code:

y_train_5 = (y_train == 5)

I am getting all the values in my dataset are false, Why?

Just because that when I am executing sgd clf command, I am getting  the error " 

The number of classes has to be greater than one; got 1 class"

Please Explain.

  Upvote    Share

Hi,

Is this a part of any assessment? If not, would request you to follow the notebook given in our GitHub repository.

Thanks.

  Upvote    Share

Good Morning,

In slide 41

sgd_clf.fit(X_train, y_train_5) is taken. But we did not define y_train_5 in building the model.

When we run the above code, we get an error showing that y_train_5 is not defined.

  Upvote    Share

Hi,

Would request you to refer to our notebooks from our GitHub repository.

Thanks.

  Upvote    Share

Thank you sir.

  Upvote    Share

This comment has been removed.

Hi CloudxLab,

I used to see the Jupyter notebook on the right panel. But, right now it is not appearing and also i cannot see the "Show Playground" button as well. Can you hep?

  Upvote    Share

Hi,

This is a lecture video only slide and does not have any assessments, so it does not have any Jupyter notebook on the right.

Thanks.

  Upvote    Share

what is the meaning of "probability of positive class" in random forest classifier ?

  Upvote    Share

Hi,

Here were are detecting the digit "5", the positive class is when it is the digit "5".

Thanks.

  Upvote    Share

hi

Unlike my previous sessions why am I not seeing Jupyter note book on right panel for classification section?

  Upvote    Share

Hi Sumbul,

Try locating a "Show Playground" button on the top-right corner of your screen. If you are not able to locate it kindly send us a screenshot.

Thanks

  Upvote    Share

Can you please share training models slides?

  Upvote    Share

Hi Shreya,

The training models slides are available in next session.

 

  Upvote    Share

Why are we performing standard scaler on X_train? Couldn't understand the logic.

Also at 1:43 hrs why are we normalizing the rows?

  Upvote    Share

Hi,

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. In case of multivariate data, this is done feature-wise (in other words independently for each column of the data). Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.

For example, assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling.

Normalization avoids these problems by creating new values that maintain the general distribution and ratios in the source data, while keeping values within a scale applied across all numeric columns used in the model.

Thanks.

  Upvote    Share

Hi Team,

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

len(precisions) -  59711

len(recalls)  -    59711

len(thresholds)  -  59710

why the length of threshold is one less than the length of precison and recalls ?

kindly make it clear

  Upvote    Share

Hi,

Could you tell me which slide are you referring to?

Thanks.

  Upvote    Share

Classification slide -    precision - recall tradeoff - threshold

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

  Upvote    Share

Hi,

These are array of numbers, I am not sure how you are getting a single value for each. Would request you to check with the notebook, and post running that cell, print the values of these separately.

Thanks.

  Upvote    Share

Dear sir,

you did'nt get  my question please have a look below your notebook code

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

(array([0.09078881, 0.09077359, 0.09077511, ..., 1.        , 1.        ,
        1.        ]),
 array([1.00000000e+00, 9.99815532e-01, 9.99815532e-01, ...,
        3.68935621e-04, 1.84467810e-04, 0.00000000e+00]),
 array([-916672.35378949, -915970.13258695, -915874.26060129, ...,
         504195.67830029,  523476.13678117,  533971.35573232]))

After that  run this code to calculate the length of each

print('Precision:', precisions.size, '\nRecall:', recalls.size, '\nThreshold:', thresholds.size)

Precision: 59903
Recall: 59903
Threshold: 59902

why threshold one less than the precision and recall ?

  Upvote    Share

Hi,

Would request you to take a look at the following topic in the forum:

https://discuss.cloudxlab.com/t/classification-sgd-classifier-precision-recall/4659

Thanks.

  Upvote    Share

Hi Cloud X Team,

Let me sum up & correct me if I'm wrong:

a) After dividing the given Sample dataset (whatever be the dataset or the scenario), you you are dividing it into Train & Test Dataset Samples.
b) Thereafter you are implementing SGD (Stochastic Gradient Descent) which is an iterative Optimization Classifier technique. This technique is carried out IF AND ONLY IF the output is a class of Ordinal or Nominal Data-type or in Binomial Probability scenarios.
c) Thereafter you are using appropriate Performance Metrics i.e. Confusion Matrix, Precision Recall Tradeoff, F1 Score & ROC on each dataset.
d) Based on these Performance Metrics, you select the most appropriate Performance Metric amongst them.
e) Thereafter that you do the fine-tuning of the Performance Metrics based on the given dataset sampling scenarios. During this phase, you go a step further and again DEPENDING OF THE CLASS OUTPUTS, the end-user does further classification of SGD Classifiers ie. Binary classifier, Multi-class Classifier, Multi-label Classifier & Multi-Output Classsifier. However, the usage of these classifiers is purely dependent on the given business scenario and the sampling data that is shared.

Indications/Usage of these different types of SGD Classifiers depends on the ML Techniques to be incorporated viz.:
a) Binary classifier - For Logistic Regression & Binomial Probability Scenarios
b) Multi-class Classifier - For Random Forests, , Naive Baye's Classification
c) Multi-label Classifier - For K-NN
d) Multi-Output Classifier - For K-NN

Aadhar Data, SSN (~ equivalent of Aadhar Card i.e. Social Security Number) Data used in USA, Finger-Print Analysis (Criminology), Facial Visualization, Iris Analysis, Baggage screening at the Airports, Cargo Screening & Clearance at the Customs, Supply-Chain Management etc are examples of Classification ML techniques and SGD can be employed on them with full gusto as the "appropriate technique of choice".

Kindly clarify my mistakes and understanding of this aforesaid concept, wherever I have gone/understood wrongly.

  Upvote    Share

Hi CloudXlab,

I have a doubt while we are comparing RF and SGD using ROC AUC. In the code :

1. You say that RF uses predict_proba instead of decision_function and gives a probability value of identifying the class( i.e is 5 or not 5)

y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")

2. Then you say, we need scores for ROC curve plot, not probability values. So the work around is "use the positive class’s probability as the score:"
y_scores_forest = y_probas_forest[:, 1]

I really didn't get this logic. What do you mean by use +ve classes probability as score ? And why so ? the cross_val_predict on RF is already giving us prob scores of both classes. Why are we leaving out -ve classes and saying lets take the probabilities of only +ve classes and directly consider them as scores to plot ROC ? Please explain with proper technical explanation.

  Upvote    Share

Hi,

In classification problems, we use two types of algorithms (dependent on the kind of output it creates):

Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms which can convert these class outputs to probability. But these algorithms are not well accepted by the statistics community.

Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.

Now regarding the ROC:

1. For a model which gives class as output, will be represented as a single point in ROC plot.

2. Such models cannot be compared with each other as the judgement needs to be taken on a single metric and not using multiple metrics. For instance, model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared.

3. In case of probabilistic model, we were fortunate enough to get a single number which was AUC-ROC. But still, we need to look at the entire curve to make conclusive decisions. It is also possible that one model performs better in some region and other performs better in other.

Hope this addresses your query.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Rajtilak in further continuation to your clarifications, pts 1, 2, & 3 are especially relevant for SGD wherein only 1 point is considered and comparisons cannot be done.
However, in case of Gradient Descent, I believe comparisons can be done between the data-points, but a rather lengthy and tedious process and therefore seldom used. What is the case for Mini-Batch Gradient Descent? Can one use various points for comparison? What are the indications for its usage???

  Upvote    Share

Sir,
i m unable to import fetch_mldata in jupyter notebook while studying classification of mnist dataset

  Upvote    Share

Hi,

Please note that fetch_mldata is deprecated, we have updated our code using fetch_openml. You can find the updated code in our GitHub repository at the below link:
https://github.com/cloudxla...
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Sir,
In the video towards the end it is mentioned that the interview questions will be uploading soon in the interview preparing blog so MAY I GET THE LINK OF THE INTERVIEW PREPARING BLOG???

  Upvote    Share

Hi,

Please find the machine learning interview questions here https://cloudxlab.com/blog/...

  Upvote    Share

IN ERROR ANALYSIS STEP:
WE ARE CALCULATING ROW SUM BUT AXIS=1 MEANS COLUMN SO MAY BE IT SHOULD BE AXIS=0 FOR ROW SUMS????

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

  Upvote    Share

Hi, Queen.

Yes you are right!
The features al always the columns. In rows you have the data points. here we are doing the sum columns wise and then dividing it with all elements for doing the normalization.

All the best!

-- Satyajit Das

  Upvote    Share

okay thanks

  Upvote    Share

1.For finding the confidence of a digit here how do we know that which confidence is better is there any limit in this example??.
ALSO WHAT IS THE CONFIDENCE INTERVAL WE ARE TAKING IN THIS PROBLEM LIKE 95% INTERVAL ETC.???
or is the threshold value works same as confidence interval we take????

2.In the previous video for calculating cross validation score what is mean by taking 3-fold???

3.Can confusion matrices be used for regression??

PLEASE CLARIFY THIS SIR.

  Upvote    Share

Hi,

First you need to understand what is confidence of a digit. It is the probability of predicting that digit correctly. Which is basically the accuracy of your model. Now, the video has in-depth description how to measure the accuracy of your mode, whether to consider accuracy as the measure of success for a model etc. So, would request you to review the materials once more to gain a better understanding.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

oksay thanks sir please answer my 2nd query
I have not understood what is meant by 3-fold ?
2.In the previous video for calculating cross validation score what is mean by taking 3-fold???

  Upvote    Share

# Plotting our results

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.figure(figsize=(16,5))
# Removing last value to avoid divide by zero in precision computation
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

ERROR
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-87-7849f863f576> in <module>
9 plt.legend(loc="upper_left")
10
---> 11 plot_precision_recall_vs_threshold(precisions,recalls,thresholds)
12 plt.show()

<ipython-input-87-7849f863f576> in plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
1 #ploting results
2 def plot_precision_recall_vs_threshold(precisions,recalls,thresholds):
----> 3 plt.figure(figsize=(16,5))
4 #last column removed from precisions or recalls to avoid zero division error
5 plt.plot(thresholds,(precisions[:-1]),"b--",label="precisions")

TypeError: 'tuple' object is not callable

  Upvote    Share

Hi,
Kindly write plt.show <http: disq.us="" url?url="http%3A%2F%2Fplt.show%3ACKLZj5HEpqZp72VVcgxm7nEIVtg&amp;cuid=4082636">() inside the function where you are defining the plotting properties.
If you write outside the it will be assessed with the plt object defined inside the function.
Kindly refer the matplotlib plottings.

All the best!

-- Satyajit Das

  Upvote    Share

Hi Rajtilak,

I was able to complete the assignment with the latest file. I don't think we need a call at this point in time. Thanks for your help. I have added couple of queries in '44. Project - Spam classifier'. Please have a look.

  Upvote    Share

Hi Punit,

Good to hear that you were able to complete the assignment. We are looking into the spam classifier comments and will get back to you.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Thanks Rajtilak. Yes, I observed the modified 'Fetch Data' code. Could you please install urlextract for my user id? Not sure what is to be done for google.colab. Without these two packages spam classifier project can not be continued.

  Upvote    Share

Hi Punit,

We are still deciding the best option to go for the Spam Classifier project which would be easiest for our learners. We will get back to you on the same. Meanwhile, would request you to continue with the rest of the course.
Happy learning!

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi @disqus_XTh3bUKOBh:disqus,

While practicing i found multiple warnings, and it not calculate the desired results. As per my view you all need to review your videos and update it, may be some method get deprecated. This types of warning give hopelessness to us about the course that whether we have taken right course or not.

This is not the first time we are getting this type of warning again and again while practicing

I request you please update the videos so that it improve cloudxlab efficiency and trust.

I am expecting as a positive response from you all guys, so that people not loss there interest on learning.

Thanks
Amit

  Upvote    Share

Hi,

Would request you to share a screenshot of the error that you are getting. The second screenshot was not uploaded properly and we are unable to view it. Also, please note that warnings are very common, errors are not. Are you getting a warning or an error? How do you know that you are not getting the desired result? Are you not able to submit your code? If you are stuck somewhere, you can always take a hint or look at the answer to compare with your code and check if it needs any amendment.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Rajtilak,

Please find the warning image attached. we found this type of warning message multiple times, while practicing and even giving assessments also.
You ask to take help of hint, we take help of hint but warning always stuck my practice and assessments.

  Upvote    Share

Hi,

Could you please tell me how do you know that you are not getting the desired result? Are you not able to submit your code? You mentioned that you have taken the hint and looked at the answer. Was it working after that?
Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I am following Sandeep videos step by step and practicing, but in multiple cases i not found desired results, even data set are same though.
As an example i tried to load Mnist data set but due to deprecated i was not able to load, again searched a lot then got to know that that method has been deprecated.

So my suggestion was for updating of videos (or dived videos into multiple smaller part so that in case any deprecation happens only update that part of videos) so that this may not get messed with deprecated method or features in future.

This is only suggestion, its totally in your hand, you want this suggestion or not.

Because I am a software guy and we need to write code in such a manner that can be extensible and modified any any point of time, so why not we can apply this for videos?

  Upvote    Share

Hi,

Thank you very much for your feedback. We really appreciate it. Please note that we have updated the notebooks associated with this course, and they contains the updated codes, including the fetch_openml code for Fashion-MNIST. Would request you to clone our GitHub repository to access the latest codes.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi! @disqus_XTh3bUKOBh:disqus ,

I have a query and need your help to clarify it.

When the decision scores are calculated using cross_val_predict, the length of the output is equivalent to the number of instances (i.e. 60,000 in the test set). However, when we calculate the precision and recall for each level of threshold, the length reduces below 60,000. I am curious to understand the behaviour. The code is as below -

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')
print(y_scores.size)
>> 60000

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
print('Precision:', precisions.size, '\nRecall:', recalls.size, '\nThreshold:', thresholds.size)
>> Precision: 59903
>> Recall: 59903
>> Threshold: 59902

  Upvote    Share

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value.
sgd_clf.fit(X_train, y_train_5)

While executing above getting below error, can you please suggest what is the issue?

ValueError Traceback (most recent call last)
<ipython-input-19-a34e7e88f69f> in <module>
3
4 sgd_clf = SGDClassifier(random_state=42, max_iter=10) # if you want reproducible results set the random_state value.
----> 5 sgd_clf.fit(X_train, y_train_5)

/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in fit(self, X, y, coef_init, intercept_init, sample_weight)
709 loss=self.loss, learning_rate=self.learning_rate,
710 coef_init=coef_init, intercept_init=intercept_init,
--> 711 sample_weight=sample_weight)
712
713

/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _fit(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, sample_weight)
548
549 self._partial_fit(X, y, alpha, C, loss, learning_rate, self.max_iter,
--> 550 classes, sample_weight, coef_init, intercept_init)
551
552 if (self.tol is not None and self.tol > -np.inf

/usr/local/anaconda/lib/python3.6/site-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)
512 raise ValueError(
513 "The number of classes has to be greater than one;"
--> 514 " got %d class" % n_classes)
515
516 return self

ValueError: The number of classes has to be greater than one; got 1 class

  Upvote    Share

Hi,

Would request you to share a screenshot of your code and the error that you are getting. You can also take a hint, or look at the answer to compare with your code and check where you need to make amends.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

There are 4 queries. Please respond to all. Let me know how I can I share my ipynb so the code can be looked at if need be.

1. Cell no. 11 in my file

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
print(y_train_5)
y[36000]

When I execute code for from the begining y[36000] displayes error "out of index' but when I execute following code again and then rerun above code it executes successfully and returns following output:
[False False False ... False False False]
5

Code executed again:

Cell no. 2nd from top

X, y = mnist["data"], mnist["target"]

Query: Could you please let me know the reason?

2. Precision / Recall Tradeoff - Thresholds [pdf slide no. - 90]

classification.ipynb cell no. 25 (Coded by cloudxlab)

y_scores = sgd_clf.decision_function([some_digit])
y_scores

Output: array([1206.46829305])

I am getting output as: array([76209.10439098]) [Cell no. 24 in my file]

Why the output is different for the same data and what is the impact?

3. Precision / Recall Tradeoff - Thresholds pdf slide no. - 92

For ‘5’ and ‘Not 5’ classifier
? For threshold = 0, classifier correctly classifies 5 as 5
? For threshold = 20000, classifier incorrectly classifies digit 5 as not 5

classification.ipynb cell no. 27 (Coded by cloudxlab)

threshold = 20000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Output: array([False])

I had to set the theshold to 77000 but for '20000' result is 'True'
[Cell no. 26 in my file]

threshold = 77000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Output: array([False])

Why the output is different for the same data and what is the impact?

4. Multi-class Classification [pdf slide no. - 117]
classification.ipynb cell no. 62,61,47 (Coded by cloudxlab)
y_train[1000]

Output - 8.0

some_digit_scores = sgd_clf.decision_function([X_train[1000]])
some_digit_scores

Output: array([[-330256.74131412, -408727.27252892, -93759.48808581,
-460978.05122121, -189552.87772983, -281278.72718979,
-272451.01681648, -198320.5270848 , 37284.63986995,
-169374.51720389]])

# The highest score is indeed the one corresponding to class 5:
print ("The index of the maximum score is ", np.argmax(some_digit_scores))

The index of the maximum score is 5

My code and output:

y_train[1000]

Output - 8.0

some_digit_scores = sgd_clf.decision_function([X_train[1000]])
some_digit_scores

Output - array([[-267509.20001414, -394156.44832612, -56146.63583926,
-372654.032655 , -114735.98485769, -217137.99447395,
-302545.78761571, -226519.39466287, -36787.41625735,
-277927.61280844]])

print ("The index of the maximum score is ", np.argmax(some_digit_scores))

The index of the maximum score is 8

Query: I am getting the output as 8. When the decision_function is applied on digit 8 (y_train[1000] and x_train[1000]) how
the maximum score is displayed as '5' in your code?

  Upvote    Share

Hi,

Would request you to mail us your notebook with your queries.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Thanks Rajtilak. I shared the files.

  Upvote    Share

Hi Punit,

Could you now try again? I have updated the classifications notebook. Please note that in some cases, specifically cross validation and fit() steps, it might take a long time.

Please let me know if it works for you.
--Sandeep Giri

  Upvote    Share

Hi Sandeep,

Thanks for your prompt response. I tried to execute the file but the kernel/server got disconnected multiple times. I did not even reach to CV or FIT. I will try to execute entire notebook tomorrow and get back. Meanwhile can you please explain the below workaround?

# y_scores_1 = y_scores[:,1] # This code was raising exception 'too many indices for array'

# hack to work around issue #9589 in Scikit-Learn 0.19.0
if y_scores.ndim == 2:
y_scores = y_scores[:, 1]

  Upvote    Share

Hi Punit,

We have made some changes in our notebooks since we last spoke, we would request you to clone our repository once again and work on the updated code. Do let us know if you are still facing any further challenges.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Rajtilak,
I cloned the repository first and tried to execute the notebook but as I mentioned I encountered obstacles. I will try to execute the code as soon as I get time.

  Upvote    Share

Hi Punit,

Take your time, we are always there to help you.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi,

I again tired to execute classification notebook. I could execute it till the end of the section 1.7. The kernel got disconnected post that.

Could not execute from below mentioned section. PLEASE DO SOMETHING.

1.8 Comparision of SGDClassifier and RandomForestClassifier on the basis of ROC-AUC

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
y_probas_forest

  Upvote    Share

Hi Punit,

The kernel getting disconnected issue has got nothing to do with the codes in the notebook. I tried this exact location on my system and it worked fine. However, would request you to share your email to check what is causing the issue on your end.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

punitnb@gmail.com

  Upvote    Share

Hi Punit,

I can see that you have used more than 70 % of the allocated disk space and due to which you were not able to connect. You can check this using the "df /dev/sda1" or "df -h" commands.

You can delete some of the heavy files or datasets so work the Jupyter seamlessly. At the right hand side there is control panel option, kindly restart your server and logout and login again and restart your kernel.
If still did not work out kindly let me know.

All the best!

-- Rajtilak Bhattacharjee

  Upvote    Share

Hi Rajtilak,

Below data is from my home directory

[punitnbxxxx@cxln4 ~]$ du -sk *
8508 cloudxlab_jupyter_notebooks
161600 ml
2268 myproject[punitnbxxxx@cxln4 ~]$ du -sk
177472

I am using total 177 mb and the command you provided shows below data.

[punitnb7985@cxln4 ~]$ df /dev/sda1
Filesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 314557500 239049056 75508444 76% /
As per this command it shows 300 gb data is it correct? Can you please clarify? This may be for multiple users and with other installed software packages.

  Upvote    Share

Hi Punit,

Hope you are doing well!

After speaking to you the last time, we have made some more changes to the classification notebook. Would request you to comment out the contents of cell 62 before running the notebook. Also, use the rm -r command to remove the scikit_learn_data folder before you proceed. This will be automatically created on the next run. Please let me know if this worked, else we can schedule a one-to-one session to check the issues you are facing.

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

I am good. Hope you are safe. I will do the needful. Yesterday I could not execute the notebook since the mentioned cell took too long to execute. I will take the latest file and try again.

  Upvote    Share

Hi Punit,

Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:

rajtilak.youcanbook.me

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share

why y_scores_forest=y_probas_forest[:,1]?
what does predict_proba() method return?
what is meaning of roc curve?

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi Sharathchandran,

The answer to all of your queries are there in the tutorial. the predic_proba method gives an array of probabilities for each row is equal to the number of categories in target variable. The ROC curve is similar to the F1 score but uses a different metric. Would suggest you to go through the materials once again if required.

All the best.

-- Rajtilak Bhattacharjee

  Upvote    Share

what is the cost function used in this classification problem?
If iam correct sgd is an algorithm to update the parameters of the model but that itself is not the cost function and what is function or (model) which takes all the 784 pixel values(784 inputs) and outputs the digit?

  Upvote    Share

Hello Disqus,

Thanks for contacting CloudxLab!

This automatic reply is just to let you know that we received your message and we’ll get back to you with a response as quickly as possible. During business hours (9am-5pm IST, Monday-Friday) we do our best to reply within a few hours. Evenings and weekends may take us a little bit longer.

If you have a general question about using CloudxLab, you’re welcome to browse our below Knowledge Base for walkthroughs of all of our features and answers to frequently asked questions.

- Tech FAQ <https: cloudxlab.com="" faq="" support="">
- General FAQ <https: cloudxlab.com="" faq=""/>

If you have any additional information that you think will help us to assist you, please feel free to reply to this email. We look forward to chatting soon!

Cheers,
The CloudxLab Team

  Upvote    Share

Hi,
Cost function means the average sum of all the error of the ground truth(original) inputs to the predicted output.

Yes you are right SGD(Stochastic gradient descent) is a optimization algorithm to reduce the cost function, it is already implemented in sklearn, you just need to import it.

You are using the SGDClassifier model from the classification here.
I recommend you to kindly watch the tutorial again!

All the best!

-- Satyajit Das

  Upvote    Share

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)

why we are using y_train and not y_multilabel????

 1  Upvote    Share

Where can I download these slides?

  Upvote    Share

What is the criteria to receive the certificate of course completion

  Upvote    Share

Hi, Vennela.
You need to complete 60% of course and mandatory projects.

All the best.

  Upvote    Share

Hi Punit,

Good to hear from you. If this does not resolves your issue, would request you to schedule a meeting with me so that we can go over these using a Hangout chat. You can book my calendar from the below link:

rajtilak.youcanbook.me

Thanks.

-- Rajtilak Bhattacharjee

  Upvote    Share