The statement "Gini impurity tends to isolate the most frequent class in its own branch of the tree" means that when constructing a decision tree using Gini impurity as the splitting criterion, the algorithm tends to create splits that quickly separate the dominant class (the one that occurs most frequently in the dataset) into its own branch. This happens because Gini impurity measures how often a randomly chosen element would be incorrectly classified, and it prefers splits that reduce this impurity as much as possible. When there is a class that occurs frequently, the algorithm finds it more effective to separate that class early, which reduces the overall Gini impurity for subsequent splits. This behavior can lead to shorter, simpler branches for the dominant class while the less frequent classes may require more splits to isolate accurately.
Thank you for your guidance
My data has similar properties to Iris, except that the target is continuous
The numbers are continuous because of these conjunctions
I gave the target a point in a certain limit so that the putty tree could predict the decision, but again it gave a snort
Thank you
Good question! You will notice that decision boundaries made by a decision tree are parallel to the axis.The reason for this is that a Decision tree splits the data based on a feature value and this value would remain constant throughout for one decision boundary. It is because of this that they are sensitive to data rotation. Data that could easily be split by a single diagonal is split by multiple decision boundaries, a solution would be to have a simpler model that would generalize better for this particular problem.
Sir, what you've said, that I've understood, but what I don't understand is what may cause the data to rotate. What is the real life implication of data being rotated. Also, how and why does one rotate the data?
The real life implecation of data rotation is that the model becomes unstable. To solve this problem, one can use dimensionality reduction techniques like PCA which orients the data better. You can also use Random Forest model instead of Decision Tree. Rotation basically means that the data changes shape/side with respect to the feature axes. You would be able to understand this better if you visualize the data as a matrix.
When we talk about the target average values in DT, what measure of average is used by default? Mean or Median? if Mean, then why not median (as it'll handle noise better than mean)
Mean is used since computing median requires sorting of the values which is more process intensive. Also, the cases where median is a better measure of center is when the data has skewed data points. In the decision tree, the datapoints that are part of the leaf are most likely not have the outlier because the outlier would land in other leaf.
1.On splitting of a branch whether 1 feature or multiple featues are considered?Is that what max_features signify?
2.Suppose the algorithum spits based on feature 1 and its threshold ,then in further nodes will the algorithum exclude Feature 1 as it was already considered and a branch was formed based on it?
1. max_features is the number of features to consider when looking for the best split
2. Please refer to slide 15 for a better understanding of this concept. In slide 15 it is shown that at first we are considering the feature petal length, in the second node we are considering another feature, which is petal width.
How can we decide what max_depth to pass as the hyperparameter without knowing how many nodes it will create? Is it possible through GridsearchCV or RandomSearchCV?
In general, the deeper you allow your tree to grow, the more complex your model will become because you will have more splits and it captures more information about the data and this is one of the root causes of overfitting in decision trees because your model will fit perfectly for the training data and will not be able to generalize well on test set.
It is also bad to have a very low depth because your model will underfit so how to find the best value, experiment because overfitting and underfitting are very subjective to a dataset, there is no one value fits all solution. So what you can do is, let the model decide the max_depth first and then by comparing the train and test scores, look for overfitting or underfitting and depending on the degree decrease or increase the max_depth.
How can we plot the graph for Decision Tree's decision boundaries as given through slide numbers 22-24. Please share the function used to plot the same. Thank You.
Basically decision tree is a non-parametric algorithm while linear or non-linear regression are parametric. What this actually means is that the parametric models are going to assume a shape for the function that maps the input variables to the target variables (e.g. if you plot the function the plot is linear or exponential, etc.). So there is a possibility of underfitting for data that is non-linear. But, non-parametric model, since it does not assume any shape for the mapping function can fit the dataset very closely, nonetheless, the risk of overfitting cannot be denied for such models (which can be dealt with some hyperparameter tuning).
Having said that, if there is less no. of instances and large no. of features with very less noise, linear models could outperform the tree models. But, in general decision trees tend to perform better.
Here is a problem , you said the Gini is computationally faster than the entropy criteria , but you also said that Entropy created the more balanced tree. but if the more balanced tree is less computationally intensive , how can it be slower than the Gini ?? this is contradictory....
More balanced tree does prediction faster, but more computationally intensive to train it, Gini is faster to train but a little bit slow in prediction than Entropy because the tree length is very high in Gini.
AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
i.e. Recall indicates indicates out of all the instances that were actually positive how many were predicted as positive
So, the use depends upon the problem statement.
For example, if you are predicting cancer for a sample of the population it matters more that the person is not wrongly diangnosed of not having cancer, but it can be tolerated if he is misdiagnosed of having cancer, further tests will make take care of it. In this case, you need to have low false negatives and so higher Recall is important here.
Another example, if suppose you are trying to identify dogs from other animals, it is important to you that only dogs are identified as dogs and not any other animal, it would be okay if a dog is identified as a cat in some instance. In this case you want to reduce the false positives, so higher Precision is important here.
Referring to the lecture videos on classification would clear you concept as it has been explained with good example and shall also supplement for the above two points
For example, when you are predicting the presence of heart disease for people.
If the presence of heart disease is important to you, you'll try to increase Sensitivity as the goal will be to reduce the False negatives.
But, if the absence of heart disease is more important to you, you'll try to increase the Specificity as the goal will be to reduce the False positives
Please login to comment
41 Comments
Could you please elaborate the below line
"However, Gini impurity tends to isolate the most frequent class in its own branch of the tree"
Upvote ShareHi Veena,
The statement "Gini impurity tends to isolate the most frequent class in its own branch of the tree" means that when constructing a decision tree using Gini impurity as the splitting criterion, the algorithm tends to create splits that quickly separate the dominant class (the one that occurs most frequently in the dataset) into its own branch. This happens because Gini impurity measures how often a randomly chosen element would be incorrectly classified, and it prefers splits that reduce this impurity as much as possible. When there is a class that occurs frequently, the algorithm finds it more effective to separate that class early, which reduces the overall Gini impurity for subsequent splits. This behavior can lead to shorter, simpler branches for the dominant class while the less frequent classes may require more splits to isolate accurately.
Upvote ShareThis comment has been removed.
Hi
I tried the same code on a database that I made myself, but it gives an impression that someone can help me.
I do not know why it is a pandas and how it should be and how to change it?
thanks a lot
Hi, Can you share some instances of target variable yy.
Upvote ShareThank you for your guidance
My data has similar properties to Iris, except that the target is continuous
The numbers are continuous because of these conjunctions
I gave the target a point in a certain limit so that the putty tree could predict the decision, but again it gave a snort
Thank you
Hi, Can you display some instances of the target variable?
Upvote ShareThis comment has been removed.
This comment has been removed.
I found tnx
Upvote Share!dot -Tpng iris_tree.dot -o iris_tree.png
is working only in CloudxLab & not working in PC local jupyter programm. What is the alternate code for that?
Upvote ShareHi,
You need to install GraphViz first in your local computer before you can use it. Here are the details on how you can install the same:
https://graphviz.org/download/
Thanks.
Upvote ShareIn real life, what does the rotation of data mean? What action with data might lead to it being rotated while plotting?
Upvote ShareHi,
Good question! You will notice that decision boundaries made by a decision tree are parallel to the axis.The reason for this is that a Decision tree splits the data based on a feature value and this value would remain constant throughout for one decision boundary. It is because of this that they are sensitive to data rotation. Data that could easily be split by a single diagonal is split by multiple decision boundaries, a solution would be to have a simpler model that would generalize better for this particular problem.
Thanks.
Upvote ShareSir, what you've said, that I've understood, but what I don't understand is what may cause the data to rotate. What is the real life implication of data being rotated. Also, how and why does one rotate the data?
Upvote ShareHi,
The real life implecation of data rotation is that the model becomes unstable. To solve this problem, one can use dimensionality reduction techniques like PCA which orients the data better. You can also use Random Forest model instead of Decision Tree. Rotation basically means that the data changes shape/side with respect to the feature axes. You would be able to understand this better if you visualize the data as a matrix.
Thanks.
Upvote ShareWhen we talk about the target average values in DT, what measure of average is used by default? Mean or Median? if Mean, then why not median (as it'll handle noise better than mean)
1 Upvote ShareThis comment has been removed.
Hi,
Good question!
Mean is used since computing median requires sorting of the values which is more process intensive. Also, the cases where median is a better measure of center is when the data has skewed data points. In the decision tree, the datapoints that are part of the leaf are most likely not have the outlier because the outlier would land in other leaf.
Hope this clarifies your doubts.
Thanks.
Upvote ShareOn calculating the Entropy for node at depth 2 p_{i,k} does not equal to zero. but my values are [0, 49, 5]
so according to the formula P_{2,1} = 0.
Please explain.
Upvote ShareHi,
Please check our notebook from our GitHub repository for the complete code:
https://github.com/cloudxlab/ml
Thanks.
Upvote ShareI did not understand
at 2:13
let say we are taking the value of x=0.9 then how the value of y is 0.6 in the graph max_depth=0??
Please reply
Hi,
Could you please tell me which part of the video are you referring to since I could not find any such reference at 2:!3 timestamp.
Thanks.
Upvote Share1.On splitting of a branch whether 1 feature or multiple featues are considered?Is that what max_features signify?
2.Suppose the algorithum spits based on feature 1 and its threshold ,then in further nodes will the algorithum exclude Feature 1 as it was already considered and a branch was formed based on it?
Upvote ShareHi,
1. max_features is the number of features to consider when looking for the best split
2. Please refer to slide 15 for a better understanding of this concept. In slide 15 it is shown that at first we are considering the feature petal length, in the second node we are considering another feature, which is petal width.
Thanks.
Upvote ShareThis comment has been removed.
This comment has been removed.
I can not find decision tree jupyter notebook
Upvote ShareHi,
You can find the notebooks for Decision Trees in our GitHub repository:
https://github.com/cloudxlab/ml/tree/master/machine_learning
Thanks.
Upvote ShareHow can we decide what max_depth to pass as the hyperparameter without knowing how many nodes it will create?
Upvote ShareIs it possible through GridsearchCV or RandomSearchCV?
Hi,
In general, the deeper you allow your tree to grow, the more complex your model will become because you will have more splits and it captures more information about the data and this is one of the root causes of overfitting in decision trees because your model will fit perfectly for the training data and will not be able to generalize well on test set.
It is also bad to have a very low depth because your model will underfit so how to find the best value, experiment because overfitting and underfitting are very subjective to a dataset, there is no one value fits all solution.
So what you can do is, let the model decide the max_depth first and then by comparing the train and test scores, look for overfitting or underfitting and depending on the degree decrease or increase the max_depth.
Thanks.
-- Rajtilak Bhattacharjee
1 Upvote ShareHow can we plot the graph for Decision Tree's decision boundaries as given through slide numbers 22-24. Please share the function used to plot the same.
Upvote ShareThank You.
Hi,
Here's an article on how to plot the DT's decision boundaries.
https://scikit-learn.org/st...
Thanks.
-- Rajtilak Bhattacharjee
Upvote ShareYou didn't really explain that what's the benefits of the decision tree over linear regression or nonlinear regression model?
Upvote ShareBasically decision tree is a non-parametric algorithm while linear or non-linear regression are parametric. What this actually means is that the parametric models are going to assume a shape for the function that maps the input variables to the target variables (e.g. if you plot the function the plot is linear or exponential, etc.). So there is a possibility of underfitting for data that is non-linear. But, non-parametric model, since it does not assume any shape for the mapping function can fit the dataset very closely, nonetheless, the risk of overfitting cannot be denied for such models (which can be dealt with some hyperparameter tuning).
Having said that, if there is less no. of instances and large no. of features with very less noise, linear models could outperform the tree models. But, in general decision trees tend to perform better.
I hope that helps.
1 Upvote ShareHere is a problem , you said the Gini is computationally faster than the entropy criteria , but you also said that Entropy created the more balanced tree. but if the more balanced tree is less computationally intensive , how can it be slower than the Gini ?? this is contradictory....
Upvote ShareMore balanced tree does prediction faster, but more computationally intensive to train it, Gini is faster to train but a little bit slow in prediction than Entropy because the tree length is very high in Gini.
1 Upvote ShareWhen we should use ROC AUC Curve?
When we should use Precision Recall Curve? Which one is moslty preferable?
When we need high precision, low recall?
When we need low precision and high recall?
Explain all above with real world scenarios..
Thanks in advance
Upvote ShareHi, Vinod.
Good questions!
AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
Please refer to this link for more details :- https://machinelearningmast...
All the best!
Upvote ShareIn brief:
1. ROC AUC curve is usefil when we are dealing with a fairly balanced dataset i.e. number of instances for each class are approximately equal
2. Precision Recall curve should be used when the dataset is imbalanced
3. Precision = True Positive / (True Positive + False Positive)
i.e. Precision indicates out of all the instances that have been predicted as positive how many are actually positive
Recall = True Positive / (True Positive + False Negative)
i.e. Recall indicates indicates out of all the instances that were actually positive how many were predicted as positive
So, the use depends upon the problem statement.
For example, if you are predicting cancer for a sample of the population it matters more that the person is not wrongly diangnosed of not having cancer, but it can be tolerated if he is misdiagnosed of having cancer, further tests will make take care of it. In this case, you need to have low false negatives and so higher Recall is important here.
Another example, if suppose you are trying to identify dogs from other animals, it is important to you that only dogs are identified as dogs and not any other animal, it would be okay if a dog is identified as a cat in some instance. In this case you want to reduce the false positives, so higher Precision is important here.
Referring to the lecture videos on classification would clear you concept as it has been explained with good example and shall also supplement for the above two points
1 Upvote ShareI'll add a couple of things to my previous answer.
Recall is also called Sensitivity.
And there's Specificity corresponding to it
Specificity = True Negatives / (True Negatives + False Positives)
For example, when you are predicting the presence of heart disease for people.
If the presence of heart disease is important to you, you'll try to increase Sensitivity as the goal will be to reduce the False negatives.
But, if the absence of heart disease is more important to you, you'll try to increase the Specificity as the goal will be to reduce the False positives
1 Upvote Share