#NoPayJan Offer - Access all CloudxLab Courses for free between 1st to 31st JanEnroll Now >>
Hi, welcome to the chapter "Challenges in AI/ML Projects" of AI for Managers. In this chapter we will be covering the main challenges you might face while
executing a Machine Learning project
And how we can successfully overcome these challenges
While you go through the different stages of a project , you might face several challenges like
not having enough training data
or .. having too deal with too much data ….
poor model performance ...
… lack of computing resources ...
… and too much time taken in the model creation process. We will go through each of these problems one by one and see what these challenges really are and what the possible remedies for these challenges are .
Let’s look into each one in detail. Machine learning models typically need huge amounts of data . What do we do when you do not have enough data ?
Let’s take an example of a classifier model that identifies whether an object in an image is a mobile phone, and
classify it as a mobile phone or not a mobile phone
Imagine that you do not have enough images of mobile phones to train the model. How do you create more sample images of mobile phones for training your model ? One of the ways in which we can create more images is by rotating the images in different angles We can also get different images by resizing or scaling the images …
.. skewing the images at different angles Adding Noise to the images..
Or by cropping the images to show partial images of the original image
You can also employ different combinations of more than one augmentation techniques to increase the volume of the training sample. What other ways can you think of to augment image data ?
Though data augmentation can help you when you have less data, there are a few things you have to keep in mind while augmenting data
Make sure that flipping the image does not change the meaning of the data. What do you mean by this ?
Let me give you a couple of examples. Flipping an image of a car does NOT change the meaning of
the image. An image of a flipped car is still a that of a car.
But flipping the image of a number might change the meaning of the image
Here, flipping the image of number 9 , made it look like that of number 6 and has changed the meaning of the data.
Data Augmenting , especially done on very small dataset often does not give a good generalized model. This is because we do not have a good variance in the dataset, to begin with . And the augmented data is always a subset of the available small data set and does not provide great variation.
There are several methods you can use for data augmentation, but you need to be aware of the right methods for the right kind of data such that you reach a balance between lack of data and too much data augmentation leading to overfitting
Another problem that we might face is to choose the right data - too much unnecessary data can also be a problem in implementation of ML projects.
Before going further, lets understand what is redundant data , using an example.
Have a look at this table and see which of the columns do not add additional information if we were to create a model out of the data that is available.
The Is_Employed column contains only one value. It does not have any data-variability.
The salary column has very few data points. It does not add much value into the creation of the model.
Since these two columns do not offer any additional information that might help create a better model, we can drop these columns.
And we can get a reduced dataset , which will give the model with same performance.
Other instances where we reduce the number of variables or columns is when two columns are very highly correlated , eg : gratuity and salary within a company, stock prices of two related indexes etc. These are examples of multicollinearity where we can reduce the amount of data or variables by removing one of the correlated variables or merging these two variables to one variable that represents both these variables. And that too without losing much data from the original dataset.
We have seen that if we have a lot of data, it increases the need for more computational power to churn all these data. Another way in which we can reduce the amount of data is called dimensionality reduction. This is a formal approach of what we have previously discussed regarding the multicollinearity. Let’s take an example to explain this ...
A photograph of a person is the projection of his 3d Image onto a 2-D surface such that you capture the maximum amount of details in one angle.
Imagine that we want to capture as much details of a person’s face by photographing her face. One option for this is to capture several photographs at different angles which are slightly different from each other.
But in case you do not have enough storage space, the essence of the face can be captured by just 2 images which are at right angles to each other.
In fact, when someone gets arrested, their mugshot images are taken for record keeping. A typical mug-shot image consists of two photographs of the person with one side-view photo ...
and one front-view taken orthogonally to each other. The two images capture most of the information that is needed to recognize the person. Thus they do not have to store many images in different angles. But the question is, how did we decide which angle to take the photograph ?
Let’s take another example. If you were to take one photograph of a group of people standing together, at what angle or direction would you prefer to take ? ……
You would prefer to take it directly from the front facing the group and not sidewards or from the top. This way you can capture the maximum information in one picture. What we have done is reduce the dimension from 3D to 2D by taking the photograph in a direction such that we capture the maximum amount of data
One such technique is called Principal Component Analysis or PCA in short, which does exactly the same when it comes to reducing the dimensionality by capturing the maximum information or variance in data.
To understand this, lets extend the example of the group photo and consider it to be a dataset of points ...
… plotted across two axes - x1 and x2.
Now if we have to project all these points onto a plane such that we lose the least amount of information, in this case the variation in the data, which of the planes would be the best to project on to ?
A plane along C1 ..
or C3 ?
If we are to project these points onto the three planes,
we get the variance as shown on the right.
From the point of view of preserving the variability, we can say that C1 would be the best plane to project without losing much information. And C2 would capture the least amount of variance. This is the core idea behind PCA. In a nutshell, the PCA finds out the right axis for projection of the points accounting for the maximum variance in the dataset.
There are other ways to reduce the dimensions of the data without compromising heavily on the information contained in the data . Color is an example of a multi-dimensional data.
A colored pixel has 3 values corresponding to the 3 primary colors - Red , Green and Blue ( ranging from 0 to 255 in intensity of these respective colors ).
Making a grayscale image out of a color image is also an example of dimensionality reduction - here we are compressing the RGB values in a pixel into one value between 0-255 .
One way to do this is to take an average of the R, G and B values in a pixel. We would still get an image without losing the meaning of the image. But we have managed to compress it to one-third - a 66% reduction in data.
Thus we see that in real life machine learning problems with datasets involving millions of data points, reducing the dimensions of the data will ...
considerably reduce the amount of data that is fed to the model ...
reduce the demand on the computational resources
and also will reduce the amount of time needed to train or build a model , making machine learning models faster to build and algorithms faster to execute.
This is the basic idea behind dimensionality reduction.
That brings us to the third and the most critical challenge faced while building machine learning projects - the performance of the model. Poor model performance can occur due to many reasons -
Due to the quality of the data itself, for example where there are a lot of missing values in your training data or
because of the process of creating the model itself - where you overfit or underfit the data.
In earlier chapters, we had seen how we can address the problem of data quality - particularly that of missing values in the training data. We had computed the average and median values of the existing data to fill missing values in the training dataset.
This process of substituting missing data with a more relevant and approximate values based on the context of the data is called imputation. Let’s take a minute and recap some of the imputation techniques.
Mean imputation is where we use the average of the existing values to fill in the missing value. In median imputation, we replace the missing value by taking the median value of the existing values. Hot Deck Imputation is where we randomly choose a value from the existing values to replace missing values.
The second reason why we end up with a poor performing model lies in the model creation process.
Remember Tom and Mary from one of our previous chapters ? Tom - who knew only addition and Mary - who memorized all the questions from his Math textbook ?
When the model training process is not done right, models generated from the training set are like Tom where they learn only some part from the data in the training set. In such cases, the model is called underfitting .
And sometimes the models memorize the entire training set like Mary. They perform really well on the known instances but do badly on unknown instances. In such cases, the model is called overfitting.
These are the two extreme cases where models end up either over-fitting the training data or underfitting the data.
In machine learning, the process of building a generalized model , such that it performs well when presented with unseen data, is called regularization.
Now that we know overfitting and underfitting adversely affect the model performance, how do we avoid it ? Let’s take the case of the decision tree algorithm. We have already seen how a decision tree works in the previous chapter and we have also seen that the more features we use to split the data set, the more granular the individual nodes become and might tend to overfit the data.
Before going further let us understand some basic concepts in the decision tree structure. Depth refers to the number of features that are used to differentiate the data sets
Any node where the data split further is called the decision node
The last node of a tree beyond which no split happens is called the terminal node or a leaf node.
If a decision tree has a very deep structure and it was trained on a limited number of samples, the chances are the model will be overfitting the training data. A general method employed to reduce over-fitting in such a case would be to reduce the depth of the tree and limit the maximum number of leaves coming out of a branch. This is called pruning. Pruning greatly reduces the complexity of the model and also reduces the chance of over-fitting.
In neural networks, to reduce overfitting, we follow a process called drop-out. To understand dropouts better, let’s take a real-life scenario to explain it.
Imagine a scenario in a classroom, when the teacher asks the class to do any activity to be performed, the same two kids always volunteer, thereby not letting the other students to attempt. By the process of doing the activity always, only these two students get to make mistakes and learn - the other kids are left without getting a chance to experiment.
Now, If the teacher asks them to stay quiet for some time and let other pupils participate, the other students also get a chance to learn by experimenting . Maybe they answer wrong, but the teacher can correct them everytime they make a mistake. This way the whole class gets to learn about a topic better. Drop-outs are a similar approach .
Now lets see how a drop out really works in neural networks . Consider a neural network with the architecture as shown. This is called a fully connected network where every node is connected to every other node in the next layer.
During the training phase of the network, neurons are randomly deactivated, or ‘dropped’ at every iteration and the network is formed with the remaining neurons. The dropped nodes do not contribute any information or do not learn during this iteration and the remaining neurons adapt to work without the dropped neurons.
A new set of neurons might be dropped in the next iteration. Based on the accuracy at each iteration, the weights of the participating neurons are updated.This process continues till all the iterations are completed and the network has been trained with different combinations of neurons .
During the prediction phase , with the test data , we do not ignore any neurons, i.e. no dropout is applied.
This process reduces overfitting and gives major improvements in the model performance. The downside of training with dropouts is that it roughly doubles the number of iterations required to converge. However, the training time for each epoch is less.
The next couple of challenges that are usually encountered while building Machine learning projects is the lack of training data and computing infrastructure.
There a a few ways in which we can manage the lack of training data.
One of them is called transfer learning . Let us take the example of one of the interesting competitions to understand what transfer learning is.
The Imagenet competition is a worldwide contest for object detection and image classification held every year since 2010. It contains almost a million images classified into 1000 groups. The participants have to classify test images into these 1000 groups.
Over the last 7 years the error rate in object classification has come down significantly from around 30% to less than 5%.
To train on this huge amount of image dataset, it takes a significant amount of time and resource. Every year, the winner of the competition releases their model architecture, and the model itself in the public domain. This will let anybody who wants to re-use this model to solve similar image classification problems. These models are also called pre-trained models since they have been created after hours, sometimes days of training and anyone can use this for solving their own image classification problems. They only have to train an incremental amount based on the level of customization in the problems he is trying to solve.
This process of re-using a pre-trained model developed by someone else to solve your specific deep learning problem, without having to redo the whole training effort is called Transfer learning.
We have seen that transfer learning helps you save a lot of time and resource when you can reuse another model solving a similar problem. Other ways of saving time and resources is using distributed modes of training models.
Tensor flow supports distributed computing by splitting computations across many servers and saving the intermediate variable values in the common server, instead of the session , so that all the other machines can access it.
Distributed tensorflow architectures give the developer enough flexibility to parallelize and synchronize operations to save time. This can reduce the overall process time from weeks to a few hours.
Apart from this, there are other scalable, distributed machine learning libraries - Apache Spark’s MLib is one such library. These contain most of the traditional ML algorithms for classification, regression , decision trees and also supports ML workflows which can be run in a distributed environment making it much faster and easy to scale.
To summarize this topic - We have seen some of the methods and modification, both in data and in infrastructure, that we can employ to address some of the common challenges faced while executing Machine Learning projects. In the next chapter we will cover some of the machine learning algorithms which are used when we do not have labeled data - namely, unsupervised algorithms.
Hope you liked the chapter. Stay tuned for the next chapter and happy learning!