#NoPayJan Offer - Access all CloudxLab Courses for free between 1st to 31st JanEnroll Now >>
Hi, welcome to the chapter "Design and Constructs" of AI for Managers. Now that we have seen some of the algorithms in machine learning, lets touch upon typical problem-solving constructs. These constructs are employed while solving the computational problems. In this chapter, we will learn
Ensemble learning, also pronounced as ??wnsamble but we will be calling it ??n?s??m.b?l
CNN Convolutional Neural Networks
RNN Recurrent Neural Networks and
Reinforcement learning. Let’s start with ensemble learning
Someone once asked me: We are all biased. What should we do?
My answer was - "Create an Ensemble". What did I mean by this? Let me explain.
Imagine you have to hire a candidate for a role.
Would it be better to entrust the decision of
no-hire to an expert interviewer
who might be biased or
get a panel of moderately good interviewers
who make their independent decisions and
then take a majority vote and
decide on whether or not to hire the candidate?
In almost all the cases, you will find that the decision by the panel of moderately good interviewers will be better than the individual expert interviewer. This is called the wisdom of the crowd and also this is the crux of ensemble learning.
In ensemble learning
we aggregate the predictions of a group of predictors such as classifiers or regressors. This aggregation often leads to better predictions than the individual predictors. So at which step do we use ensemble learning?
Remember the checklist for machine learning projects. By step 5, you would have shortlisted the 3-4 best performing models. In step 6, we can use ensemble learning using the shortlisted models to fine-tune the model. Let’s understand this.
Say while working on a classification problem you have trained a bunch of classifiers like logistic regressor, SVM , Random Forest and others. And each of these classifiers predict the class of an instance with
80% accuracy. If we aggregate the predictions of these classifiers there are high chances that the ensemble will give more than 80% accuracy. So how do we ensemble these classifiers?
First we find out the predictions by different classifiers. As you can see different classifiers predicted the class of new instance as either 1 or 2. Then we aggregate these predictions and
select the class that gets the highest number of votes. The class that gets the highest vote is the final output of the ensemble. Here class 1 is predicted by three classifiers and the class 2 is predicted by only one classifier.
So the final output of ensemble will be class 1. This type of majority-vote classifier is also called a Hard voting classifier and the process is called hard voting.
On the other hand Soft voting classifiers takes into account how certain each classifier is, rather than just a binary output from the classifier. In soft voting we first take
the probability from each classifier and then
average the probabilities. Since here the probability is 0.6 which is greater than 0.5 hence the
decision will be “hired” on the basis of soft voting. The soft voting classifiers often achieve higher performance than hard voting classifiers because they give more weight to highly confident votes.
For ensemble learning to be effective, we should use diverse models. If we have only one type of classifier or model, then we feed different subsets of data from the same dataset for these classifiers to be trained on.
Let’s understand this with our interview panel example. Each interviewer in the panel takes the hiring decision based on the specific aspects of the interviewee.
Say the first interviewer is an HR and he looks for good communication skills and attitude.
The second interviewer is a data analyst and assesses the interviewee by analytical skills.
The third interviewer is an engineer and assesses the interviewee by technical skills. There are high chances that the hiring decision made by such a diverse panel of interviewers will be better than that of an individual expert interviewer.
Also, note that each interviewer must be doing a better job than coin toss otherwise the ensemble won’t be effective.
So, the ensemble uses diverse models for prediction to give great results. Again, these models need to be better than random guessing which has equal chances of yes-no. Also, note that ensembling can be used for classification as well as for regression problems. This is all for ensemble learning.
Let’s learn about a very interesting neural network architecture called CNN - Convolutional neural networks
Although IBM’s Deep Blue supercomputer beat the chess world champion Garry Kasparov in 1996, until quite recently the computers were unable to perform the trivial tasks such as detecting a puppy in a picture or Recognizing spoken words. Why are these tasks so effortless to us humans? The answer lies in the fact that perception largely takes place outside the realm of our consciousness, within specialized visual, auditory, and other sensory modules in our brains. By the time sensory information reaches our consciousness, it is already adorned with high-level features.
For example, when you look at a picture of a cute puppy, you cannot choose not to see the puppy, or not to notice its cuteness. Nor can you explain how you recognize a cute puppy; it’s just obvious to you. Thus, we cannot trust our subjective experience: perception is not trivial at all, and to understand it we must look at how the sensory modules work.
Convolutional neural networks (CNNs) emerged from the study of the brain’s visual cortex, and they have been used in image recognition since the 1980s.
In the last few years CNNs have managed to achieve superhuman performance on some complex visual tasks and all this was possible because of increase in computational power the amount of available training data and the techniques presented in previous chapter on training deep neural nets.
Today CNN power image search services such as
automatic video classification systems, and more. Moreover, CNNs are not restricted to visual perception: they are also successful at other tasks such as
voice recognition or
natural language processing (NLP); however, we will focus on visual applications for now.
David H. Hubel and Torsten Wiesel performed a series of experiments on cats in 1958 and 1959 (and a few years later on monkeys), giving crucial insights on the structure of the visual cortex (the authors received the Nobel Prize in Physiology or Medicine in 1981 for their work)
In particular, they showed that many neurons in the visual cortex have a small local receptive field, meaning they react only to visual stimuli located in a limited region of the visual field. In the diagram, the local receptive fields of five neurons are represented by dashed circles. The receptive fields of different neurons may overlap, and together they tile the whole visual field. Moreover, the authors showed that some neurons react only to images of horizontal lines, while others react only to lines with different orientations (two neurons may have the same receptive field but react to different line orientations). They also noticed that some neurons have larger receptive fields, and they react to more complex patterns that are combinations of the lower- level patterns. These observations led to the idea that the higher-level neurons are based on the outputs of neighboring lower-level neurons (in Figure 13-1, notice that each neuron is connected only to a few neurons from the previous layer). This powerful architecture is able to detect all sorts of complex patterns in any area of the visual field. These studies of the visual cortex inspired the neocognitron, introduced in 1980 which gradually evolved into what we now call convolutional neural networks.
Why not simply use a regular deep network with fully connected layers for image recognition tasks?
Remember how we fed our image to the neural network. We convert our two dimensional image into a single dimension by concatenating the rows together basically converting the image to a single dimension array.
This is generally called a deep neural network. The usual deep neural network works fine for small images such as MNIST. But there is huge loss of information,
The pixels which are just adjacent but in next row, end up at a far distant places in inputs. Therefore the usual deep neural networks are not that effective when it comes to images. It's predictions could be improved further.
CNNs solve this problem by using partially connected layers called Convolutional Layers
Neurons in the first convolutional layer are not connected to every single pixel in the input image , but only to pixels in their receptive fields - a rectangular region of image.
In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer.
The network concentrates on low-level features in the first hidden layer, then assemble them into higher-level features in the next hidden layer, and so on.
This hierarchical structure is common in real-world images, which is one of the reasons why CNNs work so well for image recognition.
Until now, all multilayer neural networks had layers composed of a long line of neurons, and we had to flatten input images to 1D before feeding them to the neural network.
Now each layer is represented in 2 dimension, which makes it easier to match neurons with their corresponding inputs. The rectangle below is the input layer of neurons representing the input image and the rectangle above is a second layer and is called the convolutional layer. The moving rectangle represents a receptive field. Each neuron in the convolutional layer is the result of applying some operation on the receptive field. This operation is called a filter which could be simple as weighted sum of all the pixels. So, as we apply a filter over all possible positions of the moving rectangle i.e the receptive field, we generate the next layer of neurons which can further be chained
Another kinds of layers that are used in convolutional neural networks is called pooling layer.
Once you understand how convolutional layers work, the pooling layers are quite easy to grasp. Their goal is to shrink the input image in order to reduce the computational load.
While the convolutional layer has weights, the pooling layer doesn't. The Pooling layer is used to shrink the input.
There are various kinds of Pooling layers: one which computes the maximum value of the receptive field and there other way to just compute the average of all the pixes etc.
Typical CNN architectures stack a few convolutional layers, then a pooling layer, then another few convolutional layers, then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network. Towards the end, a regular feedforward neural network is added, composed of a few fully connected layers, and the final layer outputs the prediction.
Over the years, variants of this fundamental architecture have been developed, leading to amazing advances in the field. A good measure of this progress is the error rate in competitions such as the ILSVRC ImageNet challenge.
The ILSVRC or ImageNet Large Scale Visual Recognition Challenge (ILSVRC) evaluates algorithms for object detection and image classification at large scale.
We will first look at the classical LeNet-5 architecture, then three of the winners of the ILSVRC challenge: AlexNet, GoogLeNet, and ResNet.
The LeNet-5 architecture is perhaps the most widely known CNN architecture. As mentioned earlier, it was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in diagram. Input followed by convolution and pooling and again convolution and pooling and finally the usual deep neural network also called fully connected neural network which gives the output.
The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by a large margin. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper,
The GoogLeNet architecture was developed by Christian Szegedy et al. from Google Research,10 and it won the ILSVRC 2014 challenge. This great performance came in large part from the fact that the network was much deeper than previous CNNs. This was made possible by sub-networks called inception modules.
Last but not least, the winner of the ILSVRC 2015 challenge was the Residual Network (or ResNet), developed by Kaiming He et al. It is an extremely deep CNN composed of 152 layers. The key architecture change was skip a connection. That’s all for CNN.
Let’s learn about Recurrent Neural Networks - RNN. We have just learnt CNN so the obvious question is why one more neural network? Let’s understand the need for RNN.
Say you have to build an app like Gmail smart compose. This app suggests the next words based on the words previously typed by the user. We can not build such app using the traditional neural networks. This is because in order to suggest the next words the neural networks need to remember the previously typed words and understand the context. In other words, the neural networks must have internal memory to remember the previously typed words.
The traditional neural networks which we discussed in the previous chapter, pass the information through each layer exactly once. So the inputs cannot be retained. We need different neural network architecture to retain the inputs. This is where RNNs shine. So how do RNNs retain the inputs?
RNNs look pretty much the same as traditional neural networks except
they also have the loop pointing backward. The loop in RNNs allows information to be passed from one step of the network to the next. In the diagram, we have the simplest possible RNN composed of
just one neuron
producing an output and
sending that output back to itself. This neuron is called
as recurrent neuron. Let’s see how inputs are retained over time in RNNs.
At time 0, the neuron receives input and produces the output.
At time 1, the neuron receives the input as well as output from the previous step
Similarly, at time 2, the neuron receives the input as well as the output from the previous step. As you can see the inputs are being retained over time, we say that RNNs have memory and they can retain the information. RNNs are the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more. Let’s see some of the application of RNNs
RNNs can predict the future. RNNs can analyze time series data such as stock prices and tell us when to buy or sell the stock.
In self-driving cars, RNNs can anticipate car trajectories and help avoid accidents.
RNNs are being used in translating text from one language to other. For example, RNNs can translate the text from English to Mandarin and vice versa.
RNNs are being used in speech recognition. RNNs can predict the word from the input sound waves
RNNs can predict the users’ feeling about the movie by analyzing the movie reviews.
Also, notice that even if there is we do not provide input, the RNNs are capable to produce the results because of the inputs from previous step. This makes the RNNs capable of creativity.
For example, google’s magenta project enable artists to generate new music and art.
Here is an example melody produced by Google’s Magenta project [Play the video for few seconds]
Though RNNs are good at predicting the future, they suffer from short term memory. If a sequence is long, RNNs generally cannot carry information from earlier steps to the later ones. Let’s understand this with an example.
Take the example of your smart compose app in which you have to predict the next word by understanding the context from long paragraphs. Here as human, we can make out that since the author has lived in Spain for 25 years, it is very likely that
he may speak Spanish fluently. But for the RNNs the
relevant information to make the prediction is separated by a huge amount of irrelevant data. And since RNNs suffer from short term memory they may not be able to make accurate predictions in this case.
If you are interested to learn why RNN face short term memory problem, you can check vanishing gradient problem on wikipedia.
To overcome the problem of short term memory of RNNs two more network came into picture as an improvement over RNNs. These networks are
Long Short-Term Memory LSTM and
Gated Recurrent Unit GRU.
LSTM and GRU can learn which information is relevant and which one to throw away. Because of this, they pass only the relevant information from the earlier steps to the later ones to make predictions. Let’s understand this using a real-life example.
Say you are looking at the online reviews to buy a smartphone. While reading the reviews, you pick up words like
“Camera quality is too good”
“definitely be buying again”. You don’t care much for words like
If a friend asks you the next day about the review, you will probably remember the main points like
“excellent phone” and
“definitely be buying again”. And this is what LSTM or GRU networks also do. They keep only relevant information to make predictions and forget irrelevant information. LSTM and GRU networks are one of the main reasons behind the success of RNNs in recent years
particularly in natural language processing. We will learn more about natural language processing later in the course. This is all for RNNs.
Let’s learn about reinforcement learning. Reinforcement learning is one of the most exciting fields of Machine Learning today.
Remember from chapter 1, the program which learns to play Mario on its own. The program uses reinforcement learning to learn to play Mario. Reinforcement learning has been around since 1950s and produced many interesting applications like ...
TD-Gammon, a backgammon playing program. But the revolution in Reinforcement Learning took place in 2013 when a startup called Deepmind
built a system which could learn to play any Atari game from scratch. This system used only raw pixels as inputs and learned to master the games without having any prior knowledge of rules of the games.
In 2016 their system AlphaGo
defeated Lee Sedol, the world champion of the game of Go. This was all possible because of reinforcement learning. So what is the typical reinforcement learning process.
In reinforcement learning the system learns to optimize rewards. Let’s understand this
In reinforcement learning, a software agent makes observations and
takes actions within an environment and
in return it receives rewards. The software agent’s objective is to learn to take
which maximize the rewards. Please note that rewards can be both positive and negative. We will soon see example of this.
In other words, the software agent acts in an environment and learns by trial and error to maximize its rewards.
Let’s see some examples of how can we apply this in the real-life applications.
In the case of the walking robot, the software agent
can be the program controlling the robot.
The environment will be the real world.
The agent observes the environment through sensors like camera and touch sensors.
Its actions consist of sending signals to the motor to walk if there is no obstacle and stop if there is an obstacle.
The agent may be programmed to get positive rewards when it approaches the destination without hitting obstacles...
and negative rewards whenever it wastes time
goes in the wrong direction
or falls down.
In the case of Pac-Man game
the agent can be the program controlling the player.
The environment will be the simulation of Atari Game.
The actions are possible joystick positions like upper left, down and so on. The agent uses the joystick to move around and ….
observe the game screen to see if the score is going up or down. It keeps on trying various joystick positions and learns the positions which increase the score.
The rewards are the game points.
In the case of the smart thermostat
the agent is thermostat itself. Note that the agent does not necessarily have to control moving things. This thermostat in order to be effective must learn to anticipate human needs.
It gets positive rewards when it automatically figures out the right temperature and
...negative rewards when humans need to tweak the temperature manually. Let’s see one last example.
In the case of the automatic trader
the agent observes the
stock market prices and
decides how many stocks to buy or sell.
The rewards are monetary gains and
losses. Note that in reinforcement learning there may not be a positive reward at all. For example ...
in the maze game, the agent may get negative rewards for every move so that it will try to find the exit as quickly as possible. This is all for reinforcement learning.
In this chapter first we learnt
about Ensemble learning and
why ensemble learning gives better predictions
Then we learnt about hard voting and
soft voting classifiers.
Next, we learnt about convolutional neural networks CNN. We learnt the concepts like ...
pooling layers and
then we saw various architectures.
Next, we learnt about recurrent neural networks RNN.
We learnt why do we need RNN
the various applications of RNNs,
shortcoming of RNNs and
two improved versions of RNNs
Long Short-Term Memory and
Gated Recurrent Unit networks.
Next, we learnt about reinforcement learning.
We learnt the various applications of reinforcement learning and
how the system learns to optimize the rewards
Hope you liked the chapter. Stay tuned for the next chapter and happy learning!