Python for Machine Learning - Live Instructor-led Training Enroll For Free
YARN - Why?
In this session, we are going to discuss YARN - Yet Another Resource Negotiator. YARN is a resource manager which keeps track of various resources such as memory and CPU of machines in the network. It also runs applications on the machines and keeps track of what is running where.
Before jumping into YARN architecture, let try to understand with an example why we need distributed computing
Let us say we have a computer with 1 GHz processor and 1 GB RAM. It takes 20 milliseconds to read the profile pic from disk and then 5 more mill seconds to resize it. How much time would this computer take to resize a million profile pics?
Can we do two things in parallel when dealing with so many pics? Yes because reading from disk involves mainly the disk and resizing mainly involves CPU and RAM. So, reading and resizing can be done in parallel as shown in the diagram. In the diagram, time is increasing from left to right.
You can see that while pic1 is being resized, pic2 is being read from the disk. For three pics, it takes 20 times 3 plus 5 milli seconds for resizing. Not 25 times 3. So, it took 65ms not 75ms.
So, it is only the disk read time that matters we can completely ignore the last 5ms on large scale. For one million pics it would be 1 million times 20 milliseconds which is approximately 5.5 hours
5.5 hours is not good enough? The next questions is how can we make it faster?
If we use a computer which has four cores or processors, can this process finish in less than 5.5 hours?
No, because it is not the CPU which is causing the delay. The main time is being consumed in disk reads. If we make disk reads faster, the process will become faster. Disk reads can be made faster by using Solid State Drives and by using many disk drives.
Taking you to the next exercise in seconds...