In this video, we will learn about Hadoop Distributed File System (HDFS), which is one of the main components of Hadoop ecosystem.
Before going into depth of HDFS, let us discuss a problem statement.
If we have 100TB data, How will we design a system to store it? Let’s take 2 minutes to find out possible solutions and then we will discuss it.
One possible solution is to build network-attached storage or storage area network. We can buy hundred 1TB hard disks and mount them to hundred subfolders as shown in the image. What will be the challenges in this approach? Let us take 2 minutes to find out challenges and then we will discuss them.
Let us discuss the challenges.
How will we handle failover and backups?
Failover means switching to a redundant or standby hard disk upon the failure of any hard disk. For backup, we can put extra hard disks or build a RAID i.e. redundant array of independent disks for every hard disk in the system but still it will not solve the problem of failover which is really important for real-time applications.
How will we distribute the data uniformly?
Distributing the data uniformly across the hard disks is really important so that no single disk will be overloaded at any point in time.
Is it the best use of available resources?
There may be other small size hard disks available with us but we may not be able to add them to NAS or SAN because huge files can not be stored in these smaller hard disks. Therefore we will need to buy new bigger hard disks.
How will we handle frequent access to files? What if most of the users want to access the files stored in one of the hard disks. File access speed will be really slow in that case and apparently no user will be able to access the file due to congestion.
How will we scale out?
Scaling out means adding new hard disks when we need more storage. When we will add more hard disks, data will not be uniformly distributed as old hard disks will have more data and newly added hard disks will have less or no data.
To solve above problems Hadoop comes with a distributed filesystem called HDFS. We may sometimes see references to “DFS” informally or in older documentation or configurations.
Taking you to the next exercise in seconds...