HDFS - Hadoop Distributed File System

17 / 58

HDFS - Design & Limitations

Not able to play video? Try with youtube

HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s understand the design of HDFS

It is designed for very large files. “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size.

It is designed for streaming data access. It is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from the source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record

It is designed for commodity hardware. Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on the commonly available hardware that can be obtained from multiple vendors. HDFS is designed to carry on working without a noticeable interruption to the user in case of hardware failure.

It is also worth knowing the applications for which HDFS does not work so well.

HDFS does not work well for Low-latency data access. Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. HDFS is optimized for delivering high throughput and this may be at the expense of latency.

HDFS is not a good fit if we have a lot of small files. Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode

If we have multiple writers and arbitrary file modifications, HDFS will not a good fit. Files in HDFS are modified by a single writer at any time.

Writes are always made at the end of the file, in the append-only fashion. There is no support for modifications at arbitrary offsets in the file. So, HDFS is not a good fit if modifications have to be made at arbitrary offsets in the file.

Questions & Answers

Q: Can you briefly explain about low latency data access with example?

A: The low latency here means the ability to access data instantaneously. In case of HDFS, since the request first goes to namenode and then goes to datanodes, there is a delay in getting the first byte of data. Therefore, there is high latency in accessing data from HDFS.

<pre>

Latency Comparison Numbers

L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns 3 us Send 1K bytes over 1 Gbps network 10,000 ns 10 us Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 250 us Round trip within same datacenter 500,000 ns 500 us Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms </pre>

Source: https://gist.github.com/jboner/2841832

Download the slides

Previous Index Next

Please login to comment

3 Comments

Yash Choudhary

3 years ago

Hi,

I am not able to understand this paragraph:

Latency Comparison Numbers

Upvote Share

Dr. Manpreet Singh Sehgal

3 years ago

This is the justification for the high latency of the HDFS system. When name node is accessed the meta data is in RAM.

So for accessing RAM first access to cache memory. there are two levels of cache L1 and L2, first L1 is referenced, time taken to access L1 is 0.5 nano second. Now if the metadata of the file is found in L1 it is called Hit else it is called Miss (written mispredict). In case it is a miss. then L2 is searched for and time to acess L2 is 14 times more than that of accessing L1 (14 X 0.5 = 7 nano second). if there is a miss there too then main memory (RAM) is accessed. whose reference time is 20 times then that of L2 cache.

and the story goes on to finally talking about the data transfer speed from california to Netherlands and back.

I hope this will give you some idea of what this paragraph is all about.

2 Upvote Share

Karthikeyan Mahalingam

2 years ago

By Default all the Nodes (NN & DN) must be having L1 & L2.

In this case, NameNode having only reference and it provides the location details to Data Node. When Data Node processing, will it use its Local L1/L2 & Ram for faster processing ?

Upvote Share

HDFS - Hadoop Distributed File System

HDFS - Design & Limitations

Questions & Answers

Latency Comparison Numbers

XP

Please login to comment

3 Comments