HDFS - Hadoop Distributed File System

30 / 58

HDFS - File Reading - Writing




Not able to play video? Try with youtube

When a user wants to read a file, the client will talk to namenode and namenode will return the metadata of the file. The metadata has information about the blocks and their locations.

When the client receives metadata of the file, it communicates with the datanodes and accesses the data sequentially or parallelly. This way there is no bottleneck in namenode as client talks to namenode only once to get the metadata of the files.

HDFS by design makes sure that no two writers write the same file at the same time by having singular namenode.

If there are multiple namenodes, and clients make requests to these different namenodes, the entire filesystem can get corrupted. This is because these multiple requests can write to a file at the same time.

Let’s understand how files are written to HDFS. When a user uploads a file to HDFS, the client on behalf of the user tells the namenode that it wants to create the file. The namenode replies back with the locations of datanodes where the file can be written. Also, namenode creates a temporary entry in the metadata.

The client then opens the output stream and writes the file to the first datanode. The first datanode is the one which is closest to the client machine. If the client is on a machine which is also a datanode, the first copy will be written on this machine.

Once the file is stored on one datanode, the data gets copied to the other datanodes simultaneously. Also, once the first copy is completely written, the datanode informs the client that the file is created.

The client then confirms to the namenode that the file has been created. The namenode crosschecks this with the datanodes and updates the entry in the metadata successfully.

Now, lets try to understand what happens while reading a file from HDFS.

When a user wants to read a file, the HDFS client, on behalf of the user, talk to the namenode.

The Namenode provides the locations of various blocks of this file and their replicas instead of giving back the actual data.

Out of these locations, the client chooses the datanodes closer to it. The client talks to these datanodes directly and reads the data from these blocks.

The client can read blocks of the file either sequentially or simultaneously.

NOTE: Please see the discussion below to get your questions answered.

Download the slides


Please login to comment

8 Comments

What is FSDataOutput stream and FSDataInputstrea mean in the above explanation?

  Upvote    Share

In Hadoop Distributed File System (HDFS), `FSDataOutputStream` and `FSDataInputStream` are classes used for performing input and output operations on files stored in HDFS.

1. **FSDataOutputStream**: 
   - `FSDataOutputStream` is a class in Hadoop's Java API that represents an output stream for writing data to files in HDFS.
   - It provides methods to write bytes, integers, longs, etc., to a file in HDFS.
   - This class is typically used when writing data to files in HDFS from a Hadoop MapReduce job, Spark job, or any other application running on the Hadoop ecosystem.

2. **FSDataInputStream**:
   - `FSDataInputStream` is a class in Hadoop's Java API that represents an input stream for reading data from files in HDFS.
   - It provides methods to read bytes, integers, longs, etc., from a file in HDFS.
   - This class is commonly used when reading data from files stored in HDFS within Hadoop MapReduce jobs, Spark applications, or any other Hadoop-based application.

Both `FSDataOutputStream` and `FSDataInputStream` are part of Hadoop's Input/Output (IO) library, which provides abstractions for reading from and writing to files in HDFS. These classes ensure efficient and reliable data access operations within the Hadoop ecosystem, enabling applications to interact with large datasets stored in HDFS.

  Upvote    Share

How to Display the last few lines of a file?

  Upvote    Share

You can use the tail command for it such as:

hadoop fs -tail file_path

 

  Upvote    Share

Ok thanks... 

  Upvote    Share

This comment has been removed.

By Default , Replication factor is 3 , Hence Block is copied over into other Data Nodes and Admin Node provides the list of files and block information to client where client decides from where it should be read based on nearby location or other factors..

If replication is disabled, what will happen for the above case?

  Upvote    Share

Why don't you try that out?

  Upvote    Share