The metadata is maintained in the memory as well as on the disk. On the disk, it is kept in two parts: namespace image and edit logs.
The namespace image is created on demand while edit logs are created whenever there is a change in the metadata. So, at any point, to get the current state of the metadata, edit logs need to be applied on the image.
Since the metadata is huge, writing it to the disk on every change may be time consuming. There fore, saving just the change makes it extremely fast.
Without the namenode, the HDFS cannot be used at all. This is because we do not know which files are stored in which datanodes. Therefore it is very important to make the namenode resilient to failures. Hadoop provides various approaches to safeguard the namenode.
The first approach is to maintain a copy of the metadata on NFS - Network File System. Hadoop can be configured to do this. These modifications to the metadata happen either both on NFS and Locally or nowhere.
In the second approach to making the namenode resilient, we run a secondary namenode on a different machine.
The main role of the secondary namenode is to periodically merge the namespace image with edit logs to prevent the edit logs from becoming too large.
When a namenode fails, we have to first prepare the latest namespace image and then bring up the secondary namenode.
This approach is not good for production applications as there will be a downtime until the secondary namenode is brought online. With this method, the namenode is not highly available.
To make the namenode resilient, Hadoop 2.0 added support for high availability.
This is done using multiple namenodes and zookeeper. Of these namenodes, one is active and the rest are standby namenodes. The standby namenodes are exact replicas of the active namenode.
The datanodes send block reports to both the active and the standby namenodes to keep all namenodes updated at any point-in-time.
If the active namenode fails, a standby can take over very quickly because it has the latest state of metadata.
zookeeper helps in switching between the active and the standby namenodes.
The namenode maintains the reference to every file and block in the memory. If we have too many files or folders in HDFS, the metadata could be huge. Therefore the memory of namenode may become insufficient.
To solve this problem, HDFS federation was introduced in Hadoop 2 dot 0.
In HDFS Federation, we have multiple namenodes containing parts of metadata.
The metadata - which is the information about files folders - gets distributed manually in different namenodes. This distribution is done by maintaining a mapping between folders and namenodes.
This mapping is known as mount tables.
In this diagram, the mount table is defining /mydata1 folder is in namenode1 and /mydata2 and /mydata3 are in namenode2
Mount table is not a service. It is a file kept along with and referred from the HDFS configuration file.
The client reads mount table to find out which folders belong to which namenode.
It routes the request to read or write a file to the namenode corresponding to the file's parent folder.
The same pool of datanodes is used for storing data from all namenodes.
Lets us discuss what is metadata. Following attributes get stored in metadata
Transaction logs or edit logs store file creation and file deletion timestamps.
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Please login to comment
9 Comments
maintaining a mapping between folders and namenodes.
This mapping is known as mount tables. by having different namenode how we can get the data ?
Upvote ShareIn a Hadoop cluster, the namenode is the central repository for metadata about the files and directories stored in the cluster's HDFS (Hadoop Distributed File System). The namenode maintains a mapping between the filenames and the blocks that make up those files, as well as information about the directories in which the files are stored.
In some cases, it may be useful to maintain a mapping between local directories on the cluster's nodes and the namenode. This mapping is known as a mount table. The mount table allows users to access the data stored in HDFS as if it were stored in a local file system, by specifying a path that is mapped to a corresponding path in HDFS.
To access data stored in HDFS, you can use the
hdfs
command-line utility, or you can use a programming interface such as the Hadoop Java API. You can also use tools like Apache Hive or Apache Pig to access and query data stored in HDFS.If you have multiple namenodes in your cluster, you can use the mount table to specify which namenode a particular path should be accessed from. This can be useful if you have data stored in multiple HDFS clusters and you want to access all of the data from a single location.
How does client know that HDFS having multiple Name Nodes and By Default, Which Node it will access before knowing the Actual Data Node File & Block infn.
The client can check that by using 'dfsadmin -report' command. But the client needs to have admin rights to see all that information.
Upvote ShareIf the client doesn't have admin rights, he/she can use 'getconf' command to get the information about namenodes. It will list down all namenodes. But that will not be detailed information.
In the video, at 1:31 it's told- "When a namenode fails, we have to first prepare the latest namespace image and then bring up the secondary namenode." Can Secondary Namenode takes place of the Active NameNone in case of failure (Active NN service is down) ?
Upvote ShareHi Arghya,
We have to manually make the secondary namenode as namenode in case active namenode fails.
Now there is a concept of standby namenode which automatically becomes the namenode in case active namenode fails.
Hope this helps.
Upvote Sharewhat is the diffrence between in memory and disk?Please somebody answer.No one is answering asked many questions.please.................
Upvote ShareMemory: RAM
Disk: Hardrive
I hope that helps!
Upvote Share