If you want to splits the file into blocks then user should write the programs.
When a file is written into HDFS, HDFS divides the file into blocks and takes care of its replication. This is done by writing into the configutaions files.
The parameters like no, of blocks to be splits, replications factors, NUmber of mappers and reducers. etc need to be done.
The Client writing to HDFS splits the file. The Client is aware of block size. A temporary file of size BLOCK is created in local disk which is then transferred to HDFS
Hi Abhinav, I think you misunderstood my question or may be i communicated badly:).Anyways i am speaking of two different concepts block(physical file division) and inputsplit(logical division as input to each task).In your reply can you please explain what do you mean by client divides?will there be any client process started when he writes data to hdfs?
And can you please explain who and how inputsplit size is decided?
When a user is writing to HDFS, they use a client program or library such as we use "hadoop fs ..." from unix command line. This program interacts with namenodes and datanodes while writing the file. This client program or a client library is responsible to split the file into blocks or chunks of 128MB (or 64MB previously) while writing to various datanodes.
Regarding the InputSplits: InputSplits are formed out of HDFS blocks during the processing phase. The HDFS blocks are raw blocks of data of fixed size irrespective of format while the inputsplits are logical groups of records. Say, during map-reduce process the raw data of HDFS blocks is converted into InputSplits (which are bunch of records). The conversion is done by InputFormat class.
So, depending upon the data and our requirements of processing we choose various input formats. One file can be loaded using different inputformats and thus different kinds of input formats can be created out of a file.
Please login to comment
9 Comments
Sir please clarify this..
1 Upvote ShareHi, Sonal.
If you want to splits the file into blocks then user should write the programs.
When a file is written into HDFS, HDFS divides the file into blocks and takes care of its replication. This is done by writing into the configutaions files.
The parameters like no, of blocks to be splits, replications factors, NUmber of mappers and reducers. etc need to be done.
In coming session it will be explained!
All the best!
2 Upvote Sharethank you so much
Upvote ShareHi,
Upvote ShareI think the file into blocks is done by hdfs.But data into inputsplits is done by client program.
Hi Manoj,
The Client writing to HDFS splits the file. The Client is aware of block size. A temporary file of size BLOCK is created in local disk which is then transferred to HDFS
Hope this helps.
1 Upvote ShareHi Abhinav,
I think you misunderstood my question or may be i communicated badly:).Anyways i am speaking of two different concepts block(physical file division) and inputsplit(logical division as input to each task).In your reply can you please explain what do you mean by client divides?will there be any client process started when he writes data to hdfs?
And can you please explain who and how inputsplit size is decided?
/Manoj
Upvote ShareHi Manoj,
When a user is writing to HDFS, they use a client program or library such as we use "hadoop fs ..." from unix command line.
This program interacts with namenodes and datanodes while writing the file. This client program or a client library is responsible to split the file into blocks or chunks of 128MB (or 64MB previously) while writing to various datanodes.
Regarding the InputSplits:
InputSplits are formed out of HDFS blocks during the processing phase. The HDFS blocks are raw blocks of data of fixed size irrespective of format while the inputsplits are logical groups of records. Say, during map-reduce process the raw data of HDFS blocks is converted into InputSplits (which are bunch of records). The conversion is done by InputFormat class.
So, depending upon the data and our requirements of processing we choose various input formats. One file can be loaded using different inputformats and thus different kinds of input formats can be created out of a file.
5 Upvote ShareCan you please explain what all functions the client program performs ?
1 Upvote ShareCan someone answer? I too have the same question
1 Upvote Share