The two main components of Hadoop are MapReduce framework and HDFS. MapReduce is a parallel, scalable framework which runs on HDFS. It refers to two separate tasks which Hadoop jobs perform. These are the Map Task and the Reduce Task.
Map Task takes input dataset and gives a set of pairs in between which get sorted and partitioned per reducer. The output is then passed to the Reducers to generate the ?nal output.
The user applications put into use the mapper and reducer interface in order to offer the map and reduce functions. In MapReduce framework, it is the computation which is always moved closer to the nodes which store data, instead of moving the data to the compute node. In a perfect case, compute node is also used as the storage node hence minimizing the congestion in the network and therefore ensuring maximum overall throughput.
The two significant modules in MapReduce framework are the JobTracker and the TaskTracker.
JobTracker is the MapReduce master daemon which accepts all the user jobs and splits these jobs into multiple tasks and finally assigning the tasks to the MapReduce slave nodes in a cluster called TaskTrackers. These TaskTrackers are processing nodes in the cluster which runs the tasks namely, Map and Reduce. The JobTracker’s responsibility is to schedule tasks and re-execute failed tasks on the task trackers. TaskTrackers at certain intervals report to JobTracker through heartbeat messages that carry information about the status of tasks which are running and also the number of slots that are available.
HDFS on the other hand is a reliable, fault tolerant distributed ?le system which is designed for storing very large datasets. Its main features include ability to perform load balancing so as to ensure maximum e?ciency, ability to con?gure strategies for block replication to ensure data protection, recovery mechanisms in situations where there is fault tolerance and also auto scalability.
In HDFS, each of the ?le is split into blocks and each of the block is replicated into to several devices across all the clusters. The two modules used in HDFS layer are NameNode and DataNode. NameNode is the
?le system master daemon which holds metadata information about stored ?les.
Namenode stores the inode records of directories and files which contain variety of attributes
Such as name, permissions, size and time it was last modi?ed .
DataNodes are ?le system slave nodes that are storage nodes in each and every cluster. DataNodes store ?le blocks and give out the read/write requests from the client. The NameNode is able to map a ?le to the list of the blocks and each block to the list of DataNodes which are used to store them. DataNodes
report to NameNode in regular intervals using heartbeat messages that contain the information concerning their stored blocks. NameNode builds its own metadata from the block reports and always stays in sync with the DataNodes in each of the clusters.
When HDFS client initializes the ?le read operation, it gets the list of
blocks and their matching DataNode locations from each NameNode. The locations
are prearranged according to distance from the reader.
It then reads the block contents directly from ?rst location. If the read operation fails, it picks the next location in sequential order. As the client gets data directly from the DataNodes, network tra?c is dispersed across all the DataNodes in the HDFS clusters.
When HDFS client is writing data to a ?le, it initiates a write a pipelined to a list of DataNodes that are retrieved from the NameNode. The NameNode then
chooses the list of DataNodes by basing on the pluggable block placement strategy. Each
DataNode gets data from its predecessor in the pipeline and then forwards it to its successor. The DataNodes finally report to the NameNode once the block is received.