Apache Hadoop and MapReduceThis are popular big data analytical softwares based on progamming model used to solved various problem scenarios (Assuncao et al. 2014) ;it constitute Mapreduce,Hadoop kernel,Apache hive and Hadoop Distributed File System(HDFS) . The Hadoop is oftenly used open source MapReduce application; Hadoop (Liebowitz, 2013) used master node to divide and distribute data input into segments as well as worker nodes that converge sub problems output. The divide and conquer technique (such as Map Step, Reduce Step) can be used in Map reduce to process the voluminous data set.
The hadoop and MapReduce are intensive throughput data processin and fault tolerant storage. The Amazon EMR allows clienteles to instantiate Hadoop clusters using Amazon Elastic Compute Cloud (EC2) and Amazon Web Services for processing, storing and transferring of the voluminous data (Assuncao et al. 2014).The Hadoop adopts HDFS to replicate and partition various data sets through multiple nodes. MapReduce is a programmable model presented by Google to create a hardware abstraction and enhance concurrent execution of programs based on various clusters (Nair & Shetty, 2014) .Apache Mahout (Liebowitz, 2013) delivers an application intelligent accessible and business-related machine learning techniques for large scale and intelligent data analysis applications.
Core algorithms of mahout including clustering, classi?cation, pattern mining, regression, dimensionalty reduction, evolutionary algorithms, and batch based collaborative ?ltering run on top of Hadoop platform through map reduce framework. Acccording to Nair & Shetty, (2014) Stubby is one of workflow optimizer based on extensible transformation generating workflows of MapReduce. Starfish is a self -tuning system which is built on Hadoop, improve performance by enabling adaptability of user requirements as well as system workloads. Radoop combine both Hadoop and Rapidminer to grow data size and takes their advantages capabilities. Sailfish is a MapReduce abstraction that allows data aggregation and transferring to reduce tasks efficiently compared to Hadoop. Twister is part of MapReduce allowing a continuous and fast data processing. Twitter Storm enables processing and aggregation of streaming data in a real time.
It can process a million of data tuples per node per second.