This essay is going to outline the background to Hadoop, a summaryof the changes that have been implemented through the different versions ofHadoop and some of their advantages. The essay is also going to discuss some ofthe Hadoop use cases and the perceived advantages of Apache Spark and it’s usecases. Hadoop is an opensource framework that provides a distributed, scalable storage and processingplatform for large data volumes without format requirements. Hadoop is managedby the Apache Software foundation and it has its origins in Apache Nutch, anopen source search engine project. Due to the volume of webpages on theinternet, the Nutch Project realised that Apache Nutch would have difficultiesscaling webpages on the internet.
In 2003, Google published a paper on its Googledistribute filesystem (GDFS) and the Nutch project identified that a similarfilesystem would be an ideal storage solution for the data generated by theiropen source search engine. In 2004, the Nutch Project developed NutchDistributed Filesystem (NDFS), an open source implementation of GDFS. In 2004, Google published a paper onMapReduce and by the middle of that year, Nutch developed an open sourceversion of MapReduce. In 2006, NDFS and MapReduce moved out of Nutch to formHadoop.
(White, 2012).Hadoop has two components at its core, the Hadoop distributed file system(HDFS) which is a storage cluster and the MapReduce engine, a programming modelfor data processing. Hadoop has been widely adopted by companies that processsignificant amount of data and some of its use cases will be discussed below.Hadoop is becoming the foundation of most big data architecturesand has gained significant adoption in areas where there are huge datasets. Organizationscan either deploy Hadoop and the required supporting software packages in theirlocal data centre. However, most big data projects depend on short-term use ofsubstantial computing resources and this use scenario normally utilises scalable public cloud services.
The following are examples of where Hadoop isused: · In Data Science is a field where machine learning, statistics,advanced analysis and programming are combined. Data Sciences is relatively newdiscipline that draws out hidden insights from huge datasets and Hadoop hasbeen adopted as a core component for storage and processing of such huge datasets. · Hadoop is also used for data lake technologies. A data lake is ashared data environment that comprises multiple repositories and capitalises onbig data technologies. Data lakes amalgamates data from various sources withinan organisation and combines it to provide single view of an entity for examplea customer. Hadoop data lakes are used within financial institutions where thereare various systems holding information about a customer for example, a loanbook system, current account system and another system with customer investmentdetails. Hadoop data lakes are utilised to amalgamate data from these variousapplications and provide a single customer view. · Hadoop has also been adopted for stream computing where processingis performed on data directly as it is produced or received before it is stored.
Technologies enabling organisations to capture, process, ingest and analyselarge volumes of data at high speed have become increasingly important andHadoop can be used in such instances. Since its inception, Hadoop has gone through various iterations. Thesecond iteration, Hadoop 2.x improved resource management and scheduling.
Thelatest version is Hadoop 3.x which is considered stable and of a quality thatis production-ready. The changes that were introduced between the differentversions and some of the benefits they bring are summarised below. · Hadoop 1.
x had one namespace for the whole cluster which wasmanaged by a single NameNode. With a single namenode, the HDFS cluster could bescaled horizontally by adding datanodes but more namespace could not be addedto an existing cluster horizontally. The single Namenode was a singlepoint of failure and file system operations were limited to the throughput of asingle name node. Hadoop 2.x introduced clusterfederation which is the separation of the namespace layer and storagelayers.
Hadoop federation allows horizontal scaling and uses several NameNodeswhich are independent of each other.· In Hadoop1.x, theprocessing engine and resource management capabilities of MapReduce arecombined and it a single JobTracker for thousands of TaskTrackers and MapReducetasks. It should also be noted that MapReduce had other drawbacks in that itdid not have other workflows such as join, filter, union, intersection etc.
in addition,functions had to read and write to disk before and after Map and Reduce. Hadoop2.0 introduced ‘Yet Another Resource negotiator (YARN) which splits resourcemanagement and scheduling into separate tasks. YARN has a centralResourceManager and an ApplicationMaster which is created for each individualapplication allowing multiple applications to run simultaneously.
Hadoop 2.x has improved and it is scalable upto 10,000 nodes and 400,000 tasks. Hadoop 2.
0 introduced ‘Federation’ whichallows multiple servers to manage namespaces thereby allowing horizontalscaling and improve reliability. The Namenodes are now independent and do notrequire coordination with each other. (Apache Hadoop, 2017). · Hadoop 1.
x had a restricted batch processing model and it onlysupported MapReduce processing. As a result, it could only be applied to alimited set of tasks. YARN has the ability to run none MapReduce tasks insideHadoop allowing other applications such as streaming, graph etc. to run therebyextending the number of tasks that can be performed in Hadoop 2.x compared toHadoop 1.x.According to theopen-source Apache Software Foundation, ”Apache Hadoop 3.x incorporates anumber of significant enhancements over the previous major release line”.
· The initial implementation of HDFS NameNode provided for a singleactive and a single standby namenode. Version 3.x allows multiple standbyNameNodes which increases the fault tolerance. · Hadoop was originally developed to support UNIX, a significantchange in Hadoop 3.x has been the introduction of support for Microsoft AzureData Lake integration as an alternative Hadoop-compatible filesystem.
· Hadoop 2.x supports replication for fault tolerance. Hadoop 3.x offers support for erasure encoding,a method storing data durably with significant space saving compared toreplication. It should be noted that erasure coding is mainly used for remote datareading and it is suitable for data that is not accessed frequently. The YARNresource model has been generalized to support user-defined resource typesGPUs, software licenses, or locally-attached storage.
(Apache Hadoop, 2017) The changes discussedabove resulted in additional use cases. In 2015, Microsoft announced Microsoft Azure Data Lake, a set of big data storageand analytics services that is built to be part of Hadoop. Azure Data Lakeutilises HDFS and YARN as key components.
(Microsoft , 2015).The option to runmultiple Namenodes in a cluster for example running two redundant NameNodes inthe same cluster in an Active/Passive configuration with a hot standby allows fastfailover to a new NameNode resulting in high availability. (Apache Hadoop, 2017).In addition, federationcan result in multi-tenant environment where a single cluster is shared bymultiple organisations or teams on a single project. Federation allows thecluster to have several different NameNodes or namespaces which are independentof each other. The perceived advantage of Spark and it’s additional use cases.
35% Apache Spark’s supportsdata analysis, machine learning, graphs, streaming data functionalities amongothers. It can read/write from a range of data types and supports developmentin multiple languages. Spark’s key use case iscentred on its ability to process streaming data. Organisations are processingsignificant amounts of data on a daily basis and there is an argument forstreaming and analysing such data in real time. This can be traced back to the rise of social media inthe mid-to-late 2000s. Companies like Facebook, Google and Twitter designed andlaunched technology platforms that allow millions of users to share datasimultaneously and in near-real time and in most cases, Hadoop forms thebackbone of these technologies. (IBM , 2017).
Spark streaming hasthe capability to handle extra workload because it can unify disparate dataprocessing capabilities, allowing developers to use a single framework to differentprocessing requirements. Spark incorporates an integrated framework for advancedanalytics which allows users run repeated queries on datasets. Spark’s scalable Machine Learning Library(MLlib) is the component that provides this facility. As a result, Spark issuitable for use on big data functions such as predictive intelligence and customersegmentation among others. Sparkis also capable of interactive analytics, however, there are other tools thatare more suited to this task than Spark because it relies on other Hadoopcomponents to complete this task. Although MapReduce was built to handle batchprocessing other Hadoop components e.
g. Hive or Pig are notably slower thanSpark for interactive analysis. Althoughthey are considered to highly flexible, Spark’s in-memory capabilities are notalways one size fit all for all scenarios. For example Spark was not designed formulti-user environment. Spark users need to know if the memorythey have access to is sufficient for the dataset they will be working on.
Coordinationis required where more users are added to a project to ensure users do not exceedthe allocated memory. Inability to handle concurrency might mean that as the numberof users increases, projects may be required to utilise other engines such as ApacheHive, for large, batch projects.Inconclusion, it can be argued that Hadoop (HDFS, MapReduce) provides a reliabletool for processing schema on read data and they brought a paradigm shift inprogramming distributed system. Spark has extended MapReduce for in memorycomputations for streaming, interactive, iterative and machine learning tasks