This filesystem (GDFS) and the Nutch project

This essay is going to outline the background to Hadoop, a summary
of the changes that have been implemented through the different versions of
Hadoop and some of their advantages. The essay is also going to discuss some of
the Hadoop use cases and the perceived advantages of Apache Spark and it’s use


Hadoop is an open
source framework that provides a distributed, scalable storage and processing
platform for large data volumes without format requirements. Hadoop is managed
by the Apache Software foundation and it has its origins in Apache Nutch, an
open source search engine project. Due to the volume of webpages on the
internet, the Nutch Project realised that Apache Nutch would have difficulties
scaling webpages on the internet. In 2003, Google published a paper on its Google
distribute filesystem (GDFS) and the Nutch project identified that a similar
filesystem would be an ideal storage solution for the data generated by their
open source search engine. In 2004, the Nutch Project developed Nutch
Distributed Filesystem (NDFS), an open source implementation of GDFS.  In 2004, Google published a paper on
MapReduce and by the middle of that year, Nutch developed an open source
version of MapReduce. In 2006, NDFS and MapReduce moved out of Nutch to form
Hadoop. (White, 2012).
Hadoop has two components at its core, the Hadoop distributed file system
(HDFS) which is a storage cluster and the MapReduce engine, a programming model
for data processing. Hadoop has been widely adopted by companies that process
significant amount of data and some of its use cases will be discussed below.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Hadoop is becoming the foundation of most big data architectures
and has gained significant adoption in areas where there are huge datasets. Organizations
can either deploy Hadoop and the required supporting software packages in their
local data centre. However, most big data projects depend on short-term use of
substantial computing resources and this use scenario normally utilises scalable public cloud services.  The following are examples of where Hadoop is

In Data Science is a field where machine learning, statistics,
advanced analysis and programming are combined. Data Sciences is relatively new
discipline that draws out hidden insights from huge datasets and Hadoop has
been adopted as a core component for storage and processing of such huge data

Hadoop is also used for data lake technologies. A data lake is a
shared data environment that comprises multiple repositories and capitalises on
big data technologies. Data lakes amalgamates data from various sources within
an organisation and combines it to provide single view of an entity for example
a customer. Hadoop data lakes are used within financial institutions where there
are various systems holding information about a customer for example, a loan
book system, current account system and another system with customer investment
details. Hadoop data lakes are utilised to amalgamate data from these various
applications and provide a single customer view.

Hadoop has also been adopted for stream computing where processing
is performed on data directly as it is produced or received before it is stored.
Technologies enabling organisations to capture, process, ingest and analyse
large volumes of data at high speed have become increasingly important and
Hadoop can be used in such instances.  

Since its inception, Hadoop has gone through various iterations. The
second iteration, Hadoop 2.x improved resource management and scheduling. The
latest version is Hadoop 3.x which is considered stable and of a quality that
is production-ready. The changes that were introduced between the different
versions and some of the benefits they bring are summarised below.   

Hadoop 1.x had one namespace for the whole cluster which was
managed by a single NameNode. With a single namenode, the HDFS cluster could be
scaled horizontally by adding datanodes but more namespace could not be added
to an existing cluster horizontally. The single Namenode was a single
point of failure and file system operations were limited to the throughput of a
single name node.  Hadoop 2.x introduced cluster
federation which is the separation of the namespace layer and storage
layers. Hadoop federation allows horizontal scaling and uses several NameNodes
which are independent of each other.

 In Hadoop1.x, the
processing engine and resource management capabilities of MapReduce are
combined and it a single JobTracker for thousands of TaskTrackers and MapReduce
tasks. It should also be noted that MapReduce had other drawbacks in that it
did not have other workflows such as join, filter, union, intersection etc. in addition,
functions had to read and write to disk before and after Map and Reduce. Hadoop
2.0 introduced ‘Yet Another Resource negotiator (YARN) which splits resource
management and scheduling into separate tasks. YARN has a central
ResourceManager and an ApplicationMaster which is created for each individual
application allowing multiple applications to run simultaneously.  Hadoop 2.x has improved and it is scalable up
to 10,000 nodes and 400,000 tasks. Hadoop 2.0 introduced ‘Federation’ which
allows multiple servers to manage namespaces thereby allowing horizontal
scaling and improve reliability. The Namenodes are now independent and do not
require coordination with each other. (Apache Hadoop, 2017).

Hadoop 1.x had a restricted batch processing model and it only
supported MapReduce processing. As a result, it could only be applied to a
limited set of tasks. YARN has the ability to run none MapReduce tasks inside
Hadoop allowing other applications such as streaming, graph etc. to run thereby
extending the number of tasks that can be performed in Hadoop 2.x compared to
Hadoop 1.x.

According to the
open-source Apache Software Foundation, ”Apache Hadoop 3.x incorporates a
number of significant enhancements over the previous major release line”.

The initial implementation of HDFS NameNode provided for a single
active and a single standby namenode. Version 3.x allows multiple standby
NameNodes which increases the fault tolerance. 

Hadoop was originally developed to support UNIX, a significant
change in Hadoop 3.x has been the introduction of support for Microsoft Azure
Data Lake integration as an alternative Hadoop-compatible filesystem.  

Hadoop 2.x supports replication for fault tolerance.  Hadoop 3.x offers support for erasure encoding,
a method storing data durably with significant space saving compared to
replication. It should be noted that erasure coding is mainly used for remote data
reading and it is suitable for data that is not accessed frequently. The YARN
resource model has been generalized to support user-defined resource types
GPUs, software licenses, or locally-attached storage. (Apache Hadoop, 2017)





The changes discussed
above resulted in additional use cases. In 2015, Microsoft announced Microsoft Azure Data Lake, a set of big data storage
and analytics services that is built to be part of Hadoop. Azure Data Lake
utilises HDFS and YARN as key components. (Microsoft , 2015).

The option to run
multiple Namenodes in a cluster for example running two redundant NameNodes in
the same cluster in an Active/Passive configuration with a hot standby allows fast
failover to a new NameNode resulting in high availability. (Apache Hadoop, 2017).

In addition, federation
can result in multi-tenant environment where a single cluster is shared by
multiple organisations or teams on a single project. Federation allows the
cluster to have several different NameNodes or namespaces which are independent
of each other.

The perceived advantage of Spark and it’s additional use cases.


Apache Spark’s supports
data analysis, machine learning, graphs, streaming data functionalities among
others. It can read/write from a range of data types and supports development
in multiple languages.

Spark’s key use case is
centred on its ability to process streaming data. Organisations are processing
significant amounts of data on a daily basis and there is an argument for
streaming and analysing such data in real time. This can be traced back to the rise of social media in
the mid-to-late 2000s. Companies like Facebook, Google and Twitter designed and
launched technology platforms that allow millions of users to share data
simultaneously and in near-real time and in most cases, Hadoop forms the
backbone of these technologies. (IBM , 2017). Spark streaming has
the capability to handle extra workload because it can unify disparate data
processing capabilities, allowing developers to use a single framework to different
processing requirements. Spark incorporates an integrated framework for advanced
analytics which allows users run repeated queries on datasets.  Spark’s scalable Machine Learning Library
(MLlib) is the component that provides this facility. As a result, Spark is
suitable for use on big data functions such as predictive intelligence and customer
segmentation among others.

is also capable of interactive analytics, however, there are other tools that
are more suited to this task than Spark because it relies on other Hadoop
components to complete this task. Although MapReduce was built to handle batch
processing other Hadoop components e.g. Hive or Pig are notably slower than
Spark for interactive analysis.

they are considered to highly flexible, Spark’s in-memory capabilities are not
always one size fit all for all scenarios. For example Spark was not designed for
multi-user environment. Spark users need to know if the memory
they have access to is sufficient for the dataset they will be working on. Coordination
is required where more users are added to a project to ensure users do not exceed
the allocated memory. Inability to handle concurrency might mean that as the number
of users increases, projects may be required to utilise other engines such as Apache
Hive, for large, batch projects.

conclusion, it can be argued that Hadoop (HDFS, MapReduce) provides a reliable
tool for processing schema on read data and they brought a paradigm shift in
programming distributed system. Spark has extended MapReduce for in memory
computations for streaming, interactive, iterative and machine learning tasks


I'm Owen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out