Executive Summary:Google developed 3 major software layers to serve and support their huge hardware and data. In this paper we have explained the working principles and performance of these softwares.First is Google File Distributed System (GFS) which is specifically designed to handle large sets of data and to be run over commodity hardware which is cheaper and has higher failure rates than server hardware. GFS offered solutions which traditional file system couldn’t do.
GFS efficiently handled large sets of data that didn’t fit into single commodity hard disk , offered cheap solution using commodity hardware and it optimized the common write access pattern of append data to existing files.Second is MapReduce which is a software framework for processing large data sets in distributed fashion over a several machines. The core idea behind MapReduce is mapping the data into a collection of
Third is BigTable which is a widely applicable, scalable, distributed storage system for managing small to large scaled structured data with high performance and availability. Many Google products such as Google Analytics, Google Finance, Personalized Search, Google Earth, etc use Bigtable for workloads ranging from throughput oriented batch jobs to latency sensitive serving of data. While Bigtable shares many implementation strategies with other databases, it provides a simpler data model that supports dynamic control over data layout, format and locality properties.IntroductionRaw data describes the facts and figures that a company processes every day. In a retail environment, each sale will be recorded. However, a company will learn little looking at each sale in isolation.
Data becomes information after it has been processed to add context, relevance and purpose. Analysis of daily sales will reveal trends and patterns, such as peak shopping days or biggest-selling items. Knowledge is a set of beliefs based on the relationship between pieces of information. A retailer may know to order additional cakes on a Tuesday because these are his biggest selling item every Wednesday.
The ability to analyze and act on data is increasingly important to businesses. The pace of change requires companies to be able to react quickly to changing demands from customers and environmental conditions. Although prompt action may be required, decisions are increasingly complex as companies compete in a global marketplace. Managers may need to understand high volumes of data before they can make the necessary decisions. Effective business intelligence (BI) tools assist managers with decision making.It has become important to create a new platform to fulfill the demand of organizations due to the challenges faced by traditional data. By leveraging the talent and collaborative efforts of the people and the resources, innovation in terms of managing massive amount of data has become tedious job for organizations.
This can be fulfilled by implementing big data and its tools which are capable to store, analyze and process large amount of data at a very fast pace as compared to traditional data processing systems. Big data has become a big game changer in today’s world. The major difference between traditional data and big data is that traditional data use centralized database architecture in which large and complex problems are solved by a single computer system. Centralized architecture is costly and ineffective to process large amount of data. Big data is based on the distributed database architecture where a large block of data is solved by dividing it into several smaller sizes. Then the solution to a problem is computed by several different computers present in a given computer network, which makes it a lot faster. Traditional database systems are based on the structured data i.e.
traditional data is stored in fixed format or fields in a file where as big data uses the semi-structured and unstructured data and improves the variety of the data gathered from different sources like customers, audience or subscribers.Addressing the need to store and Manage data that does not fit neatly in rows and columns, NoSQL technologies are at the forefront, representing a broader world that connects to the internet at large. NoSQL databases can run on commodity hardware, support the unstructured, non-relational data flowing into organizations from the proliferation of new sources, and are available in a variety of structures that open up new types of data sources, providing ways to tap into the institutional knowledge locked in PCs and departmental silos. For example, the emerging blockchain technology is designed to store data, general-ledger style, in a highly distributed approach across the wider internet.A Distributed – File System (DFS) is a file system model that is distributed among/across multiple workstations and machines. The purpose of these file systems is to share the dispersed files across the platforms within a network securely and fast. All the resources that are in the machine itself is local to itself, whereas resources that are placed on some other machine(s) are remote. A file system provides a service for clients.
The server interface is the normal set of file operations: create, read, etc. on files.Servers, clients and storages are usually dispersed across the different machines which also means that the implementation and configurations of these would also be different and vary platform and usage vise. In many scenarios servers and clients are on different machines but they can also be on the same machines that depends on the requirements of the users. The file system sharing is an important aspect here since it is required to share files among the different machines when and where required. GFS Architecture: The architecture contains different kind of entities they are as listed below:Clients, Master serverChunk servers. There is generally a master, one or numerous customers/clients and numerous chunkserver.
1 Backup master is also present in case the primary master fails. Client – Application and personal computers are referred to as the clients. Client can request to add or edit a file 2Chunk Servers –These server are called chunk server because the server store the data in small size packet of 64 Mbs called chunks. They are the important part of architecture, and there presence is critical to the operation of the architecture .The chunk is perfect in size, good enough to store most of the file type.2 The master is mostly involved in the control flow activities1.
The Chunk Server always sends the information straight to the application and does not involve master in the data send activity. 2Master – All the control activities are monitored by the master. The master coordinate the chunk and modify the operation log1 The log is used by the master to note down its action.
For troubleshooting, the logs are very important as it points where exactly the failure happened and what time the failure happened. Master stores the changes in the metadata. Master sends delete and create new chunk request to the chunkserver2. We can treat the master as the storehouse, it stores the mapping details and the namespace of the chunks1. The operation log is always monitored, if logs are not getting updated, the system understand that something is wrong and may be the master is not alive, in this scenario, quickly an assistant server takes its place so that the operation does not stop.2 For disaster recovery and information safety, GFS makes several copies of each chunk and keep them in different chunk server.
Each duplicate is known as a replica. Initially the GFS makes three copy for all chunk, master has the control on the setting to make extra duplicates of it as and when wanted2. The data are not put away in a similar chunk server, it put away crosswise over various Chunk Server so that on the off chance that one master server is dead/unresponsive, and remaining chunkserver can answer and give the important data to the customer and it important for the procedure to keeping running2The master server uses the metadata to identify which chunkserver has the information the client is looking for and which chunk has the information. Collection of the abandoned chunk is also the responsibility of the Master1Figure 1 : GFS Architecture Figure 1 demonstrates the stream of the data/control flow from the customer to the master and after that to the chunkserver. The customer/client sends instruction to the chunk server and ask if it is having the information1. The master sends chunk index, chunkserver consequently sends a pulse and tells the master it is alive1 When the master receive the acknowledgement from the chunk of it is state, it sends metadata(Chunk location and chunk handle) to the application1.
The application send the document name, byte range to the GFS client. The client changes over the byte counterbalance given by the application into chunk index.1 The Client at that point sends the record name and chunk index to the master. The master check with the chunk server who has the documents2. The master thus restores the chunk handle and area of the copy to the customer. The customer at that point contact the chunk server with the chunk handle and byte run. The chunk server at that point exchange the information to the client2.
The master isn’t engaged with the information exchange process so master does not turn into the bottleneck of the procedure. 2Write operation: Figure 2 The application sends the document name and the information to the client. The client sends the document name and chunk index to the master. The master check with chunk if it has the control, on the off chance that nobody has the control, master doles out the control to a chunk that turns it into a primary chunk server2. The master at that point send the file location and chunk handle to the application.
The client sends data to every one of the copies. The primary in the wake of getting the information, sends a positive reaction to the client1. The customer sends the write charge to the primary1. The primary sends the data write summon to the two optional copies in serial request1.
After the information is written, the auxiliary answers back to the primary and recognize the write occasion. The primary affirms to the client that the compose operation is finished. The read/compose operations can be done in parallel by the customer.
2Figure 2: Write information stream operation 1 The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google. Available at https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf2Garden,H, and Basic, I.
(2017). How the Google File System Works.online HowStuffWorks. Available at: https://computer.howstuffworks.
com/internet/basic/google-file-system2.htm MapReduce:A programming model for processing and generating large data sets in which users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.Execution Overview:1. The MapReduce library in the user program first splits the input files into K pieces, After that it starts up many copies of the program on a cluster of Machines.2.
Out of these copies one copy is Master and the rest of the copies are workers which are assigned work by the master. Because the input file is split into K pieces, so there are K Map tasks and R reduce Tasks to assign. 3. The worker with the Map Task reads the contents of the input split and parses key/value pairs out of the input data. It then passes each pair to the user-defined map function. Map function produces intermediate key/value pairs as shown in below figure which are buffered in memory. 4.
Buffered pairs are partitioned into R regions by partitioning function. The location of these buffered pairs are then passed back to the masters who is responsible to forwarding these location to reduce workers.5. After the notification is done by the master, reduce workers uses remote procedural calls to read buffered data from the local disks of the map workers. As soon as the reduce worker has read all intermediate data for its partition, it starts sorting by the intermediate keys so that all occurrences of the same key are grouped together. 6.
Iteration is done by the reduce worker over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and corresponding set of intermediate values to the user’s reduce function. 7. This output of the reduce function is added to the final output file. The procedure is shown in the below figure.
At the end reduce function counts the total occurrence of the a word in the document. Advantages of using MAPREDUCE1. The model is easy to use.It can used by someone without any experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locality optimization and load balancing. 2.
Large Variety of problems are easily expressible as MapReduce computation. 3. MapReduce can be scaled to large cluster of Machines comprising of thousands of Machines, making efficient use of these machine resources and therefore it is suitable for use on many of the large computational problem encountered. 4.
Instead of bringing storage to processing, it bring processing to storage therefore making it much faster. Limitations of MAPREDUCE:1. Since MapReduce is suitable only for batch processing jobs, implementing interactive jobs and models becomes impossible. 2.
Applications that involve precomputation on the dataset brings down the advantages of MapReduce. 3. Iterative map reduce jobs is expensive due to huge space consumption by each job.
Sources:1. https://en.wikipedia.org/wiki/MapReduce2. MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat3. https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm