Abstract:
Recently, data driven science is an interdisciplinary field to gather, process, manage, analyze and extract inherit meaning from unstructured data and formulate them as structural information. Later, that information can be employed in many practical applications to solve real life problems. Hadoop is an open source data science tool and is able to process large amount of data sets in distributed manner across cluster of computers (a single server and several worker machines). Hadoop allows running several tasks in parallel and processing huge amount of complex data efficiency with respect to time, performance and cost. Thus, learning Hadoop with its different sub modules is important. This project work covers the implementation of Hadoop cluster with SSH public key authentication for processing large volumes of data, using cheap, easily available personal computer hardware (Intel/AMD based pcs) and freely available open source software (Ubuntu Linux, Apache Hadoop etc). In addition, Mapreduce and Yarn based distributed applications are ported and tested the cluster’s workability.
Description:
This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh