From Micro-benchmarks to Machine Learning: Unveiling the Efficiency and Scalability of Hadoop and Spark

Salah Eddine  Hebabaze; Mohamed EL Ghmary; Hamid El bouabidi; Sara  Maftah; Mohamed  Amnai

doi:10.3991/ijim.v18i17.44555

Authors

Salah Eddine Hebabaze Ibn Tofaill University, Faculty of Science
Mohamed EL Ghmary FSDM Sidi Mohamed Ben Abdellah University https://orcid.org/0000-0001-5970-481X
Hamid El bouabidi Ibn Tofaill University, Faculty of Science https://orcid.org/0000-0002-9351-1402
Sara Maftah Ibn Tofaill University, Faculty of Science https://orcid.org/0000-0001-8423-5015
Mohamed Amnai Ibn Tofaill University, Faculty of Science

DOI:

https://doi.org/10.3991/ijim.v18i17.44555

Keywords:

Big Data, Hadoop, Apache Spark, MapReduce, HiBench benchmark, Machine Learning, , Memory Resource Limitations, Data Workloads

Abstract

With the exponential growth of data, the demand for efficient and scalable data processing solutions has become paramount. Hadoop and Spark, pivotal components of the open-source Big Data landscape, have been put to the test in this study. We conducted a comprehensive performance analysis of Hadoop and Spark in virtualized environments, evaluating their prowess across a suite of benchmarks. The benchmarks encompassed a spectrum of workloads, from micro-benchmarks such as Sort, WordCount, and TeraSort to web search tasks such as PageRank and machine learning endeavors including Naive Bayes and K-means. The central focus was to gauge their performance, efficiency, and resource utilization. The findings of this study underscore the benefits of Spark’s in-memory processing, demonstrating its superiority over Hadoop in various scenarios. Spark excels in machine learning and web search applications, particularly when handling smaller inputs. Its efficient memory management and support for multiple iterations make it a strong choice. In resource-constrained environments or when dealing with large input files and limited memory, Hadoop may still hold an edge. The design and implementation of data processing solutions in virtualized environments should carefully consider the specific demands of each framework. This study not only presents a performance comparison of Hadoop and Spark across different benchmarks but also emphasizes the vital implications for designing and deploying data processing solutions in virtualized settings. It serves as a cornerstone for informed decision-making, paving the way for optimized algorithms and techniques in the dynamic landscape of big data processing.