Estimate the Performance of Cloudera Decision Support Queries

— Hive and Impala queries are used to process a big amount of data. The overwriting amount of information requires an efficient data processing system. When we deal with a long-term batch query and analysis Hive will be more suitable for this query. Impala is the most powerful system suitable for re-al-time interactive Structured Query Language (SQL) query which are added a massive parallel processing to Hadoop distributed cluster. The data growth makes a problem with SQL Cluster because the execution processing time is increased. In this paper, a comparison is demonstrated between the performance time of Hive, Impala and SQL on two different data models with different queries chosen to test the performance. The results demonstrate that Impala outper-forms Hive and SQL cluster when it comes to analyze data and processing tasks. Using two benchmark datasets, TPC-H and statistical computing, we compare the performance of Hive, Impala, and SQL clusters 2009 Statistical Graphics Data Expo.


Introduction
A massive quantity of data is produced daily from every part of the world [1]. This Inflation in the data comes from the advances of technology like the arise of cloud computing, internet of things and smart and sending devices which exist nowadays in [2]. The source of data is heterogenous and diversified. It could be means of communication like social media sites, sensor networks, health care reports, security camera, hospitals, governments and so on [3]. Big data is used to define large amount of complex data that can't be processed or analyzed by traditional means [4].
One of the many techniques to analyze a big amount of data is Hadoop framework [5][6][7]. Hadoop is an Apache framework that can be used to process big data [8]. Not only Hadoop can analyze data but also is a distributed storage for big data as well relying on distributed clusters of commodity machines [9]. An easy way to query and analyze data in Hadoop is by using Hive or Impala. In the next two sections we are explained the Hive and Impala frameworks.
By integrating big data, studies try to enhance the platform's usage as a platform for processing such large data [10]. To address the issue, Apache Hadoop is the best option [11]. However, for complicated businesses, relying on only one platform is insufficient. Big data processing with sophisticated queries, such as Cloudera [12] and Hortonworks [13], will impact execution time if using the data Platform, which is a solution that unifies the features and capabilities of several apps and utilities. Both platforms offer a variety of query processing scenarios for cutting down on execution time.
The rest of this article is organized as follows: Section 2 explains the Apache Hive Architecture. Section 3 briefly discusses the Cloudera Impala framework. Section 4 depicts the proposed frameworks. In Section 5 shows the experiment and result. Section 6 is a conclusion.

Related work
Hive is a data warehouse which offers HiveQL (HQL) language which is similar to SQL for easy query data [14]. Apache Hive supports different file formats such as text files, Avro, sequence file…etc. Also, it supports various programming languages like java, python which means Hive client applications can be implemented with many languages [15]. Figure 1 shows the Hive Architecture.
Hive services consist of three main components which are Compiler, Driver and Metastore. When a SQL command is written for execution, it gets parsed by the help of the compiler to check syntax and convert it into MapReduce input. Processing and resources managements can be MapReduce or YARN applications. The Metadata generated from executing Hive queries is storied in Metastore by the driver. The generated Metadata contains data for tables such as its schema and location [16]. Hadoop distributed file systems represent a distributed storage.  Cloudera IMPALA On the other hand, Apache Impala is considered the newest engine integrated into the Hadoop Distributed System [16]. Unlike Hive, Impala doesn't rely on MapReduce architecture, instead, Impala relied on a massively parallel process (MPP) architecture, so it performs faster than Hive [17]. Thus, Impala is considered a real time query engine which uses Hive metastore of the table information which already exists. Analysts prefer impala to perform real time analytics with lower cost, complexity and high speed [18].
Impala's architecture consists of three daemons which are catalog, statestored and impalad. Impalad deamons execute on every node of the distributed cluster as it considered as the main part of the Impala. Impalad also accepts queries and distribute it to other nodes in the cluster. While, Statestored deamon keeps track of the state of the nodes to inform impalad deamons of the healthy nodes which will accept and execute the queries. Finally, the catalogue deamon transfers the changes done to the metadata to all the nodes in the cluster [16] [19]. Figure 2 depicts how the Impala architecture looks from the front.

Fig. 2. The architecture of Impala
Even though Impala is based on Hadoop, it doesn't use it. All your nodes have daemons running that cache some of the HDFS data. The MapReduce technique does not need to be used because these Impala daemons can swiftly deliver data. Impala is not a replacement for Hive; rather, it excels in situations where Hive fails. Data scientists and business analysts who only want to look at and study some data without constructing comprehensive workflows may find the Impala to be an excellent choice. As a side note, Impala isn't all that mature. When the amount of data is greater than the available memory, it can crash. Figure 3 illustrates the query processing procedure in Impala [19] using

Experimental result of TPC-H dataset
TPC-H dataset contains 22 queries of decision support queries from Q1 to Q22 which asses the performance of different decision support systems by examining large volumes of data and executing complex queries [20]. We examine only the queries which contains multiple joins between tables to show the difference between each framework. Queries "Q1, Q3, Q5, Q7, Q8, Q9, Q10, Q11, Q12, Q18, Q21" are shown the joint between tables. Each TPC-H query asks a business question and includes the corresponding query to answer the question. We make a comparison between Hadoop and SQL which discussed in TPC BenchmarkTM H Standard Specification Revision 3.0.0. The comparison depends on the execution time of each query in different approaches.
Experiments are carried out using Cloudera VM on Oracle VM VirtualBox. The queries were performed on Hue. Four processors were used with 2 cores per each processor and 16GB memory. According to TPC-H Rev. 2.18.0, the total database size is 1000 GB. We applied the same database and the same queries to explain the execution time for each query on Hadoop frameworks and SQL.

Experimental result of data expo '09 dataset
Statistical analysis and modeling graphs that display statistics data originally created from RITA, the Data Expo 2009 dataset which contains large records of flight departure and arrival that could reach more than 120 million records [21]. Each data record contains 23 attributes and 70-80 lakh rows. The data size is up to 1.6 gigabytes when it is compressed and 12 gigabytes when it is uncompressed. This data is providing the important features of the dataset graphically. For each dataset, we have a year's worth of airline data saved in the D1 dataset, and 2-and 3-years' worth of airline data in the D2, D3 datasets [22]. We make a comparison with a previous work but on a different experiments' environment. Experiments are carried out using Cloudera VM on Oracle VM VirtualBox. The queries were performed on Hue. Four processors were used with 2 cores per each processor and 16 GB memory.

Results and discussion
We have two different frameworks: Hive and Impala on Hadoop. There are two different datasets are used: the TPC-H dataset and Statistical Graphics Data expo '09 dataset. Performance parameters including data set file size, query statements, and query average time all have an impact on the final output.

TPC-H dataset results
The performance of Hadoop frameworks "Hive and Impala" were tested by two benchmark databases named TPC-H and Graphics Data expo '09. In TPC-H Dataset, we tested the queries which contained the most joins in specific query to test the performance of Hive and Impala when multiple tables exist compared with SQL query. The database load time is 02:34:12 on SQL framework [20]. In Hadoop frameworks, the database sored on HDFS directly which save this time in the execution phase.
As shown in Table 1, Impala showed lower execution time while using TPC-H dataset. Hence, joining tables from multiple nodes in the cluster didn't affect the performance of Impala unlike Hive like shown in figure 4. In SQL and Hive queries, there are a big difference in the execution time as compared with Impala framework.

Statistical graphics data expo '09 dataset results
While in the other dataset, performance is measured regarding the size of data in one table and the efficiency of Hive and Impala is tested on table containing the data of one-year D1, while D2 dataset contained the data of two years and D3 dataset contained data of three accumulated years. As stated in [22], size of the data being processed affects the performance and the query execution time, so the data of the statistical computing statistical graphics dataset is divided into three datasets. We test the query in the Hive and Impala three periods and takes these three outcomes, on average. Table 2 and figure 5 shown the mean result in a Hive query on D1datasets and the compared results. Data processing is scalable and straight forward in Hive cluster. Our Hive cluster has a less execution time than a compared cluster [22].   Table 3 and Figure 6 shown the mean result in Impala cluster on datasets D1 and the compared Impala results [19]. Impala cluster on Hadoop has a less execution time than Hive cluster.    Figure 7 shown the mean result in a Hive query on D2 datasets and the compared results.   Table 5 and Figure 8 shown the mean result in Impala cluster on datasets D2 and the compared Impala results [22]. Impala cluster on Hadoop has a less execution time than Hive cluster.  Lastly, a bigger dataset was made containing three years of data named D3. It tested on both Hive and Impala and compared with previous work [22][23][24]. Query mean time was higher than the last two experiments, but Impala's MPP architecture proved its efficiency as shown in Table 6, 7 and Figures 9, 10. Even though time increased but Impala still has better execution time than Hive as shown below.

Conclusion
Even though Impala is the newest addition to the Hadoop distributed system, but it proved high efficiency in a big data. Two datasets were used in this paper to evaluate the outcomes of Hive and Impala when the size of data being processing is increased or when joining multiple tables which exists in multiple nodes occurs. According to experiments, Impala showed better performance and efficiency when it comes to query completion time due to relying on MPP which is designed to handle complex queries. Being a real time query system will open the doors to different applications which will need fast response and decisions depending on millions of data. In Hive, A good designed tables and query are improved the execution time which reduced the processing cost. Hive used MapReduce in parallel processing to execute one program while Impala provides MPP. Impala is ideal for real time queries, but it is not ideal for heavy joins data.