Big Data Cleaning Algorithms in Cloud Computing

—Big data cleaning is one of the important research issues in cloud computing theory. The existing data cleaning algorithms assume all the data can be loaded into the main memory at one-time, which are infeasible for big data. To this end, based on the knowledge base, a data cleaning algo-rithm is proposed in cloud computing by Map-Reduce. It extracts atomic knowledge of the selected nodes firstly, then analyzes their relations, deletes the same objects, builds an atomic knowledge sequence based on weights, lastly cleans data according to the sequence. The experimental results show that the cloud computing environment big data algo-rithm is effective and feasible, and has better expansibility.


I. INTRODUCTION
With the development of Internet of Things, its technology has been widely used in to various fields, and has accumulated massive data [1]. Since the emergence of Cloud Computing, with the continuous development of science and technology and advance by academia and industry, the applications of Cloud Computing are going on developing. Cloud Computing is moving from theory to practice.
With the development of Cloud Computing, data center is also improved.Nowaday, data center is not only a site which manages and repairs servers, but also a center of many computers with high performance which could compute and store huge data. Currently there are proposed many cleaning algorithms of big data [2], that is mainly divided into regional object cleaning algorithm [3], the object cleaning algorithm based on information theory [4][5], based on discernibility matrix and on the basis of improved object cleaning algorithm [6] .Many scholars mainly study on how to deal with inconsistent decision table [7] and how to improve the efficiency of big data algorithm [4,[8][9].
The big data cleaning is the important way to resolve the massive data mining problem, and the big data cleaning algorithm combined the parallel genetic algorithm and co-evolutionary algorithm [10], to decompose of object cleaning task, which can improve the efficiency of big data algorithm. To this end, such big data cleaning algorithm that assume all the data can be loaded into the main memory at one-time, which are infeasible for big data. Cloud computing is a new business computing model that was proposed in recent years, it is the development of distributed computing, parallel computing and grid com-puting [11]. The pioneers of cloud computing is Google Inc. ,proposed a massive data storage and access capacity of large distributed file system GFS(Google File System) [12], and providing a handle massive data parallel programming mode of MapReduce, that provides a feasible solution for massive data mining [13]. Cloud computing technology has been applied in the field of machine learning [14], but there is still no real application to big data cleaning algorithm [15]. This paper has studied the Map-Reduce big data programming technology, analysis of existing big data cleaning algorithm, by analyzing the limitation of traditional structures of knowledge base an extended tree-like knowledge base is built by decomposing and recomposing the domain knowledge, combined with MapReduce technical, the algorithms that suitable for large-scale data sets of big data computing equivalence classes is designed , and the big data cleaning algorithms are implemented in the cloud computing environment based on the Hadoop open source platform. The experimental results show that the algorithm not only has good scalability, but also able to handle the huge amounts of data.

II. CLOUD COMPUTING ALGORITHM FOR BIG DATA
Assuming that the decision table named T has n different values of decision object, the compatible object decision object values are mapped to 1..., n, set all incompatible object decision object value mapping for n +1. So we can set the decision table T can be composed by n+1 subdecision table, that is D 1 , D2,.., D n , each sub-decision table contains the same class objects, the objects number are n 1 ,n 2 ,...,n n , to this end, the decision table T is a compatible decision table. Assuming the object a have different object values, which is mapped to 1... , r. Set A. iscernibility objects calculation method in the cloud computing environment A discernibility object is generated by two objects that they have different value of decision object and conditional portfolio object. If two objects decision value is different, and then the condition object a property value is also PAPER BIG DATA CLEANING ALGORITHMS IN CLOUD COMPUTING different, then a can identify these two objects, that is has the relative discernibility ability. The more number of objects that can recognize by an object, the stronger relative identification ability can be used for object discernibility number to measure relative discernibility ability.
In a compatibility decision tables T, A C ! , a can identify the object of the property of a pair and is given by In a compatibility decision table T, Q C ! , the object set Q is able to recognize objects is given by Assuming U is imported from A that divided into r equivalence classes, set U/A A 1 A 2 … A r , the object combinations are mapped to 1 … r. Where A is able to recognize the object of a number is calculated according to the following definition.
In a compatibility decision tables T, A C ! , the object set A is able to recognize objects is given by: It can be obtained by the definition (4), the discernibility objects calculation method is related to the exported equivalence class from A and D based on U, and then run cross-op operation, the calculation is relatively complex, so we can only calculate the discernibility objects pairs number in the memory based on the small-scale sets of the object set A. However, in the cloud computing environment, due to the large data of different equivalence class is stored in a plurality of nodes and files, the calculation of D A ObjS involves many different equivalence classes, therefore, cannot be according defined to computing object number. How to quickly calculate the object identification becomes a key issue in data cleaning algorithm under the cloud computing environment of the number. Two calculation methods below focus on cloud computing environment can discernibility objects number.
As we all know, the set of objects A is able to distinguish between any two equivalence classes U/ A, it's showed that A has a certain amount of recognition ability. If A can only be divided into all the elements of an equivalence class, that A has the weakest recognition ability. Therefore, the relative identification ability of size D A ObjS can use the object set A identification ability to calculate A.
In a compatibility decision table T, A C ! , the object set A is able to recognize objects is given by: where the new add identification ability of object c is defined sum size in A 1 ,A 2 ...,A r and are given by B. Big data cleaning algorithm in cloud computing environment It can been find from the definitions (2) and (5) , object set A can identify the object to the object number D A ObjS or not identification are needed to obtain through the calculation of equivalent to a number ObjS ! , but different equivalence classes can be parallel computation. Therefore, we can use the Map-Reduce big data programming techniques to handle large data.
In MapReduce programming framework, the user focuses on big data operation algorithm, write a Map function and Reduce function, realize the large-scale data big data processing. Specifically, the Map function is mainly complete different data blocks in the equivalence class, object and function of Reduce system with an equivalent number or calculation with an equivalence class is indiscernibility number. Based on different calculation methods of D A ObjS , given two kinds of data cleaning algorithm in the cloud computing environment.
In a compatibility decision tables T, A C ! , where A is a necessary and sufficient condition of C relative to decision object D and is given by Where D A ObjS contains three algorithms that is Map function (algorithm 1), Reduce function (algorithm 2) and the main program (algorithm 3), which are described below. Algorithm  Where algorithm 1 Map function calculates each data block equivalence class and the times, algorithm 2 set all data blocks in the same equivalence class were collected, and algorithm 3 were calculated identification ability, and the best candidate objects. According to the relative identification ability of each candidate objects, repeat the above process, until the calculated data cleaning.

C. Big data parallel strategy
Traditional parallel object cleaning algorithm is based on the assumption that all the data once loaded into memory, which is not suitable for large-scale data sets. The scale data is divided into a plurality of data segments in each task, parallel computing export candidate object set of equivalence classes, then to determine the optimum candidate based on calculated, each task cannot be identified or can be identified object number property, and ultimately to determine the best candidate .

A. The example analysis
With a compatibility decision tables (Table 1) shows that the two algorithms proposed in this paper, are given in Table 1 all compatible object decision object values.
Assuming the table 1 is divided into two data segments, a data fragment contains 1 to 2 of objects, the first two data segments contains 3 to 5 objects. The following describes the algorithm to calculate the object on the number of processes.

D A
ObjS relative identification algorithm for computing the candidate object C 1 on process in Map stage, a data slice an object were generated objects derived equivalence classes C 1 <"C 1 " 1> objects D export equivalence class <D 2, 1> and object collection {C 1 , D} export equivalence class <"C 1 D 1 2 1> three <key, value> remaining object computing process is similar.

B. Experimental result
This section is analyzed the evaluate performance running time of speedup and scaleup in the cloud computing environment, used the D A ObjS parallel operation strategy algorithm. The test performance use artificial data sets Obj1, Obj2, Ojb3 and Obj4. Table 2 lists the characteristics of the different sets of data. Using the open source cloud computing platform hadoop0.20.2 and java1.7 in 10 ordinary computers build cloud computing environment to experiment, where one master station and nine slave nodes.
Firstly, from the data cleaning algorithm running time comparison, in two small data sets Obj1 and Obj2 of D A ObjS was compared three kinds of data cleaning algorithm. Can be seen from Figure 1, the small data set would not be appropriate to use the MapReduce technology, and the use of data and task parallelism, data cleaning method than using only data-parallel algorithm running time is shorter.
Secondly, Obj1 ~ 5 five data sets, respectively, in a four node test run time is shown in Figure 2.  The existing data cleaning algorithm used to calculate the minimum object cleaning, but only on a small data set cleaning, big data were randomly divided into several subdecision tables, then calculate the number of positive region, respectively, for each sub-decision table select the optimal single candidate objects, and repeat the process in order to gain cleaning. However, for inconsistent decision table, the method does not guarantee to each decision table of computing positive region and calculation of the decision table is region are equivalent, because each decision table of computing positive region is not the exchange of information. And, it is also unable to larger subset of decision table cleaning. The decomposition of largescale data by using Map-Reduce technology, cleaning of each data partition, then merge the cleaning of each data slice, add other necessary candidate objects, finally remove redundant objects. However, this method combined candidate cleaning may be larger set of condition objects become very difficult to remove redundant objects.
In order to solve the problem of data cleaning of large data sets, this paper are divided on the large-scale data by using Map-Reduce technology, equivalence calculation of different candidate object set for each data partition, and then summarize the equivalence classes, calculate the candidate object relative indiscernibility or identification of the object, select the optimal single candidate object, iteration of this process, until the cleaning. The proposed algorithm based on data parallel strategy, so that it can greatly save the MapReduce job start and scheduling time, thereby improving the data cleaning algorithm efficiency under the cloud computing environment.

IV. CONCLUSIONS
The existing data cleansing algorithms through improved sorting algorithm or using better data to quickly calculate the equivalence class, the time complexity is reduced, but can only be dealt with in the small data sets of main memory. This dissertation focuses on Cloud Computing data center, policy of replica, and scheduling mechanism, and makes deep research around such issues. It not only proposes network structure of data center, policy of replica and scheduling algorithms, but also provides a strong foundation for other related researches on Cloud Computing in the future. For large-scale data sets data cleaning, analyzed of large data cleaning algorithm can be parallelized operation, based on the knowledge base, a data cleaning algorithm is proposed in cloud computing by Map-Reduce. It extracts atomic knowledge of the selected nodes firstly, analyzes their relations, deletes the same objects, builds an atomic knowledge sequence based on weights, discuss and achieve a variety of data-parallel strategy, and experiments on ordinary computer cluster using Hadoop. Experimental results show that the data cleaning algorithm has better acceleration than able to handle large data sets.