Method of Association Rules Mining and Its Application in Analysis of Seawater Samples

— This paper aims to set up new rules for processing seawater quality monitoring data collected by photoelectric sensor network, and mine out the useful information contained in the data. For this purpose, the immune algorithm was introduced to the classical genetic algorithm, the fitness function was designed, and the crossover and mutation probabilities were adjusted, thus creating the adaptive immune genetic algorithm (IIGA). The new algorithm was described in details and applied in an actual case. Through the comparison between the IIGA, IGA and apriori algorithms, the author concluded that the IIGA not only shortened the mining time, but also ensured the operation accuracy. The research findings are of great importance to the association rules mining in various fields.


Introduction
In recent years, China has established a marine environment monitoring network in the seas under its jurisdiction. The monitoring agencies were assigned with clear duties on the monitoring and early warning of marine environment quality. The years of monitoring has accumulated a large amount of raw data, which far exceeds the scope of the single-index evaluation standards for marine water quality established in the past (1997). To solve the problem, this paper aims to set up new rules for processing seawater quality monitoring data collected by photoelectric sensor network, and mine out the useful information contained in the data.
In data mining, the most popular association rules algorithms are apriori algorithm and optimized apriori algorithms. The existing method accesses the database through various search algorithms, set support threshold to solve the frequent item sets, and generates association rules based on the frequent set through a certain algorithm. Since the database should be accessed repeated to determine the frequent item set, the existing method may increase the burden of I/O and the workload, and reduce the mining efficiency. Therefore, the genetic algorithm has been introduced to the research on association rules.
The genetic algorithm is a research hotspot in foreign countries, as stated in [1][2][3]. combine immune algorithm and genetic algorithm into the immune genetic algorithm (IGA). Thanks to the unique concentration control function, the new algorithm outperforms immune algorithm and genetic algorithm in many aspects, such as the avoidance of local optimum trap, the maintenance of population diversity, and the accuracy of convergence results, as stated in [4][5][6]. So far, the IGA has been successfully applied in production system optimization, resource scheduling, allocation of water resources, etc.
In view of the fastness and accuracy of the IGA, this paper attempts to incorporate it into the research on association rules, and thus creates an improved immune genetic algorithm (IIGA) for the mining of association rules. Then, the IIGA was implemented in the mining of seawater quality monitoring database, aiming to dig out potentially useful information from the data.

2
Algorithm improvement

Algorithm analysis
Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories. Powerful search strategies are required to improve the performance of association rules mining.
Nowadays, the most popular association rules algorithms are apriori algorithm and optimized apriori algorithms, as stated in [7][8][9]. However, these algorithms face some shortcomings like high computing complexity. Therefore, the genetic algorithm has been gradually introduced to association rules mining.
Known for its high efficiency, the genetic algorithm is inspired by the biological competition strategy of survival of the fittest. It can encode and decode database information according to the specific strategy, and is suitable for encoding and searching. The problem is the genetic algorithm is sometimes troubled by low accuracy and redundancy.
In reference to the biological immune mechanism, the IGA is developed by integrating the memory function into the genetic algorithm. Compared with the genetic algorithm, the IGA carries the following prominent features, as stated in [3].
1. The fitness function and objective function are the antigen and the solution algorithm, respectively. The former is also the constraint of problem solving. 2. The antibody is the candidate solution of the problem, and the antibody set contains a group of antibodies. Similar to the genetic algorithm, the IGA also uses binary encoding and decimal encoding.
3. Repulsion between antibodies reflects the interaction between antibodies and the binding capacity of them. 4. The affinity of antigen antibody and antigen-induced antibody in different groups reveals the matching degree of antibody and antigen in the IGA. 5. The memory unit, as antibody group in the IGA, is a guarantee of the speed and quality of convergence and the diversity of the population. 6. Similar to organisms, the vaccine can prevent pathogenic organisms in advance through analysis of the pathogenic mechanism. The problem-solving process requires some prior knowledge of the evolution environment and the estimation of the best individual gene, as stated in [10][11][12].

Improvement design
In the genetic algorithm, some excellent genes are lost prematurely, due to the selection problem of crossover and mutation operators. In this case, the search range gets narrower, making it hard to find the global optimum. It also dampens the search efficiency in the late stage of evolution. In specific application, sampling error is almost inevitable if the number of data is very limited. The error will cause deviation from the expected results. The previous studies have shown that these defects can be resolved by self-adaptive genetic algorithm, as stated in [13][14][15]. Here, the selfadaptive genetic algorithm is incorporated into the IGA, forming the IIGA. In the old algorithm, the crossover and mutation probabilities are expressed as follows: In the new algorithm, the crossover and mutation probabilities are expressed as follows: where f max is the maximum individual fitness of the population; f avg is the average individual fitness of the population; f min is the minimum individual fitness of the population; f' is the maximum fitness of the population; f is the fitness of mutated individuals.
In this research, the improved Pc and Pm are non-zero and automatically change with the individual fitness. Through the improvement, the genetic function of excellent individuals of the population increases, with no evolutionary stagnation or convergence to local optimum.
Comparing the average individual fitness in the current population, the excellent individuals should be retained in the genetic evolution of the population. In this way, the improved algorithm can jump out of the local optimum trap and avoid premature convergence. Besides, the IIGA consumes much less time in problem-solving than the original algorithms.

Algorithm design
Coding design. In terms of computer processing speed, binary encoding is the fastest, simplest to implement and most widely used on water quality monitoring. According to the characteristics of the water quality monitoring data in this paper, real number coding is selected. The real number encoding can be mixed with binary code, which can make a good mining effect.
Design of immune memory function. Biological cells react much more slowly in the first invasion than in the second. But if the pathogen first invades, the body's memory cells would store the disease in the memory bank. So, when the pathogen strikes again, the body can respond quickly. The immune genetic algorithm simulates the immune system. Memory cells are implemented through a database. When the pathogen invades again, the saved data can be found in the database, thus speeding up the calculation.
Antibody promoting and blocking function design. The improved algorithm (IIGA) introduces the immune algorithm in the genetic algorithm, takes advantage of the immune algorithm. The fitness function is selected according to the actual problem. The formula of adaptive cross mutation probability is adjusted. The specific operations are as follows: The antibody represents the fitness function!!!! ! ! X is the immune system which is not empty.
The antibody to the set X Antibody concentration Selection based on antibody concentration The selection probability of the mixture concentration based on antibody and fitness is where ! ! is The probability of choosing an individual based on fitness ! is concentration attenuation coefficient, 0<!<1.
The adoption of this scheme can preserve the diversity of the population and increase the convergence rate. The greater the fitness function value is, the greater the probability of selection, and the promotion of the population. Corresponding to this, the higher the antibody concentration, the less likely it is to be selected.
Adaptive crossover and mutation operation design. As shown in 2.2. Adaptive crossover mutation is used to adapt the adaptive crossover mutation rate in the previous section, in order to improve the convergence of the algorithm. In the adaptive immune genetic algorithm, the crossover and mutation probability vary with the fitness function. The calculation formula of cross probability ! ! is as shown as formula (3). The general value of ! !! !is 0.9, and the general value of ! !! !is 0.6.
In classical genetic algorithm, variation operation is a supplementary search operation. The variation operation is mainly used to maintain the individual diversity of the population.
The calculation formula of mutation probability ! ! !is as shown as formula (4). The general value of!! !! is 0.1 and the general value of ! !! is 0.001

References
The adaptive immune genetic algorithm (IIGA) is introduced in this paper. Take advantage of the immune algorithm the fitness function is selected. According to the actual problem, the formula of adaptive cross mutation probability is adjusted.
The IIGA is implemented in the following steps: 1. Configure the parameters; 2. Generate the initial population with real coding; 3. Scan the entire database, and calculate F(x), !"##!!!and !"#$!!! of each individual in database I; 4. Add the individuals surpassing the pre-set significance threshold to the rule table; otherwise, perform the following steps; 5. Carry out antibody promotion and blocking (selection) operations; 6. Perform adaptive crossover and mutation operations; 7. Terminate the implementation if the number of pre-set iterations is reached; otherwise, go to step (3); 8. Mine the output rule table and obtain the rule results.
The addition of antigen recognition greatly shortens the running time of the IIGA. Besides, the calculation formula of crossover and mutation probabilities are adjusted properly. The implementation process is shown in Figure 1.

Algorithm validation
The programming was carried out on Matlab2015 toolbox. The experimental data were extracted from the open source UCI Ecoli dataset. The specific parameters were configured as follows: Pc=0.95; Pm=0.01; !"##!!!=0.3; numbers=100; number of iterations=300; number of generations in the population=40.
The experimental results of the IIGA, the apriori algorithm and the IGA are shown in Table 1.
According to Table 1, the IIGA and IGA shared similar number of simplified properties, the number of breakpoints and the number of optimization rules; however, the IIGA outperformed the IGA and the apriori algorithm in operation accuracy and mining time. This is because the IIGA reduces the number of mining rules, improves the accuracy and shortens the execution time.  The main reason for the short running time lies in the use of concentration selection plan based on vector moment. The high rule accuracy is attributed to the adjustment of crossover and mutation probabilities, which makes the mining rules more comprehensive, effective and concise. Figures 2 and 3 compare the three algorithms at different support tresholds. As can be seen from the two figures above, the IIGA maintained an edge over the IGA and the apriori algorithm in both running time and accuracy. Besides, the support threshold is negatively correlated with the running time and positively with the accuracy.

Data preprocessing
The seawater quality data were obtained through the monitoring on a photoelectric sensor network. The data were analyzed by the IIGA for mining association rules based on data mining. Figure 4 shows the data captured from the photoelectric sensor network. The parameters include the time, latitude, temperature, salinity, turbidity, algae fluorescence, COD and buoy ID.
iJOE -Vol. 14, No. 5, 2018 Because the mining parameters are numerical, real numbers were adopted in the following parts, and the monitoring value was divided into different intervals.

Coding
According to the actual needs of the IIGA algorithm, the monitoring values in each field were divided into different 1~n according to the intervals. In each field, 0 encoding can be added between one attribute and another. Under this constraint, the IIGA generated rules in a random manner. If the generated rule 02300 is covered by examples 22355 and 52366, then the rule will not be covered by example 35632. The parameters in Table 2 were mined and mapped to the results in Table 3.

Rule description
After mining the data in Table3, the generation rules were created (Table 4). Every rule and its encoding interval reflect a specific monitoring value. The potential information of monitoring data were mined according to different encoding rules.

Equations
The monitoring data from the photoelectric sensor network were mined by the IIGA association rules. Taking rules 0200010, 0030003 and 0303000 for example: (1) Rule 0200010: 56% means value of pH will be slightly higher if the water temperature rises; 86% means that it is very likely to happen in every season of the year.
(2) Rule 0030003: 62% means the turbidity will increase with water salinity; 50% means the situation is not very significant.
(3) Rule 0303000: 81% means the fluorescence seaweed will grow with the rise of seawater temperature; 97% means this situation is very likely to happen.

Conclusions
In this paper, the immune algorithm was introduced to the classical genetic algorithm, the fitness function was designed, and the crossover and mutation probabilities were adjusted, thus creating the adaptive immune genetic algorithm (IIGA). The new algorithm was described in details and applied in an actual case. Through the comparison between the IIGA, IGA and apriori algorithms, the author concluded that the IIGA not only shortened the mining time, but also ensured the operation accuracy. The research findings are of great importance to the association rules mining in various fields.