Medical Image Segmentation Using a Combination of Lattice Boltzmann Method and Fuzzy Clustering Based on GPU CUDA Parallel Processing

— The rapid development of computer technology has had a significant influence on advances in medical science. This development concerns segmenting medical images that can be used to help doctors diagnose patient diseases. The boundary between objects contained in an image is captured using the level set function. The equation of the level set function is solved numerically by combining the Lattice Boltzmann (LBM) method and fuzzy clustering. Parallel processing using a graphical processing unit (GPU) accelerates the execution of the segmentation process. The results showed that image segmentation with a relatively large size could be done quickly. The use of parallel programming with the GPU can accelerate up to 39.22 times compared to the speed of serial programming with the CPU. In addition, the comparisons with other research and benchmark data show consistent results.


Introduction
Image segmentation has a vital role in medical imaging. Image segmentation is the process of separating the objects contained in an image. Thus, this segmentation can be used to study the body's anatomy [1] [2], volume estimation [3] [4], tumors identify [5] [6], cancer [7], or the eyes and surrounding organs [8]. The obstacles faced in image segmentation are artifacts such as noise, intensity inhomogeneity, low contrast, or nonsharp edges in the image. The main difficulties in the study of images are mainly image segmentation [9].
A vast number of methods have been proposed in the field of medical image segmentation. A well-established class of ways is the level set method (LSM) proposed by Osher and Sethian (1988) and Choudhry and R. Kapoor (2018) to solve surface evolution problems [10] [11]. Medical image processing algorithms are generally sensitive to noise and consist of low informative regions [12]. The Level set is robust and accurate, and the segmentation process does not need prior initialization knowledge. This method involves a spatial function that can be expressed as a contour where the absolute value of the normal vector perpendicular to the contour is always one. These contours can dynamically change shape following a pixel gradient, so they can be used to find boundaries and enclose areas of an object in the image. The model equation of the level set function is a partial differential equation (PDE). This equation is solved using a numerical method called the Finite Difference method. This method's shortcoming is the high computational load, and the contour is deteriorated due to the numerical dissipation effect. Hence, the level set is no longer a signed distance function. For long time calculations, the calculation accuracy will be significantly reduced, Reinitialization is used in the LSM to remedy the interface's numerical deterioration. This costly reinitialization allows us to keep the norm of the gradient close to unity. Several researchers have attempted to improve the performance of the set-level method without reinitialization. Li et al. at 2005 proposed a variational formulation to force the set level function to be close to a signed distance function. Then Li, in 2010, continues by developing the distance regularized level set evolution (DRLSE) method. Li et al. also showed that their proposed plans have outstanding performances [13]. Zhang et al.
(2013) added a diffusion equation to the Level Set equation to avoid the reinitialization process [14]. The calculation is divided into two steps; the first step calculates the set level method and is continued by solving the diffusion term. Diffusion terms can increase the stability of numerical calculations. The calculation results show that the proposed way is more straightforward, has better performance, and is more effective than the conventional LSM.
Generally, fuzzy clustering algorithms merge different clustering parameters for a cluster head (CH) election [15]. Li et al. (2016) discussed fuzzy multilevel image thresholding based on modified discrete grey wolf optimizer and local information aggregation [16] [4]. Fuzzy clustering is used to initialize the LSM so that the LSM requires fewer iteration. Khosravanian, Asieh et al. (2020) also developed similar medical image segmentation methods [6].
The use of the finite difference method for the LSM's discretization increases the computational load. The Lattice Boltzmann method (LBM) is proposed to reduce the LSM's numerical solution's computational load. The LBM is a mesoscopic method developed based on statistical mechanics theory and gas-particle kinetic theory. The LBM simulates the motion and collision of gas particles. Using the Chapman Enskog transformation, a partial differential equation can be obtained, including the level set function. This method's advantage is that implementing a computer program is relatively simple, and the execution time is shorter [17]. Balla-Arabé and Gao (2012) proposed image multi-thresholding by combining the LBM and a localized level set algorithm [18]. Then in 2013, they developed an image multi-thresholding method for the localized LSM, which is solved using the LBM. The results show that this approach is faster and accurate [19]. Wen et al. (2014) presented the LBM based on image area integration for the medical image segmentation process. This method can be used to handle images with the inhomogeneous property [20]. Chen et al. (2014) developed the Lattice Boltzmann Geodesic Active Contour Method (LBGM) for the segmentation of medical images of giant intracranial aneurysms (GIA) [21]. Wang (2018) proposed a hybrid LSM using the LBM and Sparse Field Constraint. The accuracy, efficiency, and robustness of the segmentation process can be improved using the proposed method [13] [22].
On the other hand, parallel processing of image segmentation will perform multiple image segmentation in parallel, saving a lot of time for the user [23]. Recently parallel processing based on graphics processing units (GPUs) is used today in various applications, including medical image processing. Because they are cheap, efficient, and they can dramatically accelerate the numerical computing process. Roels et al. show achieving speedup factors of more than 50x GPU compared to single-threaded CPU execution are not uncommon due to parallel processing [24]. Chen et al. used a GPUbased parallel LBM for raw image contour detection screening. Execution time can be shortened by 25x, and contours can still be detected even if the image is contaminated with noise [21]. Kumar et al. proposed a joint LSM and LBM-based approach for histogram thresholding in image segmentation [25]. Kalaiselvi et al. presented medical image processing algorithms implemented in CUDA running on GPU-based Machines [26], and Huang and Li use CUDA to accelerate in parallel computation [27].
Balla-Arabé and Gao (2013) developed the LBM combined fuzzy clustering as the external force function to solve the LSM. Using this method, segmentation of image contained intensity inhomogeneity, and blurred object boundaries can be handled easily. It is also shown that the technique needs fewer iteration and is faster compared with a well-known method, such as the CV method and Gibou-Fedkiw method [19]. Based on the paper, we propose combining the Lattice Boltzmann method and fuzzy clustering based on GPU Compute Unified Device Architecture (CUDA) Parallel Processing for medical image segmentation.

2
Materials and methods

Hardware and software
The hardware used in this research is 1 unit PC. The specifications: CPU consists of 6 Processor Intel I7 @ 3.70 GHz, GPU Nvidia GeForce GTX 1080 Ti, and RAM 32 GB DDR4, and the specifications of the software are: Windows 10 operating system, C++ Ms. Visual Studio 2012, and CUDA Toolkit Version 9.

GPU CUDA
GPU CUDA uses the kernel to execute a command n-times parallel on the GPU device from a programming perspective. An n-thread number handles this execution. The kernel is a subprogram in C language implemented on the GPU device and called by the host (CPU). The sequence interaction between the host (CPU) and device (GPU) is illustrated in Figure 1. The host launches the kernel and the thread asynchronously to the GPU and waits for parallel execution in the GPU to finish before tackling the next kernel. GPU CUDA executes a kernel using many threads, and each thread runs a specific statement. The threads are organized into blocks and grids. The maximum number of threads per block is 1024. Threads in a block can be arranged according to a 1,2, or 3dimensional hierarchy. Each grid consists of some blocks, arranged according to a hierarchy of 1, 2, or 3-dimensions. The number of blocks should be a multiple of 32. Figure 2 shows the organization of threads, blocks, and grids, which are indexed using 2-dimensions. The mathematical model for the evolution of level set change is written as: Hagan and Zou's (2009) external force function accommodates the speed function and the curve curvature change process [28]. In Balla-Arabé and Gao (2013), the part accommodates pixel grouping using the fuzzy clustering method [19].

b) Lattice Boltzmann method
The Lattice Boltzmann (LBM) method was developed for solving fluid flow problems but later developed for image processing, which uses partial differential equations (PDE) as the mathematical model [13][17] [22]. For image processing, the pixel position is analogous to the lattice, where several variables are called the particle distribution function. This study uses the D2Q5 scheme [29], which has five kinds of particle distribution in each lattice ( ) 4 ,..., 0 , = i f i . The five particle distribution functions move at the following speed and direction i e as described below: The particle distribution function's speed and direction for each lattice for the D2Q5 scheme are described in Figure 3. where x, t dan  are the lattice spacing, the time step, and the relaxation time, while eq i f the particle distribution function is in a state of balance (equilibrium) and is calculated by: The relaxation time () is related to the diffusion coefficient by: The value of the level set in each lattice is the sum of the five particle distribution functions: The calculation of equation (3) consists of 2 steps, the first step is the collision, and the second step is streaming: f is the post-collision particle distribution function. The streaming step is illustrated in Figure 4.

c) Fuzzy clustering
In this study, the cluster of the segmentation consists of 2 classes only. The calculation of the external force function that accommodates Fuzzy clustering is written as follows: The formula calculates the class centroid: Image intensity bias calculations are as follows: iJOE -Vol. 17

Proposed method
The proposed method is a combination of LBM and fuzzy clustering based on GPU CUDA parallel processing with an algorithm is as follows: a) The programming is based on the Matlab program by Balla-Arabé and Gao [19]. b) Read the image and convert it into RGB, initialize the level set using the circle signed distance function. c) Initialize the equilibrium particle distribution functions using equation (4). d) Start the iterative calculation e) The iteration calculation process starts: i. Compute Partition Matrix (Fuzzy membership) using equation (9).
ii. Compute class centroids using equation (10). f) Compute Bias image intensity using equation (11). g) Compute external force function equation (8). h) Compute the equilibrium condition (equation 4) and collision step (equation 7a). i) Compute streaming step using equation (7b). j) Compute the level set using equation (6). k) If the segmentation is not done, then go back to step 4.
In this study, the authors use medical images because the authors want to help the medical world accelerate segmentation results and the previous method where previous researchers used non-medical images to assess segmentation results.
In this study, the authors use GPU and CPU. The most significant difference between GPU and CPU is in the architecture and how they work. The GPU architecture consists of many threads so that they can work in parallel. In contrast, the architecture on the CPU has many cores or fewer cores than the GPU how the CPU works in serial. For this reason, even though GPU and CPU programming are used simultaneously, programming with GPU is in parallel. When programming CPU serially. So there is a difference in coding at the time of the computing process.

Results and discussion
Performance study was conducted using various resolutions of the medical images, i.e., 512×512, 1024×1024, 1536×1536, 2048×2048, 2560×2560 dan 4096×4096 pixels. All calculations using a two-dimensional block, in which the number of threads per block is 16×16. The GPU used the global memory for all calculations, except for the class centroids calculation, which used shared memory. The performance study consists of 3 parts. First, it compares the contour of the segmentation results between the CPU serial programming and the GPU parallel programming segmentation contour. The sec-ond part compares the elapsed time of segmentation processes between CPU serial programming and GPU parallel programming. While the third part compares the results of the segmentation process using parallel GPUs with the results of the segmentation process by Arbeláez (2006) [32], Balla-Arabé and Gao (2013) [19], and segmentation by a human. Arbeláez and human segmentation are taken from the BSDS300 dataset. The segmentation process is carried out using the Lattice Boltzmann and Fuzzy Clustering methods on the CPU and GPU. The segmentation results produce the same segmentation results, but the execution time on the CPU and GPU is different. To assess whether the researcher's segmentation process is correct, the researcher compares the segmentation results with the previous methods. Figures 5 -10 show that the image segmentation contours with the serial CPU and parallel GPU programs for various resolutions produced in this study resemble the image segmentation contours generated by the comparison data mentioned above. From these figures, it is clear that the segmentation results between the two programming methods show no difference. The result fits to compare the contours of the segmentation results of this study with the benchmark data is to evaluate the level of similarity of the segmentation results obtained in this study with previous studies.       Table 1 shows the elapsed time for the serial and parallel segmentation processes. Based on this table, it can be seen that the use of parallel programming with GPU CUDA can increase the speed significantly, being able to be 39.22 times faster than serial programming with a CPU. The results of this study are the results of Kalaiselvi et al. [26] and Chang et al. [33]. Improving image resolution also increases the execution time, but the speedup also increases. The increase in speedup experienced a slight slowdown for the image resolution of 2560 × 2560, but it went up again after that. The growth is because the GPU works optimally if the image resolution is a power of 2. This section compares the contour of the segmentation results obtained in this study with the results of previous methods, namely: the method proposed by Arbelaez (2006) [32], Balla-Arabé, and Gao (2013) [19], and the contour segmentation performed by humans. The image used in this study follows the results of segmentation by humans taken from the Berkeley segmentation data set BSDS300. The original images (481 × 321 and 321 × 481) were then converted to 256 × 256. The comparison results obtained are shown in Table 2. The comparison indicates that the image segmentation contour produced in this study resembles the image segmentation contour produced by the comparative data mentioned above.
These results agree with Bobbia, S. et al. (2021) with their paper entitled Iterative Boundaries Implicit Identification for Superpixel Segmentation: Real-Time Approach.
The author proposes an optimal iteration method to reduce the computation time further. This optimization allowed the authors to accelerate segmentation by a factor of 4 on the Jetson TX1 embedded GPU platform [34]. Balla-Arabé and Gao (2013) [19] Proposed method Humans (BSDS300 data set)

Conclusion
The conclusions that can be drawn from this research are: 1. Combination with the Fuzzy clustering method can reduce the number of iterations in the calculation of the segmentation process 2. GPU CUDA-based parallel programming can significantly speed up the calculation time compared to serial programming with the CPU. In this research, parallel programming with GPU CUDA can speed up 39.22 times. 3. The segmentation process for images with a size of more than 2000 × 2000 pixels can be handled easily. 4. The comparison of the results of the segmentation process with benchmark data has good similarities.