A CAD System for the Early Detection of Lung Nodules Using Computed Tomography Scan Images

— In this paper, a computer-aided detection system is developed to detect lung nodules at an early stage using Computed Tomography (CT) scan images where lung nodules are one of the most important indicators to predict lung cancer. The developed system consists of four stages. First, the raw Computed Tomography lung images were pre-processed to enhance the image contrast and eliminate noise. Second, an automatic segmentation procedure for human's lung and pulmonary nodule candidates (nodules, blood vessels) using a two-level thresholding technique and morphological operations. Third, a feature fusion technique that fuses four feature extraction techniques: the statistical features of first and second order, value histogram features, histogram of oriented gradients features, and texture features of gray level co-occurrence matrix based on wavelet coefficients was utilised to extract the main features. The fourth stage is the classifier. Three classifiers were used and their performance was compared in order to obtain the highest classification accuracy. These are; multi-layer feed-forward neural network, radial basis function neural network and support vector machine. The performance of the proposed system was assessed using three quantitative parameters. These are: the classification accuracy rate, the sensitivity and the specificity. Forty standard computed tomography images containing 320 regions of interest obtained from an early lung cancer action project association were used to test and evaluate the developed system. The images consists of 40 computed tomography scan images. The results have shown that the fused features vector resulting from genetic algorithm as a feature selection technique and the support vector machine classifier give the highest classification accuracy rate


Introduction
Lung cancer has become one of the most important diseases that pose a great threat to humanity because of the high rates of air pollution, the spread of smoking in recent years and the difficulty of treatment. Developing early detection of this disease has become the concern of scientists in medical fields [1].
Early detection of lung cancer increases the chance of survival of the patient for a period of up to 5 years by up to a percentage of 70%, as well as it increases the chance of success of treatment whenever diagnosed in the early stages, this led to the increasing importance of work on the development of early detection systems [1].
One of the most accurate techniques used in the diagnosis of lung cancer is Computerized Tomography (CT) of the patient's chest, because it allows lung imaging on many sections, which results a large number of images, enabling radiologists and physicians to examine all parts of the lung [1]. But this large number of images resulting from the CT examination in addition to the use of low radiation doses to protect the patient from the risk of exposure to large amounts of radiation, made the examination of these images by a radiologist difficult and onerous task [1].This motivated scientists to develop computerized systems that process and analyze these images and allow automatic determination of the presence of pulmonary nodules. These systems are known as Computer-Aided Detection (CAD) systems [2].
In general, any CAD system for the detect the presence of pulmonary nodules automatically is composed of the following four stages: a preprocessing stage for image contrast enhancement and noise reduction, the automatic segmentation stage that aims to extract the human's lung area and nodules followed by a feature extraction procedure of the pulmonary nodule candidates and the final stage is the classification [2]. Figure  1 illustrates the main stages of processes in CAD system. The accurate extraction of the lungs from the CT chest images is an essential step in the CAD systems. Techniques previously reported in lung segmentation are based on intensity variation (thresholding methods), image region (merge region, split region, and the region growing techniques), and others that are based on the object texture, motion tracking, and edge detection [2]. In the present work, a Novel Image Size Dependent Normalization Technique (ISDNT) was adopted.
According to clinical opinions of physicians, the blood vessels and pulmonary nodules are presented in the CT scan image as having lower contrast values and higher gray values [2].Several attempts were reported for nodule extraction. The thresholding techniques [2] where the extraction of the nodule candidates is based on the intensity variation between the lung parenchyma and the nodule candidates were utilized.
For feature extraction, four types of feature extraction techniques [3,4] were utilized: the Histogram of Oriented Gradients (HOG) features the statistical features, the texture features of Gray Level Co-Occurrence Matrix (GLCM) based on wavelet coefficients, and the Value Histogram (VH) features.
Classification approaches have been proposed such as Artificial Neural Networks (ANN), linear discriminate analysis classifier, rule-based, Bayesian classifier, support vector machine (SVM), and k-NN [5]. In the present work ANN, SVM, and Radial Basis Function Neural Network (RBF-NN) were used.
To increase the accuracy of the classification, fusion technique was used. It can be classified into three different levels, namely, data fusion at the level of data, feature fusion at the level of features, and decision fusion at the decision level [6]. A fusion step at the feature level has been adopted.
The performance of the developed system is compared with that of previous reported classifiers: ANN classifier, RBF-NN classifier and SVM classifier [5]. The paper is organized as follows: Section 2 is a description of the dataset used. Section 3 describes the different stages the proposed system. Section 4 discusses the results and Section 5 is the final conclusion.

The Dataset
Forty CT scans containing 320 regions of interest (ROI) were made available from the Early Lung Cancer Action Project (ELCAP) association [7]. The images in this database are available in format of Digital Images and Communication in Medicine (DICOM) and have a resolution of 0.76´0.76´1.25. The size of pulmonary nodules that were considered in this work varies from 3 mm to 30 mm. Figure 2 shows a typical example of the chest CT images.

Methodology
The developed CAD system consists of the following four stages:

Image pre-processing
Physicians use a low radiation doses during the CT scan to protect the patient from the risk of exposure to large amounts of radiation but this leads to low-resolution images. On the other hand, processing the CT scan itself is accompanied by the exposure of images to noise from different sources which reduces the image quality. Preprocessing was accomplished in two steps: enhancing the image contrast and denoising of the CT chest image.
Image contrast enhancement: Enhancement of the image contrast of CT scan images increases the accuracy of nodule detection. Hence, a comparative study of three image contrast enhancement techniques; histogram equalization, adaptive Histogram Equalization, and a novel Image Size De-pendent Normalization Technique (ISDNT) [8] were utilized. The visual comparison of the contrast enhanced images showed that the ISDNT technique gives the best results. Figure 3 shows the CT image before and after contrast enhancement using the ISDNT. wavelet filter and concluded that the Weiner filter gives the best results [9]. Figure 4 shows an example of a denoised image using Wiener filter.

Lung segmentation
Having preprocessing CT chest images, the next step is to extract the human lungs area from CT chest images. The proposed algorithm of human lungs segmentation [10] consists of three main steps; calculation of an optimal threshold, then segment the thorax from the background, and finally segment the lungs.
To calculate the optimal gray-level threshold, a diagonal gray-level histogram was constructed using the diagonal pixels intensity of all CT chest images of a complete scan. The resulted histogram was found to have three clear peaks ( Figure 5). This is a common feature in all CT chest scan images. These peaks represent the following; peak P1is formed from black background pixels intensity, peak P2is formed from the pixels of low intensity representing the external region that surrounds the thorax area and the internal parenchyma of lung, and peak P3is formed from the pixels of high intensity which represent blood vessels, bones of the rib cage, heart, and pulmonary nodules. Accordingly, the choice of a gray-level point that divides the distance between the second and third peaks equally as an optimal gray-level threshold was used in the present automatic segmentation work. The optimal gray-level threshold is calculated according to the following equation; Segment the thorax from background: The thorax extraction includes the removal of all image components external to the chest area. First, the bi-level thresholding technique was applied to obtain a binary image. Then a morphological operation and median filter of size 15*15 were applied to obtain the thorax binary mask which will be multiplied with the preprocessed image to get the segmented thorax area. Figure 6 shows the steps in detail.  Segment the lungs within the thorax: The goal of this step is to separate human lungs area from the thoracic area. To extract the lung area, the bi-level thresholding technique was applied to obtain a binary image. Then the morphological operations and a median filter were used to obtain the lung binary mask. This was multiplied with the thorax image to obtain the segmented lung area. Figure 7 shows the resulted images.

Extraction of nodule candidates
The objective of this step is to extract the regions of interest (ROIs) that composed of the nodule candidates using the bi-level thresholding technique and a median filter of size 5*5. The resultant image was multiplied with the gray level lung image to obtain the ROIs in the CT chest images. The performance of the proposed framework was evaluated by comparing the resulted area with those obtained by three other different techniques. These techniques are Otsu thresholding, local entropy-based transition region extraction and thresholding, and the basic global thresholding [11]. The region non-uniformity criteria [11] was used to compare the performance of the four thresholding methods.
Region non-uniformity: Region non-uniformity is defined as: ( 2) where is the whole image variance, and represents the foreground variance, and are the background and foreground area pixels in the segmented image [11]. According to a non-uniformity (NU) measure the segmented image of smallest NU measure is the best histogram thresholding technique. The calculated NU measures of each segmented image for all applied histogram thresholding techniques are shown in Table 1. By visual comparison of the images in Figure 8 and comparing the results tabulated in Table 1 shows that the developed system gives the highest accuracy to detect the pulmonary nodule candidates.

Feature extraction
Feature extraction process aims to extracting a set of features which represent the information that is used in analysis and classification process. The goal of feature extraction is to achieve significant data reduction and to determine informative measures. In the present work, four different techniques of feature extraction were used; the first and second order of the statistical features [12], the Histogram of Oriented Gradients (HOG) features [13], the Value Histogram (VH) feature [12], and the texture features of Gray Level Co-Occurrence Matrix (GLCM) based on wavelet coefficients [13].

Feature fusion
In the process of features fusion, a new set of features was created from different sets of features obtained from different domains after removing the insignificant and redundant features. Therefore, the four different feature vectors were fused in a new hybrid feature vector using a simple concatenation procedure.
Having formed the hybrid feature vector, the next step is to remove any redundant and correlated information which is known as "feature selection". In the present work, the GA algorithm [20] was applied to the hybrid feature vector as a feature selection technique. The performance of each feature vector and the new hybrid feature vector was then compared.

Nodules detection
The final stage aims to classify the extracted nodule candidates into nodules and nonnodules (blood vessels). Three classifiers were utilized and their performance was compared. These are: Support Vector Machine (SVM) [15], Multi-Layer Feed-Forward Neural Network (ANN) [15], and Radial Basis Function Neural Network (RBF-NN) [14]. The classifiers were trained and their performance was compared using the classification accuracy rate (CAR), sensitivity (S), and specificity (SP) measures [20]. For the training and testing steps of each classifiers, 25% of the available data set size was used for the training phase and they were tested using 75% of the available dataset size.

Experimental Results
The classification accuracy rate (CAR), sensitivity (S) and specificity (SP) were calculated for each classifier using the four types of features and hybrid feature vector. Tables 2-4 Table 5 depicts the number of features before and after selection using the GA algorithm and the CAR, S and SP corresponding to each classifier. As clear from Table 5, the application of GA feature selection technique has increased the CAR, S and SP in addition to reducing the feature vector size. This has led also to a reduction in the computational time. The number of features resulted from using the RBF-NN classifier decreased significantly but the CAR and SP are relatively lower than those of the other two classifiers. While the results show that both the ANN and SVM have equal values of CAR but the number of features in the case of the SVM is less than that of ANN.    Table 5. The classification of accuracy rate (CAR), the sensitivity (S) and the specifivity (SP) of the three classifiers and the number of hybrid features before and after using the generic algorithm (GA) technique.

Conclusion
In the present work, a Computer-Aided Detection system (CAD) for early detection of lung nodules in CT scans images has been developed. The proposed system consists of four main stages. These are; image preprocessing stage to enhance the image contrast of the CT images, an automatic segmentation stage to automatically extract the human's lung and the nodule candidates, a feature extraction and selection stage and a classification stage to classify the detected nodules.
Forty CT scans with 320 regions of interest (ROI) were made available from the early lung cancer action project (ELCAP) association to train and test the classifiers. The size of pulmonary nodules that were considered varies from 3 mm to 30 mm.
An Image Size Dependent Normalization Technique (ISDNT) was utilized for enhancing the CT image contrast and a Wiener filter was used to ameliorate the CT image quality in the image preprocessing stage.
For the automatic segmentation stage, the bi-level thresholding technique was applied to the preprocessed CT images and median filter and mathematical morphological operations were utilized to suppress any unwanted pixels.
In the third stage, four feature extraction techniques were utilized. These are: are the statistical features of first and second order, the Value Histogram (VH) feature, the Histogram of Oriented Gradients (HOG) features, and the texture features of Gray Level Co-Occurrence Matrix (GLCM) based on wavelet coefficients. A feature fusion step was employed on the four different sets of extracted features to produce the hybrid features vector. The five feature vectors were then used as the input to three types of classifiers and their performance was evaluated. The classifiers are: Artificial Neural Network (ANN), Radial Basis Function Neural Network (RBF-NN), and Support Vector Machine (SVM). Each classifier was trained using 25% of the dataset and tested using the remained 75% of available data input.
The Classification Accuracy Rate (CAR), the Sensitivity (S), and the Specificity (SP) were calculated for each classifier using each of the five feature vectors. Comparing the CAR, S, and SP resulted from each classifier has showed that the hybrid features gave the highest CAR, S, and SP. This leads to conclude that the feature fusion technique increased the detection accuracy of pulmonary nodules and improves the system performance. An attempt was made to increase the classification accuracy, enhance the system performance and to reduce the computational time using the Genetic Algorithm (GA) as a feature selection algorithm on the hybrid features vector. The CAR, S and SP results of the three learned classifiers; ANN, RBF-NN, and SVM showed an increase in the values of CAR, S and SP of the three classifiers. The CAR reached 99.6%, 99.2% and 99.6% for the three classifiers respectively. Based on these results, it can be concluded that applying the (GA) as a feature selection technique to the hybrid feature vector increases the classification performance of the system significantly.
In conclusion, the SVM classifier gives the highest CAR, S, and SP values of 99.6%, 100% and 99.2%, respectively. Table 6 shows a comparison of the performance of the suggested system and five systems reported in previously published researches. The comparison shows that the suggested system achieves the best classification rate and the lowest false positives.
Still much work is needed for discriminating benign and malignant tumors of the lung nodules. This is the aim of the next stage of the work.