Lung Cancer Diagnosis and Treatment Using AI and Mobile Applications

Cancer has become very common in this evolving world. Technology advancements, increased radiations have made cancer a common syndrome. Various types of cancers like Skin Cancer, Breast Cancer, Prostate Cancer, Blood Cancer, Colorectal cancer, Kidney Cancer and Lung Cancer exits. Among these various types of cancers, the mortality rate is high in lung cancer which is tough to diagnose and can be diagnosed only in advanced stages. Small cell lung cancer and non-small cell lung cancer are the two types in which nonsmall cell lung cancer (NSCLC) is the most common type which makes up to 80 to 85 percent of all cases [1]. Digital Image Processing and Artificial Intelligence advancements has helped a lot in medical image analysis and Computer Aided Diagnosis (CAD). Numerous research is carried out in this field to improve the detection and prediction of the cancerous tissues. In current methods, traditional image processing techniques is applied for image processing, noise removal and feature extraction. There are few good approaches that applies Artificial Intelligence and produce better results. However, no research has achieved 100% accuracy in nodule detection, early detection of cancerous nodules nor faster processing methods. Application of Artificial Intelligence techniques like Machine Learning, Deep Learning is very minimal and limited. In this paper [Figure 1], we have applied Artificial intelligence techniques to process CT (Computed Tomography) Scan image for data collection and data model training. The DICOM image data is saved as numpy file with all medical information extracted from the files for training. With the trained data we apply deep learning for noise removal and feature extraction. We can process huge volume of medical images for data collection, image processing, detection and prediction of nodules. The patient is made well aware of the disease and enabled with their health tracking using various mobile applications made available in the online stores for iOS and Android mobile devices. Keywords—Lung Cancer, CAD, Artificial intelligence, K-means, CNN and CT scan.


Introduction
system for processing. Originally, analysis of medical image is done by low-level pixel processing by sequential application i.e. using line and edge detecting filters and using the concept of region growing and mathematical modeling i.e. fitting lines, circles & ellipses to construct compound rule-based systems that solve a particular task. Expert Systems has an analogy where multiple if-then-else statements is used which is famous in AI (Artificial Intelligence) systems. During the start of 2000s, techniques like supervised learning which involves data training is used to develop a system which is becoming increasing popular now in the field of medical image analysis. For example, in segmentation usage of active share models, methods like atlas where the atlases are made fit from the trained data. Also, the feature extraction concepts and use of statistical classifiers in CAD i.e. computer aided detection and diagnosis. In commercial medical image analysis, we still have pattern recognition and machine learning as famous and forms the very basis of all the available medical image diagnosis systems in the commercial markets. These changes have resulted in a big shift from manmade systems to systems designed by computer using the trained data from which the feature vectors are being extracted. In high-dimensional feature space, optimal decision is determined using computer algorithms in which the most crucial step is to extract the discriminant features from the images. However, handcrafted features are still done by human intervention for such systems [3]. Computers has developed the ability to learn from the features that are optimally represented by data for a give problem statement. This is achieved by the concepts of deep learning using various deep learning algorithms, data models which can be composed of many layers into a network with input data typically the image and the output data which is the result of computer aided diagnosis of a disease or symptom i.e. presence, absence or prediction of a disease and its symptoms. In the available deep learning algorithms, Convolution Neural Networks i.e. CNN is the most successful algorithm in data trainings and predictions of the results. CNN uses convolution filters to refine the data and to produce most accurate results. Various works on CNNs is carried out since the event of Fukushima and were highly applied to medical image analysis in 1995 by Lo et Al [16]. The first real-world successful application of CNN is applied to recognize hand-written digits. Apart from these first successes, CNN didn't gain any space in its initial years until various other new techniques were developed to improve the algorithm results after which it took the momentum and advanced to the next level of core computing systems. Deep convolution networks have now become the popular choice in the field of computer vision. [3].
Various mobile applications available in the store for Android and iOS mobile devices which help patient's communication, education, awareness, important medical terminologies, finding a nearby health centers etc. through mobile apps, doctors can also share CAD (Computer Aided Diagnosis) results and monitor patient improvements and health. There exist many popular free mobiles in the market like CaringBridge, Cancer.Net Mobile, My Cancer Circle, Find a Health Center for iOS and Android.

Related Work
In Lung Cancer Detection using CT Scan Images [4] Median and Gaussian filters has been used instead of Gabor filter in the pre-processing stage of the images. Processed image is segmented using watershed segmentation. Data model is trained from the extracted features which is used as the training features to the model. The unknown cancer nodule is predicted and classification is done using the trained data sets. This model has higher accuracy of cancer nodule detection. The lung cancer that is detected is later classified as malignant or benign. Fake detection of the nodule is avoided by removing salt-pepper and speckle noises. In this research, the author has given scope to perform more research on classification of the cancerous tissues into various stages like stage I, stage II, stage III, stage IV in which the patient is affected with and treatment is required. In research work [5], experiment is done based on Keras. The network parameters used is, Batch size: 32, Epochs: 50 and has used dice coefficient index as a similarity metrics which calculates the similarity measurement with the formula: DSC = 2 * || ∩ || || ⨁ ||. This approach can be applied to a wide area and to various kinds of medical image segmentation tasks exists. Limitations: Objective in the next stage is to perform a lung nodule segmentation based on the results of this work. In A Comparative study of Lung Cancer detection using supervised neural network [6], the strength of a signal processing algorithm is determined using root mean square (RMS) and to find the relative noise. Random Forest classifier is used. It gives the result of 59.2% sensitivity, 66% efficiency, 52.8% specificity. It is observed that SVM classification gives the best approximation when compared to classification which give the result of 94.5% accuracy, 74.2% sensitivity and 77.6% specificity with 66.3% recall. The accuracy of the whole system is still low which needs to be improved.
In K-means Cluster Algorithm Based on Color Image Enhancement for Cell Segmentation [7], the paper gives a detailed information on the usage of k-Means algorithm in clustering a sample X in the a given Ycbcr space. Segmentation and Morphological Processing of is used to test cell images in the study by applying Swiss dyeing. The erythrocytes and leukocytes are segmented. Segmentation has to improve to achieve more refined segmentation effect in cytoplasm and nucleus images Segmentation. In A Study on Lung Cancer Detection by Image Processing [8], attempts to identify lung cancer tissue in early stages. This this study, the suspicious lesions is identified from the given CT scan image. However, more research can be done to detect cancerous nodules in its early stages. Management of evaluation of the indeterminate nodules is a difficult challenge which has more space in this research work for future research works that help in early detection of the malignant nodules and reduces the mortality rates.
In the research work to detect lung cancer with the help of DIP and AI [9], where back propagation method is used along with artificial neural networks and 70 images which contains 6 input neurons, 2 output neurons and 12 hidden layer neurons which is trained using back propagation network (BPN). In this research paper, they have used dataset from The Cancer Imaging Archive [13] [14] [15] a publicly available archive database. The result show whether the identified tumor is a benign or malignant. The design gives 78% accuracy. Useful in detection of lung cancer using Computer Aided Detection (CADe) systems. However, 78% accuracy is still low and needs to be improved. In Lung tumor segmentation algorithm [10], which combines different image processing techniques to accomplish the goal to segment original image using threshold values. To remove the noise from the images, erosion method is applied along with median filter. The input DICOM image is converted into JPEG image format by removing information like saturation and tint. The image luminance is maintained. Median filter is used in this proposed system. It gives a higher accuracy of 97.14%. More research can be done to identify the cancer tissues at the earlier stages and prediction of cancer using AI techniques. In the proposed model, the input CT scan DICOM (Digital Image and Communication of Medical Information) 3D image is scanned i.e. loaded to extract the image information. The image is converted into Gray Scale 2D image for noise removal and enhancement, segmentation and finally feature extraction. The image is sliced at the voxels, the third dimension. Fortunately, DICOM image has pixel spacing information which can be used for image slicing. In this flow, we apply deep convolution neural network to train the CT scan image data loaded from numpy file for training data sets and noise removal, feature extraction. This model helps in quick and fast processing of large volume of CT Scan images and produce the output for CAD (Computer Aided Diagnosis) and treatments.

Pre-Processing
In medical image diagnosis, pre-processing is the most vital part for accurate results with higher accuracy rate. Below is the noise mostly found in medical images.
Gaussian Noise: It's a statistical noise given by the formula: g = imnoise (I, 'gaussain', m, var) I is input image m is mean var is variance Salt Pepper Noise: The cause of this noise is mostly due to analog-to-digital converter errors, dead pixels, bit errors in transmission, etc. Salt Pepper noise can be removed using Median filter, DFS or morphological filter [11].
Speckle Noise: A granular noise that degrades the quality of the image [11] Poisson Noise: It's an electronic noise occurs due to particles carrying energy like electrons in an electronic circuit or photons [11].
In our case, we are loading 130 CT Scan Digital Imaging and Communications in Medicine (DICOM) images from the dataset source mentioned in dataset section.

Hounsfield Unit (HU)
Hounsfield Unit is widely used by radiologist to interpret CT (Computer Tomography) images in the measures of radio density. These units are standard across all the Computed Topography images irrespective of it absolute numbers of photons the scanner detector has captured [12]. Below are the standard units used across the globe to identity the substances in a CT scan DICOM image. HU plotting for the sample 130 images we have taken for processing looks like the below.

Fig. 3. Histogram plotting of 130 sliced DICOM images
The histogram in the above figure [ Figure.3] give the following details about the DICOM image plotted: • The image has lot of air • Some lung portion exists in the image • Lot of soft tissues like muscle, liver, etc. do exist in the image being processed along with some fat substance. • The scan image has only little bit or no bones i.e. substance in the range between 700-3000. • All the above observation shows that we need to do significant preprocessing to remove the unwanted substances from the sample image taken for study and isolate only the lung region for analysis and diagnosis

Image Slicing
We slice our sample 130 DICOM images [ Figure.4] at the voxels for better visibility and analysis. This usually means taking a plane out of a 3D volumetric image. For example, in a CT or MRI image, you can take a slice along the X, Y, or Z dimension, or even a plane at some arbitrary orientation. You get a 2D image of the pixel values that would lie on that plane. i.e. excluding the voxels, pixel in the 3 rd dimension.

Noise Removal
From the numpy file created, read all the CT scan images information of the DICOM images for processing. Precise threshold value is obtained by using k-Means centroid clusters with k=2. Compare soft tissue, bone Vs lung and air substances by applying this algorithm. Also, Erosion and Dilation has been used to denoise the image to remove tiny features. Each distinct region is identified as separate image labels by applying bounding boxes to each of the image label which helps to identify lung and all substances other than lung. Later we apply masking techniques to isolate lung region and extract the ROI (Region of Interest) for feature extraction for further study and analysis. We have applied below morphological operations in the converted 2D gray scale image for noise removal and they are; • Thresholding: Threshold value is calculated dynamically by adjusting the overflow and underflow values of the pixel after identifying the mean and standard deviation on the image X, Y shapes. Mean and Standard Deviation functions from numpy module is used for calculating the values.

Standard Deviation
• k-Means Clustering: We have applied k value of 2 to cluster the pixel and center them to apply the threshold and exclude pixels not in interest for study and analysis. We can straight away use the Kmeans function from sklean module to cluster the image pixels calculate optimal threshold value for noise removal. • Morphological operations: Used to process images based on the shape of the image. Structuring element is applied to the given image i.e. the input and creates the output image of the same shape. The pixel value of the output and input image is based on the comparison between the pixel corresponding to the input image with its neighbors. In this paper we have used library from skikit-image org to perform morphological operations. • Dilation: Dilation is the reverse process with regions growing out from their boundaries. It increases the size of the objects and fills the holes and broken areas to connect the areas that are separated by space smaller than structuring element. It increases the brightness of the objects. Distributive, duality, translation and decomposition properties are followed. It is used prior in Closing operation and later in Opening operation.
Dilation formula: XOR of A and B.
• Erosion: Erosion is a method in which the pixel is removed at the edges of the region. It reduces the size of the objects to remove the small anomalies and reduces the brightness of a bright object. It removes the objects smaller than the structuring element and follows the different properties like duality etc. It is dual of dilation used later in Closing operation and prior in Opening operation.

Original Image
Thresholded Image Image after Erosion and Dilation

Tensor Flow and Convolution Neural Network (CNN)
Convolution Neural Network is a well know machine learning algorithm used in machines to understand the features of the data, in this case images with a foresight and save the feature to predict the type of the image based on the trained model. CNN is a widely used image classification algorithm. Image classification using CNN algorithm accepts an image as input, does the processing and classifies the image into various categories i.e. for example, Dog, Cat, Tiger, Lion. An image is an array of pixels. The array size depends on the image resolution. Based on the image resolution, an image is represented as h x w x d (h = Height, w = Width, d = Dimension) and the deep learning CNN models to train and test the datasets, the input image array is passed through as a series of convolution layers to the filters (Kernals) which is pooled, fully connected layers (FC) and applies Softmax function which classifies the image object with probabilistic values between 0 and 1 The Input DICOM image converted as 2D Gray scale image with array size 512 X 512 is passed to the CNN network [ Figure.3.] for dataset preparation, classification and training for test and train data.

9.2
Dataset training example

Fig. 8. Sample CNN for classification and Training
With the above approach we train our numpy data from the numpy file. npz. With the trained data sets, we can perform auto noise removal and feature extraction.
Resampling means changing the pixel dimensions of an image. When you down sample, you're eliminating pixels and therefore deleting information and detail from your image. When you up sample, you're adding pixels. In below method we use interpolation for resampling. Below we do up sample: Input image shape before resampling is (130, 512, 512) Input image shape after resampling (390, 500, 500)

Image Meshing
Image meshing is a process of creating computer models in 3D images of images likes MRI (Magnetic Image Resonance) and CT (Computed Tomography) for use in CFD i.e. computational fluid dynamics and FEA i.e. finite element. Apply image mask to the 390 slices made, we have plotted a 3D image of lung cavity.

Image Masking for Feature Extraction
In the denoised, thresholded and enhanced image slice, we apply image masking to remove the pixel of no interest. This give the ROI (Region of Interest) for medical attention for further diagnosis and confirmation of cancerous nodules for further analysis and treatment. This approach process larger data sets for processing, data extraction, training, noise removal and feature extraction and detection of nodules. We have used images from this consortium for the research paper and study the CT scan images. The consortium consists images of around 1018 cases which has been used for our study. This consortium was created by the collaboration of 7 academic centers and 8 medical image companies.

Conclusion
In the proposed model, we applied advanced image processing techniques and AI concepts for image preprocessing, noise removal, enhancement, slicing and feature extraction. This model can handle huge volume of datasets for processing and can plot the slice and lung image in 2D and 3D form for CAB (Computer Aided Diagnosis) which help pathologist to study the results and act fast on cancer confirmation and treatment in early stages which help in reducing the mortality rates. In this approach the threshold value is calculated dynamically using k-Mean centroid cluster algorithm based on the image noise and mean values which process each slice uniquely. We can further research more on identifying the cancer stages and early prediction of lung nodules applying CNN with improved trained datasets.
In addition to the above diagnosis improvements, more seamless mobile apps should also be developed for Android, iOS, Microsoft mobile devices for improved patient's communications, education on the disease and creating awareness to improve patient's mental health to overcome any hard times and the disease faster.