Cluster Analysis of Patients’ Clinical Information for Medical Practitioners and Insurance Companies

A number of approaches have been proposed in literature to collect and classify patient related information for purpose of better clinical diagnosis for safer treatment and administration of related activities. This type of data collection and classification benefits doctors and the corresponding hospitals. However, no effort is made, as to our knowledge, to classify accumulated data within insurance company databases to facilitate doctors as well as insurance companies for better analysis and cost-effective treatment of patients suffering from chronic (and expensive to treat) diseases such as related to oncology. In this study, a customized self-organized data classification model is applied to an insurance company database to build clusters based on age, patient condition, tests done, etc. These clusters provide integrated analysis to doctors in providing patient-specific, disease-specific, etc., and cost-effective treatment. On the other side, it saves on costs to be incurred on repeated tests to be done on the patient. An experimental setup is developed to train such a network, and testing results are presented. The practical constraints are also discussed. Keywords—Clustering; Clinical Information; Data Classification


Introduction
A number of improvements in hospitals have led to implementing automated ordering and dispensing systems [1] resulting in more time in patient care activities, performance improvement, savings, etc. But, data collected within hospitals remains focused typically on billing or administrative information. Several types of products have applications for in-patient pharmacy departments. Real-time information technology (IT) systems include (1) automated prescribed order-entry systems, (2) clinical decisionsupport system platforms including intelligent systems to guide treatment and check orders, (4) automated patient records integrated with data from other patient related departments [2][3]. From a business point-of-view, clinical data security is important to maintain edge over competing institutions, since organizations invest huge money to research and development units to develop new drugs, medical devices and medical treatment procedure [4]. The corresponding results are stored in clinical trial files and records.
Patients' records typically vary in a multitude of ways, some of which include diagnosis, severity of illness, medical complications, and the speed of recovery, resource consumption, lab tests, discharge destination, and social circumstances. Such data is classified for various purposes, mostly within hospital domain. For example, in [5], the authors study medical data classification approaches applied to heart disease cases, and employ decision tree algorithms to collect results. In a similar work, the authors [6] discuss an end-to-end dynamic neural network that examines medical records, updates previous medical history, and then infers illness states and predicts future medical states. In another research [7], the authors propose an approach that combines rulebased features and knowledge-guided learning models for effective disease classification. The classification of such data may also be used for the purpose of analysis, planning, decision making, etc., and this area is active research subject in many disciplines, such as neural networks. The most important and frequently used method in classification, where no information is available about clusters, is Self-Organizing Map (SOM).
The objectives under this study are: i) To guide doctors on chronic patient cases, as well as insurance companies to improve on future professional links with hospitals and doctors ii) To facilitate doctors in analysis on patient clinical information from different hospitals iii) To benefit insurance companies on savings for cost-effective treatment.
Typically, insurance investigation of patient cases, with special reference to oncology involves intensive, long term and phase wise collection of patient data. This helps insurance companies to determine the type and cost of tests, diagnostics, and treatment conducted in the hospital per patient per physician per hospital. Indirectly, this may help insurance companies to protect their interests. The system reported in this study is based upon a neural network and is intended to be used by insurance companies as well doctors. Effectively, the solution classifies a set of standard data into a number of classes taken from an insurance company database. The paper is structured as follows. In section 2, we present proposed approach for data classification as applied to oncology patient database. Section 3 presents experimental setup, network training and testing results, followed by conclusions in section 4.

Proposed Approach
When input data is fed to neural network in unsupervised mode, the Euclidean distance or the straight-line distance between the nodes is computed. Unsupervised learning is a class of machine learning techniques to find the patterns in data. In unsupervised learning, as the actual data moves through neural network, the weights of the links between the nodes start to look more like data as iterations continues. The data given to unsupervised algorithm is not labelled, which means only the input variables (x) are provided with no corresponding output variables. The resulting output grid map does not have target vectors, since their purpose is to divide the input vectors into clusters based on similarity. Generally, more the nodes in grid map, more detailed the clustering is but requires more time for training.
The neural network trains itself to see patterns in the data much the way a human see. In this unsupervised mode, each neuron is fully connected to all the source units in the input layer. The number of input nodes equals the dimension of input vector in the network. The number of output nodes, typically set by the user, determines the maximum number of classes to be found. Each neuron (node) in the output layer represents a cluster, or alternatively a set of common features. Nearby nodes represent similar clusters and the network is trying to associate input patterns with common features to the same (or nearby) output node. The node in the network that is most similar to the input data is called the best matching unit. The neurons become selectively tuned to various input patterns during learning. The locations of the neurons are so tuned that the winning neurons become ordered and a meaningful coordinate system for the input features is created on the lattice.
The most commonly used form of unsupervised learning is self-organizing map (SOM) or self-organizing feature map (SOFM), also known as Kohonen Network [8][9], as shown in Figure 1. The main advantage of using a SOM is that the data is easily interpreted and understood. The drop of dimensionality and grid clustering makes it easy to detect similarities in the data. Self-organizing maps differ from other artificial neural networks as they apply computationally convenient competitive learning approach using a neighborhood function, in order to preserve the topological properties of the input space. The nodes in the resulting map may be arranged on a hexagonal grid, and since a format (map ratio) is taken into account, the number of nodes in the actual map may be slightly different than specified. For the purpose of competitive learning, a well-known SOM-Ward distance measure may be used [10]: where x and y denote two specific clusters, ! and " denote the number of data points in the two clusters, ̅ ! and ̅ " denote the centers of gravity of the clusters; and ||.|| is the Euclidean norm. Thus, the SOM-Ward distance observes the topological location of the clusters. In particular, two clusters that are not adjacent in the SOM are never considered to be merged.

Experimental Results
In this section, experimental setup and results are presented based on a typical database consisting of 800 case studies related to oncology in Al-Ain, UAE. Each case study specifies a data vector as an input (for example patient condition (ten different diseases), test type (total six different tests), patient origin (local or expat (from seven different countries)), doctor (local or expat (six from different countries)), patient age (four age ranges), patient insurance coverage (five different types), and hospital name (six different names)). The training set is a database consisting of these case studies. Each case study specifies an input vector.

Setup and training
A set of parameters are to be defined. The input vector consists of seven inputs: X7: Hospital Name Since the size of input vector is seven (7), a grid map of roughly 10x10 clusters seems sufficient to represent data. To enter the above inputs to any neural network and train it, the possibilities of each input must be represented by a continuous numeric value between 0.0 and 1.0, so that the input vector entered to the network at the end is consisting of only numerical values. For this purpose, the range between 0.0 and 1.0 is divided into equidistance values according to the number of possibilities of that input. The Table 1 shows the inputs with their different possibilities and the corresponding continuous numerical value for each possible input.
The topology is a 10x10 grid, so there are 100 neurons. Using Matlab, input vectors are randomly generated, and resulting topology is shown in Figure 2. The Figure 2 shows that the maximum number of hits associated with any neuron is 15. Thus, there are 15 input vectors in that cluster. We set the learning parameters to 0.1 and initial number of epochs to 200. For training, 70% (560 samples) of data ware selected to train the network for 200 iteration. The first training result is the self-organizing clustering map, as shown in Figure 3a. The rows of the Figure 3b represent the clusters. The first column contains the cluster name; the second column displays the descriptions (if any) of the cluster; the third column displays the median; the fourth column displays the frequency, etc. Subsequent columns display the aggregated attribute values. The cell value is the aggregated value. The map shows that our input data are classified in four clusters, with each cluster containing a number of neurons.

Testing and results
To test the network, 30% (240) data was used to evaluate the network. Four vectors belonging to four different clusters were selected, and later extreme noise is added to the inputs. The Table 3 shows coordinates X, Y of centers of four clusters (C1, C2, C3, C4). Next, extreme noise (0.0) is added to condition input ( Table 4). The network is still giving good response (75%) for all clusters except for C3 where the data vector is moved to another cluster. This response is expected since C3 is dependent on high values of "Condition" input and noise forced this input to be zero.
In another test, extreme noise (1.0) is added to condition input (as shown in Table  5). The network is still giving accurate response for all clusters 100%. It is observed from the testing that clustering error usually happens when the noise shifts the input far from its original value or far in a dimension away from the cluster centre in that dimension. To analyze the performance of the network in another way, a noisy data is replaced by its mean value. For example, if we replaced the noisy input in test 2 (Table 4) by its mean value, the network works perfectly (Table 6). In order to enhance the performance of neurons in the network, the number of neurons were increased from 100 to 1000. This resulted as shown in Figure 4. The clusters are increased to 5 instead of 4 and the data is now uniformly distributed among the clusters. Each cluster contains 1/5 of the data set. Reducing the learning rate will also enhance the network performance as the Figure 5 shows that the clusters are more uniform.

Conclusion
As a re-focus on objectives, an analysis platform to enable integrated and wider view on patients clinical information retrieved from insurance company database, was presented to facilitate doctors for in-depth medical analysis, and for insurance companies to benefit cost-effectiveness. The platform is not limited to oncology, though other clinical databases may be linked to the same or separate analysis for each database may be developed. Based on our experimental results, it is recommended that: • The number of clusters (or neurons) should be increased to fine tune the clustering map • The number of iterations should be at least 500 times number of neurons in the lattice • Learning rate parameter be selected as a small constant, typically 0.01, and decreased during thousands of iterations, but never goes to zero • Mostly used metric in SOM is the Euclidean distance, which is not the best to some problems, so a different metric be investigated to produce competitive results for data classification Furthermore, as initial positions of neurons differ each time the SOM analysis is run, the eventual SOM map generated will also differ. This can be alleviated by initially setting the input vector to its maximum size and increasing the number of neurons before training for smoother classification.

Authors
Qurban A. Memon has contributed at levels of teaching, research, and community service in the area of electrical and computer engineering. He graduated from University of Central Florida, Orlando, US with PhD degree in 1996. Currently, he is working as Associate Professor at UAE University, College of Engineering, United Arab Emirates. He has authored/co-authored over ninety publications in his academic career. He