Toddler ASD Classification Using Machine Learning Techniques

—At present era, Autism Spectrum Disorder (ASD) has become one of the severe neurologically developed disorders throughout the world and early recognition can substantially get rid of this problem. The proposed work is based on the analysis of unbalanced ASD toddler dataset from UCI data repository. The work in this paper is performed in three stages. In first stage, the original data is preprocessed through converting the categorical attributes to numeric values by the process of frequency encoding followed by standardization of numeric attributes. In the second stage, the dimension of input is reduced using Principal component analysis (PCA). At the end, the classification of ASD Toddler data is performed through different machine learning classification models in two stages viz. through training parameter ε and through k-fold cross validation (k=10). The experimentation yields very high classification performance in comparison with other state-of-art approaches.


Introduction
Neuro-developmental disorder reflects mental illness where the nervous system is affected. ASD is such a disorder where the social interaction, communication and behavior of an individual are in concern. It is characterized by repeated activities and aimless imaginary thoughts [1]. In Asia, within the common population, the average prevalence of ASD was nearly 1.9/10,000 in the year 1980, which got increased to approximately 14.8/10,000 by the year 2008 [2]. It is observed that the early signs of ASD are marked during the initial 6-18 months of a toddler's life span. Further the signs are followed by developmental regression like loss in verbal, social and communication ability with abnormal motor development between 18 and 36 months of the child's life span [3]. Upon all individuals including adult (17 years and above), adolescent (12 to 16 years), children (4 to 11 years) and toddlers (up to 36 months), ASD diagnosis can be implemented. But it is not possible using conventional medical tests like blood test

Literature Review
When the subject was idle, images of brain were taken by Functional Magnetic Resonance Imaging (fMRI) [10]. From the Autism Brain Imaging Data Exchange (ABIDE) [11], the analysis procured 1035 fMRI occurrences and analyzed the pattern for diagnosis of ASD. For understanding and classifying the distinguished features of neuro-images in autistic brains, deep learning techniques were implemented. It was found that compared to conventional diagnosis methods, neuro-imaging patterns of the brain distinguished ASD with improved accuracy.
In further study, from the subject's brain, the node variability was achieved. Various machine learning models [12]got trained by the obtained variability on resting fMRI ASD and no ASD data of ABIDE. Machine learning (ML) algorithms like Naive Bayes (NB) [13], Random Forest (RF) [14] as well as Support Vector Machines (SVM) [15] were applied on 147 cases. The result was achieved with 60-65 percent performance.
A total of 2925 Social Responsive Scale (SRS) samples got derived from Simons Simplex Collection 15.0, Boston Autism Consortium & AGRE [16]. The test was carried out on 2925 data dividing them into 10 folders each comprising of 65 samples with 10 percent of both ADHD and ASD data [17]. Following feature selection method based on minimal redundancy-maximal relevance [mRMR] [18], six ML algorithms-SVM, Linear Discriminant Analysis (LDA) [19], Categorical Lasso (CL) [20] Decision Tree (DT) [21], RF and Logistics Regression Model (LR) [22] were tested using package Scikit-learn package [23]. The investigation showed optimum result for classification of ASD and ADHD test samples by incorporation and fusion of SVM, LDA, CL and LR.
The author in [24] designed a mobile-based ASD screening tool called as ASDTest for testing all category of individuals based upon Q-CHAT and Autism Spectrum Quotient (AQ-10) screening questionnaire [8] in which response to each question scores a point. The author collected 1452 instances using the app for all categories of individuals but left the toddler cases as they made the entire data unbalanced. The remaining 1100 instances comprised of adult, adolescent and child data sets respectively with 21 numbers of features in each data set. Following feature extraction using wrapping filtering, two ML algorithms: LR and NB were used for classifying the ASD data. Adult data showed the maximum performance with the implementation of LR model.
To improve the performance of detecting ASD class within children, the author utilized fuzzy data mining models [25]. The data set was obtained from UCI data repository collected by ASDTest app developed by the author in [24]. It consists of 509 instances with a distribution of 252 NO ASD and 257 ASD traits and 21 numbers of features. Along with FURIA, a fuzzy data mining algorithm, JRip, RIDOR [26] and PRISM [27] algorithms have been also used to generalize the overall performance. In terms of accuracy as well as sensitivity, classification model of FURIA suppressed other models. But the specificity rate of FURIA went down in comparison with JRIP algorithm set.
Instead of analyzing single category of individual, the author in [28] focused on early detection of ASD in child, adolescent and adult based on supervised learning. The data was gathered from UCI repository from the ASDTest app designed by the author in [24]. Following pre-processing, the author performed the classification on the data sets thereby employing K-Nearest Neighbor (KNN) [29][30], SVM and RF classifier. The pre-processed data was partitioned into the training data expressed in α percent within a range of 50 to 90 percent as well as the testing data expressed as (100 -α) percent. The performance parameters were calculated based up on 5 number of experiments performed with α = 50 to 90 percent under two cases: missing and complete data. The RF classifier showed maximum performance for both adult and adolescent data sets up on both complete and missing data.
The author in [31] predicted the chance of ASD in child, adolescent and adult by applying four ML algorithms: NB, LR, SVM, KNN as well as Artificial Neural Network (ANN). Apart from the ML algorithms, Convolution Neural Network (CNN) also got used to predict ASD. The datasets got collected from the UCI Repository gathered by the author in [24] via the ASDTest app. The author pre-processed the raw data and then partitioned the pre-processed data as training and testing dataset in the ratio of 80:20 respectively. Without any dimensionality reduction, all 21 features were used to predict ASD. CNN outperformed rest of the models in all categories of individual's data sets. The category of adult data set yielded maximum result.

Data Collection
Dr. Fadi Fayez Thabtah developed a data set by using ASDTest to screen ASD in individuals including toddlers, children, adolescents and adults. The autism screening data set for toddlers got published in July, 2018 at UCI data repository. It is in activation from November, 2018 till date with six unique contributors. The data set is based upon QCHAT-10 toddler ASD screening tool [8]. It incorporates 1054 number of instances with 17 independent variables excluding the case number and one dependent variable as the output which indicates the ASD class. Out of 1054, 326 numbers of cases are with no ASD class and remaining 728 numbers of cases are with ASD class [9]. Due to such distribution of ASD instances, the respective data set is unbalanced. So, prior to the investigation on the data set, the research aimed to balance the data set by the process of standardization.

Proposed Work
The workflow of the proposed method is shown in Fig.1. The complete method is divided into four parts viz. preprocessing of original ASD data, dimension reduction of original data, classification using different machine ML models and performance evaluation of all the ML models. The training is completed in two ways i.e. through training parameter ε [28] and 10-fold cross validation [5]. Both the approaches calculate the performance parameters for classification and compare with the existing state-ofart methods.

Preprocessing
Numeric transformation of categorical data: The original toddler dataset contains the categorical input attributes like "Ethnicity", "Sex", "Jaundice", "Family_mem_ASD", "Who_completed_the_test" and the categorical output "Class/ASD Traits". It is very much essential to convert these categorical data into numeric values before feeding to the classifier. Among the categorical attributes, "Ethnicity" is assigned with dummy variable transformation where for a particular ethnicity the numeric value is assigned as 1 and under the same case other values are assigned as 0. Other input attributes like "Sex", "Jaundice", "Family_mem_ASD", and "Who_completed_the_test" are assigned the numbers based on the frequency of repetitions in the table. The output class "Class/ASD Traits" is assigned logical values i.e. 1 is assigned for "yes" and 0 is assigned for "no". Table.1 shows numeric transformation rule for all categorical attributes. Table.2 represents the sample set of numeric data after transformation where the number of input attributes increased from 17 to 27 through numeric conversion rules.

Scaling of numeric data
After numeric transformation of original dataset it is seen that some transferred attributes like "Sex", "Jaundice", "Family_mem_with_ASD", and "Who_completed_the_test" have high numeric values and they exceed the range of output class. Hence they are standardized by mean and standard deviation approach using Eq. (1) [32].
Where, xnew represents the standardized value of the attribute, x represents the original value, μ is the mean of all values of an attribute and is standard deviation of all values of an attribute. Table.3 represents the sample of numeric dataset after the process of standardization.  Dimension reduction using PCA: In previous steps it is noticed that the total number of input attributes exceed the original number of attributes. Hence it is essential to reduce the dimension of data with little affecting the output of the system. For this purpose, in this paper, Principal Component Analysis (PCA) [33] is used.
PCA is the most popular linear dimensionality reduction technique used for reduction of dimensions of input data to a lower linear subspace. PCA is used to construct lower-dimensional representation of the input data, which describes as much variance in the data as possible. In this method, we have applied the standardized input having 27 attributes to PCA for its dimension reduction and only six principle components with variances of 20.8%, 12.5%, 12.05%, 10.81%, 9.90%, and 9.16% are considered as the transferred features which affect most to the output class. The Table.

Data splitting
The pre-processed data obtained from PCA is partitioned into two sets: training and testing data set through two methods.
In the first approach the complete data is converted into training and testing samples through a training parameter ε with two values 0.8 and 0.7 respectively. To prepare an effective trained model in ML, more number of samples is required for training. Hence for the said purpose, training parameters of 0.7 and 0.8 are set. The ML models are trained by training data set and the same classifiers are tested with the testing samples through testing parameter (1-ε).
In the second approach 10-fold cross validation procedure is used for training the ML models. In this model, the complete data is splitted into ten numbers of groups. The model is trained on nine groups and gets tested on remaining one group, thereby followed by ten times repetition of the procedure by randomly partitioning the training dataset. During the entire process of data splitting, random shuffling is surely to be performed so that each class should be available in each partition.

Machine learning models
Decision tree: DT [21] is a supervised learning algorithm for solving statistical classification and regression problems. In this paper, the DT creates a training model which is used for predicting target variables (YES/NO). The algorithm partitions data into subsets containing instances with similar values.
To build the DT, we have to evaluate two types of entropy utilizing the dataset as follows: a) Entropy evaluation with single attribute is mathematically given by Eq. (2), Where, E is entropy, T is output class, pi is the probability of ith class and c is the number of output class. b) Entropy evaluation with two attributes is mathematically given by Eq.
Where T is the output class, X is one of the inputs, P(c) is probability of output for a particular value of input, c1 is the number of counts for class1 with respect to input X1 and c2 is the number of counts for class1 with respect to input X1. c) Repeat step a) and step b) for each input and calculate the total sum of (2) and total sum of (3) to get the information gain given by the formula in Eq. ( 4), d) Continue the same process till the entropy reaches to 0. At this point the leaf node is found where all the data are classified.
Discriminant analysis: DA [19] finds a set of prediction equations that is based on independent attributes for classifying individuals into groups. Two possible objectives in a DA: To find a predictive equation for classifying new individuals and to interpret the predictive equation for understanding the relationships between the variables. In the model for DA, using multivariate normal distribution each class (Y) generates data (X).
The basic purpose is minimization of expected classification cost as shown in Eq. (5) which is given by, Where is predicted classification, N is total number of classes, is posterior probability of class n for observation x as well as is cost of classifying an observation as y for its true class tobe n.
K-Nearest neighbor: kNN [29][30] classification model is implemented in numerous areas like data mining, prediction, pattern recognition, as well as other areas of applied sciences. The unclassified data is identified thereby testing the closeness of k in the dataset. By determining Euclidean distance, the nearest neighbour is determined. Let k is total number of samples, mi is i th input, ni is output for i th input then, Eq. (6) represents mathematically the Euclidean distance, (6) Support vector machine: SVM [15]is utilized for regression as well as classification problems with high accuracy. It utilizes a hyperplane which classifies the data samples in a N-dimensional space. SVM focusses on maximizing the distance between data samples of both classes. , constraints to , ) , where, ( , ), i=1,…..,l is an instance-label pair, is the mapping function, C > 0 is the penalty parameter. The Kernel function, Random forest: RF classification algorithm [14] is based upon numerous classification trees in which every tree yields a classification result. It's a collection of decision trees.
where, θk is the identically distributed random input samples. The general error for enough number of trees is mathematically given by Eq. (8), Where, c resembles correlation between two trees as well as o is strength metric.

Evaluation of performance parameters
To outline the performance of implemented ML algorithms, five evaluation parameters viz. accuracy(Acc), sensitivity(SN), and specificity (SP), F-measure(F1-Score) as well as Area under Curve (AUC) are utilized for classifying ASD and no ASD test instances over the Toddler data set. Eq. (9-13) represent the expressions for performance parameters. The confusion matrix describes the performance of the classification models up on a set of test data with known true values [5].

Result and Discussion
The proposed work is implemented in MATLAB 2016a environment utilizing a PC with Intel Corei3 1.99GHz processor and 4GB RAM. In this paper followed by preprocessing, training of the data set is done in two ways: a) Through the training parameter (ε) and testing parameter (1-ε). b) Through 10-fold cross validation.
Two values of training parameters(ε1 = 0.8 and ε2 = 0.7) are set [28] and for each ε, the values of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) are found from the confusion matrix. For ε = 0.8, the number of test instances in the data set is 20 percent of the total number of toddler instances i.e. 210 whereas for ε = 0.7, the number of test instances is 30 percent of the total number of toddler instances i.e. 316. The peformance parameters are evaluated. Table.5 shows the statistics of performance parameters through training and testing parameters. From Table.5 it is observed that for ε = 0.8, DT outperformed other classification models for classifying Toddler ASD class with the maximum performance. The maximum performance of DT in terms of accuracy is being followed by SVM with an accuracy of 0.997 for ε = 0.7. Further SVM yielded maximum sensitivity for ε = 0.8 as well as ε = 0.7 as FN rate came out to be zero resembling there was not a single instance who got wrongly classified as no ASD class. DA produced maximum specificity for ε = 0.8 as well as ε = 0.7 because of zero FP rate resembling presence of no instance who got wrongly classified as ASD class, while RF gave maximum specificity for ε = 0.8 with no instance being wrongly classified as ASD class. Following DT, SVM resulted in F1-score of 0.995 and 0.993 ε = 0.7 and 0.8 respectively. The performance of SVM is next to the performance of DT in case of AUC. Overall from the above analysis it is concluded that DT due to its maximum performance is perferable for classification of standardized Toddler ASD data. Table.6 represents the result from 10-fold cross validation. The final result is calculated from summation of results from each fold upto tenth fold. From Table.7 it can be concluded that SVM performed effectively for Toddler ASD classification with a performance of more than 99 percent. At the same time DA also yielded a specificity rate of 99.86 percent. For a comparison, the performance of DA classifier is the lowest among all classifiers but still acceptable. It is due to the fact of higher FN rate where 26 number of instances are wrongly classified as no ASD class. Similarly specificity rate of KNN is the lowest among all classifiers due to the fact of higher FP rate where 12 number of instances are wrongly classified as ASD class. Table.7 represents the results of various state of art methods in adolescent, adult and child ASD datasets and compared with the results found in this paper for Toddler ASD dataset.
The performances of ML algorithms can also be visualized through the Receiver operating characteristics (ROC) curves from Fig.2

Conclusion
The proposed work emphasizes on early ASD detection in toddlers. In this work, using PCA, there is a dimension reduction based on contribution of minimal benefit in the number of attributes followed by use of ML classifier models to detect ASD in toddler dada set. The evaluation parameters yielded clinically acceptable results using ML classifiers. The acceptance level of standard performance is 80 percent and in the analysis carried out in this investigation, all the evaluation parameters produced more than 90 percent performance. Generally, in most of the research analysis, toddler data set is dropped because of its unbalanced nature making the detection of disorder difficult in that category. In this study, toddler data set is analyzed thereby making it successful to detect ASD in toddlers. The study will be further enhanced in future thereby extending the investigation on rest category of individuals in addition with toddler one.