Measuring the Performance of Computer Vision Systems in Malaria Studies

‒Digital image processing-computer vision (DIP-CV) systems are used to automate malaria diagnosis through microscopy analysis of thin blood smears. Some variability is observed in the experimental design to evaluate the statistical measures of performance (SMP) of such systems. The objective of this work is assessing good practices when using SMP to evaluate DIP-CV systems for malaria diagnosis. A mathematical model was built to characterize diagnosis using DIP-CV systems and used to obtain curve families showing the relationships among various SMP of these systems, both using theoretical equations and computer simulation. Curve families showing (a) the relationships among the minimum number of positive erythrocytes (RBCs) to be observed, the per object (RBC) sensitivity and the probability to detect at least one positive, (b) per specimen sensitivity vs. total number of RBCs observed for a typical per object sensitivity and a range of parasite densities (c) per object positive predictive value vs. per object specificity for a typical per object sensitivity and various parasite densities. When determining the per specimen sensitivity, the parasite density p showed to have more influence on the number of RBCs that must be analyzed than the per object sensitivity. Measuring p accurately depends heavily upon the per object positive predictive value of the classifier. For low p values, this would require very high per object specificity and a high enough value of observed RBCs to measure this accurately. Keywords‒‒Malaria, Plasmodium, Digital Image Processing, Computer Vision, Statistical Measures.


Introduction
Malaria continues being one of the largest health problems faced by our planet, where in 2018 228 million cases of malaria occurred worldwide, with 405 000 deaths, mostly children under 5 years old and with a high prevalence in the Sub-Saharan Africa region [1]. An effective way to diagnose malaria and to determine the infection rate is the analysis of thin blood smears in a microscope, during which the infected erythrocytes (red blood cells, RBCs) are detected and can be counted to determine the parasite density p which will designate in this work the proportion of infected RBCs. This procedure constitutes, for the human experts, a time-consuming task, prone to intra-and inter-analysts' errors due to tiredness, subjectivity and lack of experience. As a consequence, research on the application of digital image processing-computer vision (DIP-CV) systems to analyze the thin blood smears during malaria studies, constitutes a current topic that has produced in the last few years numerous scientific publications. However, the issue of the practical meaning of the statistical measures of performance (SMP) of these systems and the influence on them of the errors due to various sources usually present, has not been considered with enough interest and deserves a more detailed analysis, which is the object of this work. At this point, we emphasize that the performance of binary detection of the parasites is of paramount importance in any system, both due to its intrinsic importance when calculating the parasite density and to its role as a usual first stage in systems devoted to classify also the Plasmodium species and their life stages. Analyzing a representative set of published papers reveals the lack of uniformity with which the system's effectiveness is evaluated. This problem is increased due to the lack of available, public annotated databases that could be used for this purpose.
An early work [2] uses covariance, correlation coefficients and distances to compare their DIP-CV system results with those from two human analysts, with a total number of RBCs in the order of 180; however, the standard SMP measures were not used there. Linear regression was used in Ref. [3] to determine the correlation between the outcomes of a proposed system and human analysts on sets having up to 16000 RBCs, reporting good results in terms of correlation with some dependence upon the experimental conditions and also without expressing the results in terms of SMP. Most references, however, use the standard SMP like sensitivity (Se), specificity (Sp) and predictive positive value (PPV), for which a relatively wide range of values has been obtained. An example using a neural network classifier in Ref. [4] reports = 0.8513 and = 0.8084 in a set of RBCs that comprised 481 true positives. This work recognizes the influence of p on the evaluation of Se as well as of the false positives (FP) in the accuracy (ACC) when evaluating p, and used an elementary simulation to assess the system's performance. A DIP-CV system that uses a neural networks classifier in Ref. [5] obtained average = 0.76 and = 0.82 with a limited number of RBCs. Another work [6] reports values of = 0.724, = 0.976, = 0.858 and ACC = 0.933 for their experiments using 4100 RBCs (669 infected) to compare three classifier systems for binary detection. This work also addresses the classification of species and stages of the parasite and includes a thorough analysis of the importance and the effects of the parasite density value p on the estimations of the SMP and p itself. However, as will be shown later, this analysis deserves an extension to obtain new insights on these relationships. Ref. [7] worked with 1000 RBCs showing a maximum = 0.98 and presumably = 0.98 (the definition given there for precision is unclear), without considering the influence of p. Ref. [8] reports = 1 and = 0.5 to 0.88 without providing enough data on the experimental conditions of this evaluation.
Experiments with 888 RBCs using 148 in each class (normal and infected) for training five different classifiers reported in [9] shows as the best results obtained (using neural networks) = 1, = 0.9874, = 0.9973 and ACC = 0.9673. This paper also tabulates for comparison the SMP of a number of works where the problem that gives rise to our work can be appreciated. Ref. [10] made experiments with 12,577 erythrocytes, out of which 713 were infected (p=0.0568), with the final purpose of classifying the species of the parasite and comparing three classifiers, obtaining values of = 0.94 and = 0.997, and calculating the F-measure which reflects the balance between Se and PPV. Research using a K-means classifier and 118 RBC images reported in [11] obtained = 0.93 and = 0.95. Ref. [12] reports a Support Vector Machine (SVM) classifier using samples from patients with p estimated in the range 0.07-0. 16 [14] which used an SVM classifier obtained = 0.9894, = 0.9612 and ACC = 0.9866. However, information on the number of RBCs tested or p value is not provided. The problem of determining the parasite density was addressed in Ref. [15] and in this case the authors provide only the accuracy obtained, ACC = 0.9646, as well as regression curves and tables of parasite counts per slide showing the degree of agreement with human-made counts.
An approach using an image analysis system [16] was developed based in a 16-layer convolutional neural network (CNN) model. The binary classification performance of this system was evaluated by means of a ten-fold cross-validation using 27578 single cell images with a 1:1 ratio of infected cells to uninfected cells. The results reported in this work were = 0.9699, = 0.9775, = 0.9773 and F1 score = 0.9736. A number of 138 quantitative features related to color, morphology and texture from segmented erythrocytes are proposed to detect malaria in Ref. [17] and tested using four feature selection methods and three classifiers, among which the best results were obtained using Correlation-based Feature Selection (CFS) together with a C4.5 classifier. These were = 0.992, = 0.996. This work also used a balanced erythrocyte database with 500 normal and 500 infected (divided in five groups of 100 infected with two species: 200 Plasmodium falciparum and 300 Plasmodium vivax) in two and three sets of 100 erythrocytes respectively at different stages (rings, gametocytes and schizonts). This work presents a comparison to other image processing algorithms for malaria detection in a table in which the variability in the use of different SMP and datasets can be appreciated.
A comprehensive review is presented in Ref. 18 on malaria parasites detection systems based in morphological image processing. In this article the authors summarized the results of 27 different approaches, which in regard of erythrocyte classification were expressed in terms of the standard SMP. However, it is to be noticed that in the various works analyzed, these measures of performance have been obtained under different experimental conditions in terms of the number of erythrocytes involved, the level of class balance and the variability of image acquisition conditions, which limits the value of the corresponding SMP values to allow a fair comparison of their effectiveness.
SVM with different kernels were employed as classifiers in Ref. [19], using as features the histogram statistics as well as the gray level co-occurrence matrix (GLCM) and the gray-level run-length matrix (GLRLM) as texture features. SVM with cubic kernel obtained the best results reporting = 0.951, = 1, ACC = 0.974, PPV= 1 and NPV = 0.949. The dataset employed consisted in 975 isolated erythrocytes segmented out of 46 digital microscopy images of thin blood smears.
A custom convolutional neural network (CNN) and pre-trained models: VGG-19, Squeeze-net, Inception ResNet-V2 and All-ensemble, were implemented and evaluated in Ref. [20] to classify RBCs into normal and parasitized in thin blood smears. The models in this case were evaluated in terms of SMP using accuracy, area under ROC curve (AUC), MS error, positive predictive value, F score and Matthews correlation coefficient. A dataset composed of 27558 RBCs was evenly divided into normal and infected, obtaining as best results: ACC = 0.9932, = 0.9931, = 0.9, = 0.9956 and F = 0.9908. Table 1 summarizes in chronological order the results reported in the previous representative sample of articles reporting the results of DIP-CV systems in the analysis of microscopy images of thin blood smears during malaria studies. In some cases, the SMP have been extended by calculating, from the data provided, measures not explicitly given in the corresponding article.
A review article [21] analyzes a large number of papers (173 references) describing techniques involved in machine learning and image analysis for the detection of malaria. Here the authors recognize explicitly the difficulty that gives rise to our article: they consider as a very difficult task to compare the performance of the published systems, due to the variability of blood slides coming from different origins as well as of the methods to prepare slides and to acquire the images. They mention also the fact that the evaluation sets are often too small or too limited as well as the lack of publicly available image benchmark datasets. Having all these in mind the authors considered that there was not a reliable basis to include in their article any formal comparative evaluation based in SMP. In this short review, the papers analyzed were chosen both to illustrate some of the most recent work and previous works of noticeable relevance as well. The values of the SMP reported in these articles reflect that there is also variability in the number of decimal places with which these values have been given.
It is worth to mention that the laboratory assays made to develop anti-malarial drugs, using the rodent species Plasmodium berghei, usually generate large numbers of images in which determining accurately the parasite density is of crucial importance. This makes them an important target for the application of DIP-CV techniques.
The purpose of this work is to emphasize the importance of making an appropriate evaluation of the DIP-CV algorithms used in malaria studies, to provide further insights about the practical meaning of the SMP when making such evaluations and to provide some clues to set the limits of validity of the measured parameters. The approach followed was to develop a mathematical model that allowed extensive simulations of the RBCs classification process. This model encompasses all the sources of error in a typical DIP-CV system, making abstraction of their physical nature. Then representative values of the SMP were obtained and an analysis was made in order to clarify their actual meaning when assessing the limitations of a given system. At this point, some distinctive properties of DIP-CV systems in the analysis of malaria are to be mentioned: When the microscopy slides are analyzed by human experts, it is assumed that the per-specimen sensitivity depends only on the parasite density, and the per-object (per individual RBCs) sensitivity and specificity are deemed as 100 percent. On the contrary, a DIP-CV system has inherently various sources of false positives and false negatives: noise and artifacts due to imperfections in preparing the slides and in the image acquisition processes, limitations of image processing algorithms like filtering and segmentation, limited discriminating power of the features used and the effectiveness of the classifier algorithms. Studying the effects of these errors is one of the main purposes of this work.
In the visual analysis by human experts, it is very laborious to make an exact accounting of the RBCs. This determines the use of approximate methods like counting only the infected RBCs and the leukocytes (WBCs, white blood cells), estimating a proportion between the amounts of total infected RBCs (np) and observed WBCs (nw) and calculating the parasite density p in infected RBCs/μl from these data using: where the WBC density is considered approximately equal to dw= 6000 WBC/μl of blood. A DIP-CV system usually does not have this limitation: it can perform a complete and fairly exact accounting of the RBCs within its limits of precision, allowing using just the necessary RBCs to have a good evaluation of p as the fraction of infected RBCs. This makes important to know what would be actually the minimum numbers of RBCs necessary to be analyzed for evaluating fairly an expected range of p values.
In laboratory experiments with rodents, the per-specimen SMP are very important because they indicate the margins within which an infected specimen would have a satisfactorily low probability of being considered free of the parasite and conversely a non-infected specimen would not be deemed as infected. This is of course also important in diagnose tasks with humans.
In this work, the following issues were addressed: (a) determining the minimum number of positive objects (infected RBCs) to be analyzed for evaluating p, (b) the relationship between per-object and per-specimen sensitivity and (c) the influence of false positives on the precision with which p can be measured.
The core of the analysis presented here is using the Bernoulli process as a model to describe the behavior one RBC classification in a DIP-CV system used for malaria studies, from which analyzing a set of M RBCs follows the binomial probability distribution. This can be justified by looking at the properties of such processes in Ref. [22] which are summarized below: 1. The experiment consists of repeated trials which are the individual analyses of independent, previously segmented, erythrocytes. 2. Each trial results in a binary outcome: an RBC is infected (positive) or non-infected (negative). 3. The RBCs are analyzed at random, which justifies the assumption that the probability p of finding a positive is constant for any observed RBC. 4. The repeated trials are fully independent processes: a given result does not have influence on the rest.
The remainder of this article is organized as follows. In Section "Materials and methods" the models and experimental design used in the simulation experiments are explained in detail. Section "Results" shows the results obtained from the equations describing the theoretical model and those resulting from the simulation experiments. Then section "Discussion" refers to the interpretation of the curve families obtained and analyzes the influence of this information upon the evaluation of the SMP in DIP-CV systems for the analysis of malaria. Finally, in Section "Conclusion" the main findings are summarized and their expected influence upon the design of experiments to test the DIP-CV systems that have been the object of this study.

Materials and Methods
The problem of determining the relationship between the sensitivity per object (i.e. the proportion of infected RBCs that the system is capable to detect) and the sensitivity per specimen, (which corresponds to the proportion of positive specimens that the system is capable to detect when analyzing a large number of individual RBCs from their blood samples), was addressed here. This issue has been treated before in some detail in Ref. [6], whereas a different approach has been introduced in the present work.

Basic definitions
Starting from the definition of sensitivity where TP stands for the amount of true positives and FN means the amount false negatives, we will call per object sensitivity Seo, that associated to the detection of individual infected RBCs (objects in what follows), and per specimen sensitivity Ses, that associated to the detection of an infected specimen after having analyzed the images of its blood smears. In analogous way, the per object (Spo) and per specimen (Sps) specificity can be defined.
The following definitions will be used in this work: Seo = per object sensitivity. Spo = per object specificity. It is worth to mention that one μl of blood contains approximately 5x10 6 RBCs as stated in Ref. [23].

Mathematical model
In order to perform the simulation study proposed here, the process of classifying a set of M objects (RBCs) as infected (positive) or not (negative), was modeled using the binomial probability function.

 Relationship among
, and : The probability of detecting at least one positive object for a given when analyzing N positive objects is given by From this expression, the number of positive objects N that should be analyzed to detect at least one, with probability at least equal to , is This expression will be used to calculate as a function of . Notice that this result lead to the necessity of finding the number M of RBCs to be examined to find at least N positives, which would allow detecting a positive specimen with a desired probability.

 Relationships among , , p and M:
The per-specimen sensitivity is affected both by the presence of positive objects -which depends upon the parasite density-in the sample analyzed by the DIP-CV system, and by the per-object sensitivity of the latter. The problem of the presence of i positive objects in a sample of M objects (RBCs) when p is the probability that a given object be positive (parasite density) is characterized by the binomial probability distribution [22] where The probability of classifying a positive specimen as such when analyzing M RBCs pertaining this specimen can be calculated by adding for all i (1 ≤ ≤ ) the probability of finding i positives in M, weighted by the probability that at least one of them be detected, for a given , which is Equation (7) allows determining iteratively the number of objects M that should be analyzed to obtain a desired value of given and an estimated p. However usually ≫ 1, which is necessary for a precise calculation of p whose values are usually low. Therefore, evaluating equation (7) requires calculating large factorials whose direct computation might even not be feasible. Then the well-known approximation of the binomial probability function by the Poisson distribution [22] for ≪ 1 can be used to obtain This expression was used to calculate curve families that reveal useful information about the interdependence between , , M and p. When the condition ≪ 1 does not hold ( . . ≥ 0.01) the normal approximation of the binomial distribution was used instead. This is given by where N is the normal (Gaussian) probability calculated in the interval ( ± 0.5) with parameters µ = and = √ (1 − ) .
 Model testing by computer simulation: Curve families were obtained by means of equations (8) and (9) and also calculated by means of computer simulation. Simulating the detection of positive RBCs consisted here in generating a first random binary sequence associated to the M analyzed RBCs, with (1) = , where 1 count as a positive object (RBC). Then a second binary random sequence with (1) = is generated to simulate the appearance of FN. The second sequence has the purpose to change the state of a positive outcome in the previous sequence, according to the considered value of , whenever this element corresponds to a false negative. The value of was determined by repeating the process just described and considering that the outcome was a positive specimen whenever at least one TP was found in a run of the simulation experiment. The number n of times that this experiment had to be repeated for each value of M to evaluate was determined [22] by the expression where z is the normalized normal variable, α/2 the significance level (5%) and e (0.02) the maximum absolute error that was set when estimating the proportion of detected positive specimens. This led to the value n≅2400.
 Assessing the effect of false positives: The rate of false positive detections is a very important factor to be considered in this analysis, given that especially for low values of p the occurrence of false positives can severely distort the results. Here the positive predictive value PPVo was used as the statistical measure of performance which characterizes the effect of false positives.
A first approach to analyze this problem is to study the dependence of as a function of the specificity and the parasite density. Consider a sample of M RBCs and a parasite density p and assume that in the mean the number of positive objects in the sample is pM, then Now from equations (11) and the definitions of Seo and Spo , a system of four equations can be formed from which the dependence of PPVo upon , and p can be found as This equation was used to obtain curves that illustrate how the proportion of true positives depends on the specificity for a given parasite density and sensitivity.
In order to validate the analysis made, these curves were also obtained by computer simulation. In addition to the random sequences used previously to compare with results of equations (8) and (9), a third binary random sequence with (1) = was used to simulate the occurrence of false positives by changing the state of the negatives in the first binary sequence. Consider that when measuring the parasite density an apparent value of p will be obtained, affected both by the presence of false positives and false negatives. It is easy to prove the relationship Some implications of this relationship will be discussed later.

Results
 Relationship among S eo , P o and N: The relationship obtained through equation (4) is shown in Fig. 1, where it is seen that a lower implies the need of observing more positive RBCs to detect at least one with a high value of . The actual values obtained for for a set of values of and their corresponding minimum value of N (Nmin) are also shown. The interdependence among , , M and p for low and high values of p respectively (the borderline was considered around pperc=0.5 %), was obtained using equations (8) and (9). Figures 2 and 3 show respectively the correspondence between the results obtained using equations (8) and (9) and those obtained by computer simulation. In Figure 2a was calculated for = 0.95 (a reasonable practical value according to those reported in the literature) and a wide range of M, for low values of p as a parameter expressed in percentage as pperc, obtained using the Poisson's approximation of equation (8) and the same conditions are set in Fig. 2b, in this case using computer simulation.
An analogous procedure was followed to obtain the curve families shown in Figures  3a and 3b, in this case employing the Gaussian approximation of equation (9)  in terms of , and p were made using equation (12) and plotted in Fig. 5a. Figure 5b shows the same curves obtained by means of computer simulation. Notice the correspondence between the curves obtained using the analysis that led to equation (12) and the computer simulation results.

Discussion
 Relationship among , P and N: Figure 1 portrays the information about the minimum number of positive objects (infected RBCs) that have to be observed for a given per object sensitivity to have a very high certainty ( ) of detecting at least one of them. This graph illustrates the effect of an insufficient per object sensitivity on the per specimen sensitivity.     Table 1) and an expected value of infection rate expressed in percentage pperc, the required number M of observed erythrocytes in order to obtain a reasonably (very) high per specimen sensitivity , tends to be very high for relatively low values of pperc. For higher values of pperc, Figures 3 (a  and b) obtained using the Gaussian model show that a rather high value of can be obtained for much lower values of M.
It is worth to mention that the mean square error was calculated between vs. curve families obtained by the theoretical approximation and by computer simulation and very low values (magnitude order of 10-5) were obtained for the range (0.6 ≤ ≤ 1). Figures 4 a and b are intended to illustrate the same relationships as figures 2 and 3 but in this case for very high values of , (e. g. above 0.99), which could be desired in a practical situation. These figures differ in their corresponding ranges of parasite density p; notice that obtaining high enough values of for low values of pperc as those used in Figure 4b might require analyzing much higher numbers of RBCs, reaching the range above ten thousand. As will be seen later, this is effect is worsening by the fact that with low p the negative effect of the presence of false positives increases significantly.
Curves vs. M obtained for a low p with as the parameter, not included here due to space reasons, showed a low relative dependence of on when high values of are sought.
The obtained relationships illustrate how the existing parasite density tends to be a very significant parameter in the process of detecting infected RBCs in regard of the value of M needed; in this case the number of RBCs to be observed in order to find the required minimum number of positives to ensure at least one detection tends to be very high, particularly for low values of p.
 Assessing the effect of false positives: Figure 5 is related to consequences derived from equation (13), which suggests that measuring p with perfect accuracy would be theoretically possible in a detection system which produces equal numbers of FN and FP, in other words should be ideally equal to . This condition cannot of course be attained in practice, because the mechanisms that give rise to FP and FN are distinct and produce them at random.
Failures in correct classification of RBCs in a DIP-CV system depend upon the specific attributes of the observed cell: in fact, it is impossible to ensure that FP=FN. Therefore, the objective should be to obtain both and as close to unity as possible. Figure 5 reveals that drops very fast when decreases, even for a range of high values of the latter, when p has a moderately low value. This means that obtaining a relatively high when p tends to be low, would require very high values of , an objective that cannot be overlooked in applications like drug development, where low infection rates are monitored in the evolution of the response of laboratory animals to drugs under testing. The interest in obtaining very high specificity, however, is not clearly sought in many works.
It is important to recognize also that to evaluate with acceptable accuracy such high values of , M must be also high enough, in order to have estimation errors in the order of the third decimal place or lower. Evaluating and , on the other hand, is a matter or evaluating proportions and equation (10) should be applied to determine the minimum M needed in order to evaluate them with controlled error for an expected range of p.

Conclusion
This work has been motivated by the need to establish a common framework to perform the evaluation of DIP-CV systems employed in the analysis of microscopy images during malaria studies. The approach followed in this work could be valuable to address similar analysis in other DIP-CV current applications related to microscopy images, examples of which can be found in [24] and [25]. In the work developed here a mathematical model of the analysis of thin blood smears employing DIP-CV in the diagnosis of malaria was introduced and it allowed revealing some limitations that can be observed in many works when determining the SMP of DIP-CV algorithms for this application, especially the sensitivity per specimen and the parasite density of a sample. It was found that when determining the per specimen sensitivity, the parasite density has a much larger influence on the number of RBCs that must be analyzed than the per object sensitivity.
A second factor to take into account is that measuring p with adequate accuracy depends strongly upon the positive predictive value of the algorithm. For low expected values of p, this would require very high values of specificity per object , which consequently demands a high enough value of M in order to measure this accurately.
The fact that the number of RBCs to be analyzed is closely related to the accuracy with which the SMP can be calculated should be taken into account when evaluating a DIP-CV system. In this case, a minimum M is to be defined for (a) measuring for an expected p, (b) measuring the parasite density p with a desired accuracy, (c) obtaining for some and an expected p, the latter being the dominating factor. As a final remark, especial attention should be taken to provide high enough in conditions of low parasite density in order to obtain reasonably good .