Building Face Detection with Face Divine Proportions

In this paper, we proposed an algorithm for detecting multiple human faces in an image based on haar-like features to represent the invariant characteristics of a face. The choice of relevant and more representative features is based on the divine proportions of a face. This technique, widely used in the world of beauty, especially in aesthetic medicine, allows the face to be divided into a set of specific regions according to known mathematical measures. Then we used the Adaboost algorithm for the learning phase. All of our work is based on the Viola and Jones algorithm, in particular their innovative technique called Integral Image, which calculates the value of a Haar-Like feature extracted from a face image. In the rest of this article, we will show that our approach is promising and can achieve high detection rates of up to 99%.


Introduction
Over the past two decades, with the development of information technology, face detection has become a key technology in computer vision through its various interesting applications, such as face recognition, expression recognition, digital video processing, video detection, video surveillance, human computer interface, video retrieval and face image database management etc. The human face is a dynamic object and has a high degree of variability in its appearance, which makes face detection a difficult problem in computer vision. The mission of a face detection system is to find and locate each human face in an image. However, face detection from a single image is a challenging task because of variability in scale, location, orientation (up-right, rotated), and pose (frontal, profile). Facial expression, occlusion, and lighting conditions also change the overall appearance of faces.
To meet all these requirements, an object detection system must involve two main steps (1) extracting features (Face representation), and (2) training a classifier using the extracted features to detect and locate faces. Indeed, for building an object detection system, we must rely on a more robust feature extraction system. The good feature set allows the detector to cleanly discriminate the human face even in cluttered backgrounds. Hence, how to extract robust and discriminating features that can differentiate a face from a non-face, remains a central and difficult problem. A variety of representations of facial features have been offered in recent years namely HaarLike Features [1] [20], which is the basis of the Viola and Jones algorithm which is the most known in the past two decades, LBP [2], SURF [3], HOG [4] [18], NPD [5], Color and Skin color [6], Eigen face [7], etc.
The selection of robust and defining features is a major challenge for any face detection system. For example, the Viola and Jones algorithm chooses these features from a set containing thousands of features (160000) [1], and this has a great influence on the duration of the learning phase. The main idea of our algorithm is to exploit the divine measurements of a face [9], to manually determine these robust and determining features (HaarLike features). These measures consist of dividing the face into specific regions with well-known measures valid for all types of face. Then, we use Adaboost learning algorithm to determine the ideal correlation between these different features. This correlation constitutes our detection system.
The rest of the paper is organized as follows: In Section 2, we review the problem of feature selection, emphasizing different search and evaluation strategies. An overview of the proposed method is also presented. In Section 3 We present a brief explanation of the divine face proportions technique. Section 4 explores our approach in detail, and to prove the performance of our algorithm, section 5 presents some tests and experiments carried out during this work. Finally, we summarize this article with a conclusion.

Related Work
The first phase of a computer vision system is undoubtedly detection, and more particularly, face detection. A lot of research has been done on this subject. These searches can be grouped into four categories: knowledge-based methods, feature invariant methods, template matching methods, and appearance-based methods [6]. According to several studies including that done by [6], appearance-based algorithms are the most popular which use local feature representation and classifier learning, Among the most recent face detection algorithms that has gained increasing attention due to its remarkable results is that of Viola and Jones [1] [19], which has been integrated into Intel Open Computer Vision library [8] with five haar-cascade classifiers. The amazing real-time speed and high detection accuracy of this face detector can be attributed to three factors: The first factor is the use of filters in rectangular form called HaarLike features with their integration function called Integral Image that is simple and fast (see figure 1). The second factor is the adaptation of the Adaboost algorithm to select the best features. The third factor is the representation of the features selected by Adaboost as a cascade, which they called Attentional cascade. In fact, this algorithm encourages several researchers to invest themselves in the search for other solutions for an ideal face detector, and especially in the axis concerning the extraction of features [15] [16]. Indeed, several other types have been proposed as Local Binary Pattern (LBP) [2], SURF [3], HOG [4], and NPD [5]. Although some of http://www.i-joe.org the features cited above offer significant discriminatory qualities than the HaarLike features for face detection, they generally increase the computational cost [5].
All these methods extract the features without being wary of their relevance, it is during the learning phase that the significant features will be chosen automatically, and this wastes the learning time and the chosen features are not necessarily relevant. This issue is our motivation to develop this work. The objective is to propose a system allowing to choose relevant and significant features without going through a learning algorithm and this based on the facial proportions [9,10,11] that thanks to it we can easily divide a face into well-known regions. and invariant whatever the face, then represent them by the Haar-Like features illustrated in figure 1.

Facial Proportions
Several centuries ago, certain mathematical rules were discovered which would govern the mathematical beauty of a person according to the proportions of his face. Moreover, it would even be possible to assess the symmetry as well as the level of mathematical perfection of a human face from these, so-called divine proportions. Among the first to address this subject Leonardo da Vinci in his writings on human anatomy [11].
Aesthetic medicine specialists, namely orthodontists and maxillofacial surgeons use its proportions for the treatment and analysis of the face and for the planning of surgical rejuvenation.
In the world of aesthetic medicine, the face is divided, horizontally into three symmetrical parts (Figure 2 (a)), and vertically into five symmetrical parts ( Figure 2 (b)) [9,10,11]. The three parts of the horizontal division are: The upper part extends from the trichion (point where the hairline meets the middle of the forehead) to the glabella, the middle part between the glabella and the sub-nasal and the last part ex-tends from the sub-nasal to the chin. For vertical division, the width of each eye, intercanthal distance and nasal width all measuring one fifth (Figure 2 (b)). Well, in general and according to several criteria like, age, sex and ethnicity its parts are not always equal.
Golden ratio: The golden ratio or mysterious number also called Phi, is a ratio observed in nature, art, and architecture. Its approximate value is Phi = 1.618. It is also used to measure the beauty of the human face. It is considered a mysterious key to determine beautiful, ideal, and harmonious forms, it is also called divine proportion. a) horizontally, into three symmetrical parts b) vertically, into five symmetrical parts c) the ratio between the height of the face and its width is equal to Phi=1.618

Fig. 2. The principle of face division
In this context, beauty specialists apply it to several proportions and structures of the face, such as the ratio of the width of the mouth to the width of the nose or the ratio of the height of the tooth to the width of the tooth, or the ratio between the height of the face and its width which is equal to 1.618 as shown in figure 2 (c). For this work, this is the latest report that interests us.

Building Face Detector with Face Proportions
The construction of our final detector goes through five main phases as shown in Figure 3. In the first phase, we build the training and test data from our dataset via the principle of data splitting, then each image undergoes preprocessing before to move on to the third phase, which consists in extracting the features determined manually in advance according to the rules of face proportions. Then the learning phase with Adaboost which gives us as a result a detector with a high rate. The learning phase is repeated several times and the resulting detectors are stored to form the final detector. In the rest of this paper, each phase is discussed in detail.

Dataset splitting
In supervised learning, a prediction model is built to predict the results of an unknown target function. This target function is represented by a finite set T. T gathers pairs [⃗⃗⃗ , ⃗⃗⃗ ] such that ⃗⃗⃗ is an input data and ⃗⃗⃗ its desired output: = {[⃗⃗⃗⃗ , ⃗⃗⃗⃗ ], … , [⃗⃗⃗⃗ , ⃗⃗⃗⃗ ]} where n> 0 is the number of ordered pairs of input/output samples [13].
In the ML world, Data splitting is a common strategy that involves taking all the data in the dataset and dividing it into two (see three [13]) learning and evaluation subsets. Usually with a ratio of 70% to 80% for learning and 20% to 30% for evaluation.

Fig. 3. Main steps for building a face detection system using face proportions
Our goal is to create a robust and reliable detector with a high detection rate, for this purpose our system performs repetitive training and with each iteration, it divides the dataset, randomly, into two sets namely learning and validation. The result of each division operation will be a detector which will participate in the formation of the final detector.

Preprocessing
The image of a face may have low contrast and may also have uneven lighting caused by the position of light sources. All of these can affect feature extraction. This is because we improve the image of a face and we remove noise by filtering the image with a Gaussian low pass filter. This filter has shown its effectiveness in the preprocessing of the faces. Figure 4 shows faces before application of the Gaussian filter and others after application. The characteristics of a face, namely the eyes, nose, and mouth, are very clear and can play an essential role in the classification phase.

Features extraction
Principle of dividing the face into regions: Our idea is based on the principle of dividing a face according to face proportions, shown in figure 2. According to several studies this division is almost perfect and valid for all face types with some differences in the measurements of each region [9]. So, the idea revolves around two axes; The first is the division of the image into small regions (boxes) according to the principle shown in section III. The second is to extract information from these portions, but only the most critical ones, such as the eye portion, the nose portion, and the mouth portion, using haar-like features. Figure 5  If we consider an image with a given size, then this image is classified as a face if the determining portions correctly understand the characteristics of the face (eyes, nose and mouth). Indeed, instead of looking for the existence of these characteristics in the whole face, as Viola and Jones do in their algorithm, our system only focuses on a well-defined area and in this way, we minimize the cost and also the resources used in the learning phase. Instead of playing with thousands of features during the learning phase, our system allows this number to be minimized down to only 32 because we know where to look, exactly, each element (eyes, mouth, nose).
The first image (a) shows the outline of a face with the different measurements adopted by our system.
In (b), the first row shows face images cropped at 30x36 size. The second line illustrates the area of the determining portions of size 18x23. Each image of the dataset must be cut, manually, in a size of 30x36 ( Figure 5 (a)) so that each portion contains the correct element (eyes, nose, mouth…) and respecting the division rules shown in section III. Then each image will be cropped, but this time automatically, into a 18x23 image that contains only the invariant characteristics of the face. The choice of size is made in such a way to facilitate the calculations resulting from the operation of the division; 36 for the height because this number is divisible by 3 and also by 4, then 36/3 = 12 is also divisible by 3. Finally, 30 for the width because it is divisible by 5. These dimensions are detailed in the table 1.

Haar-Like features:
The Haar-Like features, proposed by Viola and Jones, identify faces among several other objects in a given image. Viola and Jones [1] used three types of features, as shown in Figure 1 (a); two-rectangle feature, thee-rectangle feature and four-rectangle feature. A HaarLike feature contains white and black areas, and its value is defined by the difference between the pixel values of the white areas and those of the black areas. The value of a Haar-Like feature reflects the change in the gray scale of the image. For example, some characteristics of the face can be described simply as rectangular features; the eyes are darker than the cheeks, the sides of the nose are darker than the bridge of the nose, and the mouth is darker than the surrounding color. Figure 6 shows how our system exploits these advantages. The main motivation for using these features rather than pixels, for face detection, is that they are much faster to compute using a new image representation called Integral Image. The Integral Image of a given point in the image is equal to the sum of all the pixels located in the upper left corner of the point, as shown in Figure 1 (b). Let i be an image of dimension M×N, i(x,y) is the intensity of pixel (x,y) of image i. The Integral Image is another image ii such that for each pixel (x,y): Then, the sum of the pixels of all the regions of the image can be obtained by a simple scan of the image, which considerably improves the efficiency of calculating the eigenvalue of the image. Therefore, the integration of each rectilinear rectangle R = [(x,y),w,h] is calculated only in four references (see Figure 1 (c)): Such as (x,y) is the upper left corner of R, w and h represent, respectively, its width and height.
Features extraction: For face detection, the face representation must capture not only the entire appearance of a face, but also the distinct appearance of different parts of the face and the contextual appearance. And we believe that such a rich representation would help the face detector to improve performance under complex conditions. To achieve this, each part of the face, according to the proposed model, is represented by one or more basic features offered by Viola and Jones.

Learning with adaboost
The AdaBoost algorithm, developed by Freund and Schapire in 1995 [12], is used to improve the robustness of a weak learning algorithm. Its principle consists in combining a set of weak classifiers to form a more efficient classification function (strong classifiers). A classifier is said to be weak in the language of Boosting when it does not have good classification performance. A weak classifier can only classify learning base data correctly at 51% [1]. For each feature, a weak classifier determines the optimal threshold classification function, so that the minimum number of examples is misclassified. A weak classifier hj consists of a feature fj, and a threshold θj (formula 3): The main idea of the AdaBoost algorithm is since T iterations we select T weak classifiers ℎ ( ) each with a weight to finally obtain a strong classifier ( ) defined by the following expression: The strong classifiers trained by AdaBoost are applied sequentially, and each of them decides to accept or reject. If the window has been classified positive on one stage, containing the sought object, it will be processed by the following classifier. On the contrary, if it is negative, it is definitively excluded from the study, and will therefore not be processed by the following classifiers. A window which crosses all the stages will therefore be considered as containing the sought object. This principle is called by Viola and Jones; the Attentional Cascade. Figure 7 clearly explains this operation for a 3-stage cascade. Our principle: The principle of our learning algorithm consists in repeating the learning with AdaBoost several times and at the level of each iteration, the algorithm randomly divides the dataset into two sets, namely learning and validation. Then, it is verified that the detection rate, the precision, and the recall of each detector exceed a threshold fixed in advance, if applicable, it will be added to the final detector. Algorithm 1 details this system. The final detector Df = {d1, d2, …, dn} comprises n detector di. Each di detector is the result of the Adaboost algorithm applied to the dataset with a given splitting operation. This detector is added to the final detector if its detection rate exceeds the minimum min_rate fixed in advance and its precision and recall exceed a given value. The number n of the detectors elected is fixed by the programmer.

Algorithme 1 Learn Final Detector
This repeating characteristic shows the robustness of our feature extraction system. Indeed, the number of iterations performed by our algorithm to choose n detectors is always closer to this number n. Figure 8(a) shows the variation in the total number of iterations, to choose 20 detectors, for 100 final detectors. We notice that the number of iterations varies between 20 and 36, but also according to figure 8(b) the most realized values vary between 20 and 25. This clearly shows that the detectors comprising the final detector have detection rates high.
To detect a face in a given image with our final detector, we opted to use the majority vote prediction. In this principle, we combine the labels of the predicted classes of each individual detector, di, and select the label of the class that received the most votes. In fact, in our case, we are talking about the binary classification where class1 = −1 and class2 = +1, therefore, we can write the majority vote prediction as follows: iJOE -Vol. 17

Dataset
The image databases used to evaluate the performance of face detection algorithms are very numerous and varied. For this project, we have chosen to use the Color FERET [14] face database thanks to its diversity of images. The Color FERET database is an expansion with color images and higher resolution of the original FERET database in gray tones. This database allows the study of a wide range of acquisition conditions as images are acquired, for everyone, in various poses, levels of illumination, as well as in the presence of expression or not. The base is made up of various ethnicities, which makes it possible to assess the performance of the system against a diverse population. In the context of this work, we cut, manually, the images from this database according to the measurements exposed in section IV.C. The result is a 30x36 sized image subbase. From this subset, we created 200 positive images of size 18x23 which we used to train our final detector. Figure 9 shows a sample of the positive images used by our system. http://www.i-joe.org

Feature analysis
The choice of features was not made at random. Indeed, we based ourselves on the invariant characteristics of a face, namely the eyes, nose, mouth, eyebrows, cheeks, and the combination of two or more of these characteristics as shown by feature f21. To measure the performance of our 32 selected features, we used two more wellknown methods, namely feature importance property and Correlation Matrix with Heatmap. Figure 10 shows the importance of our features. We can clearly see that the features extracted directly from one of the invariant characteristics like the eyes and the eyebrows are more determining, while those extracted from the nose and the mouth are less important because of the different positions of the faces used in our dataset. For example, the nose is not always in the same portion.  Figure 11 shows that the main features of group 1 present a weak collinearity while the features of group 2 present a high correlation with those of group 1, which gives the impression that they illustrate redundancies but our experiments show that the addition of these features (group 2) increases the performance of our final detector.

Performance and results
To test our algorithm, we created 100 final detectors which we tested and rated their performance using a test basis in this regard. This test database includes images collected from the web. It contains 101 images of celebrity faces and 71 negative images that do not contain faces. Figure 12 shows the detection rate of these different detectors. We can see that this rate is high and proves the performance of our approach. Table 3 shows the averages of the different measurements showing that our approach is promising and will give impressive results in face detection.  To have performance tests comparable to those carried out by Viola & Jones [1], we used the MIT-CMU test base [17] which consists of 117 images and 511 faces. Viola and Jones used 130 images and 507 faces. A difference that we overlooked as most of the images have been the same or have had a few slight changes.  Figure 13 shows some examples of detection using sample images from the MIT-CMU test database.

Conclusion
In this article, we have introduced a face detection algorithm, based on Haar-like features and their application to specific regions of a face. These regions are deduced by applying the principle of face proportion. The detection rate is 97%, and the average of false positives is 18% based on MIT-CMU test. Due to the process of cutting faces according to the rules we presented above, we are not able to train our final detector on a larger number of training images, which may improve the performance of our detector. In the later work we will prepare a dataset with more images and test our algorithm with other learning algorithms (SVM, Decision Tree, etc.).