Proposal for a New Tool to Evaluate a Serious Game

—The current enthusiasm of generations of students for video games and the marked interest of training institutions for the use of playful strategies, which facilitate learning, has encouraged the development and use of formative games called Serious Games. The main aim behind these games is not to substitute the traditional training mode, but to complement it by making the learner benefit from the interactivity and ergonomics of the graphical interfaces offered by a SG. So far, much research work has focused on the benefits that SGs can bring to a training environment. However, little has been done in terms of evaluating an SG, not as a training tool but as the outcome of a development project of a tool meant for use in a learning context. The purpose of this paper is to propose an evaluation tool of a SG designed in terms of four necessary dimensions that a SG should have in order to fulfill the task for which it was designed. These four dimensions are represented in terms of measurement criteria and prioritized according to the Fuzzy-AHP method «Fuzzy Analytic Hierarchy Process». This model was tested on an SG validated by an educational committee and used by biology students in Hassan II University, Ben M'Sik Faculty of Sciences. The results obtained show the quality and relevance of the evaluation of the proposed tool.


Introduction
The current enthusiasm of generations of students for video games and the marked interest of training institutions for the use of playful strategies facilitating learning has encouraged the development and use of formative games called the Serious Games "serious games" (SG).
The objective of a SG is not to replace the traditional training mode but to complement it by making the learner benefit from the interactivity and ergonomics of the graphical interfaces it offers.
The versatility of an SG allows it to be used more frequently in different domains [1][2]. So, the evaluation interest's, that enables its correct and efficient exploitation [3].
Since the evaluation of an SG is more complex than the evaluation of an entertainment game [4][5], the objective of this paper is to propose a model for the evaluation of an SG. This model is designed on the basis of four dimensions deemed necessary that a SG must satisfy in order to fulfill its task for which it was developed.

State of Arts
A number of SG evaluation work has focused either on its effectiveness in a specific training process [6][7], its impact on different stakeholders [8], its design phase [9], and its quality [10][11].
The effectiveness of an SG in a training process is an important element and must be assessed. This is done in order to provide guidance on the learning progress and the results acquired for the learner and the tutors [12] and especially that the learning with an SG took place [13].
For example, De Freitas and Oliver [6] proposed a four-dimensional evaluation model: context, learner specification, mode of representation and pedagogical considerations. The model allows tutors to choose an SG that is appropriate for their needs. The authors illustrated the importance of the chosen dimensions and the relationships linking them while supporting the learner's experience.
However, as Robertson and Howells [7] have pointed out, this model obliges tutors to have prerequisites in computer games and does not provide any assistance in classifying serious games corresponding to their needs.
Some of the work focused on the impact of using a SG on stakeholders included Xu et al. [8], where the authors proposed an assessment of the experience of SG stakeholders called "Method (SGSEAM)". The aim is to identify the strengths and weaknesses of an SG for the various stakeholders involved in its life cycle. The evaluation is based on the collection of qualitative and quantitative data at the level of each actor. Note that this approach can be useful in identifying the usefulness of an SG and its ability to be used.
Djelil [9] has defined an evaluation methodology situated in the context of SG design and experimentation. Evaluation based on usability, utility and acceptability, used analytical and empirical evaluation methods. The goal is to reduce the risk of a poor design of an SG that must both be pedagogically useful, usable from the point of view of the learner/player and acceptable from the point of view of the institution. All these criteria are met on the basis of a non-exhaustive selection of models for the analysis and design of learning games. Similarly, in [10] it was proposed quality indicators that can help designers to analyze the quality of their SG being designed. This proposal is part of a quality approach to the creation of SG to reduce time and cost.
The analysis of the use of indicators by experts enabled the chosen terminology to be adjusted and validated. They were also able to validate that the indicators covered all the important features of a Learning Game.
The quality of SG is an important criterion to be assessed, in this context, in [11], The authors divided the quality characteristics of a SG, according to their use in the literature, into primary and secondary characteristics. Several criteria were taken into account such as usability, comprehensibility, motivation, commitment and user experience. Their results indicate that comprehensibility is a problem and therefore the recommendation of using tutors or adding tutorials to the game will allow users to better understand the goals, concepts and procedures of the game.
Recently, and in the case of a SG quality assessment, Giani Petri et al. [14] reconstructed an evaluation model called MEEGA. This model systematically breaks down quality factors using the OQM approach (objective / question / metric) [15]. These quality factors were subsequently refined into a set of dimensions from which the questionnaire elements are derived. Thus, the authors presented a new version of the model called MEEGA+, in order to provide full support for the quality assessment of SG.
Through this brief overview of the scientific literature dealing with the evaluation of SG, we note that several criteria have been considered, most of which relate to aspects of design quality, utility, motivation, commitment, game design, etc.
However, we do not find a generic model to evaluate SG not as a training tool but as the outcome of a project to develop a tool dedicated to use in a formative context, hence our proposal in this paper.

Our Proposal Model
The SG evaluation model that we propose in this paper is based on four dimensions that we believe are fundamental to consider in any SG evaluation.
These pedagogical, technological, Ludic and behavioural dimensions will be measured according to several well-defined criteria that we represent in table 1.  We chose the pedagogical dimension, because a SG must first of all meet one or more pedagogical objectives for which was designed. Similarly, a SG must be very attractive and benefit in its design from the technological advances of game development tools. As for the playful dimension, its presence is essential in a SG to guarantee learning in fun and immersive situations in order to arouse interest among students and maintain their attentions during the SG. Finally, the behavioural dimension makes it possible to test the proper insertion of SG in the context of its use according to the motivation, commitment and experience of the users.

Weighting criteria validation
The importance of one dimension in relation to another depends on the context in which the SG is used. For example, if the SG is used in a purely formative context, the pedagogical dimension will be considered dominant compared to the other dimensions. Therefore, depending on the context in which SG is used, it is essential to validate this selection of the four dimensions and the weighting of their multiple criteria by a mathematical analysis method such as the fuzzy multiple objective method [16] or Fuzzy TOPSIS [17] or Fuzzy ELECTRE [18] or Fuzzy AHP "Fuzzy Analytic Hi-erarchy Process" [19] that we used in our case. This method combines both the AHP method introduced by Saaty in [20] and the fuzzy logic proposed by Zadah in [21].
Under the traditional AHP, the method breaks down a complex problem into a hierarchical system, in which binary combinations of evaluation criteria are established at each level of the hierarchy. This makes it possible to deduce relative priorities. However, consideration of the uncertainty associated with natural language in the representation of human judgment in a specific number [22] is neglected. Thus, to be able to model imprecise or ambiguous data, fuzzy logic is used. This is due to the fact that the latter gives considerable flexibility to reasoning and is impressive in its resemblance to human thought and perception [23].
The combination of AHP and fuzzy logic (Fuzzy-AHP) was invoked in [24] where the triangular membership function was introduced for the pairwise comparison of different criteria [25]. This function used by Buckley [19] and Chang [26] to calculate the fuzzy weights of the criteria.
Thus, the Fuzzy-AHP method allows the use of linguistic values represented by triangular fuzzy numbers to make a pair comparison between the criteria. This is done in order to calculate the relative weights of the criteria.
A triangular fuzzy number is designated by the triplet (l, m, u), where (l) represents the smallest possible value, (m) the most promising value and (u) the largest possible value.
Each triangular fuzzy number has linear representations on its left and right sides so that the function of belonging can be defined as follows: , < ≤ ,  M support is the set of X elements belonging at least a little to M. In other words, it is the set supp(M)={x X U_M(x)>0}.
The kernel of M is the set of X elements totally belonging to M. In other words, it is the set noy(M)={x X U_M(x)=1}, Per construction, noy(M) supp(M).
By using the Fuzzy-AHP method, we can group our evaluation criteria into different levels and groups with similar characteristics. To ensure logical consistency in the judgments used to determine the priorities of our evaluation criteria, while taking into account imprecision and uncertainty in human judgment, The approach allows the SG evaluator to use judgment in the form of language expression in the assessment process. To obtain the appropriate weightings for the judgments given for each dimension, the geometric mean of J.J. Buckley [19] is adopted. Thus, as shown in Fig 3, the process of weighting the evaluation criteria of an SG begins with a choice between the linguistic values proposed by the SG evaluation system. And using the Fuzzy-AHP method, linguistic values will be transformed into weighting values.

The SG weighting validation scenario
The SG weighting validation scenario to validate our choice of weights associated with the evaluation criteria consists of four steps: Step 1 (translation of the selected linguistic values into triangular fuzzy numbers): First, we carry out a pairwise comparison of the elements of the set S, associating a linguistic value to each comparison. Each linguistic value is transformed into a triangular fuzzy number, which we note 2 .
3 Represents the set of fuzzy values of the different possible comparisons.
In a purely formative context, such as ours, we felt that PD was more important than any other dimension. And since the target population is of university and scientific level and therefore accustomed to new information technologies, we privileged TD over BD and LD. Similarly, since the SG to be evaluated is used in a pedagogical activity sanctioned by an evaluation, we privileged BD over LD. Table 2 summarizes our choice of weighting for the dimensions. Step 2 (Calculation of priority vectors-Fuzzy judgment matrix): We define the fuzzy judgment matrix > composed of the triangular fuzzy vectors 3 +, 4 > +, , ? +, , 2 +, 7 This matrix therefore contains the aggregation of all determined triangular fuzzy numbers.
Thus, from Table 1 Then we calculate the consistency ratio to validate our choice of weighting. This ratio is defined as the ratio between the coherence index of the evaluation matrix (CI) and the coherence index of a random matrix (RI).
Where: ( ';# ) is the eigenvalue and (n) the number of criteria. According to Saaty [29], the threshold value of the consistency ratio must be less than or equal to 10%, otherwise we have to reconsider our choice of weighting.
Step 3 (Calculation of Normalized Fuzzy Criteria Weights): This step consists of calculating, using the geometric mean method, the normalized fuzzy weight of criteria by: ; i=1, 2,….., n + : Fuzzy weight of each criterion i: (̃+): Geometric mean: Step 4: (Calculation of priorities): Finally, we calculate the priority of each criterion by the following relationship: This priority determines the weight of each criterion in the entire hierarchical system obtained in such a way as to have ∑ = .

Testing proposed model
Once the validation of the weights of the criteria of the four selected dimensions was done, we tested our evaluation model by evaluating the SG Leuco'war (or leucowar). The pedagogical committee of the biology science of the University Hassan II of Casablanca validated this choice.
Philippe Cosentino creates this SG on the theme of interactions between different leukocytes. In a science-fiction scenario, the student controls the different types of white blood cells to help a patient fight an infection. During the game, he will discover macrophages, monocytes, mast cells and B-lymphocytes.
After playing the SG, students were invited to evaluate it by answering our questionnaire, whose internal consistency and reliability were verified by Cronbach's Alpha method [30][31]. Indeed, the calculated total Cronbach's Alpha value is equal to 0.954 and thus higher than the threshold value 0.70 [32].

Results and Analysis
The results of the analysis of the students' responses show that the PD has an average of 56.32%, the TD has an average of 72.88%, the BD has an average of 75.94% and the LD has an average of 88.19%.
These results show that the SG is more suitable for use in a ludic context than in a purely formative context such as ours. It should also be noted that the Technological and behavioural dimensions have been prioritized over the pedagogical dimension in the design of the SG.
By refining the results of the pedagogical dimension, we observe that the Error Management (EM) and Pedagogical Consideration (PC) criteria had the lowest averages.

Fig. 5. Histogram of SG analysis results
Indeed, the SG does not offer explanations and remedies for errors made by students during the course of the SG. Moreover, the lack of a difficulty level choice in the SG was extensively discussed by students and the majority of them was unable to successfully complete it.
On the other hand, the students appreciated the quality of the image and the way in which the concepts of the interactions between the different leukocytes were presented.

Comparison with MEEGA+
To confirm the effectiveness of our evaluation model, we compared our results with those obtained using the MEEGA+ model [33].
To make the comparison more realistic, we restructured the criteria taken into consideration in MEEGA+ in our four evaluation dimensions (PD, TD, BD, LD), knowing that both models show great consistency and fidelity between questionnaire items.  Fig 6, the results obtained are similar and indicate the same order of growth for the four dimensions, despite the fact that MEEGA+ emphasizes user experience and usability in its evaluation, whereas our model is based on four evaluation dimensions and, in addition, offers the evaluator flexibility in weighting between these four evaluation dimensions, which he can choose according to the context in which the SG is used.

Conclusion and Perspectives
In this article, we have proposed an evaluation tool for a serious game designed around four dimensions, namely the pedagogical, technological, behavioural and playful dimensions, which can be weighted according to the context in which the SG is used.
The validation of the weightings' coherence of these dimensions of evaluation is carried out by a program developed under Matlab [34] with the use of the "Fuzzy-AHP" method that takes into account the imprecise and uncertain judgments of the human being. The results obtained from the test of the serious game "Leuco'war" have been compared with those obtained from the evaluation tool MEEGA+, and show the quality and evaluation relevance of our proposed tool.
In our future work, we will consider the dependence of the four evaluation dimensions.