Lessons from an Educational Game Usability Evaluation

—In this paper, we present the results of a usability evaluation for Xenubi, a cell phone game about the periodic table. The evaluation took place in a state high school, and the issues faced are described and discussed. These issues are related to conflicts between the data that were gathered through video recordings and through a questionnaire. We also examine the poor performance of the game’s mechanics during the evaluation. These results were unexpected because the game had performed extremely well during its pilot test. Possible causes for this outcome are discussed.

INTRODUCTION Cell phones, instant messengers, social networks, massive multiplayer online games: currently these are the tools of a teenager's everyday digital life. The criticism against the constant use of these applications highlights negative effects such as decreasing face-to-face interaction, plagiarism, overexposure and even health risks. Despite this, each release of a new cell phone or tablet has thousands of people queuing up to buy. According to [1], the feeling of immersion provided by these technologies is legitimate, as is the change brought to the way we talk, interact and learn. This motivation inspired us to design and develop an educational game about the periodic table, called Xenubi, to be played on mobile cell phones.
According to [2] and [3], although the use of mobile phones has the potential to create more compelling pedagogic proposals, the small screen size imposes a major limitation. Beyond this physical limitation, developing mobile applications is a difficult task, as there is a need to deploy the same application for several operational systems, several screen sizes and for devices with reduced processing capacity.
However, we believe it is worth the effort, because mobile learning (m-learning) is a valuable tool for improving student performance, as the use of mobile devices has the advantage of mobility, engagement and interaction [4] [5]. M-learning is not limited by time or space, which is a positive feature that is required in this learning paradigm [6]. Mobile educational games have the potential to explore positive qualities, such as tolerance with the other players and proximity with the subject of the game in an interactive way [7].
II. USABILITY GAMES, MOBILE, GAMES As mobile games are becoming more popular, design and development teams have started to focus on usability [8]. In this context, usability studies are rising, because of the need to reach a broader and more diverse audience [9].
The main evaluation techniques are expert evaluation and user testing, and they have always been encouraged by researchers in the field of usability [10] [11]. Both have pros and cons: evaluating with experts is usually less resource intensive, but yields less information than user testing [12]. Two popular expert evaluation methods are evaluation through heuristics and checklists. Reference [13] proposed a game specific heuristic set, which helped testers to find more issues than a user study (p. 25). Reference [14] proposed an evaluation strategy based on four checklists -three related to the physical, logical and graphical user interface, and one related to the task. Reference [15] reported -in a case study of the development of an AAA game -that expert evaluation provides novel and useful information for game development. They asked six usability experts to evaluate the game through Molich and Nielsen's heuristics [16]. The developers that were interviewed (by reference 15) considered the problem list that the usability evaluators gave them to be very useful, as novel and critical issues were found. Regarding user testing, reference [17] compared expert evaluation (through heuristic and checklists) and user testing, and found that user testing was more effective and more reliable. We performed two types of evaluation: heuristic evaluations (using Nielsen's set [16]) during Xenubi´s iterative design cycle [18], and a user test which -is reported in this paper.
When it comes to user testing of mobile applications, two issues should be addressed: (1) where to test and (2) how to collect data. Regarding the "where" question, reference [19] compared the results of laboratory and field testing, and found no significant difference -which is an argument in favour of laboratory tests, because they are less expensive. They also found that gathering data through think aloud protocols was the best alternative. However, we chose not to ask users to do so, for two reasons: (1) think aloud is a method suited for analyzing well-structured-problem solving processes [20] and (2) thinking aloud can interfere with the task [21]. Regarding the "how" question, reference [22] proposed a log tool to track interactivity with mobile applications, and reference [23] designed belt and backpack equipment to be worn by the user during the test.
Those aforementioned topics are of great relevance for evaluating educational game usability. However, we should not forget the "educational" aspect of our game. When it comes to educational software, there are no tasks in the typical sense; the interaction goes beyond the task and work-related usability paradigm [24]. So, usability PAPER LESSONS FROM AN EDUCATIONAL GAME USABILITY EVALUATION evaluation should not consider only "technical" aspects, but should also concern the instructional design [25] [26].

III. XENUBI, A GAME ABOUT THE PERIODIC TABLE
Xenubi is a game about the periodic table that was inspired by the work of reference [27], which proposed a Super Trump© card game about the periodic table. Aside from the name and layout, the differences between these games are the following: Xenubi is a one-player game and has several distinctive interactive elements, as shown in Fig. 1.
The game's initial screen ( Fig. 1.a) shows the following buttons in order: "play", which starts the game; "instructions"; "levels" (easy-as the default level-medium and hard); "credits"; and "save the game". The development team did not reach a consensus about the need of having a "levels" and a "save" feature -we chose to see whether users were going to click those buttons before actually implementing those features. The game play mechanics is simple: at the beginning, a pack of 12 cards representing chemical elements is evenly distributed to both players (the human player and the computer). The human player is represented by a stick character ( Fig. 1.b), and the computer is represented by Dr. Moseley ( Fig. 1.c), the scientist who proposed the organisation of the periodic table by atomic number [28]. In each round, the human player chooses one of the 6 periodic properties listed in the interface ( Fig. 1.d). If the player's element has a higher value in the property chosen, the human player receives the computer card; otherwise, he/she will lose the card. Unlike Super Trump©, the properties need not be chosen by chance: the player can see the position of his/her element and of the computer's element in the periodic table (Fig. 1.e). This position has valuable information about the relationship between the player's and the computer's elements. Xenubi's main game screen also shows other information about these two elements, such as the following: the atomic number, the element name and its abbreviation (Fig 1.f). Additionally, in the main game screen, the player can read "tips" about the periodic properties ( Fig. 1.g) or can return to the initial screen ( Fig. 1.h). After choosing a property, the player is taken to the results screen, which shows the highlighted property ( Fig. 1.i) and representations of the two game characters (the human player and Dr. Moseley) in a win/lose pose ( Fig. 1.j). The only interactive element in this screen is the "OK" button ( Fig. 1.k), which causes a new round.
Xenubi's prototype was built using Adobe Flash© because it is easier to make changes in a Flash Lite© file than in other programming IDEs. Reference [29] also justify the use of Flash Lite© in terms of easy of editing. However, we would advice its adoption only for rapid prototyping and usability inspections. The reason is that Adobe discontinued Flash Lite©, focusing on mobile development with Adobe AIR© [30]. Currently, Xenubi is being ported to HTML5, CSS3 and Javascript, using Adobe PhoneGap©, and will be compatible with the Android 2.1+ and iOS.
IV. XENUBI'S USABILITY EVALUATION When we considered that the design cycle had come to a point where no significant improvements were made to the interface, we decided to run a usability evaluation. First, we ran a pilot test and then an evaluation at a state junior high school. In both cases, a combination of interviews and questionnaires was used, which we describe in the next sections.

A. Pilot Study
The aim of the pilot was the following: (1) to verify whether the users understood the questions and (2) to prepare the researchers for the usability evaluation in a realistic setting. The pilot was conducted in a pre-medical school with 5 students, 3 male and 2 female, ranging in age from 19 to 22 years. In Brazil, these schools prepare students who will take exams in medical undergraduate courses, whose candidates have, on average, the highest scores in Brazilian university admissions tests. These students do not represent Xenubi's target audience because they are older and probably more motivated than junior high school students with respect to chemistry. Nevertheless, we decided to run the pilot in this school because we found it very difficult to gain access to high schools.
Two researchers ran this test, one with 5 and the other with 3 years of usability experience. The students performed very well: 4 of them completely understood the game mechanics (3 of them won the game), whereas 1 did not understand how to play. One of the students who won PAPER LESSONS FROM AN EDUCATIONAL GAME USABILITY EVALUATION asked whether he could start a new match, because he was defeated by the computer at the first time he played the game. The other 2 students who won the game did not use the 'tips' function frequently. After they finished the game, we asked how they chose the properties, and both answered "by the element's position on the table". This answer indicated that they did not need to be reminded of the periodicity of the properties.
All of the students inspected all of the game screens and browsed the tips screen. The average time for each match was approximately 5 minutes, and they only used the Nokia E63 cell phones.
These results suggested that the game interface, game mechanics and test materials were easy to understand.

V. INVITING TEACHERS AND JUNIOR HIGH SDHOOL
PRINCIPALS TP PARTICIPATE We used two approaches to reach the students: asking permission directly from their teachers and sending letters to high school principals. In both cases, we presented a copy of the research project-which was approved and funded by a national research agency-along with the test materials and procedures. We chose the teacher based on a friendly, informal relationship. The requisite for the invitation was teaching Chemistry to junior high school students and willingness to have the evaluation right after teaching students about the "periodic table". We also contacted four high schools, which were chosen based on the target audience profile, but none of them granted permission. From previous experiences, we know that classroom access is easier when teachers are asked directly. Once the teacher consented to allow the research team inside the classroom, the high school board was notified. The usability evaluation took place in a state school, located in Canoas, which is a city that is located in southern Brazil, near the state capital.
We decided to run the usability test in a school for two reasons: (1) better control of the independent variables, such as knowledge of chemistry, age and socio-economic profile and (2) it would be harder to recruit students to the usability lab than to take the usability lab to the students.

VI. TEST MATERIALS AND PROCEDURES
The test materials were the following: (1) an informed consent document, describing the aims and procedures of the test, (2) an interview guide and (3) a questionnaire. Fig. 2 illustrates the test mechanics.
The first step was to explain the test to the class and if they agreed to take part of the evaluation, to give them some time to read and sign the informed consent document. In this presentation, it was stressed that the students' knowledge of chemistry was not under evaluation, that the aim was to evaluate the game and that they were free to choose not to participate. After this presentation, the teacher helped the class with an assignment that was not related to the subject of the game in one of the laboratory rooms ( Fig. 2.a). Groups of five students were then invited to a neighbouring room to play the game. In this second step, each student was accompanied by a moderator, who provided the student with a cell phone (a Nokia C3, a Nokia E63 or a LG C570), as shown in Fig. 2.b. The moderator briefly reminded the students of the test mechanics, that questions about the game or about chemistry would not be answered and that the interaction would be recorded on video. After 1-2 minutes, the moderators began asking questions regarding the game´s interface and mechanics. When a student decided to stop playing, he/she was asked to answer a questionnaire about the game features, their familiarity with the games and cell phones and their interest in chemistry, as shown in Fig.  2.c. If the student wanted, he/she could receive a copy of the game via Bluetooth. The teacher knew the game, but she did not show it to their students prior to the test. However, at our request, she had warned the students that there would be an "activity with university researchers" that day. Five moderators helped in the evaluation. 4 had previous experience with evaluations: 5, 3, 3 and 3 years. One of the moderators was a design student.

VII. USABILITY EVALUATION IN A STATE HIGH SCHOOL
The usability evaluation in the state school happened about one week after the subject was taught in the classroom. In this case, the teacher gave us access to two classes of students, one before and the other after the break. The equipment was the following: five cell phones, two full HD video cameras and three 12 megapixel digital cameras, which also record video.
A total of 37 students agreed to participate, comprising 25 girls and 12 boys, with an average age of 15.5 years and a standard deviation of 1.2 years. Because we stressed that participation was voluntary, no one volunteered for the first group. The teacher and the evaluation team had to ask and encourage the students individually in both classes. At the end, approximately 1/3 of the students chose not to participate.

A. Tabulating the data
We had two sources: video and questionnaires. One of the authors watched the videos, and documented each player's match status (win, tie or lose), time in seconds of the match and whether the player clicked and browsed any of the game screens. Additionally, the answers to the interview and any comments were also transcribed. The images of 6 videos were compromised because the design student did not follow the agreed procedure: he did not wait the user to stop playing, asking the phone back after he finished the interviews. This moderator was also using a very low quality camera -his videos had so much brightness that the screen was bright white. The second data source was the questionnaires, which were tabulated by another author.

B. Playing on different cell phones
The 3 cell phone models we used in the test have hot keys. We chose these models because they are the cheapest smartphones in Brazil, as touch screen devices are still iJIM -Volume 6, Issue 2, April 2012 PAPER LESSONS FROM AN EDUCATIONAL GAME USABILITY EVALUATION very expensive. The Nokias (E63 and C3) are very similar, but the LG C570 is different because it has a small track pad that should be used to browse the items up and down and back and forth. All of the students who played with this cell phone had to be told how it worked. Because the track pad's sensitivity is high, sometimes the students would select an item unintentionally. We decided not to use this cell phone model in future usability evaluations. In the questionnaire, 4 students said that they had great difficulty using the cell phone that was provided (they all had used the LG C570), 12 students said they had some difficulty and 21 students said they had no difficulty using the cell phone (none of them used the LG C570).

C. Understanding game mechanics
The results were not as promising as those of the pilot study: from the video recordings, surprisingly, no student seemed to be able to play the game. In 11 video recordings, we heard expressions such as the following: "OK, what should I do here?"; "How does it work?"; "I am not understanding anything"; "I cannot understand it"; "I have no idea how it works"; "What is my grade?"; and "Where do I answer?". None of the students played the game until reaching victory or started over after losing. The other 20 students (who did not state that they did not understand the game) demonstrated negative emotions while playing: confusion, fidgeting and boredom. We cannot say that the only reason for this reaction was a lack of understanding of the game: a student could fail to play if he understands the game but lacks the required chemistry knowledge. The average match time was approximately 3 minutes (with a standard deviation of 1, which means that there was a substantial amount of variability in those times).
In the questionnaire, 16 students said they did not understand the game, 8 students said that they understood "after 5 trials" and 12 students said that they understood "before 5 trials". One student said that they understood after realising it was a Trump© game and wrote "now I get it" next to the option. However, these answers do not correspond to what we observed on the video. Additionally, in the questionnaire, they said that they felt that the interface was pretty (69%), that the game was fun (52%), that they would play the game outside the school (60%) and that the game was useful from an educational viewpoint (72%). Their feelings about chemistry were not as positive: 60% said they do not like chemistry and 82% said that their understanding of the subject "periodic table" was poor or average.
The game scores indicated that most of the students were playing by chance. Graph 1 shows the frequency of each possible result, for a 12 card game (6 for the player and 6 for the computer). Graph 1. Final scores: player cards x computer cards. Results for 31 video recordings, excluding 6 recordings that were compromised As shown in Graph 1, 4 students won the game. However, from the video records, we observed that three of them expressed surprise after winning the game: "I didn't get it, how come?"; "You see… I did not know what I was doing" and "How could I win guessing only two?" A single student almost won the game (11 x 1), and 20 students (approximately 64% of the students) had scores between 8x4 (winning by 2 cards) and 4x8 (losing by two cards). Scores such as those are most likely to occur when the player is guessing: you lose one round, and then you win one round. Losing or winning only by chance is an unusual result. Two students won 6 matches in a row, and one student lost 5 matches in a row. To simplify the calculations, if we assumed that all of the cards have a 50% chance of winning, then the probability of winning 6 matches in a row would be 1.5% (0,56). We cannot calculate the real probability of winning by chance in each game because 21 video recordings were not made with the full HD cameras and thus do not have sufficient resolution to allow identification of the element on each card.
All of the students started playing the game immediately after receiving the cell phone and did not click to read the instructions, which were available in the first screen. A total of 3 students read the instructions after they started playing. It is interesting to note that even after reading the instructions, these students still did not seem to understand the game. One of these students said that she still could not play well because she lacked the required chemistry knowledge. Another interesting piece of data is related to the frequency of use of the "tips" button: 15 students pressed it, but none of them browsed the "tips' screen" (some of these students pressed the button more than once). However, in the questionnaire, approximately 85% of the students (31 students) said that they observed and pressed the "tips" button, a result that is inconsistent with the video recordings.
Another issue involves familiarity with cell phone games. Based on the results of the questionnaire, 6 students (approximately 16%) have smartphones (this question indicated that a smartphone would have Wi-Fi connectivity). A total of 23 students have cell phones that are not smartphones (14 students did not know the model), which is approximately 62%. A total of 4 students do not have cell phones, and 4 students did not answer the question (adding up to 21%). We believe that this lack of familiarity with cell phone games may be a relevant independent variable.

D. Understanding the game interface
With respect to the game interface, the results are promising: it seems that the function of each of the elements was clear. A total of 23 students (approximately 62% of the sample) correctly answered the question "Who is winning the game?", which referred to the interpretation of the two-colour bar in the main game screen. A total of 27 students (approximately 73%) correctly answered the question "How many cards do you have?", related to the cards near the player's character. A total of 16 students, approximately 43% of the sample, correctly answered the question "Which property did you choose?", which was an unexpected result because we expected that almost every student would answer that the chosen property was the property that was highlighted in the results screen. From the video analysis, we observed that 10 students attempted to browse this screen up and down, which may indicate PAPER LESSONS FROM AN EDUCATIONAL GAME USABILITY EVALUATION that they expected another decision level after this screen. Because we had observed this behaviour in previous tests, we believed that it could be avoided by employing a tablelike layout. The last question of the interview was "Who won this round?": 17 students, approximately 46% of the sample, answered the question correctly. With respect to this result, we can say that the consequence of the property choice is not completely clear; it would be clearer if we used an expression such as "You win" or "You lose". We chose to use stick figures and an illustration of Dr. Moseley instead.
The last results of the video recordings concern the "levels" and "save game" buttons. Only one student clicked the "levels" button (but did not change it). When asked "what level are you playing?", no student answered correctly. Regarding the "save" button, no student clicked it, although in the questionnaire, 30 students, approximately 81%, said that they would save the game to play later. This result does not reflect what was observed in the recordings.

VIII. CONCLUSIONS
We had notably different results in the pilot evaluation compared to those in the high school evaluation. In both of the tests, we used the same version of the game, materials and procedures. In the high school evaluation, we used two cell phone models that we did not use in the pilot: the LG C570 and the Nokia C3. The students had difficulty using the LG cell phone because of the track pad-based navigation. Thus, we decided not to use this model in further tests.
The difference was the extremely good performance of the pre-medical students, who were able to play the game on their own. All of the students inspected all of the game screens and browsed the "tips screen", and 3 students won the game. They seemed more engaged with the game, and their average match time was higher (5 minutes versus 3 minutes); in addition, we did not notice any negative emotions regarding the game.
In contrast, the students of the state high school were unable to successfully play the game. From the analyses of the 31 video recordings that we utilised, 64% of the matches were played by chance. In all of those videos, we could detect negative emotions that, although not verbalised were visually evident. This result does not match the questionnaire data, which shows that 55% of the 36 students understood the game (although one questionnaire could not be computed), that 60% would play the game outside of school and that 52% found the game fun.
From the video recordings, we also observed that only three students read the instructions and that none read the instructions before they started playing. Even the students who read the instructions could not play the game. One of these students admitted that she could not play because she lacked the required knowledge. We also observed that 15 students pressed the "tips" button (approximately 40% of 31 students) but none navigated the tips screen. Again, these results conflict with those of the questionnaire, where 83% (out of 37) of the students stated that they pressed the tips button.
Based on the results of the questionnaire, we observed that approximately 62% of the students did not have smartphones and that 21% did not answer, did not know the cell phone model or did not have a cell phone. This result indicates that 83% of the students did not have familiarity with games for smartphones. We think that this measurement may be a relevant independent variable.
Regarding the interface, the results are promising: between 40% and 60% of the students correctly answered the interview questions about the interface elements, such as game and round status, the property chosen and the number of cards. However, we expected that almost every student would answer correctly which property was chosen (whereas 43% did).
According to the video recordings, the "levels" and "save" features that were included in the prototype and submitted for testing because the design team did not reach a consensus, are not necessary because only one student clicked the "levels" button and none clicked the "save" button. However, in contrast to the video observations, the answers in the questionnaire show that 81% would save the game.
From the data gathered in the video recordings and the questionnaire answers, we are not certain if the reason for the game's poor performance could be because of: not knowing the Super Trump© game; a lack of knowledge in chemistry; a lack of familiarity with cell phone games; or a lack of motivation. We believe that all of these factors might be relevant independent variables.
We would obtain better, albeit fewer, data if we did not attempt to interview as many students as possible. Although 1/3 of the students did not take part in the evaluation, we ran out of time with both classes. Additionally, we had to use lower quality equipment, i.e., three digital cameras. Because we had so many students to evaluate the game, we asked a design student, who had no experience with usability tests -and we did not used the video data he gathered.
Perhaps future studies would benefit from interviewing the students after they finished playing and asking whether they understood the game and, if so, what they understood. We also should have asked how they chose the properties. The reason that we did not conduct this interview is that we were confident that the game would be easily understood, as it was in the pilot study.
Future studies will replicate this test in a private school, where we expect -based on teacher testimony -that the students will be familiar with cell phone games. PAPER LESSONS FROM AN EDUCATIONAL GAME USABILITY EVALUATION