Automated Assessment , Face to Face

This research paper evaluates the usability of automated exams and compares them with the paper-andpencil traditional ones. It presents the results of a detailed study conducted at The University of Jordan (UoJ) that comprised students from 15 faculties. A set of 613 students were asked about their opinions concerning automated exams; and their opinions were deeply analyzed. The results indicate that most students reported that they are satisfied with using automated exams but they have suggestions to improve the automated exam deployment.


INTRODUCTION
As new technologies emerge, computers have spread to enter all aspects of our lives; one of the major applications that computers are being extensively used in education is assessment; this covers real exams and self assessment performance indicators.
Using computers in these two methods of testing (real and self performance) faced a wide welcoming from experts in the world; they aimed to measure the knowledgeable and the technical skills, and for this purpose, some companies produced special kind of applications to facilitate the assessments process such as Quiz Creator (http://www.quiz-creator.com) and QuestionMark (http://www.questionmark.com).In some organizations such as UoJ, however, they developed their own assessment software tool.
Automated exams can be networked, computer-based test (CBT), or computer-adaptive test (CAT).A new advancement in networked automated exams is the Internet-Based Test (IBT).IBT can be administered from any place in a university, a ministry or any place in the world.Organizations that administer the exams always have high security precautions to save exams and maintain information.This type of exams has been brought into operation in September 2005 [1] by the Educational Testing Service (ETS) to administer Test of English as Foreign Language (TOEFL) exams.Earlier than that and in 1992, the Graduate Record Examination (GRE) started to be computerized although Scholastic Aptitude Test (SAT) and Law School Admission Test (LSAT) are still following the paper-andpencil style where the student has to wait some times, few weeks, before getting the results [2].The content of the CBT exams are similar to the manual paper and pencil one where the exam normally goes in one direction.An advantage of the CBT exam over the manual traditional one is in the exam security and the automated grading.In CAT exams, answering a question will affect the difficulty level of the following questions which are selected by the com-puter.The difficulty level might go up or down based on the previous questions history until the level reaches a certain stable level.This method might conclude the exam before completing the whole set of questions since the computer will have enough information to judge the level of the person being tested.
Before conducting this study, the researchers of this paper were expecting that all students will be in favor of the automated exams and we based our expectation on the following facts: (1) automated exams have higher test validity.Especially in CAT, it is possible to track the achievement of students' progress and records the needed statistics about the strengths and weaknesses of the test and see the effects of the distracters for the various questions, the time spent to answer each questions, the questions that the student changed his/her answers and the skipped questions before returning to them.These factors help test developers to improve the test validity.(2) More security: In traditional exams, the exams are delivered manually, in person, through various stages to students; where as in automated exams, the exam can be delivered directly to students.Even though, cheating in exams is possible; it is less likely to occur due to the expected randomness in choosing and distributing questions.
However, automated exams have some drawbacks; for instance, some traditional characteristics are missing.For example, the student might not be able to view the text or the questions as a whole; also the student cannot use the pen to highlight some important phrases.Automated exams require some advanced technologies and require a certain level of knowledge to deal with these technologies.Also, additional initial time investment is needed to generate automated exams.Furthermore, the automated exams could not be suitable to all course contents; such as writing algorithms, the steps of formula proof, etc.
In this paper, an empirical study is presented to evaluate the usability of automated exams and to compare the automated exams with the traditional ones.It presents the results of a detailed study conducted at UoJ that comprised students from various faculties.A set of 613 students were asked about their opinions concerning automated exams.The results indicate that the majority of students reported that they are satisfied of using automated exams even though they have suggestions to improve the automated exam deployment.
The rest of this paper is organized as follows: section II reviews the automated assessments issues and some related studies.Section III presents the research method; it describes the purpose, participants and the questionnaire.The results and the analysis are discussed in sections IV.Finally, the conclusion is drawn in section V.

II. LITERATURE REVIEW
Literature about automated assessment is rich but few of it evaluated automated assessment by university students.According to reference [3], the issues that influence the trust after changing from traditional testing to the automated one can be minimized by implementing more checks for plagiarism and cheating.The authors showed also that future developments may increase the trust by standardizing the results and reassessing ambiguous questions.The authors emphasized that the examination board has to trust its employees and associated personnel because information of the exams is stored electronically which makes it possible for editing before or after the exam; this is because computer-based assessment is open to different methods of cheating.
In reference [4], among other factors, the authors discussed the principles, advantages and challenges of online assessment.Security in online assessment, due to authentication difficulties, is a major issue.They indicated that in the absence of the face-to-face interactions, assessment and measurements become more critical.The authors showed that assessment is meant to measure the achievement of the learning goals by two categories of assessment; formative and summative.Formative assessment provides a continuous feedback to both the learner and the instructor; and the summative one is meant to assign a value to what has been learned.
Two different studies covered in [5].The first study; which is similar to this research study, was to ask 162 students about the format of an accounting exam they prefer.The second study disclosed the reasons behind their selection.Both studies showed how the reasons for selecting a computer against a traditional paper-and-pencil are differing and can be explained.Out of 162 students, 89 preferred the paper-and-pencil format while the remaining 73 students preferred the computer one.For those who preferred the paper-and-pencil exams they indicated greater comfortable with this type of tests while who preferred the computer test indicated that the quick feedback was a good reason for their selection.
In reference [6], the authors conducted a study to examine if the results of online assessment were significantly different from traditional paper-and-pencil style.Among other results, the study revealed that there were no difference between the two methods in terms of age and grade point average.In addition, no differences were reported between gender, ethnicity and educational level between the two methods.The only difference reported is in the longer duration (time) it took the students to complete the paper-and-pencil exam.
In a study examined the attitudes of prehospital undergraduate students undertaking a web-based examination in addition to the traditional paper-and-pencil one [7], results showed high students satisfaction and acceptance of webbased exams.The study indicates that web-based assessment should become an integral component of prehospital higher education.This study concentrated on one category of students while this study covered almost all type of students studying at a big university like UoJ.

III. METHOD
The purpose of this research study is threefold.First, is to assess the satisfaction level of UoJ students as per their faculties about automated exams.Second, is to assess the satisfaction level of students as per their university level about automated exams.Third, is to assess the students' satisfaction level as per their gender about automated exams.In none of the studies we intended to compare the number of students together; for instance, in the effect of student's gender on students' satisfaction, we did not consider the number of male students to female students a major determining factor; their satisfaction percentage, however, was calculated.
For this purpose, a sample of 650 undergraduate students was asked to answer a set of 29 questions (see Appendix A).The sample size (650) is determined due to the fact that the researchers decided to question the maximum number of students all in one day, who used the automated assessment recently, for that reason, the researchers went to the IT labs in King Abdullah II School for Information Technology (KAIISIT) to the classes of the Computer Skills 2 service course that is offered to most university faculties in addition to other IT courses; this explains why the number of IT students is the majority in the sample.The total number of students that were able to be questioned on that day was 650 students.These questions are then grouped into seven categories for analysis purposes.After reviewing all the filled questionnaires, 37 invalid questionnaire rejected for the following reasons: (1) Some important information is missing such as the faculty, student level, gender...etc.(2) Some questionnaires have been filled carelessly by selecting strongly agree or strongly disagree or I don't know for the whole set of the 29 questions.(3) One side of the questionnaire is filled and the other one is left blank.
As shown in Table I, this study included 613 students from 15 faculties of UoJ.The table shows the faculties in ascending order based on the number of students included from that faculty.
For the purpose of this study, the questions were classified into 7 major categories; Table II summarizes those categories and the questions that belonging to them.It is important to point out that the questions: 10, 15, 17, 21 and 26 (shown inside squares in Table II) have negative indications; therefore, the researchers carefully considered these questions and reversed their meaning when analyzed the data in order to have their meaning go with the students' satisfaction with automated exams.
For each question, the student has to select one of the five answers.Table III shows the answers that were available for students to select from and their weights.
Before conducting this study, the following hypotheses were postulated and set at the 0.05 level of significance:  Automated exams are good tools for evaluating students at all university levels for theoretical contents  Automated exams require using other type of questions other than the multiple choice questions  Students trust the results of automated exams  Automated exams are a secure tool for evaluating students fairly and quickly  Students are in favor of using automated exams  Automated exams require some improvements to satisfy all students' needs.IV.RESULTS AND ANALYSIS The t-test is used to give indications about UoJ students' satisfaction level of the various categories under study.The significant level is set to 5% (i.e.p < 0.05).Since the average weight for all answers is 2 (calculated as the summation of all weights and divided by their count (4+3+2+1+0)/5), a value of 2 or above (of course when p < 0.05) indicates a satisfaction level and any value below 2 indicates that students are not satisfied.When p ≥ 0.05, however, it means that any conclusion cannot be drawn from the survey.
Table IV shows the overall results for UoJ students and their satisfaction level.It is clear to conclude that, in general, students are satisfied with the automated exams; they are satisfied with all its advantages, its duration, its idea, the mark accuracy and that is affected positively and emphasizing on seeing their results at the end of the exam, they trust its marking, security of questions, and when they get a report of their mistakes.Students are not satisfied, however, with exam environment and they believe that they encounter some difficulties in concentrating on the exam and that some students are able to cheat.Questions wise, it was difficult to draw any conclusion.

A. Detailed Results as per Faculties
As an example on how to fill the various satisfaction levels, here are some explanations in a detailed manner about how the faculty of agriculture and the faculty of art satisfaction levels are filled.As can be inferred from Table V, only the trust category from the faculty of agriculture has a positive indication (i.e.students are satisfied) and the researchers were unable to draw any conclusion about all other categories because p is ≥ 0.05 for all of them.Before analyzing the results, it is important to explain how the Satisfaction Level Percentage (SLP) is calculated.For each category, the researchers add the sample size (n) of that category that is mapped to the satisfaction level (Y, N or X).Mathematically, it is given by formula (1).
Where S is the total sample size (613 in the study), n i is the sample size of each faculty, m is the number of differ-ent faculties and i = 0, 1 , …, 15.A 0 value for i indicates that no faculty is mapped to that satisfaction level and l is the satisfaction level (Y, N, or X).
By looking at Table VII, as an example, the values for the SLP Y , SLP N and SLP X for the advantages category are calculated as shown in equations: (2), ( 3) and ( 4 It can be inferred from Table VII that not even a single faculty is satisfied (all faculties are VLS) with the exam environment and that the advantages and trust categories have high satisfaction level.In addition, the idea and mark categories have satisfied (S) level.
Also and based on the findings as summarized in Table VII, the researchers produced the graphs shown in Figure 1, Figure 2, and Figure 3, respectively.To formally give a logical meaning to the meaning of the satisfaction level percentage, the researchers made the assumptions shown in Table VIII.
It is evident from Figure 1 that the students are highly satisfied with the advantages (90.7%) and the trust (94.5%)categories.They are also satisfied with its mark (81.2%) and idea (78.5%) categories.They are very low satisfied with the duration (31.3%) and questions (3%) categories.On the other hand, they are not satisfied with its environment (0%) category.From Figure 2, the rejection level is high for the environment category (86.3%), while there is no rejection (0%) for the advantages, idea, mark, and trust categories.The rejection level for the duration category (14.2%) and the questions (3%) categories is very low. Figure 3, indicates that the majority (high level) of students (94%) cannot make decision regarding the questions category while low percentage of students (54.5%) cannot make decision about the duration cate- gory.For the other categories (advantages, environment, idea, mark and trust) in average, ((9.3+13.7+21.5+18.8+5.5)/5=13.8%),about 14% of the participants were unable to make decisions regarding automated exams; although this is a very low level but it gives a good indication about the honesty of results; hence the conclusion was drawn based on the data that were received from students who are 86% sure (100% -14%) about their opinions.

B. Detailed Results as per Student's University Level
As shown in the Table IX, first year students represents 40.6% of the sample size, second year represents 18.6%, third year represents 20.6%, fourth year represents 16.6, and 5 or more years represents 3.6%.Table IX also shows that the students in all university levels (1st year, 2nd year …etc) are highly satisfied (100%) with the trust category of the automated exams.This high satisfaction level decreases very little to become 96.4% for the students who stayed in the university for 4 years or less in the advantages, idea and mark categories.
On the other hand, 96.4% of the students who spent not more than 4 years at the university are not satisfied with the exam environment and about 60% of the students who have already finished the first year (2nd year, 3rd year, .. etc) cannot make a clear decision about the exam duration while only 40.6% of students who are in the first year are satisfied with the exam duration category.As almost the case with all the reported results, the students in all university levels are not able to make any decision regarding the exam questions.

C. Detailed Results as per Student's Gender
In this study, the satisfaction level as per student's gender is presented; Table X presents a summary of the data collected for this purpose.The table shows that the percentage of females in the sample size is 65% and the percentage of males is 35%.The table also shows that 100% of male and female students are satisfied with the advantages, idea, mark, and trust categories.On the other hand, neither male nor female students are satisfied (their satisfaction level percentage is 0%) with the exam environment category.The satisfaction level percentage for male students drops to become very low (35.0%) for the duration and questions categories.Furthermore, Table X shows that all female students (65% of the sample size) are not satisfied with the questions category and this same percentage (for female students) cannot make decision about the duration category.

D. Comparison Among the Threefold
The results recorded about students were classified into three groups: results per faculties, results per university level, and results per gender.Then, these results were compared together.The results collected as per student's faculty, the results collected as per student's university level, the results collected as student's gender were compared and summarized in Table XI.To simplify analyzing the data, three graphs were produced and shown in Figure 4, Figure 5, and Figure 6, respectively.
When compared the Y satisfaction level, Figure 4 indicates that the difference between the three groups is minimal for the advantages, the duration and the trust categories.It also can be noticed that the students' opinions agree for the university level and gender for all categories except for the questions category.
As shown in Figure 5 and when comparing the unsatisfaction level (N) for the three groups, it can be noticed that there is agreement for all categories except for the questions and the duration categories.In Figure 6 and when comparing the "I don't know" level (X), however, the researchers found the agreement in the duration category.To wrap this out, the researchers found little variation between students' opinions about their satisfaction level.

E. Major Comments from Participants
In the questionnaire and in addition to the 29 questions, a space was left blank for any extra comments the students would like to add.Out of the 613 students, 105 comments were recorded by 89 students; i.e. some students wrote more than one comment.The major comments as added by student are summarized in Table XII.
It evident from Table XII that about 25% of the comments suggest that automated exams are good for some types of contents but not for all course contents, for instance, physics and C++ courses are preferred to follow the traditional style than being automated.Also, about 24% of the comments are asking to supply the students with a detailed report detailing his/her mistakes.
Also from the comment shown in Table XII, the students emphasize the importance of allowing students to navigate through the questions in both directions; not only forward; about 11% of the comments ask for allowing the students to make changes to their previous answers after answering and leaving that question.Regarding the functionality of computers, about 10% of the comments raise this issue as an important one for having the exams running smoothly without affecting students' achievements.Issues such as the ability to collect previous questions, the exam environment, cheating, and exams errors have about 3% of the comments weights.About 2% of the comments raise points about the number of the questions, partial credit for some questions, the fear from the remaining time and showing the final exam results.Because the final exam result is linked to the overall students' results and the curve of that course, results require processing before the students can be told about their final grades.The weight for each comment of the remaining 12 comments as shown in Table XII is 1%.The researchers believe that among these 12 comments, warning students about the remaining time, fixing cameras and separating between students in the exam hall are good comments that need to be addresses; the remaining 9 comments are either directly or indirectly covered in the 29 questionnaire questions.
V. CONCLUSION This research was threefold.The first to assess the satisfaction level of all students about automated exams, the second to see the effect of students' university level on their satisfaction level about automated exams, and the third to assess the effect of students' gender on the satisfaction level of automated exams.
The study questioned a sample of 650 students each a set of 29 questions.Because some questionnaires missed important information, some questionnaires have been filled carelessly by selecting "strongly agree" or "strongly disagree" or "I don't know" for the whole set of the 29 questions, or one side of some questionnaires is left unanswered, 37 participants out of 650 were rejected and dropped out from the sample size.This research paper evaluates the usability of automated exams and compares them with the paper-and-pencil traditional ones.It presents the results of a detailed study conducted at UoJ that comprised students from 15 faculties; the opinions of the questioned students were deeply analyzed.The overall results of satisfaction as per students faculty, student university level and student gender indicate that the students are in favor of continuing using automated exams but they are looking for some improvements such as automated  exams are good for some types of contents but not for all course contents, the need to supply the students with a detailed report detailing their mistakes after the exam, and allowing students to navigate through the questions in both directions; not only forward.

TABLE I .
FACULTIES AND NUMBER OF STUDENTS

TABLE IV .
THE OVERALL RESULTS FOR THE 613 STUDENT (N=613).
*Y means yes, N means no and X means cannot judge or neutral.

TABLE V .
RESULTS FOR THE FACULTY OF AGRICULTURE (N=12).

TABLE VI .
RESULTS FOR THE FACULTY OF ART (N=81).TableVIshows the results for the faculty of art, by excluding the questions category; as being not useful to draw any conclusion; the duration and environment have negative satisfaction level.This means that UoJ art students are expecting better exam environment and exam duration.Please refer to TableIIfor more details on the areas covered in each category.The students, however, are satisfied with the exams advantages, idea, mark, and trust categories.

TABLE VII .
RESULTS SUMMARY FOR ALL FACULTIES.

TABLE VIII .
SATISFACTION LEVEL PERCENTAGE RANGES.

TABLE X .
RESULTS AS PER STUDENT'S GENDER.

TABLE XII .
TRANSLATION OF STUDENTS' EXTRA COMMENTS.