Student-Graded Oral Presentations

We describe a way to use peer-graded oral presentations as a way of reducing the load on the teacher, and show that almost identical results as can be achieved as with teacher graded presentations. Moreover, we have found that very little in the form of explicit criteria are needed.


I. INTRODUCTION
Some of the most imperative skills for a new engineer seem to be those involving oral presentations [1]. This implies that the prospective engineers need to practice these skills during their education. Moreover, these are skills that are easier to master with practical experience.
There has fortunately been an increase in student involvement in courses lately, mostly in the form of self or peer assessment. The main reason for this is that the students will be more active and will thus gain more from their studies (see e.g. [2,3]). Both self and peer assessment can be used to assess writing [4,5,6,7,8] as well as presentations [9].
It has been shown that well defined assessment criteria are helpful in getting good (or at least consistent) assessments, but no conclusive evidence have been shown with regards to the influence of age brackets, educational levels or sub-assessments of various criteria [9]. We will, partly because of this, be moderate in our discussions.
Another important question to look at is that of the trustworthiness of the results from these assessments, i.e. the reliability and validity of the results. While most of the papers on the quality of peer assessments focus on either reliability (mainly between peer assessments) and validity (between peer assessments and teacher assessment) as can be seen in the meta-analysis of [9], we will look at both in our assessment of our dataset. This dataset is also bigger than any of those found there, meaning that we can apply advanced statistical methods to it.
Main ideas of this paper: 1. Students in advanced courses are able to grade fairly without being given explicit grading criteria. 2. The students and teachers will, on average, give the same grades to each group. 3. Given enough students, summative assessments can be given by their peers, using the teachers (or teaching assistants) as fail safes.

A. Background
The course in Computer Architecture at the Computing Science department at Umeå University was a C level course (second highest level) in the pre-Bologna system that was used in Sweden. The course could be used as either a last course in a Bachelors degree or as an advanced course in a Masters degree. The course had three mandatory assignments and a written exam, and used a system where 20% of the final marks came from the assignments. These assignments were performed in pairs, but could be done individually if the student so chose. The 20% was given in lumps of 5% for each assignment if they were handed in on time (with deductions for being late) and had a passing grade before the exam. The final 5% came from an oral presentation, which was originally graded by the teacher.
The first and second assignments were to write assembly language for a number of virtual machines [10,11,12] and to implement one of the virtual machines in any computer language, respectively.
The third assignment was to write a short technical report on something within the computer architecture field, such as a processor, a bus or any type of storage media. This assignment was heavily edited and collected in a proceeding in order to model a workshop as closely as possible. This increased the likelihood that the students actually turned in their assignments in due time; everyone wanted to be in the proceedings. This also meant that the final assignment could yield up to 10% of the final marks.
During the first few years there were a number of students that contested the gradings, all the way up to heated arguments. We wanted to see if that could be alleviated by letting the students perform peer-grading [13], and the results from those experiments are presented here.
The same assignments were used a few more years after this, but the teacher that took over the course did unfortunately not keep any records. The peer-reviewed oral presentations were after this moved to another course with format changed in such a way that later data cannot directly be compared to those shown here.

II. METHODS
The students could make any type of presentation that they could think of. Moreover, they could use any means available for the actual presentation, including overhead slides, the whiteboard, a tape recorder, etc. The one rule that had to be followed was that the presentation must fit in the allotted time slot, between eight and ten minutes (depending on year).
Each presentation was graded by one teaching assistant (called teacher in all tables and figures) and at least all the students that presented in the same hour. The presentations were open for anyone to attend and grade, including other teachers and students. I personally sat in on one of the presentation tracks, as backup and extra support for that teaching assistant.

SHORT PAPER STUDENT-GRADED ORAL PRESENTATIONS
The grades were given individually by each grader, one grade for each presentation group 1 . There were six possible grades to give to a presentation, ranging from zero to five. The rationale behind this was twofold: • The grades given by the presentation should match what it would be worth on the final exam. • There should be no single average score to choose, thereby forcing the students to make a choice.
We had, moreover, added the extra rule that there had to be differences in the grades between presentations, e.g. a grading paper with all fives would be ignored in the process.
The only guidelines given to the students were the following: "Grade each group according to how well you thought they managed to get to the core of the subject, how prepared they were, the disposition that was used and how the presentation was done. Do not grade them according to how nervous they were." The grades were collected in a spreadsheet. The average, median and mode results of each presentation were calculated, and was used directly as a given grade if they agreed with each other. If they disagreed, the grade given by the teaching assistant was used as a decisive vote to show what grade to give to that presentation.

III. RESULTS
There was a total of 2310 votes given to the 112 presentations done in 2003-2006, disregarding no-shows that automatically got a zero. All averages and counts of this dataset can be seen in Table I. Fig. 1 contains the average given grade as well as the 99% confidence intervals of each type using normal distribution for the students' grades and t-distribution for the teacher's grades. There is very little difference between each year and most of the differences between years are not statistically significant.
The difference between the staff and the student grades have been checked as well, yielding a very interesting pattern over all gradings. It is normally distributed with µ=!0.047316, skew of 0.021278, kurtosis of !0.13622 and !"s=1.0391 over all four years. Looking at each year (rather than each presentation track or in total) yields a slightly different picture but there are still very small differences, as can be seen in Table II.

A. Reliability Estimates
It is possible to make estimates of the reliability for the numbers using analysis of the variance in the dataset. The test statistic F (defined as variance between groups divided by variance within groups) given by one-way analysis of variance (ANOVA) can be used to calculate the reliability of the averaged mark (r nn ) and the estimated reliability of the individual raters (r 11 ), as given in (1) [14,13]. The results of these calculations can be seen in Table III. (1) 1 Each presentation group corresponded to a subject and usually consisted of one or two students.  Figure 1. The average given grade as well as the 99% confidence intervals for the grades given by students (average, average median and average mode) and the teaching assistants per year.

IV. DISCUSSION
Looking at the results in Table 1 reveals some rather interesting tidbits of information; While the student average median and average mode grades was non-decreasing, the student average was actually decreasing 2004-2006. We attribute the student average to slightly more critical students, as well as a decrease of students from one program. The change in the other two are, however, not significant, but might indicate that more than six levels could have been used to get more information.
One of the teaching assistants from 2003 and 2004 was probably a bit too critical and the teaching assistant from 2006 was instead overly positive, according to the data. It is unfortunately very hard to guard against things like this, but it did not make that much difference in the end because of the large number of students that were more critical.
A very interesting question is "What should have been done differently?" The most obvious thing to change would be to increase the number of grading levels and incur the same increase in the number of points given by the assignments. Doubling the points from the oral presentation would not be entirely out of order, since it was a very important and large part of the course. It was also one of the most frequent suggestions found in the course evaluation.
As a closing remark, I would say that the average grades over these four years are very balanced between teaching assistants and students. The closeness of grading can be seen in Figure 2. In fact, the teaching assistants gave out on average 3.56 points per group and the students had an average of 3.55 points, meaning that the results are closer in grouping than any of the studies found in [9]. The students did an excellent job of grading each other, and using it in a course will not incur any extra costs except possibly for collecting the data and performing the calculations.

V. ACKNOWLEDGEMENTS
The author would like to thank Peter Jacobsson, from whom the Computer Architecture course was inherited. Moreover, the author is forever in debt to all of the teaching assistants and students who participated in the experiment over the years.
AUTHOR O. M. Ågren is with the Department of Applied Physics and Electronics, Umeå University, SE-901 87 Umeå, Sweden (e-mail: ola.agren@umu.se). The data presented in this paper was collected while he was finishing his PhD thesis at the Computing Science Department at the same university.