Two Algorithms for Web Applications Assessment

The usage of web applications can be measured with the use of metrics. In a LMS, a typical web application, there are no appropriate metrics which would facilitate their qualitative and quantitative measurement. The purpose of this paper is to propose the use of existing techniques with a different way, in order to analyze the log file of a typical LMS and deduce useful conclusions. Three metrics for course usage measurement are used. It also describes two algorithms for course classification and suggestion actions. The metrics and the algorithms and were in Open eClass LMS tracking data of an academic institution. The results from 39 courses presented interest insights. Although the case study concerns a LMS it can also be applied to other web applications such as e-government, e-commerce, e-banking, blogs e.t.c.

The chapter is organized as follows. Section 2 describes the background theory. Section 3 describes the logging data and preprocessing procedure. Section 4 describes the data processing procedure with the introduced metrics. Section 5 describes two algorithms. Section 6 presents the conclusions along with future directions.

Background
There are several studies that show the impact of data mining on eLearning. While data mining methods have been systematically used in a lot of e-commercial applications, their utilization is still lower in the LMSs [24]. It is important to notice that traditional educational data sets are normally small [7], if we compare them to files used in other data mining fields such as e-commerce applications that involve thousands of clients [17]. This is due to the typical, relatively small size of the classroom, although it varies depending on the type of the course (elementary, primary, adult, higher, tertiary, academic or/and special education); corresponding transactions are therefore also fewer. The user model is also different in both systems [16].
Very interesting is the iterative methodology to develop and carry out maintenance of web-based courses, to/in which a specific data mining step was added [5]. The proposed system finds, shares and suggests the most appropriate modifications to improve the effectiveness of the course. The discovered useful information is used directly by the educator of the course in order to improve instructional/learning performance. This system recommends the necessary improvements to increase the interest and the motivation of the students. It is well known that motivation is essential for learning: lack of motivation is correlated to learning rate decrease [3]. There are several specialized web usage mining tools that are used in the e-learning platforms. CourseVis [10] is a visualization tool that tracks web log data from an LMS. By transforming this data, it generates graphical representations that keep instructors well-informed about what precisely is happening in distance learning classes. GISMO [11] is a tool similar to CourseVis, but provides different information to instructors, such as student's details in using the course material. Sinergo/ColAT [2] is a tool that acts as an interpreter of the students' activity in a LMS. [13] describes a tool which uses log files in order to represent the instructor-student interaction in hierarchical structure. MATEP [25] is another tool acting at two levels. Firstly, it provides a mixture of data from different sources suitably processed and integrated. These data originate from e-learning platform log files, virtual courses, academic and demographic data. Secondly, it feeds them to a data webhouse which provides static and dynamic reports. Analog is another system [23] which consists of two main components. The first is performing online and the second offline data processing according to web server activity. Past users activity is recorded in server log files which are processed to form clusters of user sessions. [18] propose a new approach for automatic user satisfaction measurement by identifying and indexing the target groups of various e-learning material like e-courses, educational games etc through the EDUSA test. Some other researchers like [1] [12] propose metrics for e-learning evaluation. In e-commerce web usage analysis some metrics are proposed by [9].
A methodology for the maintenance of web-based courses was also proposed by [8] which incorporates a specific data mining step. Publications of the authors relevant to this paper are the automated suggestions and course ranking through a web mining system [21] and the proposal of two new metrics, homogeneity and enrichment, for web applications assessment, which are also used in this chapter [22].
In more detail, the data recording module, is embedded in the web server of the e-learning platform and records specific elearning platform fields. Specifically, eleven (11)  user requests are recorded with the use of an Apache module, developed in Perl programming language, as a first step.
The development of such a module has the following two advantages: rapid storage of user information, since it is executed straight from the server API and not by the LMS application, and the produced data are independent of specific formulation used by the LMS platform.

Data pre-processing
The data of the log file contain noise such as missing values, outliers etc. These values have to be pre-processed in order to prepare them for data mining analysis. Specifically, this logging data step filters the recorded data. It uses outlier detection and removes extremes. This step is not performed by the LMS platform and thus can be embedded into a variety of LMSs. Also, it facilitates data mining analysis methods construction of robust results.
The produced log file, is filtered, so it includes only the following three fields: (i) courseID, which is the identification string of each course; (ii) sessionID, which is the identification string of each session; (iii) page Uniform Resource Locator (URL), which contains the requests of each page of the platform that the user visited.

Study population and context
In detail the dataset was collected from a real LMS environment used in the Technological Education Institute (TEI) of Kavala that uses the Open eClass e-learning platform [6]. The data are from the spring semester of 2009 from the Department of Information Management and involve 1199 students and 39 different courses. The data are in ASCII form and are obtained from the Apache server log file. A view of the collected data in forensic log format is shown in table 1. Table 1. eClass data in FLF The log file which is produced, from the previous step, is filtered and pre-processed in order to include the following fields: courseID, sessionID and page Uniform Resource Locator (URL).

Processing the Data
The aforementioned fields of the previous section are not adequate in order to evaluate the course usage. So, some metrics are used for the facilitation of the course usage evaluation (Table 2). First, the indexes Sessions, Pages, Unique pages, Unique Pages per CourseID per Session are computed with the use of a Perl program. Then, the metrics Enrichment, Disappointment, Interest and Homogeneity are calculated.
The number of the sessions and the number of the pages viewed by all users are counted for the calculation of course activity. The metric unique pages measures the total number of unique pages per course viewed by all users. The Unique Pages per Course per Session (UPCS) metric expresses the unique user visits per course and per session; it is used for the calculation of the course activity in an objective manner. Because some novice users may navigate in a course and visit some pages of the course more than once, UPCS eliminates duplicate page visits, since it considers the visits of the same user in a session only once.
Enrichment is a metric which is proposed in order to express the "enrichment " of each course in terms of educational material.
Enrichment is defined as the complement of the ratio of the unique pages over total number of course web pages as it was proposed in [21][22].

Sessions The total number of sessions per course viewed by users Pages
The total number of pages per course viewed by users Unique pages The total number of unique pages per course viewed by users Unique Pages per CourseID per Session The total number of unique pages per course per session viewed (UPCS) by users Enrichment The enrichment of courses Disappointment The disappointment of users Interest It is the one 's complement to the disappointment Homogeneity Homogeneity of courses Enrichment is a metric which is proposed in order to express the "enrichment" of each course in terms of educational material. Enrichment is defined as the complement of the ratio of the unique pages over total number of course web pages as it was proposed in [21][22].
where Unique Pages<=Total Pages.
Enrichment values are in the range [ 0, 1). When users follow unique paths in a course this is 0 while in a course with minimal unique pages this is close to 1. Since it offers a measure of how many unique pages were viewed by the users, it shows how much information included in each course is handed over to the end user inferring that the course contains rich educational material.
Disappointment is a metric which combines sessions and pages viewed by users and it measures the disappointment of the users in the course, in the sense that when a user views few pages of the course, s/he logs out of the course. Disappointment metric is defined as the rate of sessions per LMS course to total number of course web pages.

Disappointment = Sessions/Total Pages
In other words, the disappointment metric reflects how quickly the users discontinue viewing pages of the courses. Disappointment values are in the range (0, 1]. Due to the negative nature of the Disappointment metric, it was replaced by another metric which has positive sounding manner, Interest. Interest metric is defined as the one's complement to the disappointment.
Both disappointment and interest metrics were proposed in [21]. A low interest in a course means that there are not many unique pages viewed per session; therefore the course is not so popular among the students. This may be so either because students were not pleased with the educational material or there are not many pages to visit. High interest indicates that users are interested in course content and continue further with their study. When the quality of the educational material does not fulfil user requirements, the user is led to log out of the course.
Homogeneity metric is another metric that is defined as the ratio of unique visited course pages to the number of sessions that visited that course. (4) where Total Sessions per course >> Unique course pages.

Homogeneity = Unique pages/Total Sessions
Homogeneity metric value ranges from [0,1), where 0 means that no user followed a unique path and 1 that every user follows unique paths. It is a course quality index and characterizes the percentage of course information discovered by each user participating in a course. The aforementioned metrics contribute to the evaluation of courses usage. The results for the 39 courses are presented in Table 3.

Algorithms
In this section, two algorithms which classify the LMS courses and suggest actions to the educators for course improvement are used.

Classifier algorithm
The first algorithm classifies LMS courses based on poor or rich quantity of course information material. Afterwards, based on LMS courses with adequate information material, it tries to spot how often course information is added or updated by educators based on homogeneity classification or followed by users the updated information. Finally, using the UPCS metric it identifies whether updates of course information can increase the student's interest in the specific course. Classifier algorithm schema is depicted in Figure 1.

Course ID Sessions Pages Unique pages UPCS Disappointment Interest Enrichment Homogeneity
In the first stage of the algorithm, the Enrichment metric is involved in order to identify courses with poor or rich educational content (poor equals to small enrichment value while rich to high enrichment value). A set of N courses are placed to an Nordered table based on Enrichment, where N<=Total LMS platform courses, the courses with the highest Enrichment metric values. Table 3.

LMS data and grade for 39 Courses
In the second stage, the algorithm classifies the previous set of N courses using the values of Enrichment and Homogeneity. The classification of LMS courses is performed using four clusters as shown at Figure 1. The higher the Homogeneity value the more frequent the course updates or the more dynamic the course content, depending on Enrichment value. The lower the Homogeneity value then the LMS is more static in content or of poor content updates. The classification of the courses depends on the average Enrichment value of the N LMS courses and the average Homogeneity value of the high and low Enrichment clusters accordingly.
The aim of the third stage of the algorithm is to identify whether the content can be characterized as rich or poor, and whether it is static, frequent or dynamic. In order to do this, each cluster's courses are ranked based on the value of the UPCS.

Application of classification algorithm
The 39 courses were initially ranked according to the Enrichment metric. The algorithm was tested by picking the best and worse LMS courses from a list of 39 courses which are shown in Table 4. That is, best and worst cases from students' usage point of view. Based on the previous order by Enrichment Table 1 of 12 LMS courses, the Classifier algorithm was applied by using an average Enrichment value of 0.898 and average homogeneity value for the high enrichment cluster of 0.09 and for the low enrichment cluster of 0.45. The classification of the algorithm produced four clusters, which are shown in the Table 5.
As shown in Table 6, for each one of the four classes the LMS courses are ordered based on the UPCS metric value. So courses IMD105 and IMD36 are the representatives of high and low UPCS values for cluster I, IMD132 and IMD41 for cluster II, IMD112 and IMD122 for cluster III and IMD66 and IMD8 for cluster IV accordingly.
In Table 4, these courses and the classifier algorithm evaluation feedback for each one of those courses are presented.

Suggestion algorithm
The goal of the second algorithm is to allow an automated suggestions system for course improvement. The first step of the proposed algorithm is course ranking in descending order by UPCS. A course placed in the first ranking positions is a popular one, either because of exclusive quality of its educational content or quantity of course material.
The first suggestion rule (Figure 2)

IV IMD66
Course of poor static content that still contains information followed by users (or forced to follow)

IV IMD8
Abandoned course of poor static content occasionally followed by curious users  The next suggestion rule (Figure 3), applies the Enrichment metric. A low Enrichment value means that users do not visit course pages due to the lack of course content updates. If Enrichment value of a course is less than c*Average(Enrichment), where c is a coefficient parameter, then the algorithm suggests that it would be a good practice for the author to update course content, so as to motivate users to re-visit his/her course.
IF Enrichment<c* Average(Enrichment) THEN "Update the course content" Algorithm's a, b and c coefficient parameters range between 0 and 1. In order to accurately calibrate the coefficient parameters, the algorithm first applied to a reduced set of LMS courses. Course selection was performed based on best and worst case LMS courses, using UPCS ranking. Then a value was calculated by using best to worst course Interest deviation value. b coefficient value is the average best to worst course Interest value and c was calculated as the median value of the first k LMS courses based on UPCS ranging, where k =5: c = median (Interest i )-k* N* 0.0001 (5) where N is the total number of LMS courses and k the number of selected courses with maximum UPCS ranking values.

Application of suggestion algorithm
This experiment had two goals: To determine suitable values for a, b, c parameters and to test the two suggestion rules of the algorithm with respect to their impact on improving the course quality.
The first stage of the algorithm was the ranking of the courses. Table 7, due to space limitation, displays the results for the courses in ranking positions 1-5, 21-39, using the UPCS metric. The ranking of the courses is based on: first UPCS, then Enrichment and then Interest values.
Based on the aforementioned metrics, the courses were classified using the following three steps: 1. Course ranking step: primarily, course evaluation was considered using the UPCS value and LMS courses were ranked in descending order.
2. First suggestion rule step: The first suggestion rule is used in order to evaluate course content in terms of interest as expressed by course users and provides the appropriate suggestions to the instructors, related to the quantity and the quality of their course educational content.
3. Second suggestion rule step: Course content is examined in depth in order to express whether users are satisfied from what they see or course content seems confusing or complex to the end user. Enrichment metric was used to identify courses with poor or rich educational material confirmed by users and provides suggestions for possible course updates.
In order to perform the final two stages of the algorithm, the coefficient parameters values were firstly calculated. According to the experiment outcome, the values for a, b, c parameters were 0.9, 0.6 and 0.95 respectively.
The second goal of the experiment was to test the suggestion rules by showing them to the course instructors and receive verification feedback on the suggestion accuracy. When instructors applied the proposed suggestions, their courses improved in UPCS ranking position.

Discussion and Conclusion
The proposed method uses existing techniques in a different way to perform LMS usage analysis. It uses the enrichment, homogeneity and interest metrics. It presents the clustering of students and courses. It uses two algorithms for course classification and suggestion actions.  Table 7. Processed e-learning data for 39 courses It has the following advantages: (1) It is independent of a specific LMS, since it is based on the Apache log files and not the LMS platform itself. Thus, it can be easily implemented for every LMS. (2) It uses new metrics in order to facilitate the evaluation of each course in the LMS and the instructors to make proper adjustments to their course educational material. (3) It uses two algorithms for analyzing LMS data, classifies the courses and suggests the proper actions to the educators.
Feedback about the method was received by the educators. The educators were informed about the indexing results along with abstract directions on how to improve their courses. Most of them increased the quality and the quantity of their educational material. They increased the quality by reorganizing the educational material with a uniform, hierarchical and structured way. They also improved the quantity by embedding additional educational material. By updating educational material, both quality and quantity were increased. A major outcome through the process of informing the educators about the results was that the ranking of the courses constitutes an important motivation for the educators to try to improve their educational material. Because of their mutual competition, they want their courses to be highly ranked. A few educators complained that their courses organization does not assist them to have high final scores in the ranking list. They argued that for example the metric interest is heavily influenced by the number of web pages used to organize the educational material. Thus, courses that have all their educational material organised in few pages have low interest score. They were asked again to re-organize the material for each course in the LMS according to the order they are taught, in order to facilitate the use by the students.
It should be mentioned that even if the scope of the method is on LMS platforms and educational content, it can be easily adopted in other web applications such as e-government, e-commerce, e-banking, blogs etc. Furthermore, enrichment, homogeneity and interest metrics may also be used for example by e-government applications, since enrichment shows how much information is handed over to the end user, homogeneity characterizes the percentage of information independently discovered by each user and interest indicates whether users are pleased with the material of the site and do not log out.