Educational Datamining in Virtual Learning Environments

—The present article describes the results of a medium-scale (N = 77) study, using log files from open remote laboratory at Charles University in Prague, Faculty of Mathematics and Physics, to observe students’ behavior during their work in virtual environment. Simple data mining and text mining techniques were used to reveal individual user’s behavioral patterns, to detect disengagement, and to compare learning outcomes and student preferences.


A. Definitions of educational data mining
There are many different definitions of educational data mining and its main issues.
From the statistician and data miners point of view Educational data mining (EDM) is a !eld that exploits statistical, machine-learning, and data-mining algorithms over the different types of educational data with the main goal to analyse these types of data in order to resolve educational research issues.
Policy makers and administrators usually think that EDM is mostly about mining enrolment and students' performance data for improving the services they provide and for increasing student grades and retention.
Generally EDM is concerned with developing methods to explore the specific types of data obtained in educational settings and, using these methods, to better understand students and the settings in which they learn. On one hand, the increase in both instrumental educational software as well as state databases of student's information have created large repositories of data re"ecting how students learn (Koedinger et al, 2008). On the other hand, the use of Internet in education has created a new context known as e-learning or web-based education in which large amounts of information about teaching-learning interaction are endlessly generated and ubiquitously available.
There is also a third, rediscovered, way, how to understand the process of students' learning. Simple, noninvasive, low cost measurements of neurophysiological factors like eyes blinks, galvanic skin response (GSR) or heart and breathe rate, together with screen activities and events recording are nowadays easily available. They became cheaper and more and more transferable to the "out of laboratory" conditions -into the real learning environments. This kind of data, gathered and processed in real time, has a great potential to provide the immediate and individualized reaction on the decreasing attention, in-creasing visual or cognitive information load, task difficulty, tension, arousal, stress and/or achievement of the learning subject. (Lustigova et al, 2010).
Educational data mining and learning analytics are more and more used to research and build models in several areas that can influence learning process itself, or at least to improve online learning systems.

B. Description of our research problem and its "state of art"
Our research was focused mainly on users modeling and disengagement detection and prediction within remote laboratory activities.
Remote laboratories represent one of the three mostly used nowadays laboratory landscapes, together with so called virtual labs (also known under the name simulated labs) and computer-mediated, hands-on labs.
Remote labs enable experimenting and lab work in virtual conditions and with the use of remote access. Although this work is often done in environments and conditions for recent generations of students unimaginable, the main goals of laboratory work are still the same. Nowadays students have also to master their basic science concepts, to understand the role of direct observation, to distinguish between inferences based on theory and the outcomes of experiments, to cooperate and to develop collaborative learning skills. But they have to do all this being exposed to uncertain and not exactly defined situations, since the whole virtual and remotely controlled working environment is more complicated and thus more unpredictable. (Lustig, Lustigova et al. 2012). This brings also more and more unpredictable to the teacher (or online supervisor) and also places greater demands on the analyst and remote lab developers, who themselves have often grown up and learned in different conditions. Also educational research within remote labs conditions has to deal with higher fuzziness and unpredictability. While in e-learning or online learning environment researchers have to their disposal plenty of structured and unstructured textual information, including discussion threads, all kind of communication between teacher and student, student-student, student-team of students, student -learning material (in form of personalized comments, reviews, etc.), in remote labs the situation is different. The remote lab communication tools are very limited and the whole work is usually task oriented: to setup the experimental environment, to gather data and to process them. If there is a team work and the negotiation connected, it is observable directly, at place (see Lustig, Lustigova 2011).
Remote laboratory environments offers communication tools like chats, discussion clubs or cafés, whether synchronous or asynchronous, very rarely. This means, that SHORT PAPER EDUCATIONAL DATAMINING IN VIRTUAL LEARNING ENVIRONMENTS there is virtually no textual information available and the researchers often have to work just with log files and information hidden in there.
Within the latest "state of art" literature review focused on remote laboratories, we did not find any study based on log files analysis. It follows that log !le data from remote laboratories is more often collected than analyzed. Most of research papers in the field are focused on remote experiments development, online access improvement and other technical and engineering aspects of the problem. Studies of users' behavior and learning process are quite rare and often based on direct (at place) observation, results and reports discussion, or survey data (Lustigova et al, 2011) Within our research we processed data from log files, collected in spring and summer 2012 at remote laboratory belonging to Charles University in Prague, Faculty of Mathematics and Physics.
Remote laboratory at Charles University in Prague belongs to so called "open remote laboratories", which means that the local laboratory through a remote control option is available to any visitor, who is interested. In spring and early summer 2012 the most engaged were students of 5 secondary schools, who were asked to measure and process their data and report their results of photo effect experiment.
Unlike many remote laboratories, laboratory at Charles University offers quite favorable conditions for high school students. The impression of the real presence is emphasized by installed web cameras that provide real time image transmission of the most interesting parts of selected experimental setup or its results. Simultaneously, different variables are measured and visualized in a form of graphs.
Our main goal during processing log files data from this students' activity was to reveal disengagement, to prevent such a situation and to improve the users' motivation within the online learning and measuring environment. We researched mainly to avoid objective causes of disengagement, such as unnecessarily long wait for the event or feedback, confusing information and instructions or other problems, that cannot be easily identified with the use of traditional techniques.
We also wanted to discover behavioral and problem solving patterns with the help of user modeling technique, described above.

II. RESEARCH PROCESS AND RESULTS
Each particular record in log file, pre-processed by special SW without losing any information, contains a string, describing individual user activity, (see an example of an individual user activity recorded in a form of a string below). While the first line in the figure above identifies the user's computer IP address, the date and time he started to measure, the whole time in seconds his activities lasted and the original ID in log file under which we can find original data, the second long line contains the full description of user activities.

A. Descriptive statistics
From the collection of 613 sessions within first half of 2011, just 155 belonged to the experimental group (April 2011) and from that number just 15 sessions finished with measurement or data downloading. The length of the connections changes from very short to very long (up to one hour). The length of the connection says nothing about the meaningfulness of the activities. Some short connections finished with data downloading, while some very long connections string descriptions contain absolutely no activity (see histogram of connection length on figure 4, notice that time axe is nonlinear). The average length of any connection was 354,7 seconds, while the average length of meaningful connection (connection finished with data download or measurement) was 756,2 seconds.
Our experimental group users connected from 43 different IP addresses. The users preferred to work in late afternoons and evenings (see Fig. 4). Notice that some of these secondary schools students worked after midnight as well.

SHORT PAPER EDUCATIONAL DATAMINING IN VIRTUAL LEARNING ENVIRONMENTS
If we define a session as a chronological series of a connections from defined IP address within the same day and setup the interconnection "no activity interval" up to 15 minutes (900 s), the number of sessions decreases to 56. Since the number of participants in our experimental group was slightly higher, it gives us evidence that some of them were not able or did not want to work within the remote lab environment.

B. Behavioural patterns
Individual But he did not use the occasion. After while (waiting for 2) he/she took the control and started to work. The activity record, presented by following string (figure 4), belongs to the longest ones, but surprisingly has no real output.
"Early birds" students, who followed recommended time schedule, preferred real time measurement (app # within each group), while those "last minute" students, cueing to operate remotely lab devices, frequently used pre-measured data, often without checking their quality and reliability.
Although the remote lab offers up to 200 stored data sets, the users in experimental group usually selected among last 3 offers without using the preview and checking their reliability and quality.

III. CONCLUSIONS
Although the students from experimental group presented nicely processed reports, the reality hidden in log files was different. On the base of educational data mining techniques, we revealed, that: 1. although our remote laboratory is open to individual secondary school students, the overwhelming majority of them is not able to practice in the laboratory without meaningful training. If they are forced to do so, they leave the environment without any meaningful activity or they play for a while, but then also prefer data withdrawal to the real measurement. 2. The "play phase" seems to be very important. Just those, who played for a while, were able to setup the apparatus, to start the measurement, to finish it correctly and to save the measured data. But finally, even these students mostly preferred data download. 3. The credibility of pre-measured data (doesn't matter how they look like and who is their author) is very high. 4. Students do not trust to their own results. It might be associated with the learning and teaching paradigm change in general (teamwork x individual work), lack of supervision; they are not used to, and/or increased uncertainty in the virtual environment.