Paper—Bachelor Thesis Analytics: Using Machine Learning to Predict Dropout and Identify ... Bachelor Thesis Analytics: Using Machine Learning to Predict Dropout and Identify Performance Factors

The bachelor thesis is commonly a necessary last step towards the first graduation in higher education and constitutes a central key to both further studies in higher education and employment that requires higher education degrees. Thus, completion of the thesis is a desirable outcome for individual students, academic institutions and society, and non-completion is a significant cost. Unfortunately, many academic institutions around the world experience that many thesis projects are not completed and that students struggle with the thesis process. This paper addresses this issue with the aim to, on the one hand, identify and explain why thesis projects are completed or not, and on the other hand, to predict non-completion and completion of thesis projects using machine learning algorithms. The sample for this study consisted of bachelor students’ thesis projects (n=2436) that have been started between 2010 and 2017. Data were extracted from two different data systems used to record data about thesis projects. From these systems, thesis project data were collected including variables related to both students and supervisors. Traditional statistical analysis (correlation tests, t-tests and factor analysis) was conducted in order to identify factors that influence non-completion and completion of thesis projects and several machine learning algorithms were applied in order to create a model that predicts completion and non-completion. When taking all the analysis mentioned above into account, it can be concluded with confidence that supervisors’ ability and experience play a significant role in determining the success of thesis projects, which, on the one hand, corroborates previous research. On the other hand, this study extends previous research by pointing out additional specific factors, such as the time supervisors take to complete thesis projects and the ratio of previously unfinished thesis projects. It can also be concluded that the academic title of the supervisor, which was one of the variables studied, did not constitute a factor for completing thesis projects. One of the more novel contributions of this study stems from the application of machine learning algorithms that were used in order to – reasonably accurately – predict thesis completion/non-completion. Such predictive models offer the opportunity to support a more optimal matching of students and supervisors. 116 http://www.i-jai.org Paper—Bachelor Thesis Analytics: Using Machine Learning to Predict Dropout and Identify ... Keywords—Thesis, bachelor, completion, machine learning, retention, performance, learning analytics.


Introduction
The thesis has become a central element as well as a formal requirement of graduate programmes for more than a century and continues to be an essential component in most universities [1,2]. The thesis is supposed to establish the competences of developing critical thinking, empirical research literacy, and synthesis of knowledge and the assessment of the veracity of information. In most universities around the world, the bachelor thesis is a mandatory last step towards the first graduation in higher education and thus constitutes a central key to both further studies as well as employment that requires higher education degrees.
Students who obtain a graduate degree realise a wide array of benefits that include personal, economical and long-lasting career advantages [1,2]. These students are more likely to be employed, the jobs they secure are better paid than those of their counterparts, and they more likely to have stable jobs. Further benefits extend to their families in the form of increasing chances of high-quality education, better parenting, and health insurance benefits [3][4][5]. The benefits of higher education also extend to governments and society at large, who derive a range of direct and indirect benefits, such as better return on investment in education, lower crime rates, improved tax revenues and less dependence on welfare programmes [3,4]. Therefore, ensuring that students enrolled in graduate programmes obtain their degrees in a timely fashion is in the best interest of governments, higher education institutions and students alike [3,6].
However, the thesis is a challenging endeavour that requires skills, aptitude, and determination for successful, timely completion [1,2,[7][8][9]. As expected, a considerable number of students struggle with the thesis process, resulting in delays, disruptions, and non-completion of their degrees [10][11][12]. Non-completion results in a vast waste of faculty time and institutional resources, a devastating personal experience for students that costs precious time, loss of money and energy, and a societal loss of high-skilled workers [1,3,11]. Thus, it is reasonable to assume that non-completion of higher education degrees should be viewed as a substantial problem that requires serious attention and proactive planning [1,3,11,[13][14][15].
Previous research related to thesis projects has identified some variables that influence the performance of students undertaking thesis projects; variables that, in particular, point out the relation between the student candidate and the supervisor [1,11,16]. The specific student variables that have been indicated as influencing thesis completion are students' attitudes and motivation [2], the students' average entry grade [14], and the students' communication and language skills [15]. Among the supervisor variables, it has been shown that the supervisor's experience, research output and workload constitute factors of thesis success [15,17]. However, the review of the literature leads to a conclusion that there are few studies explicitly focusing on bachelor thesis projects. Studies on completion of thesis projects mostly concern the doctor-ate thesis [18,19] while studies on undergraduate non-completion tend to focus on the whole programme, not the thesis specifically [20][21][22]. Furthermore, most studies have used a qualitative approach to investigate factors for thesis completion; single factors have been looked at in an isolated way with a primary focus on student variables and on completion factors (and not on non-completion and supervisor variables) [23][24][25] and there are few contemporary studies on completion and non-completion of bachelor thesis.
The introduction of thesis management systems, such as SciPro from Stockholm University [26] and Thesis Writer (TW) from Zurich University of Applied Sciences [27], offer an accurate recording of many aspects and interactions of the thesis process. These recordings along with the logs of using the system enable the use of learning analytics techniques for the pursuit of the factors and indicators behind thesis completion. Learning analytics have been used successfully to early map the indicators of successful course completion, inform course design, provide insights and feedback to teachers and students, as well as improve education outcome [28] and as such, learning analytics methods could offer valuable insights to students, educators and administrators. Examples include the early prediction of a troubled thesis.
It is against such a background this study takes as a departure point to better understand factors that influence completion -and in particular -non-completion of bachelor thesis projects by investigating some student and supervisor variables. We do this supported by the introduction of thesis management systems, such as Daisy and SciPro from Stockholm University, which record large amounts of data related to the thesis process, and consequently pave the ground for using learning analytics techniques to improve our understanding of thesis management and students' success factors [29]. More specifically, this study aims to, on the one hand, identify factors that contribute to thesis completion and non-completion, and on the other hand, predict completion and non-completion of thesis projects using machine learning algorithms. Early prediction of non-optimal thesis projects and insights from successful completers might help us introduce proactive interventions that salvage students at risk of non-completion and decrease the time to complete thesis projects.

Factors for completion of thesis projects
The literature review has led to the identification of two groups of factors that influence thesis outcomes: the student candidate and the supervisor. Below we give an account of what is known about these two groups of factors.

The thesis candidate
Rennie et al. proposed the term 'thesis-blocking' in 1987 while using a grounded theory approach to investigate the problem of thesis delay (Rennie & Brewer, 1987). They suggested that there are more ways for thesis blocking than completing it in a timely fashion. According to their research, successful thesis completion requires the candidate's conformity and acceptance of the thesis process, the willingness of the candidate to manage their idealism and cope with the overwhelming nature of the project. Failure to resolve a candidate's negative feelings, the hesitation to approach their supervisors, are the causes for many candidates to be stuck in the middle of the path [2]. House and Johnson's findings point to the applicants' average entry grade as a decisive predictive factor of successful, timely completion [14], a finding that was corroborated by Jiranek [15] and Wright and Cochrane [30].
Nevertheless, some studies have shown that entry grade is not a significant predictor of completion [30,31]. For instance, in a study by Pascarella and Terenzini [32], the background characteristics including entry grades (R2c .009) showed only to explain a small part of retention; it is academic and social integration that explain the persistence (R2c .127). The result suggests that the academic organisation has a potentially substantial effect on the student's study outcome.
Age of the candidate seemed to show different association with time to completion or grade of the thesis; the inconsistency continued when researchers considered the disciplines in the analysis [15,17,30,33].
Qualities and approach of the candidate were also reported as considerable predictors; for instance the ability of the candidate to cope with the demanding nature of the study and the flexibility to adapt to the process [2], motivation to finish in a timely fashion, engagement with coursework, prior research work, prior coursework in a relevant field and choosing the appropriate courses that are most relevant to the thesis topic during the programme [16,34]. Other factors also include communication skills and language proficiency skills [15], self-reliance and independence [35]. Family responsibilities and children seem to impact both genders in different ways. However, a right balance and proactive planning along with institutional support could mitigate the impact and assist the candidates [1,11,15,17,33,35]. Research has also emphasised the ability to access laboratory and scientific resources, along with the availability of students' guidance and support services for a successful process [1]. Contrary to the common belief, part-time older candidates seemed to fare better than their counterparts in their approach to research, organising their duties and being independent [30].

The supervisor
The supervisor role is instrumental in every step in the development of the thesis; the role starts by guiding the research proposal topic, supervising the plan, overseeing or participating in the implementation of the research or the project. The supervisor guides the thesis document writing process, rectifies flaws, suggest directions and approves the final document in its final form. The supervisor' role extends to the arrangement of the defence examination and preparing the candidate for the event [36]. As expected, since the supervisor has control over each step in the process and has to be satisfied by the quality of the work the candidate is producing, they can intentionally or unintentionally delay the process.
In some cases, the supervisor may decide to terminate the process, as in cases where the candidate is deemed unfit to do the presumed work. Rennie and Brewer liken the supervisor's negative role to the writer's block phenomenon [2]. They suggested that both phenomena share essential features, the main problem being the writ-er's internalisation of the negative feedback received from supervisors and poor management of duties and time constraints.
A healthy student−supervisor relationship is helpful to the success of the thesis project. The thesis is an embedded social exercise more than most of the other educational compositional projects, therefore collaborating with the supervisor, regular productive meetings and the ability to reach a shared understanding are central to the success of the project [1,11,16,37,38]. A relationship in which the supervisor exerts a moderate control over the process and greater affiliation was found to produce the best outcome regarding time and completion rates [39]. Supervisor experience and research output is a considerable factor that might be a predictor factor in the positive direction [15]. Furthermore, supervisor support is an indispensable element in all aspects of the development of the thesis process through all the stages, from the inception and the choice of the research topic to the tackling of obstacles and final presentation of the thesis [1,15,33,37,40].
Supervisors overwhelmed by research work, teaching or multiple students could lead to them having less time for students and minimise their interaction with candidates [11,17]. Furthermore, the supervisor is constructive, and on-time feedback, commitment to plans and encouragement and communicability were reported by students as the most desirable factors that helped them complete their thesis projects [41].

Sample and context
The sample for this study consisted of bachelor students' thesis projects (n=2436) during the period between 2010 and 2017 at the Department of Computer and Systems Sciences, Stockholm University, Sweden. Since it takes approximately 350 days for students to complete a thesis project, data from the year 2018 were excluded as they contained many projects likely to be completed after the data extraction.
The dropout rate for the thesis project at the department is approximately 30% for the period studied. We have included all bachelor thesis projects that adhere to the present curriculum for thesis projects. Table 1 below presents descriptive statistics regarding both student and supervisor variables.

Data collection
A challenge in data collection for learning analytics is to avoid amplifying errors from different standards in data sources, especially if some sources are external and out of control. In this study, to minimise this risk for all data sources we used data that are under the control of the university.
Data collection was performed in several iterative steps. Using SQL (structured query language) queries, we extracted data from two different data systems used by the department to record data about the thesis projects. From these systems, we collected thesis project data concerning both students and supervisors. Informed by factors identified by previous research [14,15], and taking into account additional variables that were available in the systems that record thesis data. We focused in general on three groups of factors that influence the academic thesis process, namely: • Student's previous performance in the bachelor programme • Supervisor's thesis project performance and experience • Supervisor's research output.
More specifically, we extracted the following variables: • Thesis project: start and completion date, from this the number of days to completion was calculated. • The students: the grade of the thesis, method course and average grade in the study before the bachelor thesis. • The supervisors: academic title, number of publications, year of publication, from this average number of publications per year was calculated, number of complete/incomplete thesis projects, number of started thesis projects, and average days of supervisors to complete thesis projects were calculated from the projects.
All data was anonymised by converting personal identifiers to fictive IDs. The researchers who did the analysis did not know the identity of the subjects. The data was subsequently prepared for statistical and predictive analytics by removal of extremeand null values and through the computation of relevant variables.
Ethical approval for this study was obtained through the Regional Board of Ethical Vetting in Stockholm. Consent for participating in this research was also obtained from the selected supervisors in the sample. Six supervisors and their associated thesis projects were excluded due to no consent for using their data was received.

Data analysis
The analysis was performed using SPSS, and R. Spearman correlation test to investigate the correlation between incomplete thesis projects (dropouts) with student and supervisor variables. Multiple independent sample t-tests were performed in order to explore differences between completers and non-completers with regards to student and supervisor variables. The Shapiro−Wilk test of normality was employed and confirmed that the assumptions for the t-tests were satisfied.
For the predictive analytics, seven supervised machine learning classifiers were applied: Naive Bayes, Logistic Regression, kNN, Neural Network, Deep Learning, Decision Tree, and Random Forest in order to predict completers and non-completers of thesis projects. These classifiers were chosen because they are frequently used for predicting dropout, and each has demonstrated good and comparable performance in predicting at-risk students and dropout [42,43]. The data set was split into a training and testing set. The training set consisted of 70% of the total data set, and the testing set the remaining 30%. After implementation of the predictive models, features were ranked using the information gain ratio. To prevent overfitting and increase robustness, 10-fold cross-validation was performed, where performances were measured from multiple iterations of cross-validation and averaged over iterations. To measure the prediction performance of the different models, the area under the receiver operating characteristic curve (AUC) was obtained, along with measures for precision and recall.
In addition, a clustering of the thesis projects was performed through the k-means clustering algorithm as well as through factor analysis using the principal component method with varimax rotation.

Statistical analysis
After performing the descriptive analysis presented in Table 1, a correlations tests (Spearman's) was performed in order to study the correlation between incomplete thesis projects (dropouts) with student and supervisor variables (see full correlation matrix in Table 2). This analysis revealed that non-completion (dropout) is significantly (albeit weakly) correlated with grades on method course (r=0.14, p<0.01), students' average grade in their study programme at the university (r=0.19, p<0.01), the average time it takes for supervisors to complete thesis projects (r=-0.17, p<0.01), the number of scientific publications published by supervisors (r=0.06, p<0.05), the ratio of incomplete thesis projects of supervisors (r=-0.35, p<0.01), and the total number of incomplete thesis projects of supervisors (r=-0.20, p<0.01). As can be noted, the ratio and total amount of unfinished thesis projects by supervisors presented the strongest correlations with thesis dropout. One can also note that the experience of teachers measured in the number of thesis projects supervised did not render a significant correlation with thesis non-completion (r=-0.14, p>0.05). Furthermore, it is noted that the supervisor's experience measured in the number of scientific publications shows a weak and almost non-existing correlation with non-completion of thesis projects (r=-0.07, p<0.07).  Multiple independent t-tests were also performed in order to explore differences between completers and dropouts with regards to many student and supervisor variables. See Table 3 for a full presentation of the t-test results. Based on these tests the following can be concluded: • There is significant difference between completers (M=2.02, SD=1.90) and noncompleters (M=1. 45 Significant differences were, however, not revealed concerning the total number of scientific publications published by supervisors or the total number of thesis projects supervised by the supervisors.

Prediction of completers and non-completers
Then predictive analytics was performed using several machine learning models (Naive Bayes, Logistic Regression, Deep Learning, Decision Tree and Random Forest) in order to predict the completion/non-completion variable using the features described in Table 1. The performance across the models showed AUC values between 0.56 and 0.79 with modest gains for the Deep Learning and Logistic Regression model (see Table 4). The logistic regression model proved to perform best concerning accuracy and AUC, with almost 91% accuracy in predicting the actual completers. However, the actual non-completers could only be predicted with a 42% accuracy (see Table 5). The Deep Learning model's performance, on the other hand, was more balanced and it was able to identify 72% of the actual students that did not complete the thesis project and 71% of the actual completers (see Table 6).
As can be seen from Table 7, the features with most weight were the ratio of unfinished thesis projects of supervisors, students' average grade during university studies, and the average time it takes for supervisors to complete a thesis project.

Cluster and factor analysis of complete and incomplete thesis projects
Then the k-means clustering algorithm was applied to see if the thesis projects can be grouped into some distinct clusters that share a similarity. A four-cluster solution demonstrated best silhouette scores. Among the 2436 student thesis projects in the sample, 1152 thesis projects were classified into the first cluster, 623 thesis projects were in the second, 421 thesis projects were in the third, and 240 thesis projects were in the fourth, respectively (see Table 8). What can be concluded from the cluster analysis is that one non-completer cluster is explained by supervisors having a high ratio of unfinished thesis projects and taking, on average, more days to complete thesis projects. For the completers, there are three different clusters respectively characterised by: 1) supervisors' ratio of unfinished thesis projects are smaller as well as their average days to complete thesis projects; 2) less scientific publications of supervisors; and 3) more scientific publications of supervisors. The cluster analysis thus points at three critical factors, namely: previous performance of supervisors measured in average time to complete thesis projects, and the ratio of incomplete/complete thesis projects, as well as the research output of supervisors measured in the total number of published scientific publications.
The cluster analysis was complemented with factor analysis using the principal component method with varimax rotation. The factor analysis resulted in four factors with eigenvalues greater than 1 that explained thesis outcomes. When excluding factors that did not contain items with loadings above 0.6, three factors remained that have been named: supervisor performance; supervisor research output; and student performance. Table 9 reports the factor loadings, eigenvalues and explained variance for each of the factor. Consequently, the results from the factor analysis demonstrate that non-completion of thesis work is dependent on supervisor performance in terms of the number of thesis projects they complete and the grades of these, how many thesis projects super-visors have experienced, supervisors research output, and students grades on prepatory methodology courses and their average study grade.

Discussion
The bachelor thesis is, almost worldwide, a necessary last step towards the first graduation in higher education and thus constitutes a central key to both further studies in higher education as well as employment that requires higher education degrees. In light of this, non-completion results in a vast waste of faculty time and institutional resources, a devastating personal experience for students that costs precious time, loss of money and energy, and a societal loss of high-skilled workers [1,3,11].
The results demonstrated that three general groups of factors influence students' performance during the thesis process regarding completion and non-completion. These were: • Student's previous performance in the bachelor programme • Supervisor's thesis project performance and experience • Supervisor's research output The statistical analysis revealed that non-completion of thesis projects most strongly correlated with the supervisor's ability to complete thesis projects measured in the ratio of unfinished thesis projects and average time to complete thesis projects. The independent sample t-tests pointed out significant differences between completed and incomplete thesis projects regarding students' previous performance (average grade for the whole bachelor programme), supervisors' average thesis grade, supervisors' average time to complete a thesis, and supervisors' ratio of unfinished thesis projects. The conducted factor analysis pointed out three factors: the supervisor's previous thesis performance; the supervisor's research output; and the student's previous performance. In total, 43% of the variance in the thesis outcomes (regarding completion) could be explained by the supervisor's previous performance and their research output, and 12% by the student's previous performance. The cluster analysis performed generated four clusters of thesis projects. Among these, one was a non-completer cluster that is explained by supervisors having a high ratio of unfinished thesis projects and taking, on average, more days to complete thesis projects. For the completers, we identified three different clusters in which the following variables were influential: Cluster 1) Supervisors' ratio of unfinished thesis projects are smaller as well as their average days to complete thesis projects Cluster 2) Less scientific publications of supervisors Cluster 3) More scientific publications of supervisors. Thus, taking all the above-mentioned analyses into account, we can with confidence conclude that the time supervisors take, on average, to complete thesis projects, and their experience of completing or not completing thesis projects (in terms of ratio of incomplete thesis projects), play a significant role in determining the completion and non-completion of thesis projects. No similar results have been found in previous work, which to some extent has overlooked supervisor variables, and thus these findings can be considered as one of the main contributions of this paper. Furthermore, the analysis leads to the conclusion that the research output of supervisors is influential in determining success, but to a much lesser extent than the previously mentioned factors. This particular finding corroborates Jiranek (2010), who looked at supervisors of doctoral students. However, we could also conclude that the academic title of the supervisor, which was one of the variables studied, did not constitute a factor for completing thesis projects. Also, we could conclude that students' previous performance regarding average grade during the bachelor programme is shown to be one of the factors that influence completion and non-completion of thesis projects. However, this student-related factor was less influential than supervisors' average time to complete thesis projects and their ratio of unfinished thesis projects. The finding that students' previous grades are of less importance for thesis completion corroborates previous findings of Tinto [44], Pascarella and Terenzini [45], Astin [46], and Nouri et al. [47].
Another novel contribution of this study stems from the application of machine learning algorithms, which were used in order to predict thesis completion/noncompletion. Using the set of features mentioned in previous sections, and especially when using the deep learning algorithm, it is possible to predict completers and noncompleters of thesis projects reasonably accurately. For future work, we would suggest adding more student features/variables by using additional data collection methods such as questionnaires, in order to increase the performance of the predictive model -the limited amount of student variables is one of the limitations of this study.
As a final remark, it is argued that the insights gained from the statistical analysis, and the predictive models constructed through machine learning algorithms, can be used to support the matching of students and supervisors in order to increase graduation rates and the probability for thesis projects being completed.