Predictors of Academic Achievement in Blended Learning: the Case of Data Science Minor

This paper is dedicated to studying patterns of learning behavior in connection with educational achievement in multi-year undergraduate Data Science minor specialization for non-STEM students. We focus on analyzing predictors of aca-demic achievement in blended learning taking into account factors related to initial mathematics knowledge, specific traits of educational programs, online and of-fline learning engagement, and connections with peers. Robust Linear Regression and non-parametric statistical tests reveal a significant gap in achievement of the students from different educational programs. Achievement is not related to the communication on Q&A forum, while peers do have effect on academic success: being better than nominated friends, as well as having friends among Teaching Assistants, boosts academic achievement.


Introduction and Background
A ubiquitous proliferation of demand for data science-related competences (data literacy) poses new challenges for universities all over the world. On the one hand, high market demand makes it easy to attract students from different disciplines [1], including non-STEM disciplines [2]. On the other hand, it makes the student audience extremely heterogeneous in terms of disciplinary background and the level of mathematical preparation [3], complicating the pedagogical design.
This paper is a part of the project dedicated to studying the patterns of learning behaviour in connection with educational achievement in multi-year undergraduate data science minor specialization.
Data science is a minor specialization that undergraduate students can choose to study for two years (2nd and 3rd year of a bachelor program). The specialization unites students from different non-STEM programs and departments ranging from economics to oriental studies. Students study the basics of computer science and different methods and techniques related to data science, text mining, and social network analysis. The first cohort of students started their studies in September 2015, and the second cohort started one year later in September 2016.
Traditional face-to-face classroom settings for the data science minor are supported by a virtual learning environment (VLE), a web-based software system that includes an RStudio Server IDE and a Q&A forum. Students can access the same working environment inside and outside the class due to the web-based nature of VLE. Also, the nature of the subject and the setting inform the choice of a blended learning approach for the minor.
One of the major issues on the course is the level of skill disparity among students. Some students, especially those having previous experience with programming and IT (e.g., they had statistics and IT-related courses), tend to grasp the new material quicker, while other students may struggle through the unfamiliar topics. Consequently, the course is supplemented with an online exercise part complementing regular classbased lessons with seven modules, with each module addressing different aspects of data analysis. This strategy supports students who struggle to learn programming and follows recommendations to support advanced students [4]. Each module became available for students after the corresponding topic was covered in class by the instructor and were open until the end of the first semester with an unlimited number of attempts. The completion of more than 60% of the assignments in the additional modules accounted for 10% of the final grade.
Having such a mixed group of students with very diverse levels of math knowledge, different motivation and learning involvement levels, as well as a combination of offline and online learning components, requires the creation of a baseline model connecting different aspects of student learning behaviours with academic achievement while considering student diversity and existing social ties. In addition, the model must answer the following research question: "How are different aspects of students' learning behaviour in a blended learning data science course connected with academic achievement, considering students' background diversity and existing social ties?" This paper discusses our first results of building such a model while combining estimates of peer effects generated by offline ties of nominated friendship and metrics, capturing various aspects of learning behaviour, and partially controlling for the diversity of students' backgrounds in math.

Current Research in Learning Behaviour and Educational Achievement
Previous educational achievement (GPA) is one of the most influential predictors of a student's academic achievement on a given course [3], [4], because it may reflect both the skills student obtained during previous courses and the intention and motivation to succeed in the future courses [7]. GPA also shows students' level of adaptation to the university's academic culture. Several studies have shown higher grades on math-related courses being a positive indicator of advanced potential even on nonmath-related courses [8], [9].
A student's behaviour and academic achievement are influenced by the diversity of social groups he or she is a participant. Different types of social networks (e.g., friendship and advice) have different effects on academic success [10], and the topperforming students can form a rich club where the communication is very active and fruitful, leading to higher final academic achievement of the members [11]. Meanwhile, low performing students are deprived of the access to these networks, having fewer chances to improve academic achievement.
In addition, peer instruction is considered an important aspect of CS courses. Specifically, [12] states this practice has been successful in the case of non-STEM students in CS courses. Students who were affiliated with a group demonstrated higher final academic achievement than those who got through the course by themselves [12].
Furthermore, research shows that the effect of the number of friends on academic achievement is nonlinear. At first, the relationship is positive, but after a certain threshold, the trend changes: having too many connections leads to lower achievement. The maintenance of the large and diverse connections needs resources, such as time and attention, that might be spent studying [10], [13].
The social environment also affects the achievement of students through social comparison mechanisms [14], [15]. Also, comparison with more successful peers increases individual's achievement, although the comparison decreases students selfconcept [16]. However if the gap between the student and the reference group becomes too large, it deprives the student and leads to lower achievement [14].
The third group of relevant studies comes from the rapidly evolving area of learning analytics and educational data mining [17], [18]. Specifically, current developments in the field are connected with the appearance of large amounts of data about students' learning behaviour in VLEs and learning management systems (LMS). The analysis of this data may help in finding non-trivial patterns of students' learning behaviour as well as determinants of academic achievement [19].
The blended model of learning has certain features of both online and offline learning methods. One of the notable features is the automatic grading of a certain type of assignments, which is usually provided by VLEs. Moreover, VLEs allow multiple attempts for the students to successfully accomplish the assignment, leading to two substantial strategies students might employ to accomplish the assigned tasks [20]. The first strategy assumes that students are cognitively involved in the task-solving process, and this strategy can be observed as the low ratio of wrong attempts. In contrast, students might use trial and error approach. The latter strategy might be an indicator of lower course engagement and lead to worse academic achievement in the future, despite the student eventually completing the assignment [20], [21].
In addition, the assessment of students' data in the form of binary outcomes (ratio of successful vs. unsuccessful attempts) despite its simplistic nature might also be a good predictor. In particular, Petersen claims that the existing tools used to predict academic achievement based on programming errors and debugging skills demon-strate major differences in predictive power based on course content and the programming language used [22]. Nonetheless, the type of mistakes adds to the predicting ability of the model. For example, unsuccessful students make syntactic errors (and too many attempts), preventing them from solving the problems [23].
In this regard, the non-obligatory assignments' completion rate seems to be a useful indicator of academic achievement. Meanwhile, [24] that the completion of the additional non-mandatory assignments has a positive effect on academic achievement only after adjusting the results based on previous programming experience.
Another predictor specific for the offline model of learning is class attendance. Research on the effect of lecture attendance on academic achievement in CS courses demonstrated negative relationship [25]. This result is partially explained by the availability of course materials online, opportunities for independent study practise. Home assignment completeness is associated with increased final grade; however, the effect varies considerably based on the year observed [25].
Thus, three main dimensions of learning behaviour and the lives of student that are related to students' achievement are explored in this paper: • Previous achievement that can be operationalized either as general abilities or as an average grade the student got in previous years. Higher abilities indicate higher chances to be successful in each subsequent course. This predictor is found to be more influential than parental status, and it is believed that achievement may reflect student's SES. • Social environment: A student's friends can both enhance the achievement (through networks of help and advice, as well as due to inclusion in the rich club) and decrease it (if the student has too many friends or the gap between her or his abilities and the achievements of others is too large). • Engagement in offline and online learning: As students are more involved in learning and some activities related to the study process, their achievements are higher. The amount of overall training, additional tasks performed by the student, attendance, and other proxies of how much the student is interested in the course should be considered.

Data and Methodology
Data (Table 1) was extracted from VLE and contains three main parts: Students' programming activity logs show how much the student coded overall, which is a proxy of general learning engagement. As a student is more interested in the minor, the more he or she trains the coding skills while trying to solve the tasks and goes beyond the standard tasks given to everybody. The overall number of code lines written by each student was log transformed because of its clearly exponential distribution and wide range from 500 to 20000 lines of code. These programming activity logs also provide data on non-obligatory student activity on academic achievement. Specifically, the data relates how many additional tasks students tried to perform and the percent of wrong submissions.
Q&A forum communication indicates whether and to what extent the student is involved in the online activities, e.g., whether he or she uses the Q&A forum at all, and the intensity of communication measured through the number of posts, answers, comments, and votes.
Survey data with name-generator functions which students were asked to fill in, providing information about their friends among same-year students and among teaching assistants. This information about nominated friends from each participant allows us to test whether friendships with peers are connected with the performance of students. To determine some of the social comparison effects, a variable called grade gap was created, reflecting the difference between the individual achievement of a student and the average achievement of his or her friends.
Previous achievement was measured by a student's previous year GPA. The dependent variable, academic achievement, is operationalized and measured as the sum of the raw scores for the following tests: two midterms and the final exam.
To investigate the predictors' effects on outcome, a robust linear regression with the backward variable selection and Kruskal-Wallis non-parametric test was employed.
Overall, 194 second-year students were officially enrolled in data science specialization, with 189 cases in our data. Due to the small number of students from some educational programs, the programs were analytically split into three groups: • Economics, management, and logistics • Area studies, public administration, history, political science, philology, and law • Sociology Separating students of the sociology program from all other programs is due to the following logic. Conventionally, sociology is closer to the second block of programs; however, based on the exploratory analysis of the third-year students of the data science minor, sociology students reveal traits of students from both blocks of programs. In addition, the division including the economics and similar programs can be inter-preted as a proxy to the level of quantitative skills required for the program because a higher level of math skills and knowledge is expected. In contrast, the group including area studies, philology, and other social and humanities programs is more weakly connected with university-level mathematics.

Analysis and Results
The aim of this work is to determine the main predictors of student academic achievement in the blended learning model. Specifically, this work focuses on the factors associated with the following three dimensions of a student's academic life: • Previous achievement, previous math knowledge (based on student's educational major) • Different types of connections with peers • Off-line and online learning activities. Using robust OLS regression, significant factors are identified and their effects on a student's academic success are compared, providing an overview of the factors relevant to the achievements of students in the blended learning model in a data science minor program in a Russian university.
The robust linear regression results are presented in Table 2 where the final model is a result of a backward selection procedure.
The effect of engagement in learning (measured by code quantity metrics) is significant and positive. Specifically, a 1-point increase in the log of the lines of code written (range of the variable: 4-10) is associated with nearly a 3-points increase in academic achievement. This variable also can serve as an indicator of the self-guided work. Another interesting result is the insignificant effects of the communication activity on the Q&A forum in any form (QA and QA intensity). As shown in Table 1, forum communication is perceived as a proxy to cognitive engagement in the learning process, as substantial discussions require a deep understanding of both theoretical concepts behind data analysis and practical implementation rules. Additional data on reading the Q&A forum, unavailable for this study, could help to explain the relationship between academic achievement and passive consumption of communication.
According to the model, there is no linear connection with the student's previous achievement (GPA) while controlling for other predictors.
The percentage of non-obligatory assignments completed (addit. tasks) and the percentage of wrong submission attempts (wrong attempts) on the online platform are complimentary. The first percentage indicates the overall extra learning activity progress, while the second predictor indicates the strategy used to accomplish assignments. The student may use a number of wrong attempts hoping to get the right solution, or they may invest in deeper understanding of the task and make one or just a few submissions. The strong negative effect on academic achievement corresponds to other studies of blended learning settings [20].
As in [24], our model shows a significant positive effect of non-mandatory exercises: the submission of all extra assignments is associated with a 12.66-point increase in academic achievement.
Sociology students (program -soc) have no statistically significant direct increase in terms of academic achievement in comparison to students from other social science & humanities programs, although students from educational programs with larger mathematics components (program -ecmanlog) have approximately a 5-point increase in mean academic achievement while controlling for other predictors. Another result is the difference in effects of attendance for the programs with a larger mathematics component. Table 2 shows that the interaction of attendance with the educational program has a significantly lower effect (attendance: program interaction term) on academic achievement among students from economics, management, and logistics departments in comparison with social and humanities programs (excluding sociology). Existing research in the field considering CS1 courses shows a negative association between lecture attendance and academic achievement [25]. In our case, one possible explanation is that the academic performance of economics and management students, whose majors more efficiently prepare them to study data science, depends less on in-class participation in their learning.
Furthermore, student gender (gender) does not demonstrate either a significant direct connection nor an indirect (moderated) effect.
To study the peer effects, the difference in achievement between those stated they have no friends (no friends) and those who stated they have friends was analysed; two subsamples were then compared using the Kruskal-Wallis test. According to the results, the null hypothesis was rejected (Kruskal-Wallis chi-squared = 10.401, df = 1, p-value < 0.01, Ml = 30, M2 = 37, n1 = 22, n2 = 170), and students with access to a friends network are characterized to have higher academic achievement than their peers without nominated friends, confirming the findings in [11].
The grade gap with the friends (grade gap) demonstrates a significant positive connection with academic achievement. Students who got higher grade than their friends did in the first semester tend to have an additional increase in academic achievement.

Conclusion
In this research, we analysed the main predictors of academic achievement for the 2nd year non-STEM undergraduates from a blended data science course. The factors related to initial mathematics knowledge, specific traits of educational programs, online and off-line learning engagement, and connections with peers were considered.
Social sciences and humanities educational programs' students have significantly lower achievement than students from economics, management, and logistics programs, moreover the latter demonstrated weaker positive connection between class attendance and academic achievement.
Also, communication on the educational Q&A forum, which is in the literature associated with higher cognitive involvement, showed no statistically significant connection with academic achievement with control for other predictors, requiring a deeper exploration of this and other achievement factors. Many of the active forum users also show a high level of engagement in coding and online exercises. Thus, a form of the Matthew effect can be proposed here: students engaged in the subject get even more actively engaged, while others do not benefit from additional forms of task discussion.
In addition, connections with peers have a significant effect on a student's achievement. A student being more successful than their friends lead to an additional increase in academic achievement. These effects seem to work for any students regardless of the educational level and specialization. One of the possible explanations is that peer support, especially when peers have more expertise in the subject, boosts the self-confidence of a student. Self-concept is also enhanced by the positive comparison with others. Specifically, being better than the frame of reference makes a student more confident in any academic domain.
This study represents our first attempt to build a baseline model of academic achievement in a blended setting of an introductory data science course and has some limitations.
First, this work mostly focuses on metrics of learning behaviour and does not consider the effects of demographic factors while they have shown to have minor effects (in comparison with GPA) in university settings [6].
Second, with regard to the insignificance of previous achievement effects, scenarios with relationships between previous academic success and observed academic achievement being non-linear or mediated by other variables should be further explored. While there is a significant medium correlation between previous academic success and academic achievement (r = 0.43, n = 189, p < 0.01), it also correlates significantly with engagement in learning, e.g. code lines (r = 0.52, n = 189, p < 0.01), and class attendance (r = 0.44, n = 189, p < 0.01).
Lastly, the crude measure of peer effect was employed without further investigation of their mechanisms using social network analysis models of peer effects.
We plan to continue our work in this direction, including measures of motivation, self-regulation and social network analysis-based measures of peer effects, and explore the ways to support student course motivation [26], [27] via educational technology [28], [29] interventions. Ksenia Tenisheva is a senior lecturer in Department of Sociology at National Research University Higher School of Economics, Soyuza Pechatnikov, 16, in St. Petersburg, Russia. She also works as a junior research fellow at the Sociology of Education and Science Laboratory. She has a PhD degree in Sociology from HSE University.