Open Research Online An Experimental Investigation of ‘Drill-and-Practice’ Mobile Apps and Young Children

— The choice of mobile applications (apps) for learning has been heavily relied on customer and teacher reviews, designers’ descriptions, and alignment with existing learning and human-computer interaction theories. There is limited empirical evidence to advise on the educational value of mobile apps as these are used by children. Understanding the impact of mobile apps on young children’s learning is timely given the lack of evidence-based recommendations that could guide parents and teachers in selecting apps for their children. In this paper, we present the results of a series of Randomised Control Trials (RCTs) with 376 children aged 5 to 6 years old who interacted with two maths apps in three schools in the UK. Pre/post-test comparisons revealed learning gains in both the control and intervention groups, suggesting that the selected applications are equally good to standard maths practice. Implications for the selection and use of mobile apps are discussed.


Introduction
A plethora of mobile applications (apps), often labelled as 'educational', is targeting young children. The mobile and tactile nature of apps facilitates a great degree of independence enabling young children, or even toddlers and infants, to easily interact with them [1], [2]. Yet, the listing of an app in the education category of an app store does not necessarily mean that the app has an educational value [3] or that the app has been tested with children and has been shown to promote learning. Technical constrains such as a lack of resources (e.g., time, money) may inhibit app evaluation [4] or, in other cases, educational technology experts such as instructional designers are not involved in the process of design to ensure that effective pedagogical practices and relevant game mechanics are considered. Review ratings by customers, teachers or designers may not be particularly helpful either, as they often overstate information or are assessing aspects not directly related to the educational quality of an app [5]. A lack of transparency and information that can help assess the quality of apps is found to be missing from the app stores [6].
skills and knowledge may benefit differently from using a single app. The abilities of children can influence how an app is used and the degree to which this will be helping or hindering learning [13]. Such knowledge can help teachers and parents to make evidence-based decisions when selecting apps for children, and inform app designers resulting in better quality educational apps.
The examination of maths apps has attracted limited interest compared to other domains, such as language literacy [9]. Underachievement of children in maths is a global phenomenon [14], necessitating the development and testing of interventions that can help children reach (and exceed) minimum maths proficiency levels. The use of maths apps can be promising in this respect, especially if future research activities focus on testing specific app characteristics, align theory, design and outcome measures, and assess varied cognitive and skill-based outcomes [15]. This study aims to contribute to this line of work by examining the impact of two mobile apps on maths skills and knowledge and determining whether these apps can bring benefits to certain groups of children as defined by their age, gender and prior maths knowledge. Aligning with existing recommendations [10], it has deployed a Randomised Control Trial (RCT) methodological design. The two apps under study -Moose Math and Monster Numbers -could be described as "instructive" or "drill-and-practice" apps; they require limited cognitive effort in the form of remembering or recalling previously acquired knowledge [9], [10]. They resemble pen-and-paper activities with the advantage of providing immediate feedback. They are training children in automating tasks such as addition and subtraction through ongoing practice and repetition. Such apps are amongst the most popular and top rated in the app market [3], [16]. It is thus worthwhile to examine whether these widely used and positively perceived apps are beneficial for children and their learning.

Reviewing existing studies
An increasing number of studies, including systematic reviews, are found to examine or summarize the learning impact of selected mobile applications on young children [9], [10], [17], [11]. In particular for maths apps, positive effects have been observed on early maths learning in typically developing children in the areas of number recognition and naming, and simple addition and subtraction [10]. Prior knowledge and performance are found to be significant moderators of proposed effectiveness [11]. Although these studies are underpinned by a common goal -to determine the effects of mobile apps on early years maths development-, they present a great variation in terms of who the children under examination are and what the apps under study look like. This suggests that a closer examination of reported studies is needed to shed some light on who can benefit the most from interacting with apps and which design features or implementations are those that can support or promote these benefits. The skills or expertise of learners may interact with the cognitive load of tasks. For example, novice or less skilled learners may need considerable guidance and breaking down in steps complex instructions, whereas more skillful learners may find this as impairing their progress [18].
The few studies available measuring effects of mobile apps on early maths learning present rather mixed-findings in relation to who of the children can benefit the most from interacting with apps. In particular, Schacter and Jo [19] examined low-income children (Mean age = 4,6) who used the tablet-based curriculum app Math Shelf, in a classroom setting and identified that the intervention group outperformed the control group. While gender and race had no effects on outcomes, prior maths knowledge had a moderating effect; children with lower pre-test scores on number sense (<50%) benefited nearly twice from the intervention that those with higher pre-test scores, showing the value of the app especially for low performing children. Yet, another study testing the same app with the same age children showed contradicting outcomes. While the intervention group performed better than the control group aligning with previous findings, pre-test scores and gender were found to moderate effects. This time the higher performers (>50%) and female children had better post-test scores in number sense [20]. Enhanced learning outcomes in numbers, shapes, space, and measure were also reported in a number of studies with children 4 to 7 years old, who interacted with a set of apps from OneBillion. In one of the reported studies, low achievers 4-5 years old were found to benefit more from the apps than a similar age of high-achievers. No impact of socio-economic status and child's first language was found [21].
Age was shown to moderate effects on STEM learning (including quantity of different sets) in a study where a group of children played a game (Mesozoic Math Adventures) and another group watched the experimenter playing the same game [22]. Younger children (Mean age = 3,6) were found to learn more from watching rather than playing the game, while older children (Mean age = 4,7) learnt equally well from playing or watching the game. These differences were explained by cognitive load which is likely to increase when playing a game and make it difficult for the younger children to manage it. Yet, in another study comparing video versus tablet-based interactions, observed differences held true even after controlling for age. Children (3,7-5,6 years old) who played a tablet-based game about approximate measuring or viewed a video recorded version of the game demonstrated greater transfer of knowledge than a control group playing a zoo keeping game. Children in the interactive condition (tablet-based) had better outcomes in a near transfer test, whereas children in the video-recording condition were better in the far transfer test. Other co-variates including gender, verbal ability, parent's education, and household income were not significant [23].
Yet, in other studies, age as well as gender were not associated with post-test performance. A tablet-based maths implementation consisting of 32 different digital games was superior to a respective computer-based one in terms of developing numbering skills, numeral literacy, mastery of number facts, calculation skills and understanding of concepts with 4 and 5 years old (Mean age = 5,2) [24]. Also, the game-based app Measure Up! was found to result in enhanced learning gains in understanding measurement concepts such as height and length, weight, and capacity in the intervention condition (Mean age = 5) than the control condition [25]. Similarly, a numeracy app was found to improve numerical magnitude knowledge in 6 years old, yet a working memory game app did not result in any improvements compared to the control group. A combination of the two game apps was found to improve working memory for at least a month later. No differences in age, gender, ethnicity, race, and home languages were observed between the intervention and control groups, hence these variables were excluded from the analysis [26].
A study that examined a game app, closely similar to the design of the two apps we examined in this paper, showed improvements in the arithmetic fluency of 7 years old; it helped children become fluent in adding and subtracting simple sums up to 20. The app gave an arithmetic addition or subtraction problem (e.g., 6 + 8 = ?) to the children and a number of possible answers. The speed and correctness of each problem were associated to game performance. Post-test comparisons showed significant gains for the intervention group in subtraction using non-symbolic (dots; ::) number representations than the control group. Improvements in non-symbolic problems required students to make a calculation in order to find the answer and therefore the authors concluded that the game improved calculation efficiency rather than retrieval efficiency, as originally expected [27]. Also, to the best of authors' knowledge, a single study was found to report on equally good pre-post test outcomes between the intervention and control groups (5 and 6 years old) in mathematical abilities, spatial awareness and working memory. This concerned a comparison between a programming app (Bee-bot app), programming with pen-and-paper, and a control group [28]. The lack of significant differences was explained by standard teaching practice in addition and subtraction that may have helped all groups perform equally well in the proposed tasks, a lack of statistical power and a possibility of Type II error.
In terms of the apps used in the aforementioned studies, these feature certain design characteristics. The apps of OneBillion used an in-app virtual teacher to guide children's learning with instructions and demonstrations [21]. A racing game in which children competed a virtual enemy helped children calculate correctly certain maths problems suggesting that non-symbolic arithmetic skills can be improved through simple multiple-choice tasks [27]. The Math Shelf app, structured around games that support short-term maths goals, can be tailored to students' needs. It assigns content based on an assessment children take which determines where in the curriculum they are [19]. The Mesozoic Math Adventures presented two games in which a character was indicating what the children should do, either by asking a question that could be answered by selecting from a number of options or asking children to test a hypothesis by, for example, arranging objects on the screen [22]. The Measure that Animal app introduces a zookeeper who needs to measure some animals, yet he has forgotten his measuring tape. Children can select an item from a box and place it on a line to measure the animal. This interactive approach has been designed to scaffold the process of measuring [29].
With the exception of one study, the outcomes of existing studies point to enhanced post-test performance after children aged 4 to 7 years old interacted with certain maths apps. Yet, the effects of prior knowledge, age, and gender on post-test performance are rather blurred. There are mixed-findings in respect of whether maths apps can help in particular the low or high achievers or whether older children (than younger ones) and female are those who can benefit the most from interacting with apps. None of the reported studies evidenced significant effects of socio-economic status, ethnicity, child's first language, verbal ability, parent's education, and household income. These insights raised the need to explore further the effects of moderating factors in order to determine who of the children benefit the most from interacting with selected maths apps.

2.1
Learning through "drill-and-practice" Ealy years maths curricula are mainly focused on improving skills such as counting, using numbers, and calculating addition and subtraction problems [30], [31]. Mathematic skills such as number combination (e.g., 6 + 4 = ?, 10-4 = ?) can be solved by counting, decomposing, or by automatic retrieval of the answer from memory. Children make use of specific strategies that can help them solve number combination problems, often starting with "counting all", then moving to strategies such as counting starting from the biggest number and decomposing a whole into different combinations of parts. Over time, associations of problems with correct answers become established in memory and children retrieve answers rather than practising strategies to find the correct answer [32]. There are three stages to skills acquisition: cognitive -performance of calculations to produce the correct answer; associative -retrieving the answer from declarative knowledge; and autonomous -no strategy is used and retrieving the answer becomes a reflex [33].
Drill-and-practice is a significant part of learning about number combinations that can lead to the "autonomous" stage of skills acquisition or arithmetic fluency. It is a behaviourist-oriented approach to learning that can result in conducting lower level processes (such as addition or subtraction of small sums) with limited effort. This is a significant skill as it enables greater cognitive capacity for solving complex tasks [34]. Teaching the strategies for solving a task coupled with deliberate practice were shown to result in better learning outcomes than teaching without practice [32]. Developmental differences were found in terms of practising number combinations. A computer-based task showing children the strategy to use to solve an addition problem was found to be more beneficial for 3rd graders, while a process-based training (no strategy or scaffolding provided) was more beneficial for 5th graders. It is more likely the older children could develop their own strategies for solving the tasks by possessing relevant cognitive skills, and this resulted in becoming faster in finding the correct answer. On the contrary, younger children who were given a strategy to solve the task became more accurate after practice [18].

The maths apps under study
In this paper, we examined two commercially available mobile apps that have not been researched before, Moose Math by Duck Duck Goose and Monster Numbers by Didactoons. To select apps, we first reviewed the design features of available maths apps for early years and grouped existing apps in three main categories: (i) apps linked to physical artefacts, (ii) "drill-and-practice" apps with external rewards, and (iii) apps that combined gaming and learning elements, for example a racing maths game. In this study, we chose to study an exemplary app from the second and third categories. The criteria we used to select the specific apps were as follows: (a) free maths apps available in both the Apple and Google stores, (b) apps not used in previous published work, and (c) apps rated with at least 3.5/5. The apps under study were "instructive", that is, supporting learning through "drill-and-practice" [3] and targeting recalling of simple addition and subtraction tasks and counting. At the point of writing, Moose Maths was rated with 4.5/5 (iOS) and 4.4/5 (100,000 + installs) (Google store) and Monster Numbers 4.5/5 (iOS) and 4.0/5 (10,000 + installs). In Moose Math, participating children were asked to interact with specific in-app learning activities that were deemed suitable to their age, in particular the Juice Mixer, Pets, and Pets Bingo.

The present study
This paper presents evidence from the project mEvaluate: Devising an evaluation framework for the design and use of mobile learning applications in early years' education funded by the British Academy Mid-Career Fellowship scheme, the aim of which was to devise an evidence-based evaluation framework for the design and use of mobile apps for math literacy (see [35]). Project data were collected from a series of RCTs in primary schools in the UK. The following Research Questions (RQs) were addressed in the study: RQ1: What is the learning impact of the mobile apps Moose Math and Monster Numbers on 5-6 years old? We hypothesised that the performance of participating children would improve after interacting with the two apps under study. The apps would be an opportunity for practising maths concepts and processes taught in the classroom [32]. It would enable performance of lower level processes with limited effort and help to establish the correct number associations in memory [32], [34]. The added value of using mobile apps, rather than a pen-and-paper equivalent, is that children receive immediate feedback from the apps that can help quick recovery from mistakes and facilitate progress.
RQ2: How do children's characteristics in particular age, gender, and previous maths performance relate to learning using these apps? Existing studies present rather mixed findings about the impact of age and gender on maths learning with apps (e.g., [19], [22], [29]). Therefore, we hypothesised that these characteristics would not influence post-test performance. On the contrary, given the evidence to support that previous performance is related to post-test outcomes in a positive or negative way [21], [19], we hypothesized that previous maths performance, as measured by pre-tests and the assessment of teachers for each individual child, would moderate effects on post-tests.

Context and process of data collection
We ran four RCTs in four self-selected primary schools in the UK, identified through announcements we shared with different teachers associations in England. Using a SPSS function, we numbered and randomly allocated students within each class into a control and an intervention groups. Ethical approvals were gained from the ethics committee of the Open University UK. Parental consent was obtained by the guardians of all children who took part in the study. Teachers were offered Amazon vouchers as a thank you gift for their participation in the study. No incentives were offered to participating children. We treated the first school (20, Year 1 children) as piloting of the process of implementation and data collection. In particular, we piloted and refined the pre/post-tests designed to measure impact on learning after interacting with the apps, and the instruction documents we shared with teachers that detailed how to implement the study. Also, we monitored and gave feedback to teachers as to how to respond to students' queries when using the mobile app and when completing the pre/post-tests, to ensure that limited guidance is provided to children that could bias the results of this study.
In terms of the socio-economic status of participating schools, all four schools were public and presented a larger than the national average concentration of disadvantaged students (i.e., minority ethnic groups, English as an additional language, free school meals, children in care or adopted, and pupil premium -that is the governmental grant offered to school and families of disadvantaged children to minimise the attainment gap). The Index of Multiple Deprivation, that is, the official measure of relative deprivation of small areas in England classifies the area around School 1 at the 7th decile (the 10th decile is the least deprived small area nationally), and Schools 2, 3 and 4 at the 1st decile, an indication that the areas where those schools are located are amongst the most deprived 10% of small areas nationally. In terms of technological equipment, School 1 had 15 iPads shared across the entire school and School 2, 3, and 4 had no mobile devices. In these schools, mobile devices were provided to each child by the authors.
In this paper, we excluded the pilot school and are reporting on the outcomes from three schools (coded thereafter as School 1, School 2, School 3) with a total of N = 376 children as follows: School 1 -one Year 1 and one Reception classes (n = 46); School 2 -two Year 1 and two Reception classes (n = 100); School 3 -four Year 1 and four Reception classes (n = 230). The duration of the intervention ranged slightly across schools to accommodate the needs and availability of participating teachers: children in School 1 had 8 sessions with the mobile app of 15-20 minutes each, and Schools 2 and 3, 5-7 sessions of an average 15 minute duration (two sessions per week). Prior to the start of the intervention, we shared written instructions with participating teachers and debriefed them orally as to how they should use the devices. In School 1, the intervention took place during maths teaching. While the intervention group was interacting with the app, the control group was doing standard maths practice. In School 2 and 3, teachers organised the study around the school needs, therefore the control group in some sessions was doing standard maths practice while in others was practicing other subjects. The role of the teachers was to moderate or supervise the study and provide limited technical support if needed. This design aligned with existing studies examining the use of technology in classroom settings [21].
The first and last sessions at each school were coordinated by the research team, as a means to showcase to the teachers how to implement the study and also allocate and collect pre/post-tests. Children in the intervention group were sitting together and worked individually (one-device-per child) with the mobile devices. No guidance or help was provided by the teacher or the researchers, unless technical difficulties inhibited a child from interacting with the app. The research team contacted the teachers once a week as a means to monitor the progress of the study, resolve any issues they were facing, and enhance the fidelity of the implementation.
Pre and post tests were designed based on the learning objectives of the apps under study, piloted and revised during the piloting phase. The piloting indicated that the tests were relatively lengthy and therefore they were substantially shortened. The tests followed the type of activities children were asked to complete in the app: (a) number recognition, (b) counting, (c) adding, and (d) subtracting numbers. Instructions on how to complete each activity in the tests were written in a separate document and shared with teachers across all classes. When children could not understand the instructions given, a non-related to the app example was given and explained by the teacher. Table 1 summarizes the sample characteristics across schools. Overall, 376 self-selected children took part in the study. Participating schools had on average similar numbers of male and female children, aged between 5 and 6 years old. In terms of children's existing performance in maths, the teachers' assessment showed that in School 1 and 3 the majority of children had an average performance, whereas in School 2 a slight majority was above average. Another measure of students' previous maths knowledge and understanding is their scores in pretests indicating that School 1 had a lower maths average that the other two schools.

Mobile apps under study
To visualize the interaction pathways and design features of each app, we used the Activity Theory framework for analyzing serious games [36]. Moose Maths presents a cyclical interaction pathway that starts with: (a) selecting a learning activity, (b) selecting a reward, (c) completing correctly a learning task, and (d) receiving a reward (See Figure 1 and Table 2). It allows for maximum three wrong answers to a given task before proceeding to the next one. Instructions are provided in the form of oral help before a learning task starts. Help (oral and visual) is available on demand (See purple bird in Table 3).   Similarly, Monster Numbers presents a cyclical interaction pathway with separate learning and gaming tasks. The successful completion of a learning task follows a gaming session (racing game). There is no limitation as to the number of wrong attempts made neither in the learning nor in the gaming task (unlimited repetition of activity) (Figure 2). Instructions are both visual and written (but not oral) and presented before the start of a gaming or learning task. In learning tasks only, these can be skipped by pressing the start button (Table 4).

Process of data analysis
Aligning with [37], we used a multiple linear regression analysis; independent variables or predictors were the pre-test scores, the condition (intervention versus control), gender, age, and previous maths performance. We transformed all pre/post-test scores to percentages to allow for easier interpretation (see [37]). The analysis considered for only complete cases of children (listwise selection), that is cases where both pre and post-test values were available. Three pre-test and six post-test cases were missing and excluded from the analysis. In all three datasets (three schools), no values over .70 were observed in the correlation matrix, P-P plot and scatterplots showed linear relationship of standardized residuals, and Cook's distance was not greater than 1, meeting the assumptions for running a regression.
Moose Math app: This app was tested in Schools 1 and 2. We ran a separate regression analysis for each participating school. We first inspected the distribution of the dependent variable (post test scores) within each group (control, intervention) and within each school dataset. In Schools 1, the skewness and kurtosis measures and standard errors, and their histograms, normal Q-Q plots and box-plots and the Kolmogorov-Smirnov test of normality (School 1: control p = .200; intervention p = .200) showed that the data were approximately normally distributed. Levene's test verified the equality of variances between the control and intervention groups (School 1: p = .65).
In School 2, the data were found to be non-normally distributed. The Kolmogorov-Smirnov test of normality was significant for both the control (p = .009) and intervention groups (p = .001). Yet, we performed a regression analysis given that the sample size was 'sufficiently large' and over 80 participants which is considered appropriate for running a parametric test [38].
Monster Numbers app: This apps was tested in School 3. We first inspected the distribution of the dependent variable (post test scores) within each group (control, intervention). The skewness and kurtosis measures and standard errors, and their histograms, normal Q-Q plots and box-plots and the Kolmogorov-Smirnov test of normality (control p < .001, intervention p < .001) showed that the data were not normally distributed. Levene's test verified the equality of variances between the control and intervention groups (p = .793). Despite the non-normal distribution, we performed a linear regression analysis given that the sample size was 'sufficiently large' as above [38].

Gaming Levels
In-game instructions given visually and as text.
When instructions are given the running speed of the main character decreases.
Children have no option of opting out of instruction.

Instructions do not repeat when level is failed.
Ongoing feedback is shown through score, coins collected, and potion collected.
End of level evaluation based on performance.
Rewards given as parts of spacecraft.
Rating out of 3 is given.

Math Levels
Instructions given at the beginning of level as text and visual representation.
Children can skip the instructions by pressing play.
Instruction repeat when level is failed.
Ongoing feedback shown through progression bar, lives remaining, and timer.
End of level evaluation based on performance.
Rewards given as collectable coins.
Rating out of 3 is given.

School 1 (Moose Math):
The results of the regression indicated that only one predictor explained 26% of the variance in the dependent variable (post-tests) (R 2 = .26, F (2,40) = 6.99, p < .01). Pre-test scores significantly predicted post-test scores (β = .45, p = .003) (See Table 5), while the condition (control versus intervention group) was not statistically significant (β = -.12, p = .401, NS). After entering demographic variables, the model remained significant (R 2 = .29, F (5,37) = 3.1, p < .01). The only variable predicting post-test scores was pre-test performance (β = .39, p = .025) indicating that the greater the pre-test performance the better the post-test scores were. In particular, one point increase on the pre-tests corresponds to 0.39 increase in post-tests performance. Participating children had significantly better scores in post-tests, over and above the condition they were in, gender, age and previous maths performance.

Discussion
In this paper, we conducted three RCTs with 376 children aged 5 and 6 years old to capture the impact of two popular and highly rated, "drill-and-practice" mobile maths apps at three primary schools located in relatively deprived areas of the UK. In contrast to the majority of existing studies reporting positive learning gains from using maths apps (e.g., [39], [10]), this study identified no significant differences between the app and non-app conditions. Participating children were found to have better learning outcomes in post-tests by the end of the intervention over and above the condition they were in, suggesting that both conditions -interacting with a maths app and standard teaching practice -were equally beneficial to helping children complete basic maths tasks such as counting, addition and subtraction of small numbers. In contrast to our initial hypothesis for RQ1, the intervention group did not present improved learning outcomes compared to the control group. This finding aligns with a few studies that had reported non-significant gains post intervention for the app condition [40], [41] as well as insights suggesting inflated effect sizes in studies examining constrained maths skills as such skills have a ceiling effect, are mastered by most children and are influenced more by direct teaching [42]. The increased performance of children in post-tests in both conditions could also be explained by an overall progress in understanding early maths concepts over the period of the proposed intervention. Counting, addition and subtraction are core topics in early years maths instruction, hence systematic classroom practice may have had a positive impact on the performance of students as a whole.
Aligning with our hypothesis for RQ2, the only factor explaining post-test performance in both the control and intervention groups was pre-test scores. The greater the children's performance in pre-tests, the better their post-tests outcomes were. There was no effect of age, gender and previous maths performance (as assessed by teachers) suggesting that these factors are unrelated to post-test changes. These findings confirm studies showing that the effectiveness of mobile apps for learning is often related to prior knowledge and performance [11]. Children who were more skillful or knowledgeable might have performed better in the tests than other children either because they developed the strategies needed to solve the tasks in hand or they were at the "autonomous" stage of calculations in which they could recall answers from memory with no effort [33]. In contrast, the low performing children might have performed less well due to a lack of additional guidance or explanations (either from a teacher or the app) that could help them manage the cognitive load and cope with the tasks successfully [18].
Reflecting on the delivery of the intervention, there was a variation in the activities children in the control group were engaged with across sessions and schools. For example, practicing addition or subtraction using pen and paper or receiving instruction as to how to solve such problems may have benefited the control group and helped them perform equally well to the intervention group. Also, for the intervention group, the medium used to deliver the pre/post-tests was different to the medium used to practice maths concepts. Generalisation may not happen when children are instructed using a mobile device, whereas the assessment is completed using a different format such as pen-and-paper. Studies on computer-based maths instruction showed that students who practised only on a computer performed better in a computer-based assessment than a pen-and-paper one, whereas those who practised on a paper-and-pencil had similar outcomes in the computer-based and the pen-and-paper assessments [43]. These factors may have disadvantaged the intervention group and benefited the control group that was used to practising using pen-and-paper. On the other hand, the design of the pre/post tests followed the structure and content of the activities presented in the two mobile apps. In other words, they were closely aligned with the content of the in-app learning experience to facilitate near transfer [29]. Researcher-developed as opposed to standardised instruments were shown to inflate effect sizes for app conditions [12], suggesting that the tests may have favoured the intervention group. The combination of the above factors could explain why both groups improved after the intervention.
Other factors that may explain the lack of superior post-test performance in the intervention group is the design of the two apps under study and their focus on "drilland-practice" of already acquired knowledge. The selected apps had no elements of explicit or direct teaching or structured instruction, an app feature that has shown to relate to enhanced learning outcomes [44]. Such features could showcase to children the strategies to use to solve tasks such as how to add up quantities, and help children understand and recover from mistakes in a constructive way. Given the young age of participating children (4 and 5 years old), these are more likely to be at the cognitive stage of calculations [33], that is practising strategies to find the correct answer rather than drawing from memory established number associations [32]. Aligning with existing studies [18], "drill-and-practice" apps may have been more beneficial for older children that are transitioning to the "autonomous stage" of skills acquisition or reaching arithmetic fluency. Therefore, a "drill-and-practice" app could help them become faster in finding the correct answers, a skill needed for solving more complex problems.
In addition, the delivery of feedback in the apps under study may have inhibited learning and recovery from mistakes. Studies have showed that specific (and not all) types of feedback can result in enhanced learning outcomes ( [19], [45]). In this study, the Moose Math app provided verbal and emotional feedback (e.g., Let's try again or Looks delicious) (see Table 3) or at a 'self-level' referring to personal evaluations and affect in the form of reinforcement [46]. In contrast, help-on-demand provided feedback at a 'task-level', that is instructions about how to proceed. These instructions were verbal, written, and graphical. Yet, the most beneficial form of feedback has shown to be elaborative feedback, that is providing explanations as to why an answer is correct or wrong, as well as cues and suggestions as to how to modify a response [47]. In Moose Math, elaborative feedback is provided in the help-on-demand button (see Table 3, purple bird), yet not in the task feedback, suggesting that the latter could be enhanced by explaining why an answer is correct or wrong, or by providing personalised feedback that responds to specific actions on the screen. Examining the role of feedback in Moose Math using screen recordings, Herodotou [35] has shown that feedback is perceived differently by children of the same age, with some children being unable to recover from errors after accessing oral and visual help.

Limitations
In an effort to increase the ecological validity of the study and improve the fidelity of the implementation, we produced and piloted protocols of implementation with instructions as to how teachers should interact with children and children with the apps. Also, we had weekly email communication with teachers discussing progress and any issues related to the implementation. This is a rather common approach of conducting research with technology in educational contexts (e.g., [44]), where teachers receive training as to how to facilitate the study while the researchers are not present in all implementation sessions. Yet, we cannot rule out slight variations in the implementation by individual teachers that may have had an impact on outcomes. For example, participating children used the apps inside the classroom context. In some of the sessions (as reported by some teachers), children in the control group may have been in the same physical environment implementing other maths-related activities. This may have posed a threat to internal validity due to contamination.
Also there was a variation in the length of the intervention across participating classes and schools, and the activities the control group was engaged with. In particular, we originally planned the study to span for four consecutive weeks with three sessions of 20 min in each week (total of 12 sessions). Yet, due to teachers' workload and last-minute school priorities, eight sessions ran in School 1 of 15-20 min duration each and 5-7 sessions in Schools 2 and 3 of 15 min duration each. The smaller duration of the intervention may have had an impact on the performance of the intervention group that could explain the lack of superior outcomes, often cited in the literature, compared to the control group. In addition to that, in School 2 and 3, children in the control group were not always engaged with standard maths practice. Despite the instructions given to teachers, there were sessions when children were studying other non-math related topics. This suggests that, in some cases, the exposure of the control group to maths content and teaching may have been less compared to the intervention group.

Conclusions
Although "drill-and-practice" maths apps are quite popular in the app market, highly rated and frequently downloaded, few studies have attempted to examine their impact on learning. In this paper, we conducted three experimental studies with a total of 376 children aged 5-6 years old from deprived areas in the UK, in an effort to assess their impact on early maths learning. Insights showed that the app condition was equivalent to standard teaching practice suggesting that popular apps, such as Moose Maths and Monster numbers, could help children practice basic maths tasks such as counting, addition and subtraction, yet they were not superior, or showed to have an added value compared to standard maths practice. Considering the development of early maths skills and in particular, the transition of children through different stages prior to calculating with no effort, it is suggested that teachers and app designers consider for the skills and knowledge children have developed prior to using or recommending the use of "drill-and-practice" apps. Children who have developed an understanding of the strategies needed to calculate and are starting to become more autonomous in performing such tasks are those more likely to benefit from these apps. Such apps can help them develop calculation efficacy (do tasks quickly) or establish number associations in memory, a skill needed for reducing cognitive load and enabling the solution of complex maths problems.
App designers should be cautious with the age recommendations they make for such apps (e.g., suitable for children 3-7 years old) as children up to 6 years old may not benefit from interacting with them. Apps with instructive or teaching features including more elaborated feedback and scaffolding might be more beneficial for these ages as they can help children develop the strategies needed to calculate effectively. To this respect, the role of instructional designers or experts in educational technology should be heavily considered in the process of design; they could provide valuable insights as to how children develop [35], which features or mechanisms are more appropriate for their age, and how to embed these to the app design to enable active and personalised learning experiences. Partnerships between app designers and educational experts should be promoted to ensure that educational apps consider for pedagogical principles and have been tested with children prior to their release to the market [48]. Such evaluations could contribute to the development of an evidence base that could guide parents and teachers when choosing and using apps with children. Also, the design of apps should move from an "one size fits all" approach to more tailored and personalised approaches, using for example machine learning techniques, that consider for children's individual learning needs including prior experiences with maths and how these may relate to app use and understanding, thus presenting each child with a dynamic and tailored learning experience.