Automatic Assessment of University Teachers’ Critical Thinking Levels

The present work describes the structure of a pilot study which was addressed to test a tool developed to automatically assess Critical Thinking CT Levels through language analysis techniques. Starting from Wikipedia database and lexical analysis procedures based on n-grams, a new approach aimed at the automatic assessment of the openended questions, where CT can be detected, is proposed. Automatic assessment is focused on four CT macroindicators : basic language skills, relevance, importance and novelty. The pilot study was carried out through different workshops adapted from Crithinkedu EU Erasmus + Project model aimed at training university teachers in the field of CT. The workshops were designed to support the development of CT teaching practices at higher education level and enhance University Teachers’ CT as well. The two-hour workshops were conducted in two higher educational institutions, the first in the U.S.A (CCRWT Berkeley College NYC, 26 university teachers) and the second in Italy (Inclusive memory project University Roma Tre, 22 university teachers). After the two workshops, data were collected through an online questionnaire developed and adapted in the framework of the Erasmus + Crithinkedu project. The questionnaire includes both open-ended and multiple choice questions. The results present CT level shown by university teachers and which kind of pedagogical practices they intend to promote after such an experience within their courses. In addition, a comparison between the values inferred by the algorithm and those calculated by domain human expert is offered. Finally, following up activity is shown taking into consideration other sets of macro-indicators: argumentation and critical evaluation.


INTRODUCTION
In recent years, new ways to define and assess Critical Thinking assessment have been developed. Big amount of behavioural data connected to learning processes are stored automatically in digital platforms (e.g. social media, LMS). The analysis of data collected from virtual learning environments has attracted much attention from different fields of study; therefore, a new research field is born, known as learning analytics.
In addition, in the field of Critical Thinking assessment, many researchers agree that multiple assessment formats are needed for critical thinking assessment. However, the use of open-ended questions raises some problems concerning the high cost of human scoring. Automated scoring could be a viable solution to these concerns, with automated scoring tools designed to assess both shortanswers and essay questions (Liu, Frankel, & Roohr, 2014). In the field of Critical Thinking assessment, Gordon, Prakken and Walton (2007) proposed a functional model for the evaluation of arguments in dialogical and argumentative contexts. Wegerif and colleagues (2010) described a computational model to identify places within e-discussion in which students adopt critical and creative thinking. Developing computational model to identify Critical Thinking in students' written comments provides many advantages such as assisting the researcher in finding key aspects in big amount of data and helping a teacher or a moderator identify when students are thinking more critically.
Despite the advantages, it is important to examine the accuracy of automated scores to make sure they achieve an acceptable level of agreement with valid human scores. Liu and colleagues (2014) asserted that, in many cases, automated scoring can be used as a substitute for the second human rater and can be compared with the score from the first human rater. If discrepancies beyond what is typically allowed between two human raters occur between the human and machine scores, additional human scoring will be introduced for adjudication.
In this work, we presented a pilot study carried out to test a tool developed to assess automatically Critical Thinking Level through language analysis techniques. Starting from Wikipedia database and the use of lexical analysis based on n-grams, we propose a new approach aimed at the automatic assessment of the open-ended questions, where Critical Thinking levels can be detected. The prototype devised so far is based on code framework developed in previous research (Poce, Corcione & Iovine, 2012;Poce, 2015) mainly inspired by the Newman, Webb and Cochrane model (1995). The above framework is composed by six macro-indicators: Basic language skills, Justification, Relevance, Importance, Critical evaluation and Novelty. The first macro-indicator, namely Basic language skills, is useful to assess the correct use of the language. The Justification macro-indicator evaluates students' ability to elaborate on their thesis and support their arguments throughout a discourse. Relevance is a macro-indicator that analyses students' texts consistency, such as the correct use of outlines and students' capability to accurately use given stimuli. Importance macroindicator evaluates the knowledge students use in their discourse. Finally, Critical Evaluation and novelty refer to personal and critical elaboration of sources, data and background knowledge with the use of new ideas and solution associated with the initial hypothesis and students' personal thesis. At the moment, the prototype has been designed to assess four areas out of six: Basic language skills, Relevance, Importance and Novelty. To test the employability of thetool, we carried out a pilot study through a workshop adapted from Crithinkedu EU Erasmus + Project training course for university teachers. The workshop was designed to support both the development of Critical Thinking teaching practices at Higher Education level and enhance University Teachers' Critical Thinking.

OBJECTIVES
In the context of the Crithinkedu EU Erasmus + Project 'Critical Thinking Across the European Higher Education Curricula', funded by the European Commission under the Erasmus+ Programme, a specific training course, aimed to improve the quality of CT teaching and learning in universities across the curricula, was designed. The main idea underpinned by the training course is that HE teachers do not only need to be trained about methods to improve Critical Thinking in their students, but also to develop themselves a Critical Thinking attitude within their own professional practices. Indeed, Critical Thinking development in Higher Education is often considered a priority not only because it improves deep-comprehension ability and allows to be effective on the workplaces, but also because Critical Thinking is a necessary mind-set to be active citizens of the wider social environments (Davies & Barnett, 2015). For this reason, university teachers need to receive the proper training to incorporate Critical Thinking instructions in their curricula. From previous research, it is clear that improvement in students' CT skills and dispositions cannot be a matter of implicit expectations (Marin & Halpern, 2011;Dominguez, 2018). Educators should make CT objectives explicit and include them in training and faculty development. In addition, a gap between university and workplaces' expectations (Dominguez, 2018) was observed (see the table 1) and definitions in terms of who is a "critical thinker" and what he/she should be able to do. In the first part of this paper the structure of the Crithinkedu adapted workshop from training course model (Dominguez, 2018) for university teachers is described. The workshop model is aimed to 1. Support the development of Critical thinking teaching practices at higher education level and 2. Enhance University Teachers' Critical Thinking. As mentioned, the two-hour workshop model was conducted in two Higher Educational institutions, the first carried out at Berkeley College NYC (U.S.A) 1 and the second at the University of Roma Tre 2 (Italy).

A. The workshop structure
The two workshops carried out both in the United States and in Italy followed a general structure inspired by the Crithinkedu training course, although there were some differences due to the specific context. The course carried out in the United States took place in the setting of the 6th Annual Conference "Defining Critical in the 21st Century? 3 ". The conference was devoted to Critical Thinking in Higher Education and university professors were invited because of their interest to improve their teaching and professional practices.
On the other hand, the course carried out in Italy took place in the framework of a project named "Inclusive Memory" 4 local university professors, from different fields, were involved to develop Critical Thinking knowledge, skills and dispositions in order to produce inclusive museum Object Based learning paths to be used in their own courses. Our goal was to see if CT knowledge they have to acquire for the sake of the project would be used also in their university teaching practices.
Both the workshops lasted two hours and they were implemented bearing in mind the following objectives: 1. Participants should be introduced in more general/transversal elements of CT; 2. Participants should be able to discuss and apply CT in their discipline/field; 3. Participants should be encouraged to redesign their courses aiming at the strengthening/embedding the 'teaching CT' aspects; 4. Participants should have the opportunity to discuss field/discipline specific instances of teaching CT. 1 CCRWT October 19th 2019 https://ccrwt.weebly.com/2018ccrwt.html 2 Inclusive memory project, November 6th 2019 3 https://ccrwt.weebly.com/uploads/2/2/7/1/22712194/5683_ccr wt_program_onlinedoc_final_pdf.pdf 4 The project is aimed to support inclusiveness of minorities and disadvantaged group through the fruition of cultural heritage in museums and through the development of the 4Cs (Collaboration, Creativity, Communication and Critical Thinking).
At the beginning of each activity, the goals of the workshop were explained and negotiated with the participants. Then, the most used definitions of Critical Thinking based on the Facione (1990) and Jiménez-Aleixandre and Puig (2012) conceptualizations were shown to the participants. After that, they were invited to reflect upon learning strategies that can be used to improve Critical Thinking (e.g. Jigsaw methods, conceptual maps, problem and project based learning) and on methods to assess Critical Thinking, according to the skills and dispositions they intended to improve. After the theoretical presentation, participants were invited to work in small groups and they were divided according to their field background (STEM, humanities, social sciences, foreign language and literature, engineering) in the case of USA group or to their role in the Inclusive memory project in the case of Italian group. All of them were invited to: -Define their CT learning goals. Based on Facione and Jiménez-Aleixandre & Puig definitions, they had to choose which CT skills or dispositions they aimed to focus on.
-Define their activities. They had to decide methods that could support the chosen CT skill or disposition to be developed.
-Define their assessment method. They eventually had to decide assessment methods consistent with their CT learning goals.
At the end of the workshop, groups were invited to present and compare their idea in a plenary sessions and to comment on the choices made by other groups.

B. Data collection and analysis
To answer to the research questions described above, Data were collected after the two workshops through an online questionnaire developed and adapted in the Erasmus + Crithinkedu project. The questionnaire includes both open-ended and multiple choice questions. We received 22 answers from the Italian group and 26 answers from the American group.
The two questionnaires covered the same areas of interest, even if each one was adpated to the context. Both tools presented closed questions regarding the following topics: -Personal contacts and information; -Departments and discipline field (STEM, humanities, social sciences); In order to detect Critical Thinking levels shown by university teachers on the pilot activity and how they intend to change their pedagogical practices, we analysed the open questions mentioned above by comparing human assessment with the one carried out by a prototype for the automatic assessment of Critical Thinking devised by the research group on purpose.

C. The Automatic Tool for Critical Thinking Assesment
Our prototype is composed of four main modules that allow to perform all the operations necessary to obtain the experimental results (figure 1). 1. Authentication Manager: the module is implemented within the spring framework and allows users to digest sessions and authentications. All operations within the system are logged but operations are anonymous among users in order not to influence the interactions with the system. The module allows online registration via email and provides a secure login form to access the services offered. 2. Input module: this module manages the insertion of the questions and answers to be evaluated. For questions, in addition to the title and text, users are also asked to include words representing the concepts that should be included in a correct answer and representatives of possible inferred developments. This information will be exploited by the automatic response analysis module to evaluate the four indicators of critical thinking. It is possible to insert more questions or answers at the same time using the import function from google forms and uploading the generated xml file. The module interacts with Hibernate, a framework for the automatic management of entities in the local database where all the questions and answers are saved. 3. Manual evaluator: through this module, field experts can manually evaluate the indicators for the answers entered. It is possible to select any question on the system and the system will propose in series all the answers not yet evaluated. The user can then decide whether to evaluate or delegate to another teacher. For each question it is possible to associate only one anonymous evaluation; these evaluations will be compared with the automatic evaluations to verify the validity of the proposed approach. 4. Automatic evaluator: this module is at the heart of the system. It will use two external modules to perform the automatic evaluation of the four indicators presented.
Basic language skills: to evaluate language skills, the system makes use of the collaboration of an external system, JLanguageTool (https://languagetool.org/). This tool, developed as an online web service rest, allows you to send texts and receive information on grammar errors in just a few milliseconds. It also allows you to receive a version of the text received with the most probable corrections. This correct version is fundamental for more advanced analysis because an incorrect text introduces noise that lowers the performance of the whole system. The value of the indicator is given by normalizing the number of errors considering the number of words that make up the text of the answer.
Relevance: The importance is assessed by exploiting a Wikipedia analysis: initially, the text of the question and of the answer are sent to an online tagging service through Wikipedia pages, TAGME (https://tagme.d4science.org/tagme/). The service returns a set of pages associated with a given text, in our case the text of the question or answer. To see how many topics related to the question have been described, the system performs the intersection of the titles of all the pages linked to the entities in the outgoing link related to the text of the question with those reported by the TAGME service for answers. The hypothesis that we want to show is that outgoing links from pages representing the concepts of the question, points to concepts that must be covered by the answers.
Novelty: The same analysis carried out for the evaluation of the relevance is performed to evaluate this indicators using the set of concepts defined during the creation of the demand for possible inferred developments.
Importance: the relevance is evaluated by exploiting an analysis of the concepts defined during the creation of the application. The text is processed by a POS Tagger (https://nlp.stanford.edu/software/tagger.shtml), which analyzes the text of the response and extracts all the nouns. This set of nouns is applied to an algorithm that generates n-grams of length from one to three and is compared with the concepts defined for the question. The number of the intersection between the n-grams and the concepts will give the relevance of the answer to the topic treated. The analysis of concepts through Wikipedia is also applied as for the previous indicators.
We decided to take advantage of Wikipedia for these analysis because most of the teachers, about 87%, use Wikipedia in their didactic activities and the reliability of Wikipedia (primarily of the English-language edition), has also been assessed: an early study in the Nature journal said that, in 2005, Wikipedia's scientific articles came close to the level of accuracy of the Encyclopedia Britannica (Giles, 2005).
For this first evaluation of the prototype, we analyzed only the American group because Wikipedia.it contains only 1 million pages against the 5.5 million of the English version and this leads to a considerable decline in performance in finding the concepts associated with questions and answers. In the future, we hope to extend the approach to every different language.
The first interaction that users have with the system after entering the url to reach the platform (currently locally on a Roma3 server) is with the login form. If the user reaches the platform for the first time, he/she is asked to perform an email registration, with confirmation from the system administrator.
The submission of the login form redirects the user to the main page of the system. Here the user will find all the questions inserted in the system and for each question he/she can perform a manual evaluation of the answers based on the four criteria of the greed (basic linguistic skills, relevance, novelty and importance).
When user chooses the manual evaluation, the text of the question and the answer will be visualized. Through four checkboxes it will be possible to insert manually the values of each critical thinking indicators. On the other hand, the system can perform the automatic evaluation of the answers and create the entry in the database for future evaluation. By clicking on the insert a question button, the user will visualize an insertion form where to write the text of the answer, two sets of concepts that should be treated in the answer and represent possible developments or conclusions.  In the Italian group, most of the teachers come from the department of Educational Sciences (45,5%), Foreigner literature (22,7%), Engineering (9,1%), Economics (9,1%) and Business School (9,1%).
Manual evaluation was carried out by two domain experts and the averages of the values collected were taken into consideration as a reference for comparisons. For each sub-skill, each question collected is marked from a minimum of 1 to a maximum of 5. The two groups, as shown in the following images ( Figure 5, Figure 6), obtained similar scores in terms of sub-skills related to Critical Thinking. In the first five skills (Basic language skills, Relevance, Importance, Argumentation, Critical Evaluation) both the groups obtained a score higher than four, showing a good level of CT on the five areas. With regard to the last skill "Novelty", both showed a similar score with a result lower than four.   In the automatic classification shown in figure 9, only 4 Critical Thinking indicators are considered and therefore the maximum score for each question is 20. To analyze the performance of the prototype, the metric used to compare manual and automatic evaluation is Accuracy that has the following definition (1): (1) For this first test we made some simplifications: -The terms inserted in the system to contextualize relevance and novelty are defined as the titles of the Wikipedia pages associated with the concepts in order not to introduce noise in the evaluation process. -The manual scoring was divided into three class of values (negative neutral and positive) for each indicator of critical thinking. For this first pilot an ad hoc dictionary of terms for the recognition of synonyms and related concepts has been built as a simple query expansion module applied to concept in order to maximise the number of retrieved entities for the relevance, importance and novelty indicators.
The prototype, currently, can only evaluate texts related to the domain of the three questions considered. Following the typical approach of the development of the classifiers, we try to identify the best features to describe the problem and then try to generalize these conclusions outside the analyzed domain. In the future we'll try to automatically create these dictionaries through the exploit of bases of knowledge such as DBpedia (https://wiki.dbpedia.org/) or Wordnet (https://wordnet.princeton.edu). In these conditions the prototype agreed with the domain expert in 30% of cases. Analyzing only a sub-sample of the dataset, the one with the best answers (more complete and longer in terms of words), the value grows to almost 34%.
The best evaluations were obtained from the Basic language skill and Importance indicators with accuracy values of 67% and 39% respectively. The result is not satisfactory yet for an effective classification considering the application of the study to the domain (only three question) and the restrictions made, but has allowed to identify many points to extend the approach. An analysis of the negatively classified instance highlighted some evidence: the process of defining the associated concepts to the Importance must be very specific, otherwise the system can't evaluate the indicator correctly because general concepts lead the system out of topic in the Wikipedia analysis. Moreover, it has been found that the more general the question is, the more the system performance worsens calculating the relativity of the answer, due to the amount of concepts found in the Wikipedia pages explored and that are not related to the question.
Finally, to increase the accuracy of the classification it may be interesting to analyze a semantic database for a better contextualization of the questions and answers considered; specifically, it could be interesting to extract the set of associated concepts and travel the tree of the Wikipedia categories linked to the pages to go back to common nodes to better recognize the level of relevance.

V. FINAL REMARKS
Taking into consideration the starting research questions, for the sake of the present contribution, some final remarks can be made. First of all, data collected here are limited to a pilot activity where a small number of participants was involved (48 in total) so any generalisation is impossible. University teachers within the sample used have a fairly correct CT interpretation and knowledge and they show positive results in 4 out of the 5 CT macro-indicators. The attempt to automatize CT assessment through open-ended questions is at its start but proves to be a useful support to human evaluation. The use of Language analysis procedures seems to be a possible direction according to the first results collected in the study herewith presented. The research group feels therefore encouraged to follow up the research described above, through further experimentation, working also on different macroindicators from the Newman Webb and Cochrane adapted model employed so far.
In follow up research experiences, the textual corpus will be expanded because the prototype performed slightly better performance with longer and more elaborated openanswers. We will conduct further validation studies with a larger sample and with different kinds of questions.
In addition, the research group is interested in understanding which kind of mental processes are involved to detect Critical Thinking in open-answers through task analysis and the information collected will be used to improve the algorithm's design and related functioning.

Antonella Poce is Associate Professor in Experimental
Pedagogy at the University Roma Tre -Department of Education. Her research concerns innovative teaching practices in higher education at national and international level. She is a member of the EDEN -European Distance and E-Learning Network Network (since 2009) and has been elected Chair of NAP SC in 2017 and she is EDEN Executive Committee member since then. She is member of ICOM-CECA (Committee for Education and Cultural Action) (since 2006). She coordinated 4 departmental projects and 4 Erasmus+ projects. She chairs the two year post graduate course "Advanced Studies in Museum Education" (email: antonella.poce@uniroma3.it).
Carlo De Medio is a PhD candidate in Computer Science at University of Roma TRE. His research interests are in the field of Adaptive learning and Critical Thinking Evaluation tools. Contribution: analysis technological tools to promote social inclusion in museum contexts, production and evaluation of Inclusive Memory OERs and MOOC, transversal skill development assessment, project evaluation.
Francesca Amenduni is a PhD Student in Culture, Education and Communication at the University of Roma TRE in collaboration with University of Foggia. Her expertise is in the e-learning field both as practitioner and researcher. She has worked as e-learning tutor and instructional designer since 2015. She carried research related to blended learning and her current PhD project regards semi-automated assessment of Critical Thinking in e-learning forums.
Maria Published as submitted by the author(s).