Information Theoretic Clustering for an Intelligent Multilingual Tutoring System

People working as groups, collaborating, rather than people working individually, has unquestionably helped them develop and make accomplishments beyond our imagination. It is quite common to believe that human beings have an inner need to act as social beings. In computer science and particularly in intelligent tutoring systems, the related scientific literature enhances the conviction that even in learning, humans as students may improve the way they learn when working in groups and being in clusters. In this paper, we present the information theoretic clustering built up in the context of student collaboration in a learning environment of English and French languages. Collaborative student groups are created with respect to the corresponding user models.


I. INTRODUCTION
During the last few years, the ever increasing role that computers can play in education has been emphasized and this evolution of including computer-assisted learning in addition to data clustering, has been realized [1].The massive use of intelligent tutoring systems widens the avenues of learning.Furthermore, young people's everyday practice is consequently interconnected with their language learning activities [2].Consequently, on the one hand, we have societal knowledge of young people's considerable interaction with computers using English and on the other hand, we have language education that still has to increase the understanding of changing conditions for languages to make more informed use of the languagelearning potentials in this new arena.When incorporating computers in educational contexts, we have to recognize that young people's engagement in them during their everyday life belongs to their "self-directed practices" [3], which are different from school practices in many respects.This fact implies that students have different ways of representing and expressing themselves in the naturally occurring linguistic activities during their everyday life in computers compared to the more instructionally designed language teaching and learning practices in schools.Therefore, when computer-assisted approaches are used for educational purposes, the discrepancies in the views of learning, namely what counts as knowledge and the goals of the different practices, implicitly lead to tensions and practical challenges.Furthermore, there is not always an awareness of the divergence [4].
User clustering has significant educational implications in language learning and supports the learners' personal relationships and collaboration with their peers.Therefore, the support of classification in language learning promotes the educational process.When adaptive personalized elearning systems could accelerate the learning process by revealing the strengths and weaknesses of each student, they could dynamically plan lessons and personalize the communication and didactic strategy.Machine learning techniques are used for acquiring models of individual users interacting with educational applications and group them into communities or stereotypes with common interests.User clustering concerns the construction and application of correlated data sets, based on searching large databases.The correlations and patterns that emerge can highlight specific characteristics of an already existing group or they can form the construction of the group/cluster/category itself.
The authors of this paper have used the k-means algorithm in order to cluster the students in a multiple language learning environment and ameliorate the learning process in [1], [5], and [6].Selecting an appropriate distance measure (or metric) is fundamental to many learning algorithms such as k-means.However, choosing such a measure is highly problem-specific and ultimately dictates the success (or failure) of the learning algorithm.To this end, there have been several recent approaches that attempt to learn distance functions [7], [8].These methods work by exploiting distance information that is intrinsically available in many learning settings.Concerning the problem of semi-supervised clustering, points are constrained to be either similar (namely, the distance between them should be relatively small) or dissimilar (the distance should be larger).In information retrieval settings, constraints between pairs of distances can be gathered from click-through feedback.In fully supervised settings, constraints can be inferred so that points in the same class have smaller distances to each other than to points in different classes [9].Although existing algorithms for metric learning have been shown to perform well across various learning tasks, each fails to satisfy some basic requirement.First, a metric learning algorithm should be sufficiently flexible to support the variety of constraints realized across different learning paradigms and second, the algorithm must be able to learn a distance function that generalizes well to unseen test data [9].Finally, the algorithm should be fast and scalable.
In view of the above, we propose a novel approach of information theoretic clustering, based on entropy.Our approach generalizes the standard Euclidean distance, used in k-means clustering algorithm, by admitting arbitrary linear scaling and rotations of the feature space iJET -Volume 8, Issue 6, December 2013 and models the problem in an information-theoretic setting.In this way, we offer the possibility of qualitative collaboration among students of the same cluster, so that they are capable of succeeding in multiple language learning, namely in the learning of the English and French language.
This paper is organized as follows.First, we present related scientific work.In section 2, we discuss the entropy-based information theoretic clustering and the general architecture system's architecture.Following, in section 3, we present an evaluation concerning the student clustering.Finally, in section 4, we come up with a discussion about the usability of the resulting system and we present our future plans.

II. RELATED WORK
In this section, we present the related scientific work, firstly related to machine learning in education and user clustering and secondly to Intelligent Tutoring Systems (ITSs).

A. Machine learning in education and User Clustering
In [10], the authors proposed the exploitation of machine learning techniques to improve and adapt the set of user model stereotypes by making use of user log interactions with the system.To do this, a clustering technique is exploited to create a set of user models prototypes; then, an induction module is run on these aggregated classes in order to improve a set of rules aimed as classifying new and unseen users.Their approach exploited the knowledge extracted by the analysis of log interaction data without requiring an explicit feedback from the user.
In [11], the author presented a snapshot of what has been investigated in terms of the relationship between machine translation (MT) and foreign language (FL) teaching and learning.Moreover, the author outlined some of the implications of the use of MT and of free online MT for FL learning.
In [12], the authors investigated which human factors are responsible for the behavior and the stereotypes of digital libraries users so that these human factors can be justified to be considered for personalization.To achieve this aim, the authors have studied if there is a statistical significance between the stereotypes created by robust clustering and each human factor, including cognitive styles, levels of expertise and gender differences.
In [13], the authors focused on machine learning approaches for inducing student profiles, based on Inductive Logic Programming and on methods using numeric algorithms, to be exploited in this environment.Moreover, an experimental session has been carried out from the authors, comparing the effectiveness of these methods along with an evaluation of their efficiency in order to decide how to best exploit them in the induction of student profiles.
In [14], the authors studied the problem of unsupervised domain adaptation, which aims to adapt classifiers trained on a labeled source domain to an unlabeled target domain, since many existing approaches first learn domaininvariant features and then construct classifiers with them.They propose a novel approach that jointly learn the both.
In [9], the authors presented an information-theoretic approach to learning a Mahalanobis distance function and formulate the problem as that of minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function.
In [15], the authors presented an organized study of information theoretic measures for clustering comparison.They have shown that the normalized information distance (NID) and normalized variation of information (NVI) satisfy both the normalization and the metric properties.Between the two, the NID is preferable since the tighter upper bound of the MI used for normalization allows it to better use the [0,1] range.They highlighted the importance of correcting these measures for chance agreement, especially when the number of data points is relatively small compared with the number of clusters.
In [16], the authors proposed to embed the clustering problem into a Bayesian framework to automatically detect the number of clusters.The entropy is considered to define a prior and enables them to overcome the problem of defining a priori the number of clusters and an initialization of their centers.A deterministic algorithm derived from the standard k-means algorithm was proposed and compared with simulated annealing algorithms.

B. Intelligent Tutoring Systems
In [6], the authors presented an electronic learning platform that addresses the problem of grouping students in order to provide sophisticated user models.
Another CALL system is SignMT, presented in [17], which tries to translate sentences/phrases from different sources in four steps, which are word transformation, word constraint, word addiction and word ordering.
Another computer-based program on second language acquisition is Diglot Reader, presented in [18], which is used in a way that students may read a native language text with second language vocabulary and grammatical structures increasingly embedded within the text.
TAGARELA is an individualized instruction program, presented in [19], which analyzes student input for different activities and provides individual feedback.
Finally, in [20], the authors described a ubiquitous elearning tutoring system for multiple language learning, called CAMELL (Computer Assisted Multilingual E-Language Learning).It is a post-desktop model of humancomputer interaction in which students "naturally" interact with the system in order to get used to electronically support computer-based learning.Their system presents advances in user modeling, error proneness and user interface design.
In [21], the author discussed the effects of the inclusion of a Facebook activity as part of the language classroom.They examined students' feedback toward the activity and the use of language in their interaction.Based on their findings, the said activity proved to be advantageous because it kept the students connected with their friends while they take the language course.
In [22], the author investigated the benefits of SNS communities for learning Japanese as a second language (L2).They found out that SNSs has provided a "a portal for L2 learners to access other information and sources, and present a safe introduction to wider communication in the L2".Seeing the usefulness of the SNSs communities as a learning tool, they recommend that "researchers and PAPER INFORMATION THEORETIC CLUSTERING FOR AN INTELLIGENT MULTILINGUAL TUTORING SYSTEM educators should direct their attention to such new tools, and the communities that form around them".
In [23], the author has found out that simple activities in Facebook has helped the less language-proficient students to become more actively involved in the language learning process.
In [24], the author investigated the possibility of gathering learners' experience and views pertaining to issues on English language learning problems in secondary school, college or university, gathering learners' views on English language teaching and learning in secondary school, college and university and gathering learners' suggestions on ways to enhance English language learning and teaching.
In [25], the authors focused on the utilization of Facebook to enhance students' interaction using English in the English Speaking Zone.In undertaking this study, the authors had two research objectives: to understand the ways that social network can be used to provide the students with opportunities to use English as a medium of communication, and to reflect on and improve our own practice with an aim to create an environment that can continue to be used to enhance opportunities to use the language.
In [26], the authors presented the multivariate clustering for group learning in Facebook.By the incorporation of the multiple parameter based clustering, the problem of grouping students is further treated.Multivariate clustering is conducted by the k-means algorithm which takes as input multiple students' characteristics.
In [27], the authors presented the misconception diagnosis in multiple language learning over social networks, assisted by user clustering.In particular, they introduced a prototype Facebook application that can diagnose the errors of users, who are confused while they learn multiple languages.
In [28], the authors developed an educational application in Facebook for learning the grammatical phenomenon of conditionals.The users are modeled so that they can receive advice from the system.The procedure of personalized profiling is further improved by the incorporation of machine learning techniques for clustering.
However, after a thorough investigation in the related scientific literature, we came up with the result that there are not educational systems that incorporate the information theoretic learning for clustering.In this way, students can be classified in order to better collaborate.Hence, this is another step for the amelioration of the educational process.

III. ESTIMATING ENTROPY
Entropy plays an important role in forming many information theoretical quantities.In this paper, we are interested in how it is related to I(X; Y), the mutual information between a random variable X and its cluster membership Y and the techniques for entropy, as precisely described in [15], have been also used in our research.Concretely [15], (1) where the right-hand-side is the difference between the entropy and the conditional entropy of X.
For high-dimensional X, estimating entropies from samples is a challenging problem.Standard approaches include quantizing/binning X or constructing density estimators of X and they often do not work well due to the "curse of dimensionality", where the number of data points grows exponentially in order to obtain an accurate estimation [15].There has been a growing interest in using non-parameteric statistics to estimate entropies.Specifically, it was shown that, given N samples D = {x 1 , ...x N } where xi !RD, the entropy of X can be estimated [15] by (2) where x(k) i is the k-th nearest neighbor of x i in D [29], [30].The estimator approaches H(X) with a convergence rate of O(1/ ).Averaging k(X) over all possible k from 1 to (N!1) leads to a simplified estimator [15], (3) This estimator was first investigated in [31] and can be understood intuitively as follows.To estimate the entropy, one would need to obtain an unbiased estimator of !log p(x i ) such that [15] (4) For one-dimensional X, we can approximate p(x i ) as a uniform distribution between x and x j , which gives rise to [15] (5) Averaging this estimator over all possible xj " xi, an estimator is obtained in the form of (3).A detailed derivation of this result is given in [8].
( ) k(X) is more computationally convenient than k(X) as it does not need to identify nearest neighbors.Thus, the focus is on k(X) in the rest of the paper.For the conditional entropy H(X|Y ), we estimate it with the following [15] (6) where (Y = y) is the (empirical) prior distribution and (X, Y = y) is the entropy of data samples whose corresponding Y is y.Specifically, in the context of clustering, Y stands for cluster memberships.We assume that there are K clusters, each with N k data points.The conditional entropy is thus given by (up to a constant) [15] (7) iJET -Volume 8, Issue 6, December 2013 57

PAPER INFORMATION THEORETIC CLUSTERING FOR AN INTELLIGENT MULTILINGUAL TUTORING SYSTEM
where the inner summation is over data points xi and xj which are both assigned to the cluster k.
Since the entropy H(X) does not depend on how data points to different clusters are assigned, maximizing the mutual information -a clustering criterion to be described in detail in the next section -is equivalent to minimizing the conditional entropy.Further insight is gained by contrasting the conditional entropy to the criterion minimized in the K-means [15]: (8) where µ k is the centroid of the cluster k.Both H(Y |X) and J(X, Y) measure how tight the clusters are.However, the conditional entropy uses the logarithm of the distances, which grows slower than the linear function used by the K-means and showed in past research of the authors [1], [5] and [6].Thus, arguably the conditional entropy tends to be less sensitive and more robust to outliers.In the following, we describe how optimal clustering in the information theoretical sense can be obtained.

IV. INFORMATION THEORETIC CLUSTERING AND GENERAL ARCHITECTURE
The conditional entropy of ( 7) depends on the cluster memberships of the data points.The minimization of this quantity over all possible assignments is referred as information theoretical clustering (ITC) in [31].Experimental results reported there have shown this is an effective and useful clustering criterion.Despite its similarity to the K-means objective function in (8), minimizing the conditional entropy does not admit the two-step alternate minimization procedure often used in K-means.Specifically, it is not obvious how to define a single centroid (as in K-means) for each cluster and iteratively update the locations of these centroids.
Instead, in [31], the authors proposed a local search procedure that greedily assigns data points to clusters.The procedure starts with a random assignment.Then a data point is cyclically but randomly chosen from D and evaluated.If changing its cluster membership k to a different k# would result in reduction in the conditional entropy, the data point's cluster membership will be updated to k#.It is easy to see that the procedure converges to a local optimum.Furthermore, determining whether to change assignments can be performed efficiently, involving at most (N!1) calculations of distances.However, there is no rigorous analysis on how many such evaluations are needed in order to converge or how good the converged solution is in terms of approximation guarantee.
In what follows, we show how we can apply information theoretic clustering for a multilingual learning system and we incorporate the techniques described in [32].
In data clustering, a label l k !{1,…,C} , k=1,…,N , is assigned to each d-dimensional data point x k , k=1,…,N , comprising a data set and the data points with the same label form a group in the data set [32].Each group should correspond to a "natural" cluster of data points, such that members of the same cluster are more similar, in some sense, than members of different clusters and in this way, collaboration between members of the same cluster can enhance the tutoring.In the current information theoretic clustering approach, similarity is quantified by a divergence measure, D(p 1 ,…,p C ), between the probability density functions (pdfs), p 1 (x),…,p C (x), of the clusters and the divergence measure is based on the quadratic Renyi entropy [33], which is an information theoretic measure of the uncertainty of a random variable [32].The goal is to assign labels such that an estimate of the divergence is maximized, because this corresponds to clusters which are maximally distant in terms of the cluster pdfs [32].
The quadratic Renyi entropy is given by [32] H 2 (p)=!log$p 2 (x)dx (9) where p(x) is the pdf.It is a special case of the family of Renyi entropies, H % (p)=11!%log$p % (x)dx , where % is the Renyi entropy parameter (Renyi, 1976).In the limit %&1 the Shannon entropy is obtained.
Consider two pdfs, p 1 (x) and p 2 (x).By the Cauchy-Schwarz (CS) inequality, the following holds [32] (10) Using ( 10), a CS divergence measure between these two pdfs may therefore be defined as [34] (11) This measure is principled in the sense that it is additive under statistical independence (joint densities are products of marginal densities), a fundamental requirement for an information measure, it is symmetric and obeys D CS (p1,p2)![0,') .It is zero only if p 1 (x)=p 2 (x) .It does not satisfy the triangle inequality and is for that reason not a distance metric.
The CS divergence is based on the quadratic Renyi entropy since (12) where !log$p 1 (x)p 2 (x)dx can be considered a "cross" Renyi entropy between p 1 (x) and p 2 (x) .As shown in the next section, the relation to the Renyi entropy provides a convenient estimator for this quantity.
The CS divergence may be extended to the C-cluster case as follows [32] (13) where (14) To be useful as a clustering cost function, the samplebased estimator of the CS divergence must be expressed in terms of the labels of the data points.The goal of the current clustering approach is precisely to adjust the labels, such that the cost function used obtains its 58 http://www.i-jet.orgPAPER INFORMATION THEORETIC CLUSTERING FOR AN INTELLIGENT MULTILINGUAL TUTORING SYSTEM maximum value.This optimization procedure may be implemented in several different ways.In [34], the authors first devised an exhaustive search optimization.This means that every clustering possibility had to be examined by computing the cost function in each case, for then to pick the best result in the end.Since the computation of the cost function scales as O(N 2 ) , where N is the number of data points, this method became prohibitive for anything but very small and manageable data sets.The results obtained were however very promising, and instigated research focused on more efficient optimization rules.
Such an information theoretic divergence measure captures directly the statistical information contained in the data as expressed by the probability density function and can thus produce non-convex cluster boundaries.Student clustering (Fig. 1) offers the possibility of collaboration among students of the same cluster and thus the multiple learning of English and French can be ameliorated.

V. EXPERIMENTAL RESULTS AND DISCUSSION
The approval of the educational software by Software human instructors and students consists one of the most significant aspects that play crucial role in the success of the process.Educational systems have a large number of users, by whom they should be approved.Hence, evaluation of this kind of software is an important phase that has to follow development at all times and, in particular, formative evaluation is one of the most critical steps in the development of learning materials, because it helps the designer improve the cost-effectiveness of the software and this increases the likelihood that the final product will achieve its stated goals [35].The modeling of instructors and students constitutes the cornerstone of the educational process.Hence, in the scientific literature, there have been presented evaluation methods which are completely adjusted to educational software.
One such evaluation framework outlines three dimensions to evaluate: (i) context; (ii) interactions; and (iii) attitudes and outcomes [36].The context determines the reason why the educational software is adopted in the first place, namely the underlying rationale for its development and use, users' interactions with the software reveal information about the users' learning processes and the "outcome" stage examines information from a variety of sources, such as pre and post achievement tests, interviews and questionnaires with students and tutors [35].
The aforementioned framework, as precisely described in [35], has been used for the evaluation of our multiple language learning application.The underlying rationale of our system involves providing educational features, since ITSs provide opportunities within professional education, curriculum education, and learning.Furthermore, the context of the evaluation required an emphasis on the information theoretic clustering framework of the application.Moreover, users' interactions with the application were evaluated with respect to the users' learning processes.Finally, the "outcomes" stage involved pre and post-achievement tests before and after the use of the application.In addition, it involved questions to students and instructors, which focused mainly on evaluating the use of the aforementioned framework in the application.
In view of the above, the evaluation of our system involved both instructors and students and was conducted in two different phases.At the first phase, only the instructors took place at the evaluation.The second phase concerned the evaluation of the resulting educational application and involved both instructors and students.The instructors of the second phase were exactly the same as in the first phase, so that they could have a complete experience with our language learning application.At the first and second phase, 4 instructors participated in the evaluation.All of them are private instructors in the English language and were asked to use our system, namely to check in detail the tutoring of the multiple languages.All of the instructors who participated in the experiment were familiar with the use of computers.
When asked, all of the instructors confirmed that our language learning application had a user-friendly interface and that the clustering procedure was satisfactory.More specifically, three of them stated that they found the application very useful, while one of them stated that the application is useful.Concerning the question "How do you see the possibility of the system to create clusters and are you satisfied with the cluster, in which the system classified you?", the instructors stated that they were all very satisfied from the possibility of the system to create clusters and from the cluster in which they were classified.The second phase involved in total 60 users, from which 30 users are undergraduate students and 30 users are postgraduate students.The underlying rationale of the application lies on the hypothesis that these applications are more convenient and flexible to use (because of the fact that students can collaborate with their peers) while they retain the educational quality.At a first glance, the validity of this hypothesis might look obvious.However, there may be users who are not familiar with educational software in general and thus might not like the particular applications.On the other hand, there may be students, who are very familiar with computers and are eager to use them for educational purposes.Hence, one important aspect of the evaluation was to find out whether users were indeed helped by the whole environment.
After the completion of the use of the application, the users were asked about their experience.77% found satisfactory the possibility of the system to create clusters and were satisfied by the cluster in which they were classified.Moreover, the students were very pleased and enthusiastic by the opportunity that our application gave them the possibility to collaborate with their peers, namely the members of the same cluster.Fig. 2 illustrates a piechart, representing this result (users' answers).

VI. CONCLUSIONS AND FUTURE WORK
ITSs have already become very popular among students and thus a new era has been imposed.However, many issues should be taken seriously into account so that the resulting applications can be educationally beneficial to users and the tutoring process is promoted.Moreover, instructors should be included in the educational design of such applications, as an integral part of the educational process.
Information theoretic clustering is the process of grouping, or clustering, the items comprising a data set, according to a divergence measure between probability density functions based on Renyi's quadratic entropy.Such an information theoretic divergence measure captures directly the statistical information contained in the data as expressed by the probability density function and can thus produce non-convex cluster boundaries.As is well known K-means clustering produce only convex clusters.The information divergence measure is estimated directly from the data with pairwise sample interactions using concepts from kernel density estimation.
In this paper, we have described how information theoretic learning can be used in the clustering process of students, so that the emerging clusters can offer the possibility of collaboration among their members and thus the educational procedure may be assisted.Our application for multiple language learning is designed to provide the relatively facilities to both instructors and users, while retaining high quality of the educational application with respect to interactivity, adaptivity, collaboration and user classification.The research, which was conducted, has resulted in the development of the aforementioned application that has been evaluated among instructors and users.The evaluation results were very encouraging, both for our language learning application.The results showed that the contribution of the entropybased information theoretic clustering can be appreciated by instructors and simple users and can ameliorate the teaching of languages.It is in our future plans to further evaluate our multilingual language learning system in order to examine the degree of usefulness of the individualized learning offered by our system.Furthermore, we are planning to further evaluate the system in terms of its usefulness of the multilingual support along with the efficiency of such applications within the education.

Figure 1 .
Figure 1.Information theoretic clustering for students

Figure 2 .
Figure 2. Rate of satisfaction from user clustering.