Towards identifying collaborative learning groups using social media: How Social Media can contribute to spountaniously initiated collaborative learning

This work reports about the preliminary results and ongoing research based upon profiling collaborative learning groups of persons within the social micro-blogging platforms like Twitter1 that share potentially common interests on special topic. Hereby the focus is held on spontaneously initiated collaborative learning in Social Media and detection of collaborative learning groups based upon their communication dynamics. Research questions targeted to be answered are: are there any useful data mining algorithms to fulfill the task of pre-selection and clustering of users in social networks, how good do they perform, and what are the metrics that could be used for detection and evaluation in the realm of this task. Basic approach presented here uses as preamble hypothesis that users and their interests in Social Networks can be identified through content generated by them and content they consume. Special focus is held on topic oriented approach as least common bounding point. Those should be also the basic criteria used to detect and outline the learning groups. The aim of this work is to deliver first scientific pre-work for successfully implementation of recommender systems using social network metrics and content features of social network users for the purposes of better learning group communication and information consumption.

million users. As such platform it implies daily numerous social interactions based upon interest sharing, opinion and experience exchange. Recent research has shown that social interactions with people who share the same affinities can contribute progress in research and learning [1].
Another trend is that many of them blog and tweet about events, like conferences, especially in communication and technical research communities [2] [3] [4]. Lately also universities started to use the advantage of fast information exchange in micro blogs to consolidate the information sharing and discussion across courses lead by the idea of technology based collaborative learning.
These processes create implicitly in this way huge number of opportunities for profiling [6]. The attendees tweet about what they notice, what they remark as interesting according special topic of matter. In the focus of lecture support for instance this could be a special lecture or topic related to it. However many of the content generated by the persons that a user in a online platform may follow does not offer focused view on a special interest and it is still noisy and unstructured. Except of text based search there is still not any groundbreaking approach as alternative to the challenge of managing this data and knowledge hidden inside of it.
Micro blogger assign topics, links and media artifacts to their user generated content. Focused view on heterogeneously disseminated information resources like this accommodated to personal preferences and learning goals offers the possibility of spontaneous involvement and initiation into collaborative learning tasks based upon the matter of content. Interacting on same topic targeted on learning process generates opinion exchange and knowledge aggregation.
What if these users could be clustered into sub-networks of main topic based upon their interest using this information? What if science could contribute to these users to receive filtered view on information generated in their micro subnetworks? Which methods or technologies would be suitable for this challenge? What are the metrics that can be used to achieve this distinction?
These are the questions this paper is trying to address in a specific area of collaborative learning. Efforts described here will not be able to offer answers to all the questions, but it tends to report a preliminary study on possibilities offered through science how to detect and cluster people with similar interests inside the social networks and let them communicate on purpose with each other in the boundaries of their interest. Such awareness delivers many appliances like in the area of recommender systems for e.g. collaborative learning and technology enhanced learning or for interconnecting the interest groups like learn and research communities [5]. Further then that this work is interesting for areas like viral marketing and market research for placing offers and materials a certain group of users would consume [7].
Processes that happen spontaneously are mostly initiated by adequate stimuli. As necessary precondition for stimuli of this kind, as fundamentally important pre-condition of a familiar ambience will be assumed. All methodologies represented in following subsections will use this hypothesis as preamble.

II. METHODOLOGY
Thinking in manner of solving such complex task as collaborative learning content consummation inside of heterogeneous information networks as Social Networks are, the first task that has to be solved is to identify the information stakeholder relevant for the process of collaborative learning with respect to information consumer. In order to achieve this first task area of semantically-lexical analysis combined with NLP and data mining can deliver the proper tools and techniques.
However before the clustering process can begin, data has to be pre-processed and formed in a manner acceptable for common clustering algorithms. Then significant features of content should be used to determinate least common relation between the potential members of the same interest group. In the case of Twitter this would be mentions denoted in micro text fragments with "@someusername" and hash tags denoted using "#sometopic". Hash tags are expected to contribute content related clustering while mentions will be used to discover relatedness in social context. This methodology follows the logic of item based filtering of recommender systems design. To the best of authors knowledge no similar comparison or evaluations has been done so far in the area on similarity measures as preamble of item based recommendation of learning groups.
Using these two common features (hash tags, mentions) as base for clustering and identification of potential collaborative groups is inspired by the idea that the persons who communicate about same topic and persons belong potentially to the same interest area. On the other hand persons mentioning the same communication actors also share implicitly an interest on the content generated from particular source.
Tweets as small as they are, brought into a proper context can delivery astonishing results. Their usage as "social sensors" is applicable for several purposes. Lately some work on tracking the sentiment inside the "electronic word of mouth" as tweets were described has been published with respect to ecommerce area of appliance [7]. Some sentiment sensing platforms for tweets like tweetfeel 3 , Sentiment140 4 or Tweetsentiments 5 has been already realized and powered partly by Twitter itself. Realization of those mostly relies on classifications using Support Vector Machines and some sentiment relevant reference data. All these examples are improving the assumption that even such short text fragments like tweets can be a reliable source for complicated task as classification.
This is a pre-assumption that has to be necessarily done before the context of learning groups in the manner of E-Learning can be considered. Therefore for now the focus of this paper remains on this pre-condition. Aim in this realm was targeted primary at evaluation of similarity measures needed for clustering of collaborative groups based upon specific content parts of tweets.
To the best of authors knowledge no similar comparison or evaluations has been done so far in the area on similarity measures as preamble of item based recommendation for learning and interest groups.

III. ACQUISITION OF DATA
As data source serves the database of Grabeeter 6 tool which includes the tweets from around 1600 users from mostly educational and research area. This tool developed by Social Learning Group at Graz University of Technology simply grabs the user timeline via the regular Twitter API 7 . Therefore potentially every person or institution that owns a Twitter account can grab his/her own Tweets using the Grabeeter. These tweets are then preserved in the local database of the software and can be searched by web interface or by a JavaFX based client. Alternatively Grabeeter offers a rudimentary REST API 8 with export possibility for timeline to XML or JSON format. For local search with Java client tweets can be exported using The JavaFX client to file system and indexed by embedded Apache Lucene 9 search engine . Grabeeter serves primarily as tweet storage. In contrast to Twitter API which allows the insights on only last 300 tweets, Grabeeter provides all stored tweets and makes no restriction over time. At the moment of writing this paper Grabeeter database contained approximately 4.700.000 tweets, which makes it a very reliable source.

A. Definitions
Considered as simple concept a Collaborative Learning group can be primary treated as a "Interest Group". Let us define a potential "Interest Group" in a more formal way: Let G be the set of "Group Candidates" defined as follows: In current observation single values and value pairs depending on which similarity function is applied will be used (e.g. #hash-tag or {#hash -tag , 2} tuple where 2 represent the occurrence count). Also j and k indexes are of the same length, which means that we assume j = k.
Let H be a "Reference Candidate" of type "Group Candidate" as previously defined in Def. Note that indexes j , k and r are the same length! Further T a pair of real value thresholds between 0 and 1 will be defined as follows: Intersection between the corresponding item sets of each "Group Candidate", here denoted as Cj , Cr and Lr , Lk delivers a intersection subset µ: This subset µ serves as input for a similarity ratio function α. The function α delivers correspondence ratio in percent between either significant content or social reference items from intersection set µ respectively the "Group Candidate" vectors as a value between 0 und 1.
As final step a threshold based clustering function δ is applied on each a to determinate whether a "Group Candidate" Gi belongs to an "Interest Group" or not.
Hence "Interest Group" I is defined through following factors: For the matter of evaluation one additional measure will be defined called λ or "acceptance ratio". This is a ratio between the count of accepted and considered "Group Candidates".
As similarity function in the context of "Interest Group" detection Cosine Similarity was used for single valued vectors while Euclidian Distance was used as pair value vectors similarity measure.

1) Cosine Similarity
This ratio can be used as a similarity measure between any two vectors representing documents, text fragments, snippets or the like. Cosine Similarity represents the angle between two vectors that reflects their diversity. As the angle between the vectors becomes shorter, the cosine angle approaches the value of 1, which means that the two vectors are getting closer regarding their similarity. Total diversity is represented through 0. Cosine Similarity is defined as:

2) Euclidian Distance
Euclidian Distance is base for many similarity measures. The distance between the vectors A and B is defined as follows: This similarity metric is most often used to compare profiles of respondents across variables. In other words, Euclidean distance is the square root of the sum of squared differences between corresponding elements of the two vectors. Note that no adjustment is made for differences in scale out of the box. In order to hold the scaling convention some correlations and scaling adjustments are already made in pre-assumption of current work. Also for the purposes of evaluation respectively expressing the similarity ratio in percent as value between 0 and 1 some calculating adjustment has been made to the native definition.

B. Data set preparation and measurement process
As reference data for evaluation set 100 accounts from Grabeeter accounts register were prepared. Using simple pre selects all account owners who used "elearning" or "elearning" keyword in their tweets were taken. This is done in order to create basic data set that allows comparing the ratio of similarity respectively the size of candidate group.
For evaluation purposes always the last 250 tweets of a specific user profile hast been taken into account. Out of them top 5, 10 and 20 hash tags and mentions vectors per each user were generated and compared using similarity measures: Cosine Similarity and Euclidian distance. Vectors used as input for similarity and ratio calculations are all of same length. Dynamical vector size adjusting for e.g. different sized vectors was intentionally left out since the main point of matter focuses rather whether the approach delivers promising results than the scalability of algorithms in appliance here.
All measurement made respectively the detection of potential "Interest Groups" were made using a specially designed Similarity API based upon Grabeeter tool. Similarity API was implemented in PHP 10 using the Grabeeter database as primary data source. Results are delivered in JSON 11 (Fig.2). Formatting and final calculations into results has been done by using the statistic functions inside the API. Cosine Similarity and Euclidian Distance were used as similarity measure since some prior research work [8] referenced them also as reliable indicators for detection of text based similarity. Distances used here belong in two different groups. Cosine Similarity uses only simple items to calculate the similarity angle among two text terms while Euclidian distance is calculated using the text items and their occurrence count.
Upon these results clustering using simple thresholds in percent in the range from 10% and 20% has been applied on similarity results. As a reference candidate for target learning group @mebner account was used because this account can be considered as one of the key competence bearer for E-Learning area. Each simulation consisted as described above out of similarity calculation and calculation of δ ratio function which checks if the result which is calculated for similarity reaches the threshold. "Interest Group" potential was reflected by the number of acceptable group candidates respectively the number of observed group candidates or as defined in Def. (8) as λ ("acceptance ratio") .
Values presented in the results section represent a median value of retrieval ratio. To get a deeper insight also the number of top "hash tags" and "mentions" was varied from 5 to 10 to 20 in order to evaluate how the length of parameter vector s influence the result.
Expectance towards presented measurement relies on the thought that comparison different similarity measures should deliver first hints on building the collaborative groups techniques and an evaluation which of the measures is suited in the best way for proposed effort. Considering that the test group was quite small results delivered are very encouraging. Very important role in preparation process was the choice of keywords for filtering the users from candidate groups, as well as choice of reference candidate.

V. PRELIMINARY RESULTS AND DISCUSSION
At the beginning of this section it has to be mentioned that all of the observation made respectively simple clustering of the potential "Interest Group(s)" are aiming at the evaluation of proposed methodology and system dynamics more than at qualitative analysis of retrieved results. "Interest Group" detection is meant to serve as pre-step for building the qualitative "Collaborative Groups". Described methodologies in this paper are meant to act as "sieves" and can be used as tools to simplify the task of building "Collaborative Groups" by reducing the number of potential candidates.

A. Single valued measurement results with Cosine Similarity 1) Evaluation of "hash tag" vectors
Evaluation results for Cosine Similarity measure applied on "hash tags" vectors of different length (5,10,20) with thresholds of 0, 1 (10%) and 0, 2 (20%) can be seen in Fig. 3 and Fig. 4: Figure 3. Cosine Similarity -ratio that reflects the percentage of accepted number of candidates to the total number of evaluated candidates for tC ≥ 0,1 The 10% threshold seems to be easily reached according to values in Fig. 3. Results are disseminated between 0 and 0, 35 (or 0% and 35%). The same can be observed for 20% threshold (Fig. 4) however highest "acceptance ratio" reaches only 0,2 as highest value. Although 10% threshold boundary drifts more stable and its course is hand in hand with candidate group size, both of them tend to converge against a median value. Linear behavior of both systems relies on distribution of correspondences across the test set and on the nature of similarity function. Threshold with 10% is reached obviously easier and causes less oscillation. On the other side real nature of similarity measure for 20% threshold can be recognized not until the candidate sets n > 80. Both systems tend to stability as the candidate group increases and respond linearly respectively number of hash tags. In Fig. 3 there are some deviations for vectors of size 20. The reason is the structure of data set and its potential regarding the variation of vector size. Same can be said for Fig. 4. And 5 "hash tags" sized vectors. It is obvious that significantly corresponding hash tags in test data set are placed at top 5 positions. Fig. 5 and Fig 6. reflect the results of appliance of Cosine Similarity on "mentions". Same as in the case with "hash tags" the size of vectors was varied starting by 5 over 10 up to 20.

2) Evaluation of "mentions" vectors
For 10% matching threshold however the values of λ ("acceptance ratio") seem to perform better than for "hash tags" (0, 05 < λ <0, 45). This fact points to the consistence better distribution and quality of "mentions" retrieved from test data. Same case can be observed also for the 20% threshold (0 ≤ λ ≤ 0, 3). This is also reflected in the trend of λ which changes consistent together with the growth of number of group candidates. Figure 5. Cosine Similarity -ratio that reflects the percentage of accepted number of candidates to the total number of evaluated candidates for tL ≥ 0,1 Figure 6. Cosine Similarity -ratio that reflects the percentage of accepted number of candidates to the total number of evaluated candidates for tL ≥ 0,2 Similar observation as for the "hash tags" can be concluded for the appliance of Cosine Similarity on the "mentions" for the case of linear dependency of "acceptance ratio" from the candidate group size. Dynamics of the system as already mentioned relies of distribution of interesting "mentions" and on the nature of similarity function. Deviation regariding the vector size are caused as in the case of "hash tags" by the placement of relevant "mentions" inside the vector, which is done in dependence on their occurrence count. From analysis of the course and form of "acceptance ratio" it can be easily consluded that in observed data set the matching "mentions" are distributed more equally all-over the data set then "hash tags".

B. Pair valued measurment results with Euclid Distance 1) Evaluation of "hash tag" vectors with occurrences
In following figures results based upon Euclidian Distance will be presented. Additionally to sole "hash tags" also their occurrences are taken into account by calculation of Euclidian distance. Occurrence as it will be shown contributed to more stable behavior of "acceptance ratio" course.  Fig.8. are representing the results for thresholds of 10% and 20%. It is significant that larger number of "hash tags" in vector for the the case of 10% threshold also increases the "acceptance ratio" (0 ≤ λ ≤ 0, 3). For the 20% thresholds this happens when the size of candidate group exceeds the count of 70 with approximately half lesser "acceptance ratio" (0 ≤ λ ≤ 0,12).
Except two deviating values for n = 50 and n = 60 observations made by 10% thresholds mainly correspond with the 20% case. It is also evident especially for the 10% that when a "acceptance ratio" reaches nearly median value. For all three vector sizes deviation respectively predecessor values decreases towards a minimum. Depending obviously on threshold this convergent behavior is reached at different count of candidates for each size of the vector. 2) Evaluation of "mentions" vectors with occurences Hardly different behave threshold based clustering based upon Euclidan Distance for input vectors consisting out of "mentions" and their occurrences which is clearly depicted in Fig. 9 and Fig. 10. Once again size of input vector here filled with "mentions" and occurrence counts influences the rate of "acceptance ratio".
For threshold of 10% "acceptance ratio" varies in dependence on size of vectors between 0 in single case of 10 candidates and 10 "mentions" with occurrences in vector up to high rate of 0,4. Same characteristics are also measured by the 20% threshold. However here is the highest "acceptance ratio" value by 0,3.
In comparison to the "hash tags" Euclidian Distance measurements with same clustering threshold "acceptance ratio" for "mentions" does not decreases by the same coefficient. The reason for this behavior relies most probably on more equally dissemination of relevant vector items ("mentions") in test data set than the one of "hash tags" as in the case of Cosine Similarity for the same observation. Same as in the case of "hash tags" here even more evident the course of "acceptance ratio" values deviates lesser as the number of candidates increases.

VI. CONCLUSION AND FUTURE WORK
Concluding the measurements some significant observations has been made worth outlining as the results. First of all despite of very small test set including only 100 candidates and one reference candidate, very promising perceptions regarding dynamics of similarity measures based threshold driven clustering could be drawn.
Although no qualitative evaluation has been made, and "acceptance ratio" as such is clearly inaccurate indicator of the precise distinction of discovered "Interest Groups", it was sufficient to approve the significance of the intention behind the usage of similarity based approach for organizing and steering of targeted information exchange between the user with same interests participating in Social Networks as Twitter.
Results presented in previous section are showing us that this approach looks promising even on very small data sets, which is encouraging for future work. The choice of parameters approved the initial expectance of setting the first steps in right direction. Further it made possible the comparison of two approaches.
Details from measurement also clearly outlined the facts about the stability of single measures. Euclidian Distance performed more stable and consistent in comparison to Cosine Similarity at least according the presented measurement. Some instability characteristics of Cosine Similarity can be explained by not equally dissemination of relevant matching items across the data set; which makes this measurement even more near to realistic circumstances.
It would be too optimistic to claim that the presented approach could be the end concept towards building collaborative learning groups however it seems to be a small step in right direction. Extending the measurement on more appliance cases and reference users from different areas would contribute more granular view on the research issues addressed by this work.
Additionally in order to enable more accurate and qualitative evaluation of clustering single matching similarities should be considered, clustered and re-evaluated more precisely during the measurement process. Some other approved similarity measures like Pearson or Jaccard could be considered as extension to the current experiment setup. In this way it could be possible to determinate the level of quality of each single similarity method. Such extension of presented approach would contribute the reliability of the initial idea. Improvements towards preparation of more extended test data set are aimed to be done with expectation to reapprove the results.
Nevertheless presented results confirm the basic intention of current work towards improving organized collaboration ,information recommendation and exchange in Social Networks to be feasible and worthy of future research. It also underlines the claims that such effort is based upon realistic expectations.
Most encouraging about this approach is awareness that current scientific technologies, methods and techniques can be used to deliver complete solutions and answers to addressed challenges in a very near future.