Prototyping Text Mining and Network Analysis Tools to Support Netnographic Student Projects

—Social science is witnessing tremendous growth of data available on the Internet regarding social phenomena; however, social science students are typically not prepared for managing the challenges and opportunities of analysing online data. One of the areas where this growth is especially important is in social studies of consumption. This article discusses a prototype of a visualisation tool intended to support the learning of netnographic analysis with computational tools.


Introduction
One of the important aspects of applying new skills for social science students is working with online data that provide valuable opportunities to learn more about social phenomena [1]. In fields such as the sociology of consumption, valuation studies, and marketing research, the text data of online discussions can be a fruitful source of information about the motivations of people, their perceived dimensions of experience, and evaluation practices. However, qualitative or manual quantitative content analysis of large corpora becomes very demanding in terms of time and other resources [2]. This paper presents a design concept of a visualisation tool that allows students to analyse user-generated content from a netnographic perspective. By combining activity-centred design (ACD) [3], [4] and instructional design practices, we prototype this software tool that covers the common research goals of netnographers.
By creating a visualisation tool, we aim to support the exploration of data collections using text mining and network analysis methods. Moreover, the tool is aimed to help students structure the exploration process and connect findings to theoretical concepts using affordances provided by the structured visualisation of data, facilitating a balance between building quantitative summaries of discussions and conducting a deeper qualitative understanding of texts.
In the article, we demonstrate the prototype on the case of esports athletes brand analysis [5].

Netnographic Approach in Studies of Consumption
The development of Internet technologies has led to the emergence of new sources of user-generated data that allow researchers to study different aspects of consumption. Internet users now have various platforms, such as blogs [6], social media [1], [6], [7], review-based websites [8]- [10], and websites of organisations and public institutions [11], to express themselves and share opinions about their experiences. In this way, user-generated content (UGC) provides researchers with data on what users consume and how they evaluate the consumption process.
Netnography [7] is one of the contemporary approaches to studying online communities, employing a mixed method that is targeted to build an understanding of users and their interactions. Netnography can help online studies answer questions associated with traditional ethnographic face-to-face studies [12] that are usually applied to real-life situations when actions and only visible interactions are examined and are not fit for working with massive concentrations of text data. Netnography helps to focus on cases where text and other forms of communication prevalent in online platforms are the main mediums of meaning. Netnographic quantitative analysis reveals interesting patterns and highlights the most important pieces of materials. By using quantitative methods, netnography supports qualitative analysis and navigates the researcher through the large corpus of data. This analysis can also provide insights into consumer behaviour in particular cases, equipping social science students with tools to bridge the gap between theoretical models they study and applied analytics [13].

Computational Methods to Support Netnographic Analysis
Our prototype is based on tools and methods adopted from the emerging areas of computational social science and digital humanities to support netnographic analysis of large UGC datasets and provide students with a repertoire of methods and practices combined in a single interface and blended in learning design. At this stage of the project, we build on the approaches used to analyse text data on different levels, starting from simple word-frequency-based methods to topic modelling and network analysis of concepts and entities extracted from UGC.
As texts are usually the primary content of netnographic studies, computational support of text analysis is the key component of our prototype. Methods based on wordfrequency analysis and derivative weights and metrics (e.g., TF-IDF, log-likelihood ratio-based methods) allow us to search for the texts that are most relevant to particular queries, quickly grasp the meaning of texts, and analyse the changes in the corpora. In combination with specialised dictionaries, reflecting sentiment or particular topics, more nuanced analysis is possible.
However, analysing the context of discussions, which is one of the main requirements for netnographic analysis, requires more advanced methods, ranging from n-grams (commonly used token collocations of length n) to topic models and advanced deep learning models.
In our example case, we derived bigrams from text and applied log-likelihood ratiobased comparison to the spectators' discussions of players before and after team changes.
Topic modelling [14], [15] allows researchers to find themes that show consumers' reflections on their experiences. It is a machine-learning-based method of fuzzy biclustering of words (n-grams) and documents into groups called topics. When applying this method, researchers ignore all relationships between words in the text: e.g., neighbourship, the position of the words in the text, and lexical meaning. In this method, the text becomes a bag-of-words that only provides the model with information about the co-occurrence of words. Considering the frequency of words and their cooccurrence, the model creates topics and sequences of unique words distributed by their probabilities related to the same group. The role of the researcher in topic modelling is twofold. First, the researchers define the number of topics that should be created during topic model computation. By relying on diagnostic metrics [16], [17], the researcher can make such a decision. Second, the researcher should interpret the groups of words in an attempt to reveal the themes. By using the list of most probable words and examples of text with the highest proportion of a topic, the researcher can label each topic. While other advanced methods of text mining exist, showing superior performance on many natural language processing tasks, topic modelling balances performance and interpretability, making it a suitable and widespread instrument for social science goals [18].
Another approach used in the prototype is network analysis [18] of relationships between the entities mentioned in the text. Network analysis is a fruitful method of analysing relationships between tokens in the texts. What can the co-occurrence of tokens in the text tell us about community practices? In discussions of brands, community members compare athletes and mention their names in the posts, allowing us to visualise the network of brands' co-mentioning. In addition, networks are useful when scholars are focused on understanding similarities between compared brands and the contexts of discussion that mention two brands -i.e., what users write in the forums when mentioning two brands. For example, is co-mentioning a case of finding dissimilar brands or stating that one brand is better than another?

Design Methods
There are already solutions to help scholars to code and visualise ethnographic data (e.g., [19]); however, those tools are less suitable for supporting the analysis of large bodies of text data, which is an important task of netnographic studies. Our approach implies rapid prototype design and development, building on existing open-source software components in the R ecosystem [20] and modern visualisation and interactive presentation tools.
Within the spirit of activity-centred design, we use the Jobs-to-be-done framework [21] to focus on representing users' needs in a wider netnographic analysis context. Afterwards, we map netnographers' activities with learning goals to give students competences to perform these activities with computational tool support.
Because the focus is on building the tool supporting learning real research activities, we rely on building the prototype based on the available algorithms, methods, and software instruments belonging to the R project ecosystem [20].
This approach allows for the gradual removal of tool-provided scaffolding for those students progressing to a deeper technical level of expertise and fast integration of the new methods and activities.
There are existing approaches to support exploration for some of the tasks. For example, there is a tool for topic modelling results exploration called LDAV, which introduces a novel metric for presenting words connected with topics and model tuning [22]. However, these are task-focused and not activity-focused, thus making them less relevant to support both research and learning goals. Instead, we use computational instruments to perform computational text analysis [23] and network analysis [24], and we rely heavily on visualisations and scaffolds in the form of applied recipes, parameters, and connections between different instruments and stages of analysis.
For visualisations, we use ggplot2 ecosystem packages [25], [26] built on using a grammar of graphics approach. Regarding the framework for the application, we use a flexdashboard [27] package and an interactive web-service package called Shiny [28].
We believe that our approach helps to embrace a more holistic perspective in our design, interconnecting preprocessing, interpretation, and results construction parts of computationally supported netnographic analysis.

Activities and Learning Tasks
Based on the literature review and syllabi analysis of related courses and projects, we attempted to extract the main activities that netnographers perform during work with collected text data (Table 1). Then, we connected these activities with learning goals extracted from the course and project syllabi and mapped the supporting computational technologies and related concepts. By using the four-component instructional design approach [29], [30], we construct a sequence of learning tasks built around authentic netnographic activities supported by part-task practice and procedural and supporting information. In our case, the task sequence starts from a focus on instrument-related skills with a high level of scaffolding. Then, the sequence progresses on core tasks focusing on the interpretation of project data in a netnographic sense, with supporting information communicating details of netnographic approaches and procedural information communicating necessary details about tool applications. Lastly, the focus of the tasks switches to higher-level theoretical, methodological, and ethical problems. While core methodological assumptions of the approach are introduced in [7] and further discussed in multiple studies [31], the challenging tasks for advanced students can involve reflections on the traits of big data and are based [32] and structured around concepts such as algorithmic confoundedness, system, usage, and population drifts.

Design and Prototype
We build a prototype with dashboard-based representations, mapping each representation to the supported activity. In this section, we present the prototypes for two activities: analysing themes in the text (see Fig. 1) and analysing relationships between entities in text.

Themes in text
When working with a large body of text, the vital task for the netnographer is to correctly interpret what participants discuss on social media. This task includes an understanding of the main discussion topics and an evaluation of topics' prevalence and comparison of discussions (e.g., a discussion at several points of time or discussions in several communities). As mentioned previously, topic models produce two kinds of outputs, and the distribution of topics by their proportion in each text and the distribution of words by their probability are related to each topic. By being able to select the most probable words and the texts with the prevalent topic of interest, netnographers can interpret topics.
In interpreting topics, we can rely on different simple probability-based metrics or more complex ones, such as FREX (FRequency + EXclusivity) [23]. The FREX score combines the frequency of a word and its exclusivity in the topic that defines a measure of a word to be related exclusively to a particular topic. The interpreter should rely on multiple scores and their comparisons in the process of interpretation.
Once the topic model is calculated and topics are labelled, researchers can use topic proportions to apply the methods of statistical inference and qualitative analysis of web community content. In the spirit of netnography [7], [33], it is possible to use the topic model as a snapshot of discussions that reveals key themes and highlights the most interesting texts. Based on this approach, researchers treat topics as frames of discussions [2], and texts with the highest proportion of a particular topic can demonstrate in what context consumers mention particular themes and how texts on this theme can vary. Moreover, netnographers that deal with several corpora of texts can quickly compare their content, examining the tokens' prevalence in one of the corpora.

Relationships between entities of interest in texts
Visualisation of networks allows researchers to learn more about how entities under study are connected. Relying on information about connections between entities in texts helps researchers understand who or what are mentioned together in discussions. By building on this baseline structure, netnographers can start exploring attribute-based properties of the network and mapping attributes to visualisations. Network attribute distribution for whole or personal networks for some entities can be summarised with a radar chart, allowing us to make comparisons and build profiles.
While in traditional social research, personal networks are most often self-reported, in netnographic studies, they can be reconstructed based on the discussions. In our example case, we are representing community perceived connections between a particular athlete and the other athletes with which they are associated in texts, and analyse which attributes they have in common, suggesting possible explanations for the comparisons and allowing the researcher to reason about possible mechanisms behind them, and support this reasoning with relevant examples.

Conclusion and Future Work
While we describe only the first results of the proposed approach, some current directions for software support are clear, and improving the program's ease of use and automating typical operations will help lower the entrance barrier for undergraduate students.
When supporting progression to more professional use of the computational instruments, resulting recipes and practices can be detached from the current interface and be reused, e.g., for integration with existing high-level tools such as jamovi [34] or R-QDA [35].
Further challenges are associated with investing in the usability of big text data exploration, focusing on the aspects beyond the scope of traditional-scale content analysis [36] -e.g., ranking cases based on transparent interest measures -and visualising the uncertainty of quantitative estimates [37] and plot layouts to minimise algorithmic confoundedness of explanations produced with the help of computational tools [38]. The current developments of computational data analysis workflow research, the focus on the implementation of visualisation techniques in education [42], [43], and the goals behind these approaches are relevant to enhancing the reproducibility, transparency, and collaboration of netnographic analysis among students. Lastly, recent developments in the interpretability of word-embedding-based [44] and aspectextraction-orientated NLP methods [45] build a foundation for further integration with computationally supported netnographic research.