Cluster Analysis for Internet Public Sentiment in Universities by Combining Methods

—A clustering method based on the Latent Dirichlet Allocation and the VSM model to compute the text similarity is presented. The Latent Dirichlet Allocation subject models and the VSM vector space model weights strategy are used respectively to calculate the text similarity. The linear combination of the two results is used to get the text similarity. Then the k-means clustering algorithm is chosen for cluster analysis. It can not only solve the deep semantic information leakage problems of traditional text clustering, but also solve the problem of the LDA that could not distinguish the texts because of too much dimension reduction. So the deep semantic information is mined from the text, and the clustering efficiency is improved. Through the comparisons with the traditional methods, the result shows that this algorithm can improve the performance of text clustering.


Introduction
In recent years, with the development of the information technology, the Internet has been widely available in the university, The number of the students who use Internet has increased considerably [1]. By some Internet public affairs, People start to understand the important function of intendancy. The uniqueness of well-educated college students makes them the major and indispensable components of the internet public sentiment. The complicated network public sentiment on Internet brings a challenge which cannot be ignored on the political and ideological work in universities. Therefore, student management departments should reinforce their work on online public opinion collection, research, and assessment, and attach importance to the control and guidance of the internet public sentiment. In Internet public opinion analysis, the student affairs administrators need some intelligent methods to find the exact information in the magnanimous information sources for deeply analysis. Only by using intelligent algorithm to collect and analysis public opinion corpus automatically, an effective, comprehensive and fast monitoring early-warning mechanism can be established.
According to the requirement of analysis of network public opinions at colleges and universities, an online public opinion detection and analysis clustering method has built based on LDA (Latent Dirichlet allocation).This algorithm melts the subject models based on Latent Dirichlet Allocation and the VSM model based on TF-IDF weight to compute text similarity, and the cluster analysis is carried out. So the deep semantic information is mined from the text, and the clustering efficiency is improved. Through the comparisons with the traditional methods, the result shows that this algorithm can improve the performance of text clustering.

2
The subject models

Latent Dirichlet Allocation
One subject can expressed as a certain distributions of word frequency, And a paragraph or a sentence is regarded as being generated from a probabilistic model. To measure the document similarity, the most common way is to calculate the times of the words that appear at the same time in two documents, TF-IDF algorithm is one of the common methods. The deficiency in this approach is ignoring the implications inherent in the documents. Sometimes No word appears in both documents at the same time, but the two documents are related to each other semantically. So the implied semantic must be considered when judging the similarity between two documents. It is necessary to take the subject model. The Latent Dirichlet Allocation is a kind of the most common model.
The probability of each word in the document can be expressed as the formula below: ( 1 ) The probability formula can be represented by matrices, see Figure1. In the figure1, the text-word matrix on the left gives the probability distributions of each word in the text. And the topic-word matrix on the right gives the probability distributions of each word in each topic. The text-topic matrix gives the probability of the appearance of each topic in the text. For a series of text, after the character division, the probability of the appearance of each word in the text is calculated.so the text-word topic on the left can be obtained. The construction of the topic model is to get the text-topic matrix and the topic-word matrix on the right side by training and learning the text-word matrix on the left side. In the topic mode of Latent Dirichlet Allocation, all the documents in a document set can be regarded as a combination of the topics in the latent topic set according to the probability. The structure of the topic model based on Latent Dirichlet Allocation can be described as a three-layer topology, see Figure 2. Step1: for a document in the document set, randomly select a topic in the corresponding topic set.
Step2: randomly select a word in the corresponding word set in the selected topic.
Step3: Repeat these steps until complete coveraging all the words in the document .the topic model based on Latent Dirichlet Allocation consists of three layers, the document set layer,the document layer and the feature words layer.
The three-tier structure based on Latent Dirichlet Allocation see Figure 3. In the figure 3, parameter α and β are used to define the document set layer of the topic model. The vectorαis used to generate the vector θ. The matrix β represents the probability distribution of words which corresponding to potential topics. Parameter α and β show the level of the document set, it's value only need to be set once. The random variable θ defines the document layer of the subject model. θi is the probability distribution of each latent topic in the ith document, it is a vector. θ is a variable with the level of document. Each document maps to a variable named θ, the probability that every document generates the subject named z is different, the value of θ also only need to be set once for each generated document. Z and w are the parameters that indicate the level of the feature words, w is a vector of the feature words in a document. Z shows the distribution of all the feature words in a document. W is a observed variable .Z and θ are hidden variables.α and βare got by learning, w and z are the variables in the word level, z comes from θ and w comes from z.
So the joint probability of Latent Dirichlet Allocation can be expressed as: The process of generating a document with the subject model based on Latent Dirichlet Allocation should cover the following steps.
Step1: get the scale of the feature words in a document.it can be regarded as selecting the number of the feature words.
Step2: get the parameter of the subject distribution in a document. andαis the Dirichlet distribution parameter.
Step3: generate all the feature words for every document. 1) select an implied subject named z. it comes from the polynomial distribution of the subject distribution probability vector named e. 2) then select a feature word w.it comes from the polynomial probability distributions of the latent subject named z. Using the above steps, the generating probability of the ith feature word named wi in document d can be expressed as: The probability that document d includes the feature word w can be expressed as: ( 4 ) r n iJES -Vol. 6, No. 3, 2018 Then the maximum likelihood estimate is taken to build the three layer model of the Latent Dirichlet Allocation based on the parametersαandβ.
The Conditional distribution of generating document d is expressed as ; ( 6 )

Estimations of parameters
The Gibbs sampling algorithm is used to estimate the parameters. The subject models based on Gibbs sampling algorithm is： In the formulas, is the prior probability of , is the prior probability of . In the implementation of Latent Dirichlet Allocation, it is only need to assign the words of the subject.so the variable z is made a sample analysis of. The formula of the posterior probability is: ( 11 ) In the formula, is all the distribution of is the number that the feature word wi belongs to subject j. is the feature word of subject j. is the number that the feature word wi belongs to subject j in document di. is the number of the feature words that belongs to subject j in document di.
The following is its basic process based on The Gibbs sampling algorithm.
Step1: give an initial sample set.
Step3:after multiple iterations, when the Markov Chain converges to a stable state, estimate the subject of each feature word. the subject distribution and the subjectword distribution are estimated by the formula below.
( 12 ) ( 13 ) In the formula, is the number that the feature word belongs to the subject j.
is the number that the feature word in document d belongs to the subject j. is the number of the feature words in document d that belongs to the subject j.
Suppose that There are two texts named di and dj. the text similarity of them are calculated by the formula below based on the TF-IDF weights strategy.
Based on the subject vector from the Latent Dirichlet Allocation, the text similarity of them is calculated by the formula below.
( 15 ) This research uses the Latent Dirichlet Allocation subject models and the VSM vector space model of the TF-IDF weights strategy respectively to calculate the text similarity. Then the linear combination of the two results is used to get the text similarity. Then choose the cluster analysis based on k-means clustering algorithm. The formula of linear combination is as below. In the formula, λ expresses the correlation coefficient. The process is illustrated in figure 4.

Experiments and analysis
The research data come from the Chinese corpus categorization database. The Simulations are separately based on the Latent Dirichlet Allocation subject models, the VSM vector space model of the TF-IDF weights strategy and the clustering algorithm of the linear combination of the two. The clustering quality is measured by the F-measure method involving Recall and precision.
The method of F-measure is: In the formula, P is the Precision and R is the Recall.
( 18 ) Ni expresses the number of the samples whose class is i in the original set of data. Nj expresses the number of the objects whose class label is j in the clustering results.
Nij expresses the number of the samples in the intersection of Ni and Nj. The evaluation of cluster quality is often based on the weighted average of the value of F among classes. The formula is: By comparing the three different methods, the results are shown in Table 1. The test data show that the Average value of the clustering quality based on the present method is 92.03%. The Value of F of this algorithm is improved by 14.95% of VSM algorithm, and improved by 13.38% of LDA algorithm.it is a noticeable improvement compared with the simple VSM and LDA method. Then through Taking the PKU Weiming BBS forum database as an example, it can be proved that the proposed algorithm can be used for online public opinion analysis.

Conclusion
Campus network is an important kind of information resources, and the extraction of its comments is the basic work of public opinion analysis researches and of college student management. The proposed algorithm based on the subject models and the vector space model is used to calculate the similarity grade for the clustering analysis, the results of experiment show that the proposed algorithm can enhance the clustering algorithms' accuracy.