Partitioned-based Fuzzy Clustering to Learn Documents' Triadic Similarity

—With the development of the Web and the high availability of storage spaces, more and more documents become accessible. For that reason, similarity learning suffers from a scalability problem in both memory use and computational time when a data set is large. This paper provides a fuzzy triadic similarity measure to calculate memberships in a context of document co-clustering. It allows computing simultaneously fuzzy co-similarity matrices between documents/sentences and sentences/words. Each one is built on the basis of the others. The proposed model is extended to tackle the problem of large data sets by a splitting architecture which deals with a new fuzzy triadic similarity to parallelize both memory use and computation on distributed computers. This architecture is based on fuzzy clustering for partitioning data sets into similar groups (or clusters) in order to create more coherent sub-sets.


I. INTRODUCTION
With the increasing number of available documents, the processing ef!ciency and scalability of the systems and their underlying computations become a major concern. For economic and operational reasons it is often preferable not to execute the computations on a single machine. One of the major problems is the similarity computing due to these huge data. Several methods dealing with this task are referred to as co-clustering approaches and have been extensively studied. In [1], a co-similarity measure has been proposed, called X-Sim [1] [2] which builds on the idea of iteratively generating the similarity matrices between documents and words. This measure works well for unsupervised document clustering.
However, in recent research, the sentence has been considered as a more informative feature term to improve the effectiveness of document clustering [3]. While considering three levels Documents"Sentences"Words to represent the data set, we are able to deal with a dependency between them. It is done throught weights computing based on statistical models. But it has spawned the view that classical probability theory is unable to deal with uncertainties in natural language and machine learning.
We proceed to a fuzzi!cation control process which converts crisp similarities to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. These fuzzy similarity matrices are used to calculate fuzzy similarity between documents, sentences and words in a triadic computing called FT-Sim (Fuzzy Triadic Similarity). Several extensions to the co-clustering methods have been proposed to deal with such multi-view data. Some works aims at combining multiple similarity matrices to perform a given learning task [5], [6], [7]. The idea being to build clusters from multiple similarity matrices computed along different views.
To combine multiple occurrences of FT-Sim, we can adopt sequential, merging or splitting-based parallel architectures. In this work, we are interested in the splitting-based mode. In this case, there are no methods to construct smaller data sets, and a random strategy is often used. It is for that reason that we propose a strategy to split a given data set which is considered as huge into smaller ones using the Fuzzy C-Means (FCM) clustering [8].
The rest of this paper is organized as follows: in section 2 we highlight backgrounds related to textual data coclustering. In section 3 we discuss some previous work and present our motivations. In section 4 we provide a detailed description of our fuzzy triadic similarity. In section 5 we present our splitting-based model based on fuzzy clustering. Corresponding parallel architecture is described in section 6. In section 7 we conclude the paper and give some indications about further research.

II. BACKGROUNDS
The purpose of clustering is generally to organize a set of objects following criteria of similarity, to discover the structure according to which they are organized. The clustering has for objective to group these objects in homogeneous classes. So, the similarity of a couple of objects belonging to the same class must be maximized, whereas that of a pair of objects belonging to two different classes must be minimized.
Before the document clustering, a document cleaning procedure is executed for all documents. Several researchers consider that the main unit which characterizes a document is the word. The preprocessing step aims at the decrease of the noise, and the transformation of the data into an appropriate format, while extracting the most representative terms in the analyzed corpuses.
First, all non-word tokens are stripped off. Second, the text is parsed into words. Third, all stop words are identi!ed and removed [9], [10]. The obtained data must be indexed. Several techniques have been proposed to index documents. The most commonly used one is the Vector Space Model (VSM) [11] and its graphical representation as a k-partite graph [12].
A second aspect that we must consider is weighting the terms (or words). Various techniques have been proposed to weight terms such as the binary valued vector weighting scheme indicating the presence or not of the word in the document; or real valued, indicating the importance of the word in this last. Several models have been proposed for computing real valued weights such as tf-idf [13], term distribution [14], or simply the number of occurrences of a word in a document, etc.
Another point of view is to represent a document as a collection of sentences. To achieve a more accurate document clustering, a more informative feature wordsentence has been considered in recent research work. A sentence of a document is an ordered sequence of one or more words [3].
Bigrams and trigrams [15] are commonly used methods to extract and identify meaningful sentences in statistical natural language processing. In [16], a method to compute all sub strings' (sentences) word and document frequencies in large document corpuses by using suf!x array is presented. In [17], they propose a sentence-based document index model namely Document Index Graph (DIG) which allows an incremental construction of a sentence-based index for a set of documents. The quality of obtained Web documents clustering based on this model surpassed the traditional VSM-based approach.
Classically, data are described as a set of instances characterized by a set of features. For example, when using the VSM, text corpuses are represented by a matrix whose rows represent document vectors and whose columns represent the word vectors. The similarity between two documents obviously depends on the similarity between the words they contain and vice-versa.
The purpose of co-clustering is to take into account this duality to identify the relevant clusters [1].
Consequently, the concept of higher-order cooccurrences has been investigated and recently a new algorithm called X-Sim [18] was introduced. It exploits the duality between words and documents in a documents corpus as well as their respective higher order cooccurrences. While most researchers have focused to directly co-cluster the data, X-Sim [18] consists of building two similarity matrices, one for the rows and one for the columns, each being built iteratively on the basis of the other.
Moreover, with the development of the Web and the high availability of the storage spaces, more and more documents become accessible. Data can be provided from multiple sites and can be seen as a collection of matrices. By separately processing these matrices, we get a huge loss of information.
Several extensions to the co-clustering methods have been proposed to deal with such multi-view data. Some works aim at combining multiple similarity matrices to perform a given learning task [5], [6], the idea being to build clusters from multiple similarity matrices computed along different views.
Multi-view co-clustering such as MV-Sim [7] architecture, based on X-Sim measure [18] deals with the problem of learning co-similarities from a collection of matrices describing interrelated types of objects. It was proved that this architecture provides some interesting properties both in terms of convergence and scalability and it allows an ef!cient parallelization of the process.

III. DISCUSSION
In the traditional document models such as the VSM, words or characters are considered to be the basic terms in statistical feature analysis and extraction. The statistical features of all words are taken into account of the term weights (usually tf-idf) and similarity measures, whereas the sequence order of words is rarely considered in the clustering approaches based on the VSD model.
The motivation in this paper is that we believe that document clustering should be based not only on single word analysis, but on analysis of the sentence as well. Sentence-based analysis means that the similarity between documents should be based on matching sentences rather than single words only. Sentences contain more information than single words (information regarding proximity and order of words) and have a higher descriptive power. Thus a document must be broken down into a set of sentences, and a sentence is broken down into a set of words. We focus our work on how to combine the advantages of two representation models in document co-clustering. As a result, each document is represented as a vector of sentences, and each sentence is represented as a vector of words.
Most of the models use statistical approaches or probabilistic methods to model the membership of sentences (or words) in the documents. The question that arises is: are probabilistic methods and statistical techniques the best available tools for solving problems involving uncertainty?
This question is often answered negatively, especially by computer scientists and engineers. These respondents are motivated by the view that probability is inadequate for dealing with certain kinds of uncertainty.
In [19], it has been claimed that probability lacks suf!cient expressiveness to deal with uncertainty in natural language. In contrast, fuzzy set theory prescribes a calculus for the treatment of uncertainty associated with classi!cation. We purpose to apply this theory to the memberships computing. To determine the membership of a sentence (resp. word) in a document (resp. in a sentence), it is necessary to take into account the size of the document (number of sentences or words) and the frequency of appearance of the sentence (or of the word) in the document, to assure a coherence between the size and the number of occurrence.
However, with the development of the Web and the high availability of the storage spaces, more and more documents become accessible. Data can be provided from multiple sites and can be seen as a collection of matrices. PAPER PARTITIONED-BASED FUZZY CLUSTERING TO LEARN DOCUMENTS' TRIADIC SIMILARITY By separately processing these matrices, we get a huge loss of information.
We provide a splitting-based model for FT-Sim to tackle the problem of learning similarities from a collection of matrices. For multi-source or large matrices, we propose a parallel architecture in which each FT-Sim is the basic component or node we will use to deal with multiple matrices.
Thus, we consider a model in which data sets are distributed into N sites (or relation matrices). They describe the connections between documents for each local data set.
Our goal is then to compute a fuzzy Documents " to take into account all the representative information expressed in the relations.

A. Assumptions and Notations
The following assumptions and notations are used in developing the proposed model: " Document similarity matrix of size ! (sentences) by ! (documents). It represents the number of occurrences of the " Sentence similarity matrix of size ! (words or terms) by ! (sentences). It represents the number of occurrences of the Sentence " Document matrix of size J (Sentences) by I (Documents). It represents the memberships degrees associated to the ! !! sentence according !! !! !document.
Word " Sentence matrix of size ! (Words) by ! (Sentences). It represents the memberships degrees associated to the ! !! word according To represent our textual data set, two representations have been proposed: the collection of matrices and the kpartite graph [12]. In the !rst, each matrix describes a view on the data. In the second, a graph is said to be kpartite when the nodes are partitioned into k subsets with the condition that no two nodes of the same subset are adjacent. Thus in the k-partite graph paradigm [12], a given subset of nodes contains the instances of one type of objects, and a link between two nodes of different subsets represents the relation between these two nodes.
From a functional point of view, the proposed FT-Sim model can be represented in the following way as shown in !gure 1, where SD and WS are two data matrices representing a corpus and describing the connection between Documents/Sentences and Sentences/Words, brought by the three-partite graph [20].
! ! !!matrix provides a fuzzy similarity between the documents of the corpus, ! ! provides that between sentences of the corpus and ! ! matrix provides that between words of the corpus. ! ! ! !! ! !!"#!! ! !!are initialized with the identity matrix !!at the !rst iteration. For each one, the matrix ! ! is updated taking into account the similarity provided by!! ! . !! ! is updated taking into account the similarity provided by ! ! ! and ! ! . ! ! is updated while taking into account the similarity provided by !! ! .

B. Fuzzi!cation Controller Process
Let us consider the following similarity matrices between sentences and documents (resp. words and sentences): Once these values are determined, we proceed to a fuzzi!cation process. It converts crisp values to fuzzy ones. The conversion to fuzzy values is represented by the membership functions [4]. They allow a graphical representation of a fuzzy set. The ! axis represents the universe of discourse (number of occurrences of sentences or words), whereas the ! axis represents the membership degrees in the [0,1] interval.
For each document, we de!ne a fuzzy membership function through a linear transformation between the lower bound value !", a membership of 0, to the upper bound value !", which is assigned a membership of 1. This function is used because smaller values linearly increase in membership to the larger values for a positive slope and opposite for a negative slope.
The mathematical formulations of these functions are given in the following equations: and By applying these formulas, the fuzzy matrices are as follows: and Before proceeding to fuzzy triadic computing, we must initialize Documents"Documents, Sentences"Sentences and Words"Words matrices with the identity ones denoted as ! ! !!! , ! ! !!! !and ! ! !!! . The similarity between the same documents (resp. sentences and words) has the value equal to 1. All others values are initialized with zero. ! ! !!! , ! ! !!! !and ! ! !!! are as follows: Usually, the similarity measure between two documents ! ! and ! ! is de!ned as a function that is the sum of the similarities between shared sentences.
Our idea is to generalize this function in order to take into account the intersection between all the possible pairs of sentences occurring in documents ! ! and ! ! . In this way, not only can we capture the fuzzy similarity of their common sentences but also the fuzzy ones coming from sentences that are not directly common in the documents but are shared with some other documents. For each pair of sentences not directly shared by the documents, we need to take into account the fuzzy similarity between them as provided by ! ! !!!!! .
Since we work with fuzzy matrices formed by membership degrees, we should certainly be applied in accordance with the operators for fuzzy sets, especially the intersection and union. Thus, ! !" !!! , except the case !! ! !!, can be formulated as follows: As we have shown for ! ! !!! computing, we generalize fuzzy similarities in order to take into account the intersection between all the possible pairs of words occurring in sentences Sl and Sm. In this way, not only can we capture the fuzzy similarity of their common words but also the fuzzy ones coming from words that are not directly common in the sentences but are shared with some other sentences. For each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by ! ! !!!!! .
The overall fuzzy similarity between documents ! ! and ! ! !is de!ned in the following equation:

PAPER PARTITIONED-BASED FUZZY CLUSTERING TO LEARN DOCUMENTS' TRIADIC SIMILARITY
For each pair of words not directly shared by the sentences, we need to take into account the fuzzy similarity between them as provided by!! ! !!!!! .
The overall fuzzy similarity between documents ! ! !and ! ! is de!ned in the following equation: As shown by algorithm 1, the Fuzzy triadic algorithm proposed is based on an iterative approach, in which each iteration t consists in evaluating the similarities according to the documents/sentences/words three-partite graph.

V. SPLITTING-BASED MODEL
In order to reduce the complexity of the problem of treating huge databases, it is possible to split a given data matrix into a collection of smaller ones, each sub-matrix becoming a component of our network and processed as a separate view. The splitting strategy can be random or use the FCM algorithm [8].

A. Random Split
Let us suppose ! machines (or nodes) are allocated in a distributed environment for our target similarity learning task. In the random split strategy, we can choose the number of splits!!, corresponding with the number of cores we have.
Then we explore the behavior of the proposed architecture while varying the number of ! splits, obtaining ! sub-matrices with the aim of !nding the one most suitable with our solution. The split is performed on sentences, (random split sentence method); for !" !  Figure 2 shows the overview of the random splitting process. This model presents a solution which is not based on any strategy. There is no guarantee to obtain an interesting matrix for an optimal processing by varying the number of splits. Thus, we cannot deduct rules to be applied to this kind of problem.

B. FCM-based Split
The second alternative to split a given data set, is to adopt the FCM algorithm [8]. As shown in section 2, the FT-Sim is the exploitation of the trial nature of the problem of similarity.
iJES -Volume 1, Issue 1, August 2013 PAPER PARTITIONED-BASED FUZZY CLUSTERING TO LEARN DOCUMENTS' TRIADIC SIMILARITY That means the relationship between groups of sentences that occur in a group of documents and the relationship between groups of words that occur in a group of sentences. Thus, documents are considered similar and hence grouped together, if they contain similar sentences, and sentences in turn are considered similar and therefore grouped together, if they occur in similar documents, etc.
The idea behind this method is that by preceding a rapid clustering of sentences before construction of the submatrices for each core as it is shown in !gure 3, we can already obtain groups of similar sentences. This will facilitate the following task which is the parallel cosimilarity learning. The main idea behind this is the unsupervised data clustering while adopting a fuzzy partitioned-based strategy. Fuzzy clustering methods allow objects to belong to several clusters simultaneously, with different degrees of membership. The data set ! ! ! ! ! ! ! ! ! ! !!!! ! is thus partitioned into ! fuzzy subsets. The result is the partition matrix ! ! ! ! !" for !! ! !! ! ! and ! ! !! ! !. The aim of this algorithm is to minimize an objective function, denoted as ! ! , in the following form: is de!ned and well-studied in [8].
Thus, the FCM can be used to obtain a set of clusters with similar sentences.
In this way we can exploit these results to construct sub-matrices Documents " Sentences and Sentences " Words with similar sentences, with the aim to have more coherent matrices.
Dividing a huge database into smaller ones can considerably reduce the time and the complexity of the computing, but we must add the complexity of the FCM run; which we could ignore because FCM is considered as rapid compared with the other clustering method.
On the other hand this solution permitted us to gain in time for the following task which is the co-similarity learning.

VI. SPLITTING-BASED PARALLEL ARCHITECTURE
After the splitting step, we compute the similarity matrices from several local sub-matrices and aggregate them before performing the co-clustering algorithm on it. Figure 4 shows the splitting-based parallel architecture.

The aggregation function takes
given iteration t. If a given document does not appear in a single local data source, then we assign its corresponding similarity measures directly in! ! . If a particular document appears in several different local data sources, we assign the minimum of all similarity measures relevant to this document to!! ! without taking into account the value of 0. The different steps of aggregation computing are presented in algorithm 2. So, for a given iteration t, each instance !" ! !"# !!! produces its own similarity matrix (! ! !!! ! ! . We thus get a set of output similarity matrices being equal to the number of local data sets related to !. Therefore, we use the aggregation function denoted by % and developed in the aggregation function to compute a consensus similarity matrix merging all of the In turn, this resulting consensus matrix is connected to the inputs of all the !" ! !"# !!! instances, to be taken into account in the ! ! ! !! iteration, thus creating feedback loops allowing the system to spread the knowledge provided by each !! ! !!! ! ! within the network.

Algorithm 2 Aggregation Function
Inputs: The complexity of this architecture is obviously related to that of the !" ! !"# !!! algorithm. In the parallel splitting-based architecture, as each instance of !" ! !"# !!! can run on an independent core, the method can easily be parallelized, thus keeping the global complexity unchanged (considering the number of iterations as a constant factor). So, the complexity of the aggregation function can be ignored.
By splitting a matrix, we lost some information. The solution does not compute the co-similarities between all pairs of sentences but only between the words occurring in each!!" !!! . Thanks to the feedback loops of this architecture and to the presence of the common similarity matrix!! ! , we will be able to spread the information through the network and alleviate the problem of intermatrix comparisons.
The algorithm 3 presents the different steps of the parallel splitting-based process. Thus, by using a parallel version of !" ! !"# !!! on !!cores, we will gain both in time and space complexity. Indeed, the time complexity decreases, leading to an overall gain of !!!!. In the same way, the memory Central core needed to store the similarity matrices between words will decrease by a !!!!factor.

VII. CONCLUSION
In this paper, a fuzzy triadic similarity model for the co-clustering task has been proposed. It takes, iteratively, into account three abstraction computing levels (documents/sentences/words). The sentences consisting of one or more words are used to designate the fuzzy cosimilarity of two documents. We are able to cluster together documents that have similar concepts based on their shared (or similar) sentences and in the same way to cluster together sentences based on words. This also allows us to use any classical clustering algorithm such as FCM or other fuzzy partitioned-based clustering approaches.
We propose that our fuzzy Triadic similarity-based model gives an ownership of words-sentences memberships in accordance with the size of a document. This can deal with uncertainty associated with coclustering. This ensures a good interpretation of the result of the co-clustering method which does not need to cluster the words-sentences for clustering the documents.
To tackle the problem of big dimensions of matrices, we have proposed an extension of our fuzzy triadic cosimilarity model to the multi-view one. There is no single right way to approach analysis of a large data set, and it is often an assortment of complementary approaches, building one upon another. Thus, we have proposed to use our fuzzy triadic similarity method associated with the FCM clustering. Before proceeding to the similarity learning, we have proposed to split the data set, considered as huge, into a set of smaller ones, according to the sentences. There are several possible alternatives of splitting. We opted for the split, based on a preliminary rapid clustering using the FCM method, and that to obtain coherent sub-matrices. Then, a parallel architecture is presented, which combines FT-Sim instances to compute similarities on different cores. The combination of these techniques may be particularly useful to help simplify computing on large amounts of textual data and to focus on the rich, descriptive, and expressive details of qualitative data.
For future work, many directions seem compelling to explore in the splitting approach. More sophisticated models will be investigated such as a two dimensional splitting in order to divide the data set according to sentences and words.