Using Linguistic Resource for Cross-Lingual Ontology Alignment

—In the Semantic Web register, ontology alignment process can be seen as a cornerstone solution for the data heterogeneousness by allowing their interoperability. However, the most of the existing alignment methods assume that all ontologies to be aligned are described by identical languages. Indeed, very few approaches paid attention to the thriving challenge of multilingual ontology alignment. This paper introduces a new alignment method for multilingual ontologies. The proposed method implements a strategy of a direct alignment based on an external resource. Results obtained after extensive carried experiments are very encouraging and highlight many useful insights about the new proposed method.


INTRODUCTION I.
Multilingualism has become an issue of major interest for the Semantic Web community. This process has been accelerated due to a few initiatives which encourage all the active participants to make their data available to the public. These actors often publish their data sources in their own respective languages, in order to make this information interoperable and accessible to members of other linguistic communities [1].
As a solution, the ontology alignment process aims to provide semantic interoperable bridges between heterogeneous and distributed information systems. Indeed, the informative volume reachable via the Semantic Web stresses needs of techniques guaranteeing the share, reuse and interaction of all resources [2]. The explicitation of the associated concepts related to a particular domain of interest resorts to ontologies, considered as the kernel of the Semantic Web [3]. According to Gruber [4], an ontology can be defined in the context of computer and information sciences, as a set of representational primitives with which to model a domain of knowledge or discourse. On the other hand, the open and dynamic resources of the future Web endow it with a heterogeneous aspect, which reflects at once the formats or the languages variety of its description. This characteristic implies that the ontologies used for the description and the structuralization of the resources will be expressed in diverse formats, in particular in different natural languages.
Indeed, multilingual knowledge representation, access and translation are an impending need. For that purpose, the task of ontology alignment becomes particularly important by authorizing the reconciliation of resources described by different or multilingual ontologies. Multilingualism is identified as one of the six challenges of the Semantic Web. Consequently, some solutions were proposed at the ontology level, annotation level and the interface level [5]. At the ontology level, the support should be conceived by the ontology designers to create knowledge representations in diverse natural languages. At the annotation level, tools should be developed to help users to annotate the ontologies independently of the natural languages adopted in their descriptions. At the interface level, users should be able to have access to the information in natural languages of their own choice, without any linguistic constraint.
The absence of the multilingual aspect coverage can be a real handicap during the information exchange in between various services offered by the Semantic Web. So, application fields are more and more numerous and they put in front very specific difficulties. Moreover, the multilinguality coverage allows the reasoning on the context intersections of various ontological representations. In this context, the task of reasoning about overlapping context domains led to support multilingual information retrieval and digital content management. Multilingual ontologies alignment is still a little investigated domain in spite of the multiplicity of the alignment methods which remain restricted to monolingual ontologies, such as: OMAP [6], H-MATCH [7], OLA2 [8], RIMOM [9], to cite but a few. This paper meets challenges strictly bound at the annotation level. Indeed, it proposes a new idea for multilingual ontology alignment called DCLOA (Direct Cross Lingual Ontology Alignment). The main idea is to align ontologies expressed in the OWL-DL language and written in different natural languages which already exist within such a semantic environment. Therefore, both ontologies O s and O t treated in the alignment process contain entities described in two different natural languages. The DCLOA method presents an originality on dealing with the multilingual aspect in ontology alignment. Furthermore, the new proposed approach appeals an external resource, i.e., to assure and establish PAPER USING LINGUISTIC RESOURCE FOR CROSS-LINGUAL ONTOLOGY ALIGNMENT equivalence between the ontological descriptors expressed in two different languages.
The outline of this paper is as follows. Section 2 reviews the existing methods in the field of multilingual ontology alignment and defines some terminologies and notations for the rest of this paper. Section 3 supplies a detailed description of the DCLOA method, its foundation and its various steps as the main contribution of this work. Section 4 presents the experimental results obtained with the considered test base as well as a comparative study with the pioneering methods of the literature. Finally, Section 5 draws the conclusion and the future issues of this paper.
Ontology alignment is considered as an evaluation of the degrees of resemblance or the differences detected on them [10]. Besides, the process of alignment can be defined as follows: being given two ontologies O s and O t , an alignment between O s and O t is a set of correspondences, (i.e., a quadruplet): <e s , e t , r, Conf n >, with e s in O s and e t in O t , r is a relation between two given entities e s and e t , while Conf n represents the confidence level in this relation [10].
In the literature, a few methods were interested in multilingual ontologies alignment. A theoritical idea was also presented for building indirect alignments between multilingual ontologies [12]. The basic principle of this method is the reuse of already existing and stored alignment files. An intermediary alignment should be done between source and target ontology to compose a new alignment using such objects. Beforehand, equivalence between multilingual entities belonging in the two distinct ontologies should be discovered and established by a human expert. Then, a process of alignment composition is applied using alignment algebra [11]. Indeed, each composed relation is obtained thanks to two correspondence relations, e.g., equivalence and inclusion. An API for multi-lingual ontology alignment was developed, which specifies a minimal interface based on few strategies [13]. Following a direct translationbased strategy, one source ontology is translated into a new one. TranslateOnto module reads the source ontology, translates it, and writes the resulting ontology. The translation step relies on an URI translation labels strategy. The TranslateStrategy module implements the OWLEntityURIConverterStrategy method of the OWL-API 1 . Such conversion is carried out using an external resource, i.e., the Google-Translator-API 2 to provide the translations. In the stage of reading the source ontology and rendering the translated one, some tools are provided by the Alignment API 3 [14] and OWL-API are respectively used. Consequently, the translated and targeted ontologies, can be matched with regard to a chosen matcher.
The DAMO method [12] implements a direct alignment strategy based on two phases and the use of two external resources. The alignment process contains two complementary components. The first one is the string based similarity module which allows to obtain a compound terminological similarity between the descriptors of the ontological entities to be aligned. The second one, is the structure based similarity module.
The recent evaluation campaign 4 for alignment systems was marked by the release of a new multilingual test base for the community working on ontology alignment. This has led to the emergence of new methods dedicated to the multilingual aspect. The AUTOMSV2 [16] alignment algorithm is composed of four complementary modules. The first module, synthesizes the alignments of two string-based similarity distance methods distributed with the Alignment API 5 . The second one, treats the alignments of two WordNet-based string-based similarity distance methods of the Alignment API. The third task is a single method that is implemented based on a string similarity distance approach. Finally, two more methods, a structure-based and an instance-based method, are integrated, based on the general principle of neighborhood similarity. AUTOMSV2 is using a free Java API called WebTranslator 6 to solve the multilingual problem. AUTOMSV2 translation method is converting the labels of classes and properties that are found to be in a non-English language and creates a copy of an Englishlabeled ontology file for each non-English ontology. The core idea of the WESEE approach [12] is to use a web search engine for retrieving web documents that are relevant for concepts in the ontologies to match. Concepts, labels, comments, and URI fragments are used as search terms. The search results of all concepts are then compared to each other. The more similar the search results are, the higher the concepts' similarity score. First, stop words like 'and', 'or', and 'so on' are inherently filtered, because they occur in the majority of documents. Second, terms that are common in the domain and thus have little value for disambiguating mappings are also weighted lower. For multi-lingual ontologies, we first translate the fragments, labels, and comments to English as a pivot language, using the Bing Search API's translation capabilities. The translated concepts are then processed as described above. In YAM++ approach [18], multiple working strategies have been implemented in order to deal with both terminological and conceptual heterogeneity of ontologies. Input ontologies are loaded and parsed. Then, information of entities in ontologies are indexed by the annotation indexing and the structure indexing components. The terminological matcher PAPER USING LINGUISTIC RESOURCE FOR CROSS-LINGUAL ONTOLOGY ALIGNMENT component produces a set of mappings by comparing the entities annotations. The instance-based matcher component supplements new mappings through shared instances between ontologies. In YAM++, matching results of the terminological matche and the instancebased matcher are aggregated into an element level matching result. Finally, the semantic verification component refines those mappings in order to eliminate the inconsistent ones. In the case where input ontologies use different languages to describe the annotations of entities, Bing 7 multilingual translator is used to translate those annotations to English. GOMMA [18] includes a generic component to semantically align ontologies.
GOMMA's matching component allows for direct and indirect ontology matching. Direct match strategies involve internal ontology knowledge like conceptassociated or structural information. By contrast, the indirect matching is based on the composition of existing mappings to intermediate ontologies. The postprocessing stage deal with the combination or aggregation of the directly and indirectly obtained mappings and to select the most likely correspondences from the combined mapping. This approach returns for each concept only correspondences with the maximal similarity value or those within a small delta distance to the maximal value, i.e., only the best correspondences for each source and target concept are kept. In case of multilingual ontologies a free translation API 8 is used to automatically translate non-English terms and produce a temporary list of equivalent terms. In the context of multilingual ontologies alignment, the alignment process should establish a semantic link between a source ontology O s and a target ontology O t . A source ontology Os, expressed in a language L s , will be aligned to a target ontology O t , expressed in another language L t . Indeed, both entities descriptors of the ontologies O s and O t presented by their corresponding graph will be translated from L s language to the L t language.
THE DCLOA METHOD III.
The DCLOA method aligns a source ontology O s to a targeted ontology O t . The DCLOA alignment method operates on OWL-DL ontologies. In a pretreatment stage, both considered ontologies in entry are transformed into a graph structure. Consequently, each ontology is loaded in memory only once and transformed into a graph structure. This stage of parsing, is a technique which allows analyzing a given characters flow supplied in entry. This means segmenting the supplied flow following a model. Consequently, parsing is very useful to extract a targeted information in some data. For the DCLOA method, the parsing is realized through the OWL API 9 . Indeed, all the informative wealth of every ontology is described by a corresponding graph, i.e., classes, relations and instances. Nodes of each graph are classes and instances, whereas arcs represent links between the ontological entities. Each entity of an ontology is expressed with the RDF formalism : <subject, predicate, Object> [20] and described thanks to OWL-DL constructors. The subject corresponds to a class or a relation. The predicates are the OWL-DL primitives or the RDF properties. Indeed, each property used in a triplet, enriches the knowledge of a described entity. The arrangement of all these knowledge constitutes the definition of an entity. The representation of an OWL-DL ontology under the shape of a graph allows to store it in central memory only once, so reducing disk accesses to the OWL-DL file. Also, we note that the alignment process is restricted only to entities names.

Cross-Lingual Similarity A.
From the corresponding graphs of both ontologies O s and O t , we get back entity names. As a first treatment, every target entity name will be considered as release mechanism. Then, from the external linguistic resource, i.e., Bing 10 , we obtain the list of n equivalent words to the source word as depicted by figure 1. Assume that we want to compute the cross-lingual similarity (CLS) between two given strings s and t. These strings can be a single word or a text containing several statements.
At first, we tokenize each set of strings using delimiters and then convert it to a bag of words. Any nonalphabetical character in the given strings will be deleted. For example, if a given string s contains two nonalphabetical characters then we consider these two as delimiters and remove them from the string s. As a result, s will be tokenized into three words.
After converting each of the strings s and t to a bag of words, every word that is common to the two bags will increase the similarity score. In addition, from each set we remove stopwords, and this is achieved based on a list that contains stopwords for each supported language during the alignment process. For each extracted entity name N t in O t , we run a search test over the external resource Bing to create a list of the correlated wors, i.e., a list in the same language of s.
If a given word d appears in this list, we conclude that both entities belonging respectively to O s and O t are similar, otherwise they aren't. We consider that ! is the number of words in s, ß the number of words in t and " the number of common words between the two considered strings s and t. In this case, the cross-lingual based similarity using external linguistic resource is computed as follows: Sim CLS (s,t)= | " | / max(|! |,| ß |) PAPER USING LINGUISTIC RESOURCE FOR CROSS-LINGUAL ONTOLOGY ALIGNMENT The idea of the context-based similarity is based on the assumption that, when two entities are similar, there is a big chance that the concepts that surround it are also similar. Here, by surrounding concepts (which define the semantic context) we mean super-concepts, sub-concepts and siblings concepts. Therefore, in the context based similarity, the description of a concept is based on its context. The value of this similarity is equivalent to computing the ratio of the number of nodes that are linguistically similar, to the total number of nodes belonging to the neighborhood. If we denote N.CLS the number of nodes that are linguistically similar and N.NEIGHBORS the total number of nodes belonging to the neighborhood, then the context based similarity formula is expressed by : If a given source node n s , as depicted by Figure 2, is aligned to a targeted node n t , then if we admit that : n s1 , n t1 , n s2 , n t4 and n s3 , n t2 are pairwise linguistically similar, then the context based similarity value is computed as follows : Global Similarity Value C.

SimCBS(N Es , N Et )=| N.CLS |/| N.NEIGHBORS |=6/8=0.75
The alignment process ends by aggregating the various stemming values of the two modules: the linguistic and the context based one. The aggregation is realized through a fair weighty combination. With the sum of various weights is equal to 1, i.e. (# 1 +# 2 =1), the aggregative correspondence value, V Agg.Corr , is computed as follows: V Agg.Corr = # 1 *Sim CLS (n s ,n t ) + # 2 * Sim CBS (n s ,n t ).
Such a setting can sometimes not be adequate, since it does not take into account the intrinsic nature of ontologies to align. This nature is described by its hierarchy and its information content. The values of # 1 and # 2 cannot be universally optimal. In other words, a particular parameters configuration can produce an excellent alignment result for a pair of ontologies and may provide poor results in other cases. In context we used a novel approach for the automatic adaptation of an ontology alignment method parameters. This approach is based on the use of the Choquet integral as an advanced aggregation operator [21]. Indeed, the aggregation problem is returned to the resolution of a linear system, in a manner to maximize the sum of the similarity values. This was done using the Kappalab R package 11 .
The first stage of the DCLOA method consists in the opening of the OWL-DL files representing the ontologies to be aligned. Each ontology is loaded with in memory only once and transformed into a graph structure. The alignment process can operate using the fixed weights assigned to every entity name as well as for the various stages of the DCLOA method. The example uses two mockups of two ontologies represented by the Figure 3 and whose source code is represented in what follows.
The various stages of the alignment process are based on the equivalence links which can exist between the entities of both considered ontologies. This task supplies two vectors corresponding to the similarity computation in the DCLOA method.  Commonly, ontologies developers adopt syntactical rules for some entity names, such as the hungarian notation. This notation is the practice of writing compound words or phrases in which the elements are joined without spaces, with each element's initial letter capitalized within the compound and the first letter either upper or lower case.
As flagged by the illustrative example, there is two compound entities names written usingthe hungarian notation, namely, FemmeAdulte et HommeAdulte. Worthy of mention, in the DCLOA method, such entities names are transformed to be treated separately as a set of words.
As depicted by Figure 2, HommeAdulte entity is aligned to Man entity. Doing so, the task means determining the statistical correspondence which can exist between the following both sets : {Man} and {Homme, Adulte} using the linguistic external resource. For example, Table 1 show that the two entities PAPER USING LINGUISTIC RESOURCE FOR CROSS-LINGUAL ONTOLOGY ALIGNMENT Man and Homme have the highest similarity value, with regard to the obtained similarity values.

EXPERIMENTAL STUDY V.
In what follows we will present the experimental study, based on the metrics of Precision and Recall. Subsequently, a comparative study is conducted with the pioneering methods.

Test Cases A.
The carried out experimental evaluation uses the battery of test files provided by the OAEI (Ontology Alignment Evaluation Initiative Campaign) 12 13 . This dataset is composed of a subset of the Conference track 14 , translated in eight different languages (i.e., Chinese (cn), Czech (cz), Dutch (nl), French (fr), German (de), Portuguese (pt), Russian (ru), and Spanish (es)). With a special focus on multilingualism, it is possible to evaluate and compare the performance of alignment approaches through these test cases.
Results and Discussion B.
The table I summarizes the values of precision and recall for the pioneering methods and the DCLOA method. These metrics are presented in a condensed formula 15 , i.e., each pair of languages reflects the mapping of eight ontologies in the source language ontologies and eight ones in the target language. Indeed, the DCLOA method supports five language pairs where it clearly exceeds other methods, except the couple (fr-de), DCLOA is exceeded by AUTOMSV2. The accuracy of the DCLOA method is highlighted by the recall metric which reflects the number of entity pairs correctly aligned. This is explained by two aspects: the first is that the DCLOA method adopts a dynamic and flexible aggregation operator which allows it to react to similarity values changes; the second is that the context-based similarity eliminates ambiguities at the level of aligned entity pairs. 12  http://oaei.ontologymatching.org/2012/multifarm/result s/ Therefore, enhancing the recall values by reducing the number of false positive entities. Unlike other methods mentioned above, the DCLOA method does not use stringbased similarity measures that are not totally reliable, but at the DCLOA linguistic similarity is based on a semantic and boolean test. During the translation stage, the DCLOA method does not create a new translated ontology, which is an expensive process, but the translation starts and ends on a temporary lists.
CONCLUSIONS AND OUTLOOKS VI.
Multilingual ontology alignment is an important task in the ontology engineering field. In this paper we introduced the DCLOA method for OWL-DL multilingual ontologies alignment. In addition, the results obtained by DCLOA method are satisfactory. In this frame, it is important to highlight the external resource contribution.
The proposed method showed a good performance compared to other methods, but still requires some improvements. In the near future, we intend also to enhance the performance of the DCLOA so that it can handle a wider range of natural languages. Also, the integration of new external resources can provide a wider choice of translation which is beneficial to the task of alignment. Besides, a graphical user interface (GUI) is needed to assist ordinary users. In addition, we work on the automatic detection of hierarchical trends paths in the considered ontologies even in the multilingual context, e.g., logical links are usually independent of linsguistic details. In addition, we aim to integrate the DCLOA method in a complete ontology localization system in interactive environments in the Semantic Web domains.