Health Query Expansion Based on Graph Matching between DBpedia and UMLS

—Information Retrieval (IR) in the medical domain is considered as a challenging task for many reasons. Short health queries tend to lack information on user's intent, and the target corpus may not have sufficient information for Relevance Feedbacks. And even, if the user obtains relevant documents to his/her queries, it is difficult for him/her to understand the technical terms. In contrast, in this paper, we propose an approach for health queries reformulation based on graph matching between two external linked data sources: DBpedia and Unified Medical Language System (UMLS). DBpedia has a broad coverage of topics and less noise compared to Wikipedia articles, and UMLS is specific to the medical domain. We also introduced the degree centrality to measure the graph connectivity and to select the most efficient candidate terms for query expansion. Experimental results on MEDLINE collection using Okapi BM25 as a retrieval model showed that our approach outperformed related methods, and the two sources achieved very good retrieval results. They helped in the diversification of the retrieved documents and the improvement of the recall.


Introduction
Nowadays, health related information is increasingly available through several biomedical data sources including clinical reports and forums among others. Moreover, a survey indicates that around 80% population of US search engine users look for information on particular diseases or health problems [1].
This increasing demand for domain-specific IR from medical practitioners has led communities like TREC and CLEF to gather health resources and foster research in this field [2]. One of largest well-known databases of biomedical literatures is This paper is organized as follows Section 2 discusses related works. Section 3 depicts our approaches, and section 4 addresses the obtained results, and gives an outlook on our future works.

Related Works
Recently, some studies suggested using external resources as well as graph matching to improve query expansion. Query reformulation through the combination of multiple information sources has a better effectiveness compared to the use a single information source [10]. Diverse sources such as Wikipedia [11] were found to be beneficial for document retrieval.
In [12], authors performed query expansion on the MEDLINE collection by exploiting the top retrieved Wikipedia articles, and using their corresponding Wikidata attributes' values. However, even if Wikidata has links to UMLS as well as to other databases, using UMLS directly would have been more appropriate and a better alternative.
Other works [13], introduced external "document expansion" from Wikipedia through "document reduction" that generates a query for a document. This approach has a disadvantage; because in general a feedback document is independently relevant to the query, but a feedback document from Wikipedia may correspond only to a segment from the query and not to the whole of it [14]. Also, authors' study in [15] showed that top retrieved documents contain 65% harmful terms.
Another kind of studies [16] identified objects within a query and gave them ranking scores using Google Search API, then performed Pseudo Relevance Feedback (PRF) on linked objects' descriptions, from Freebase, to select the expansion terms. For instance, an object (entity) in Freebase is identified by a unique Machine id. Yet, the granularity of the Freebase ontology's first level can be too general for certain queries; and lower levels are difficult to use because of the lack of instances.
Most of the RDF graph matching algorithms are either involving a pairwise comparison of semantic resources like SimRank [17] that considers objects as similar if they are related to similar objects, or based on finding paths between resources like LDSD [18] which provides measures for the determination of the semantic distance between Linked Data resources. Moreover, such algorithms do not use the predicates that usually contain valuable information [19].
Also, using the whole graph representation of information sources to exploit their matching relationships; is highly demanding computationally [20]. To overcome these problems, authors in [20] explored for each WordNet concept, a WordNet sub-graph centering on it as well as a UMLS sub-graph of candidate matches to identify the matching relationships between the two ontologies. In our work we used DBpedia instead of WordNet; that exploits terms individually i.e., it does not take into account the context of the term to determine its meaning. Also, unlike DBpedia that we used in this work, WordNet covers only few relations which are synonymy, hypernymy, and hyponymy. Moreover, the WordNet concepts follow a tree structure whereas UMLS has a general graph structure [20]. As a consequence, UMLS has a better connectivity.
Based on RDF graph matching; authors in [19] enhanced user satisfaction by generating intelligent snippets (i.e. snippets providing more valuable information): First, they generated RDF graphs for queries and documents using WordNet (for expanding queries with synonyms) and DBpedia Spotlight (for named-entities recognition) among others. Second, they transformed the graphs to bipartite graphs [21]. Third, they opted either for the resource-graph matching algorithm: Relevance Search (RS) [22] that allows comparing a node with a graph, or for the graph-graph matching algorithm that determines the common resources in two graphs for the snippets generation.
In [7], authors used a WordNet graph-based method to expand queries by selecting all synonyms, hypernyms, etc. of each query term to obtain a sub-graph including only the shortest path between a pair of query terms. This method had liabilities because it used WordNet that may not have ontology for a certain domain. Moreover, WordNet has a low coverage of concepts and phrases [23] compared to DBpedia which annotates entities that can be phrases.
And in [24], authors expanded biomedical [25] queries using MeSH thesaurus. Then, they retrieved documents based on the similarity between those expanded queries and clusters of biomedical documents. Unlike our approaches, this approach does not take advantage of other resources within UMLS.

Proposed Approach
This paper aims at formulating a new form of query expansion by integrating and matching between two external resources: DBpedia and UMLS ( Figure 1). The process of query expansion is carried out according to the following phases: 1. Preprocessing of queries through stop-words' removal and stemming [26] using the Porter Stemmer. 2. Using DBpedia Spotlight to determine concepts (entities) in the query and divide its keywords to two categories concept/not concept (1).
Where qI: the initial query, tc:DBpedia concepts in the query, and tnc: non DBpedia concepts in the query.
3. Searching each of the found concepts in the UMLS Metathesaurus Browser. 4. Matching between DBpedia and UMLS: Matching directly related "dct:subject" values as well as indirectly related "skos:broader" and "is dct:subject of" values of the DBpedia concept, with the attributes of the UMLS concept.
In fact, for a concept i in ontology A, a concept j in ontology B is considered as a match of i if i and j have similar meanings [20]. Similarly, in this work, we considered attributes from the two ontologies as matching ones, if they had either a relation of equivalence (═) or if an attribute from the first ontology is more general (⊒) compared to the one from the other ontology as shown in figure 2. And we chose to use particularly the attributes "dct:subject", "is dct:subject of" and "skos:broader" from DBpedia for two reasons: first, in order to make our approach more simple and non expensive computationally. Second, we chose these attributes because they are commonly found for most entities and at the same time, we believe that they carry valuable information for the matching with UMLS.
In figure 2, we matched DBpedia concepts with UMLS ones. These UMLS concepts are connected to "aortic insufficiency" through the semantic types: "RB" which means "has a broader relationship" and "RO" which stands for: "has relationship other than synonymous, narrower, or broader". The "valvular heart desease" is a DBpedia category of the entity "aortic insufficiency" and it is directly linked to this entity. While "aortic stenosis" is actually an entity and "valvular heart disease" is one of its categories. That is to say, "aortic stenosis" is not directly linked to aortic insufficiency".

Reformulation of the initial query through:
Strategy 1: Using the candidates' terms that were commonly found in both of the external semantic sources (conceptsmatch of the 4th phase) as shown in (2):q E1 : (DBpedia − UMLS) GraphMatching q E1 = q I + concepts match (2) Strategy 2: Using "degree centrality" measure to determine query reformulation terms from the ones obtained in the 4th step by using both the content of the nodes with degree centrality higher than the average, and the matching terms between their directly linked DBpedia nodes and the UMLS nodes as shown in (3). We examine the performance of the degree centrality: q E2 : (DBpedia − UMLS) DegreeCentralityMatches q E2 = q I + V DCaboveaverage +degreecentrality matches Where VDCaboveaverage: the vertices having a Degree Centrality which is above the average; And degreecentralitymatches: the conceptsmatch of the VDCaboveaverage' directly linked vertices i.e. matching terms between each directly linked DBpedia vertex to a VDCaboveaverage and an UMLS vertex.
In fact, the "degree centrality" measure is one of the well-known graph connectivity measures. It is a variant of "graph centrality" that aims at determining the importance of a node in a graph by considering the relation of the node with other nodes in the graph [23]. Actually, it is the simplest way to determine a vertex importance by its degree [27]. The degree of a vertex refers to the number of edges incident on that vertex [7]. And the degree centrality is the degree of a vertex normalized by the maximum degree [27].
Concept terms that are separated by "/" in UMLS are all used for query reformulation because they are considered as synonyms.

Fig. 2. A sub-graph of query 6 from MEDLINE collection that corresponds
to the query terms "aortic regurgitation" (annotated as "aortic insufficiency" in DBpedia and "aortic valve insufficiency" in UMLS)

Dataset description
To evaluate our approach, we used the MEDLINE collection, and Okapi BM25 as a retrieval model. We indexed this dataset using Indri4search engine.
The dataset's texts (table 1) vary in length and contain more technical terms.

Evaluation measures
The overall performance was evaluated in terms of Recall (R), Precision (P), and MAP.
• Precision. Shows to which level a system is capable of returning only relevant documents [28]: • MAP. The MAP for or a set of queries is the mean of the Average Precision (AP) scores for each query [29]: Where Q is the number of queries Table 2 represents the results obtained with our two suggested queries' reformulation methods. As well as their comparison with: the UMLS approach, that we per-formed using attributes that we obtained from step 3 of our method; prior to the matching step, and "Clusters' Retrieval Derived from Expanding Statistical Language Modeling Similarity and Thesaurus-Query Expansion with Thesaurus" (CRDESLM-QET) [24] approach.  In both P@10 and R@10, our graph-based approaches outperformed considerably the CRDESLM-QET [24].

Results
As for the use of every linked data source separately from the other (i.e. without matching them); the DBpedia approach gave lower precision and MAP at 10 compared to the (DBpedia-UMLS) graph matching. Whereas the UMLS approach gave a slightly better recall and MAP at 10 compared to the (DBpedia-UMLS) graph matching. And the P@10 of UMLS was nearly similar to that of our (DBpedia-UMLS) graph matching.
Also, the UMLS approach gave better results than the DBpedia approach in terms of P@10 and MAP@10. We believe that this difference is normal since UMLS is specialized in the medical domain. Thus, it is richer in terms of technical terms of that domain compared to DBpedia.
Moreover, we believe that our approaches boost significantly the recall because we are using linked data to expand queries. As a result, documents that do not contain initial queries' concepts but contain their interlinked concepts are retrieved. Consequently, non-domain expert users will be able to find relevant documents even when they are not knowledgeable about the domain to know the right terms to use. And even for experts, they can find other relevant documents they may be interested in since the expansion concepts are interlinked to their queries and cover more aspects of their intent. In other words, our approaches improve considerably the recall since we are adding different terms from different external resources and covering different aspects of the query concepts.
As for the graph matching approach, we think that the low results are due to the small number of features we used from both DBpedia and UMLS which already has a very low number of features. One way to improve the results of this first approach would be exploiting interlinked data of indirect features. In other words, we need to exploit concepts from the root to all their n hop neighbors.
In this work, although the queries of the MEDLINE collection are very long and most of them contain two to three sentences, we opted for query expansion instead of query reduction which is another possibility for query reformulation. In fact, since we are dealing with a domain specific dataset and professional terms, almost every term in the query is mandatory for retrieving relevant documents; even if it is not recognized as a concept by DBpedia. In the future, we will find an accurate way to perform query reduction on such queries. Also, we will diversify even more the expansion terms by using multiple external sources. Another way to improve our approach is through: first, focusing on the matching between every source's objects in a separate way i.e. matching DBpedia objects within a query using their similar attributes. And second, matching attributes of a certain source like DBpedia with their equivalent attributes from other sources like we did in this work.

Conclusion
Related query expansion works tend to either combine linguistic features from a thesaurus like WordNet and semantic features from a non-domain specific linked data source, or rely only on one external source for graph matching. The novelty about our work is the application of graph matching as well as the use of degree centrality measure on two linked data sources; one of them is general and the other one is domain specific. Another advantage of our approach is covering and exploiting all query concepts and not only a certain segment or n-gram of it as in related works that use Wikipedia feedback documents. And since we are using a graph matching method we do not process a concept separately from the others.
In this work, we performed query reformulation through: first, distinguishing between DBpedia concepts and non DBpedia concepts in the initial query. Second, the expansion of the query using external terms from DBpedia and UMLS based on degree centrality results as well as common matching terms. This approach lead to better recall results compared to related approaches. Consequently, we can say that reformulating queries [30] through matching between the graphs of external sources and using degree centrality measure helps; especially in the improvement of the recall because the expansion terms are more diverse and are based on semantics. As a result, multiplying the external information sources for query expansion; helps in the diversification of the retrieved documents and thus the improvement of the recall. In fact, external sources tend to expand the query from different aspects of its meaning rather than expanding it based only on the target document collection.