Integration of User Profile in Search Process according to the Bayesian Approach

—Most information retrieval system (IRS) rely on the so called system-centered approach, behaves as a black box, which produces the same answer to the same query, independently on the user’s specific information needs. Without considering the user, it is hard to know which sense refers to in a query. To satisfy user needs, personalization is an appropriate solution to improve the IRS usability. Modeling the user profile can be the first step towards personalization of information search. The user profile refers to his/her interests built across his/her interactions with the retrieval system. In this paper, we present a personalized information retrieval approach for building and exploiting the user profile in search process, based on Bayesian network. The theoretical framework provided by these networks allows better capturing the relationships between different information. Experiments carried out on TREC-1 ad hoc and TREC 2011 Track collections show that our approach achieves significant improvements over a personalized search approach described in the state of the art and also to a baseline search information process that do not consider the user profile.


Introduction
Personalization is an appropriate solution to find information adapted to the user's needs. Modeling the user can be the first step towards personalization of information retrieval. However, more personalized information retrieval approaches focused on the user profile construction in order to better identify his information needs. User profile can be deduced explicitly by asking users questions [Ma et al., 2007] or implicitly, by observing their activities [Gauch et al., 2003] [Speretta et al., 2005], [Liu et al., 2010], [Srinivasa et al., 2016], [Zhou et al., 2016]. It can be represented by a simple structure based on keywords [Shen et al., 2012], or by concept hierarchy issued from the user's documents of interests [Kim et al., 2003], [Speretta et al., 2005]. Or by using external domain ontology as an additional evidence to model the user profile as a set of concepts issued from predefined ontology [Gauch et al., 2003], [Daoud et al., 2009].

Related Work
Various approaches have been proposed to personalize the search results to a given user. Personalization consists of user modeling to build a powerful user profile and then its exploitation in the search process. User profile refers to the user interest built across his/her interaction with the retrieval system. In the next sections, we present related work to the personalization process, namely the user modeling and user profile exploitation in the retrieval process.

User modeling
A user model describes data that characterizes a user, such data related to user's preferences, goals and interests [Sieg et al., 2007], [Micarelli et al., 2007], [Shen e al., 2012], [Jenifer et al., 2015], [Srinivasa et al., 2016]. Most of user model approaches represent user profile as one or more vectors of terms [Gowan 2003], [Shen et al., 2005], [Tan et al., 2006]. Others organize user profile as hierarchical concepts structure representing the interest's domains [Gauch et al., 2003], [Kim et al., 2003], [Speretta et al., 2005] or with a structured model of predefined dimensions (personal data, interests, preferences etc). Works presented in [Micarilli et al., 2007] describe the user profile with two dimensions represented by the interactions history with search system and the user information needs based on his/her interests. Other approaches use external domain ontology as an additional evidence to model user profile as a set of concepts issued from predefined ontology [Gauch et al., 2003], [Daoud et al., 2009].
The construction of the user profile consists of collecting information representing the user. It can be done in two ways; explicitly or implicitly [Micarelli et al., 2007], [Jenifer et al., 2015], [Zhou et al., 2016]. In the explicit approach, the user is asked to be proactive and to directly communicate to the system his/her data and preferences [Ma et al., 2007]. However, an explicit request of information to the user implies to burden the user, and to rely on the user's willingness to specify the required information. To overcome this problem, several techniques have been proposed in the literature to automatically capture the user interests by implicit feedback techniques; this is done by monitoring the user's actions in the user system interaction, and by inferring from them the user's preferences. The proposed techniques range from click through data analysis, query log analysis, desktop information analysis, document display time [Speretta et al., 2005], [Agichtein et al., 2006], [Srinivasa et al., 2016].

Personalization process
User profile can be exploited before search to reformulate the query or after a search by re-rank the initial results [Micarelli et al., 2007].Query reformulation consists of initial query expanding with the user profile terms [Koutrika et al., 2005], [Joachims et al., 2007] [Gan et al., 2008]. In [Qiu et al., 2006] user profile is incorporated in the query-document matching model. It consists of computing the document score by considering its relevance to the query and to the user profile. Most of personalization approaches are based on initial results re-ranking by combining either original rank or score between the document and the query with the rank or score between the document and the user profile [Gowan 2003], [Liu and al., 2010], [Teevan and al., 2011], [Cai et al., 2017].

User profile representation and exploitation approach
In personalized search, one of the main issues is how to infer user profile and how to exploit it in search process.
To address these issues, our general approach for search personalization relies on building and using this user profile in retrieval process. First, the user profile is modeled by his/her general interests learned across his/her interactions with the retrieval system including queries.
User interest is built from returned documents judged relevant by the user for a query. It is represented as a vector of weighted terms. The building user profile is used to improve relevant results that match the user information needs. We propose a variant of bayesian network approach for search personalization performed by integrating the user profile in retrieval process. In particular, we extend the Bayesian belief network model proposed in [Ribeiro-Neto et al., 1996] to provide a structure for representing a user interaction and interpreting the query-document-user profile relevance as a belief in a document and in a user profile with respect to a query.
We summarize below the terminology and notations used in our contribution, then we detail our approach.

Terminology and notations
User's Interaction: A user's interaction with the search system, noted in, includes a query submitted by the user, the returned documents and the subset documents judged relevant implicitly by the user.
User profile: User profile refers to the user interests learned across his/her interactions with the retrieval system. A user interest is issued from the relevant documents selected by the user at his/her interaction. It is also represented as a vector of weighted terms, noted ck = {(t1,w1k) , (t2,w2k) , …, (ti,wik)}, where wik denotes the weight of term ti in user interest ck. The weighting term value wik will be detailed below.

Building a keyword user interest
Building the user interest starts by collecting a set of relevant documents Dr returned with respect to a query q related to a user's interaction. Each relevant document is represented as a vector of weighted terms, where the weight wij of term ti in document dj is computed using the TF-IDF weighting scheme: Where tfij is the frequency of term ti in document dj, N is the total number of documents and ni is the number of document that contain term ti.
The user interest ck is also represented as a weighted vector of the most relevant terms occurring in the relevant documents judged by the user. The weight wik of term ti in user interest ck is computed as follows: (2) Where, N and R the total number of documents and the number of relevant documents to the query belonging to user interest ck, respectively. r is the number of relevant documents that contain term ti, in the number of documents that contain term ti.

Bayesian belief network for search personalization
To improve relevant results that match the user information needs, we present a personalized information retrieval approach integrating the user profile in the retrieval process. Let us consider a submitted query q related to the user's interaction. Let D={d1,…dj,…dn} the set of documents in the collection, C_I= {c1,…ck,…cm} the set of user interests, and T= {t1...,ti ,…tp} the set of index terms used to index these documents and user interests . Furthermore, documents, user interest and query are modeled identically.
The relationship between user interests, documents and query can be modeled as a Bayesian belief network that provide an effective and flexible framework for modeling distinct sources of evidence in support of a ranking. We propose to extend the Bayesian belief network model proposed in [Ribeiro-Neto et al., 1996] by integrating the user profile to provide a structure for representing a user's interaction and interpreting the query-document-user profile relevance as a belief in a document and in a user profile with respect to a query.
Bayesian belief network is represented by a directed acyclic graph G (V, E), where nodes V = T ∪ D ∪ C_I ∪ q correspond to the set of random variables and the set of arcs A = V × V represents conditional dependencies among them. Figure (Fig.1) shows the topology of our belief network model for user's interaction where the terms nodes represent the network roots.
Each term in the index terms, ti∈ T, is modeled by a random variable ti ∈ {0, 1}. The event of "observing term ti" is noted ti = 1 or shortly ti. The complement event that "term ti is not observed", is noted ti = 0 or shortly .Let p be the number of index terms present in the set of terms T. It exists 2 P possible term configurations represented by the set θ. A term configuration may represent a query, a document, or a user interest. It is represented by a vector of random variable = (t1,t2,…,tp) where each variable indicates if the corresponding term is observed .For example, an index of 2 terms t1 and t2 presents 2 2 = = 4 possible term configurations represented by the set θ = {(t1, t2), (t1, ), ( , t2), ( , )}. The event of observing a particular http://www.i-jes.org random variable dj ∈ {0, 1} with two possible values 0 or 1. The event dj = 1, simplified with dj, denotes that the document dj is observed. The event dj = 1, simplified with dj, denotes that document dj is not observed. A document dj is represented as a term configuration dj= (t1,t2,…,tp) with ti is a random variable indicating if either term ti is present in the document or not. Obviously, observing a document in a retrieval process means that this document is relevant to the query.
Each user interest ck ∈ C_I is modeled by a random variable ck ∈ {0, 1}. The event ck = 1, simplified with ck, denotes that the user interest ck is observed. The complement event that "user interest ck is not observed", is noted ck = 0 or shortly . A user interest ck is represented as a term configuration ck= (t1,t2,…,tp) with ti is a random variable indicating if either term ti is present in the user interest or not. Obviously, observing a user interest means that this user interest is related to the query q.
-A user query q is represented by a random variable q ∈ {0, 1}. The two events of observing the query (q = 1) or not observing the query (q = 0) are noted q and , respectively. In our case, we interest only to a positive instantiation of q. In the same way as documents and user interests, a query is represented as a term configuration q= (t1,t2,…,tp) with ti is a random variable indicating if either term ti is present in the query or not. To express conditional dependencies between random variables, three types of arcs are identified in the inference network model for search personalization: (1) Term to document: Arcs joining term node ti ∈T to document node dj∈ D, (2) Term to user interest: Arcs joining term node ti ∈T to user interest node ck ∈ C_I, (3) Term to query: Arcs joining term node ti ∈T to user's query node q. Whenever term ti belongs to document dj, to user interest ck and to a query q.
We detail in what follows the query evaluation process for the proposed belief network. k c q iJES -Vol. 6, No. 4, 2018 Evaluation process: Intuitively, we can express the personalization retrieval problem as follows: Given a query q, the search personalization consists in ranking documents according to the information need and user interest. In the network of ( Fig.1), the ranking computation is based on interpreting the similarity between a document dj, a user interest ck and the query q as an intersection between dj ,ck and q. To quantify the degree of intersection of the document dj, the user interest ck given the query q, we use the probability (dj, ck and q). Thus, to compute a ranking, we use Bayes' law and the rule of total probabilities, as follows: (3) As the denominator P (q) is a constant, we can use only the numerator in order to estimate the probability P(dj,ck|q). Thus, the formula (5) is computed as: The probability P ( ) corresponds to the likelihood of observing term configuration . We assume that all the configurations are independent and have an equal probability to be observed. Therefore, the probability P (dj ,ck, q) is then approximated with: In the network of ( Fig.1), instantiation of the root nodes separates the document nodes, the user interest's nodes and the query node, making them mutually independent, which allows writing: By substituting in formula 4, the probability P (dj ,ck, q) is estimated as: We detail in what follows the computation of the conditional probabilities in formula (7).

Probability
The probability that document dj is generated by term configuration is estimated as the similarity between the document dj and term configuration . As described in [Ribeiro et al 1996], a Bayesian network can be used to represent the rankings generated by any of the classic models. For instance, a Bayesian network can be used to compute the vector space model ranking. So, the similarity between the document dj and term configuration is interpreted as an intersection between document dj and terms configuration .Then P(dj| )is computed as follows: wij, and wit denote respectively, the weight of term ti in document dj and in term configuration .
• Probability 6, No. 4, 2018 Analogously, the similarity between user interest ck and term configuration is interpreted as the similarity between the user interest ck and term configuration .
Then the probability is computed as follows: wik, denotes the weight of term ti in user interest ck . Given this latter probabilities, the formula (7) becomes: is a constant for a given document and user interest. Ignoring it, formula (11) is rewritten as follows: (12) We can use an m×n matrix X, noted Xm,n (m and n indicate respectively the number of user interests and documents) to represent resulting probabilities for each instantiations of document dj and user interest ck .It is defined as: Pkj denotes the probability P(dj,ck |q) of relevance of user interest ck and document dj for a given query q. We consider that the most likely user interest, noted , given a query q, is selected as follows: (13) Where represents the arithmetic mean of the set { Pk1,…, Pkj,… Pkn } for given user interest ck.
Therefore, for a given query q and user interest , the probabilities Pkj presented in matrix Xm,n are used to output a ranking list of documents

Experimental Evaluation
Our experiments have two main objectives. The first one is to compare the performance of our search personalization approach to the personalized approach proposed in [Daoud et al., 2009]. The second one is to evaluate the impact of user profile on the search results by comparing our personalized search approach to a baseline search information process that does not consider the user profile.

Evaluating the effectiveness of our personalized approach
Our purpose is to compare the performance of our search personalization approach to the approach proposed in [Daoud et al., 2009]. We recall that in our approach, the user profile is integrated in the retrieval process by interpreting the query-documentuser profile relevance as a belief in a document and in a user profile with respect to a query. In [Daoud et al. 2009] approach, personalization consists of re-ranking the search results by combining query-document score and profile-document score.
The experiments have been handled in TREC data set from disk 1& 2 of the TREC ad hoc collections AP88 (Associated Press News, 1988) and WSJ90-92(Wall Street Journal, 1990-92). Collections contain 741670 documents, queries and relevant judgments. We particularly tested the queries among q51 − q100.
The choice of this test collection is due to the availability of a manually annotated domain for each query. This allows us, to simulate user interests changing over different domains of TREC. We used the same domain categorization than [Daoud et al 2009]    Experimental design and results: The evaluation is based by simulating user interest's process based on N-fold cross validation strategy [Mitchell 1997] explained as follows: For each TREC domain, divide the query set into N subsets. We repeat experiments N times, each time using a different subset as the test set and the remaining N−1 subsets as the training set.
For each query in the training set, the 1000 top documents are first returned by BM25 Model provided by terrier-3.5 platform then an automatic process uses the returned top documents which are listed in the assessment File (qrels) provided by TREC collections , to generate the user interest vector of weighted terms, using formula (2).
Then for each query in the test set, an automatic evaluation process (cf. section 3.3.1) generates the matrix given the relevance scores of documents and user interests. Table 2 shows the percentage of improvement of our approach compared to [Daoud et al., 2009] approach computed at P5, P10 and MAP (Mean Average Precision) and averaged over the queries belonging to the same domain. We notice that our approach gives higher performance than Daoud et al. ( 2009) approach for most of the queries in the all domains at P5, P10 and mean average precision (MAP). Based on the overall evaluation results, the conclusion we can made is that the integration of user profile in the matching model of retrieval process as computing the query-document-user profile relevance can better improve the search that the re-ranking of search results for a given query using the user profile as done in [Daoud et al 2009].

Evaluating the impact of user profile on the search results
The goal of this experiment is to evaluate the system performance by introducing the user profile in search process. We compare our approach to the baseline BM25 Model [Robertson et al 1998] provided by terrier-3.5 platform, using only the query ignoring any user profile.
We use a TREC 2011 Track collection. It consists of clueweb09_English1 collection of documents and includes relevance judgments, 61 main queries (topics). Each topic has a number of subtopics distributed as follows: 202 interactions queries and 75 currents queries. Interactions queries and current query are a sequence of reformulations of the main query. Table 3 shows the statistics data characteristics of the test collection.

Experimental design and results:
The evaluation scenario we adopted is the following: Inferring user profile: For each main query, the 1000 top documents are first returned by BM25 Model provided by terrier-3.5 platform then an automatic process uses the returned top documents which are listed in the assessment File (qrels) provided by TREC to generate the user interest vector of weighted terms, using formula (2). The vector represents the user interest Personalization process: It consists of ranking the search results of a current query by using the user profile. We present in table 4 the precision improvement obtained by our approach introduced the user profile compared to the baseline BM25 Model [Robertson et al 1998] using only the query ignoring any user profile, at P5, P10, P20 and MAP averaged over the current queries  We notice that our approach gives higher performance than BM25 Model at P5, P10 and P20. More particularly, our approach brings an improvement of 43.44% in

Discussion
Our research work relies on how to build and how to exploit a user profile in the search process to produce better result rankings. Our intuition was based on the assumption that the search system provides the probability that a document is relevant to a user query, the goal is to estimate this probability by taking into account the user profile. For this purpose, our user profile is modeled by his/her general interest learned across his/her interaction with the retrieval system. Following this general view, our approach could be distinguished by several features in the personalized search community. The first one concerns the user profile construction and the second one concerns the user profile integration in the search process.
In our approach the user profile is modeled by his/her, interests represented as weighted vectors of terms. We consider the relevant documents selected by the user at his/her interactions with the retrieval system as the data source involved to build his/her interest. Then to estimate the relevance of document we use a bayesian approach for the matching measure by integrating the user profile as a separate component in the relevance retrieval function. While in [Gauch et al., 2003] and [Daoud et al., 2009], a user profile is represented by a list of concepts issued from an external data source that is domain ontology and original score between the document and the query with the score between the document and the user profile [Daoud et al., 2009]. The main assumption behind this representation is that we aim at representing the user profile as weighted vector of terms and incorporate it in the query-document matching model both represented as vector of terms using probabilistic approach.

Conclusion
In this paper, we have explored our approach for the user profile representation and its integration in personalized search. It consists of two basic steps: (1) inferring user interest at user's interaction (2) incorporating the user profile in the matching model of retrieval process. The user profile refers to the user interests built across his/her user's interactions. To integrate the user interest in the search process, we use a Bayesian networks to represent the user's interaction.
To evaluate the performance of our approach, we have conducted two experiments, based on using standard test collections in order to allow accurate comparative evaluation. First, to evaluate the effectiveness of our personalized search approach, we use TREC ad hoc collections. We compared our approach to Daoud et al., (2009) approach. In our approach we integrate the user profile in the matching model by interpreting the query-document-user profile relevance as a belief in a document and in a user profile with respect to a query. In Daoud et al., (2009) approach, personalization consists of re-ranking the search results of a given query using the user profile. Moreover, our experimental evaluation shows an improvement of personalized retrieval effectiveness compared to Daoud et al., (2009) approach. Second, to evaluate the user profile impact on the search results, we use clueweb09_English1 test collection and we compared our approach to baseline BM25 Model of the Terrier-3.5 platform, using only the query ignoring any user profile. The obtained results show that our approach gives higher performance than BM25 Model.
As future work, we plan to use user profile evolution in to improve the system performance for a recurring query and then undergo experiments in order to evaluate the impact of introducing the user profile in personalizing search results by comparing our approach to another personalized approach.
Intelligent user interfaces, New York, pp: