A Chatbot as a Natural Web Interface to Arabic Web

In this paper, we describe a way to access Arabic Web Question Answering (QA) corpus using a chatbot, without the need for sophisticated natural language processing or logical inference. Any Natural Language (NL) interface to Question Answer (QA) system is constrained to reply with the given answers, so there is no need for NL generation to recreate well-formed answers, or for deep analysis or logical inference to map user input questions onto this logical ontology; simple (but large) set of pattern-template matching rules will suffice. In previous research, this approach works properly with English and other European languages. In this paper, we try to see how the same chatbot will react in terms of Arabic Web QA corpus. Initial results shows that 93% of answers were correct, but because of a lot of characteristics related to Arabic language, changing Arabic questions into other forms may lead to no answers.


I. INTRODUCTION
Human computer interfaces are created to facilitate communication between human and computers in a user friendly way.For instances information retrieval systems such as Google, Yahoo, AskJevees are used to remotely access and search a large information system based on keyword matching, and retrieving documents.However, with the tremendous amount of information available via web pages, what a user really needs is an answer to his/her request instead of documents or links to these documents.This is what a question answer system do.A question answering (QA) system accepts user's question in natural language, then retrieve an answer from its knowledge base rather than "full documents or even best-matching passages as most information retrieval systems currently do."[1] QA systems are classified into two categories [2]: Open-domain QA; and closed-domain QA.Closeddomain question answering systems answers questions in specific domain such as medicine, education or weather forecasting.In contrast, open-domain question answering answers questions about everything and relies on general ontology and world knowledge.In recent years, "the combination of the Web growth and the explosive demand for better information access has motivated the interest in Web-based QA systems" [3].Katz et al., [4] addressed three challenges face QA developers to provide right answers: "understanding questions, identifying where to find the information, and fetching the information itself".To understand questions and retrieve correct answers, QA systems use different NLP techniques such as: super vector machine to classify ques-tions, and HMM based named entity recognizer to obtain right answer [5]; Others use a surface patterns to extract important terms from questions, constructs the terms' relations from sentences in the corpus, and then use these relations to filter appropriate answer candidates [6].
In contrast to English and other European languages, Arabic language suffers from shortage in NLP resources and tools.In this paper we will use an Arabic QA corpus to retrieve answers for questions without the need for using sophisticated NLP through using an interface which fools users into thinking that they speak/ask a real human; chatbot.
A chatbot is a conversational software agent, which interacts with users using natural language.The idea of chatbot systems originated in the Massachusetts Institute of Technology [7], where Weizenbaum implemented the Eliza chatbot to emulate a psychotherapist.After that, Colby developed PARRY [8] to simulate a paranoid patient.Colby [8] "regarded PARRY as a tool to study the nature of paranoia, and considered ELIZA as a potential clinical agent who could, within a time-sharing framework, autonomously handle several hundred patients an hour." Nowadays, several chatbots are available online, and are used for different purposes such as: MIA 1 which is a German advisor on opening a bank account; Sanelma2 a fictional female to talk with in a museum that provides information related to specific piece of art; Cybelle3 , and AskJevees4 , a web-based search engine.
The remainder of this paper describes our AL-ICE/AIML architecture in section II.Arabic Information retrieval tools and the Arabic language charactrstic are described in section III.The Arabic QA corpus that is used to retrain ALICE with the adapted program is described in sections IV and V consecutively.Results and conclusion are discussed in sections VI and VII consecutively.AIML consists of data objects called AIML objects, which are made up of units called topics and categories as shown in figure 2. The topic is an optional top-level element, it has a name attribute and a set of categories related to that topic.Categories are the basic unit of knowledge in AIML.Each category is a rule for matching an input and converting to an output, and consists of a pattern, which represents the user input, and a template, which implies the ALICE robot answer.The AIML pattern is simple, consisting only of words, spaces, and the wildcard symbols _ and *.The words may consist of letters and numerals, but no other characters.Words are separated by a single space, and the wildcard characters function like words.The pattern language is case invariant.The idea of the pattern matching technique is based on finding the best, longest, pattern match.Recursive categories are those with templates having <srai> and <sr> tags, which refer to simply recursive artificial intelligence, and symbolic reduction.Recursive categories have many applications: symbolic reduction that reduces complex grammatical forms to simpler ones; divide and conquer that splits an input into two or more subparts, and combines the responses to each; and dealing with synonyms by mapping different ways of saying the same thing to the same reply as the following example:

II. ALICE/AIML CHATBOT ARCHITECTURE
The input is mapped to another form, which has the same meaning.

B. ALICE/AIML PATTERN MATCHING
The AIML interpreter tries to match word by word to obtain the longest pattern match, as this is normally the best one.This behavior can be described in terms of the Graphmaster as shown in figure 3. Graphmaster is a set of files and directories, which has a set of nodes called nodemappers and branches representing the first words of all patterns and wildcard symbols.Assume the user input starts with word X and the root of this tree structure is a folder of the file system that contains all patterns and templates; the pattern matching algorithm uses depth first search techniques:  If the folder has a subfolder starting with underscore then turn to, "_/", scan through it to match all words suffixed X, if no match then:  Go back to folder, try to find a subfolder starts with word X, if so turn to "X/", scan for matching the tail of X, if no match then:  Go back to the folder, try to find a subfolder start with star notation, if so, turn to "*/", try all remaining suffixes of input following "X" to see if one match.If no match was found, change directory back to the parent of this folder, and put "X" back on the head of the input.When a match is found, the process stops, and the template that belongs to that category is processed by the interpreter to construct the output.

A. Arabic Language Characteristic
Arabic language belongs to family of Semitic language that differs from Indo-European languages semantically, syntactically, and morphologically.Arabic is composed of 25 consonants and three long vowels that are written from right to left, and take different shapes according to its position in the word.In addition, Arabic has short vowels (diacritics) that appear in written text above or beneath alphabets and effect pronunciation and meaning of the word [9].According to diacritics usage written Arabic text could be classified into Classical Arabic and Modern Standard Arabic (MSA) language.The Classical Arabic is vowelized, and is used in religious text that requires strict obedience to pronunciation rules such as Qur'an the holy book of Islam; some scripts of Al-Hadith (teachings of Prophet Mohammad (PBUH), classical poetries, in children literatures and in ordinary text when it is ambiguous to read [10,11].For example, Arabic word ‫"ﻋﻠﻢ"‬ that is composed of three letters, can be ambiguous without vowels as shown in table 1.
MSA is the language of media, newspaper, books, magazines, and is used also as a communication media between different Arab nationalities.A third category of Arabic language is Colloquial Arabic dialects, which is the spoken language in different Arabic languages where each country has its own dialect.Colloquial Arabic dialects are used in informal settings, and between friends [10].Arabic is a derivative language where "most Arabic words are derived from a root, generally composed of three consonants; occasionally the root can be also formed of two, four or rarely five consonants" [12].According to [11] Arabic words are classified into three categories:  Original Arabic words: which include Arabic verbs and nouns that are formed according to Arabic derivation rules;  Fixed Arabic words: which include words that do not belong to derivative rules, these words were modeled by Arabs in ancient times;  Arabized words: which include nouns that are taken from foreign languages and become common within Arab people.
Even though, Arabic is an international language, rivaling English in number of mother-tongue speakers [13]; the progress in Arabic Natural Language Processing is slower than English and other European languages, and this is because of [14], 15]:  Orthographic variations are prevalent in Arabic;  Arabic has a very complex morphology;  Arabic words are often ambiguous due to tri-literal root system;  Synonyms are widespread, perhaps because variety in expression is appreciated as part of a good writing style;  Broken plurals are common;  The absence of diacritics in the written text creates ambiguity and therefore, complex morphological rules are required to identify the tokens and parse the text;  The writing direction is from right-to-left and some of the characters change their shapes based on their location in the world;  Capitalization is not used in Arabic, which makes it hard to identify proper names, acronyms, and abbreviations.

B. Arabic Information Retrieval
In 2002 Arabic-speaking internet users was about 4.4 million which is about 1.5% of the Arab population [16].Within days, Arab Internet users increase, and the volume of Arabic online information also increases which necessitate the need for Arabic information retrieval systems (AIR) to facilitate accessing to this information.Abdelali et al., [12] classified Arabic IR into two categories:  Full form based IR, which is used in commercial market as Yahoo, Google, and Ayna. Morphology-based IR.These systems use different NLP techniques to improve Arabic IR.These techniques involve: part of speech taggers [17,18], using thesauri [14], using ontology [19], and using light stemmers [20,21].
The lack of Arabic IR tools refers mainly to two reasons [10]:  Complexity of Arabic language;  Lack of adequate resources (corpora, morphological analyzers, lexicons, part-of-speech taggers, etc.) Some researchers try to use what is known as Cross Language Information Retrieval (CLIR) which allows a user to insert query in his own natural language and obtain documents in one or several other languages [22].Different approaches have been used to tackle Arabic CLIR problem as listed below [10]: iJET -Volume 6, Issue 1, March 2011

 Machine translation and dictionary-based approach,
where input queries are translated using a dictionary into the language in which document may be found.This technique was adopted by Aljlayl et al., [22], to build an Arabic-English IR.Diekema et al., [23] built an English-Arabic CLIR system that retrieves an English query an returns documents in Arabic language, at the same time those Arabic documents are automatically translated into English to facilitate reading for English analyst. Transliteration/Transcription approach, where input queries are converting the characters of an alphabetical or syllabic script to the characters of a conversion alphabet.This technique was used by AbdulJaleel and Larkey [24] to build Arabic-English IR.  Latent semantic indexing (LSI) approach, where automatic statistical algorithms are used to improve the retrieval process Abdelali et al. [25]  Corpus-based approach, Semmar and Fluhr [26] describes a new approach to align Arabic-French sentences retrieved from Parallel corpus based on CLIR system.In this approach a database of sentences of the target text is created, and each sentence of the source text is considered as a query to that database.
Generally speaking, IR systems return most relevant documents according to user request [27], this is insufficient in some how in this electronic age, sometimes what users really need is a specific answer instead of a set of relevant documents.The goal of a Question Answering (QA) is "to provide inexperienced users with a flexible access to the information allowing for writing a query in natural language and obtaining a precise answer" [28].Most QA systems are developed for English as the target language, till now few QA systems has been developed in Arabic which also based on using sophisticated NLP, and machine translation such as: AQAS [29]; LMRA [30]; QARAB [15]; [28]; and [11].
In this paper we examine accessing an Arabic Web QA using a simple pattern matching technique simplified by ALICE chatbot without the need for using sophisticated NLP.
IV. USING WEB ARABIC QA TO RETRAIN ALICE ALICE chatbot was originated for chatting and entertainment.In order to find other useful applications for ALICE, a Java program that converts a text corpus to the AIML chatbot language model format was developed.
The program generates two files: an atomic file and default files.The atomic file will hold the same questions and answers as appeared in the corpus where the pattern represents the questions and the template represents the answers.Since we can not guarantee that the user may enter the same questions as it is stored in the ALICE knowledge base, the default file was built using the idea of first word and the most significant word approaches.
The first word will act as question classifiers: so when differs than who, or what.The most significant word is the least frequent word in the question, which will have the highest information content?For example, when you ask "what is your name?"The least frequent word will be "name" and the answer will be generated according to this.
We modified the Java program we developed with the Qur'an, the holy book of Islam [31].The generated chatbot accepts an Arabic question related to Islamic issues, and the answers are verses from Qur'an that match some keywords.However, because of the Qur'an nature as a monologue text, not as questions and its answers, evaluation for the Qur'an chatbot shows that most of responses were not related to the question.In this paper, we extend our FAQs chatbot systems [32] generated before in English, and Spanish to include Arabic QA.
In this term, we used different Web-pages to build a small corpus consist of 412 Arabic QA, and covers 5 domains:  Mothers and pregnancy issues,  Teeth care issues 6 ,  Fasting and related issues to health,  Blood disease such as cholesterol, and diabetes,  Blood charity issues 7 .
The questions and answers were extracted not from users' forums, but to guarantee its correctness, we gathered it from web pages like medical centers and hospitals.
Different problems raised up that is related to QA format and structural issues which necessitate some manual and automatic treatments as follows: The questions in these sites were denoted using different symbols: stars, bullet points, numbers and sometimes with " " ‫س‬ : which mean "Q:".To facilitate programming issues, and unify these symbols, all questions were preceded with "Q:" Samples of those questions are presented in table 1.
Another problem was that some of these were in fact PDF files not as web pages, which required to convert it into text ones.
The answers for some questions were long and found in many lines which requires a concatenation procedure to merge these lines together.

A. Module 1: Generating Atomic file
The first program is generating the atomic file; during this program the following steps are applied: 1. Reading the questions which are denoted by ‫"س:"‬ ("Q:") 2. Normalizing the question by: removing punctuations, and un-necessary symbols 3. Adding the question as a pattern.4. Reading the answer which is coming in a separate line after question mark.

B. Module 2: Generating the frequency list
The frequency list created using the questions only, since the most significant words will be used within the questions.All questions denoted by <pattern> are read form the atomic file.A file of these questions is generated.After that a tokenization process is applied to have lexical and found its frequencies.As a result a frequency list is created.

C. Module 3: Generating the default file
1. Reading the questions and extracting the two most significant words (content words only) which are the least frequent words.2. Different categories are added to extend the chance of finding answers as shown below: o Build four categories using the most significant word (least 1) in four positions as patterns and the set of links it has as templates.o Repeat the same process using the second-most significant word (least 2) o Build four categories using the first word and the most significant words (least 1) where the most significant word is handled in four positions.
o Build two categories using most significant 1 and most significant 2, keeping the order of position as in the original question.o Build a category using the first word, most significant word 1, and most significant word 2 where the template is a direct answer.
At the end of this stage, two files were generated: an atomic file and a default one.One of the default categories for the above atomic category is: ‫ﻋﺪﻳﺪة‬ ‫ﻋﻮاﻣﻞ‬ ‫وهﻨﺎك‬ …… </template> </category> VI.RESULTS AND EVALUATIONS Before training ALICE with the generated AIML files, these files were converted into "UTF-8" code to recognize the Arabic letters.For this purpose two steps are taken: 1.All Arabic AIML file are started with: <?xml ver-sion="1.0"encoding="UTF-8"?> 2.An online tool was used (Foxe2318) to convert encoding into UTF-8.
As a result five versions of ALICE were generated to cover the five domains as shown in table 3. Table 4 shows the number of categories generated from each WWW FAQs.In total 5,665 categories were generated.
Fifteen questions were submitted to the generated versions, 93% of answers were correct.Sample of chatting is shown in figure 4.
The same questions were submitted to Google and Ask-Jevees, the recall was 87% for both.However, because Google and AskJevees return documents that hold the answers, we measure how much it is easy to find the answers inside the documents, based on if correct document is the first one in the returned list, and if the answer is found at beginning.In both search engines AskJevees and Google 50% of the answers were found in files, where users need to search again in these file to find their requests.

Figure 3 .
Figure 3.A Graphmaster that represents ALICE brain III.STATE OF THE ART So ALICE will pick a random answer from the list.

TABLE II .TABLE 1 :
SAMPLES OF QUESTIONS OF ARABIC QUESTIONS

TABLE IV .
AIML CATEGORIES GENERATED FROM ARABIC WWWFAQS