Efficient Data Mining Model for Question Retrieval and Question Analytics Using Semantic Web Framework in Smart E-learning Environment

subhabrata.sengupta@iemcal.com Abstract — In the field of Information recovery, the fundamental target is to discover important just as most applicable data concerning a few questions. However, the essential issue regarding recuperation has reliably been, that the request an area is enormous so much that it has gotten very difficult to recuperate applicable information capably. In any case, with the latest pro-gressions in profound learning and AI models, calculations, applications brilliant and computerized data recovery component matched with text examination to decide different characterizing boundaries alongside intricacy and weight-age assurance of inquiries. By focusing, the cutoff points and hard-ships, like CPU cost, efficiency, automation and congruity, we have assigned our information recuperation structure particularly towards Academic Insti-tutional Domain to consider the interest of various association related inquir-ies. The aim is to make an efficient data mining and analytical model that can automate an efficient question retrieval and analysis for complexity and weight-age


Overview
The World Wide Web fills in as an enormous, broadly circulated, worldwide data administration place for news, promotions, customer data, monetary administration, instruction, government, online business and numerous other data administrations. With the unstable development of data accessible on the World Wide Web, different wise web administrations have been created to help client's entrance significant data from the Web. Web search has its root in information retrieval (IR) [1]. Customary IR accepts that the essential data unit is an archive, and an enormous assortment of records is accessible to shape the content information base. Recovering data just methods finding a bunch of reports that is applicable to the client inquiry. A positioning of the arrangement of reports is typically additionally performed by their importance scores to the question [2]. The most ordinarily utilized inquiry design is a rundown of catchphrases, which are likewise called "Terms". IR is not quite the same as information recovery in data sets utilizing SQL questions on the grounds that the information in data sets is profoundly organized and put away in social tables, while data in content is unstructured [4].
Web personalization is a technique, an advertising instrument, and a craftsmanship. Personalization requires undeniably or unequivocally collecting guest data and utilizing that information in your substance transport configuration to control what data you present to your clients and how you present it [6]. Web mining means to find helpful data or information from the Web hyperlink structure, page substance, and utilization information. Despite the fact that Web mining utilizes numerous information mining strategies, as referenced above it isn't absolutely a use of customary information mining because of the heterogeneity and semi-organized or unstructured nature of the Web information.
The system aims to achieve a smart automated question retrieval mechanism with the implementation of text processing and analysis techniques. Using the different algorithms for key phrase and keyword determination from texts, topic classification and similarity comparison for creating a relevant question bank through determining based on a set of questions fed. After it the acquired questions are classified based on its defining parameters like textual complexity, lexical complexity, difficulty and accuracy rate.
Our aim is to couple the AI and deep learning algorithms for customized learning and personalized information retrieval along with smart learning frameworks like semantic web technologies to influence the E-Learning system.

Background study
Novel highlights dependent on reference network data is constructed and utilized related to customary highlights for key phrase extraction to acquire noteworthy enhancements in execution over solid baselines. [7] Kea, a calculation for naturally separating key phrases from text. Kea distinguishes applicant key phrases utilizing lexical strategies, ascertains include values for every upand-comer, and utilizations an AI calculation to anticipate which competitors are acceptable key phrases. The AI conspire first forms an expectation model utilizing preparing archives with known key phrases, and afterward utilizes the model to discover key phrases in new documents. [8] Report is treated as a bunch of expressions, which the learning calculation should figure out how to group as certain or negative instances of key phrases. GenEx calculation explicitly for naturally separating key phrases from text. The test results uphold the case that a hand-crafted calculation (GenEx), fusing specific procedural area information, can create preferred key phrases over a broadly useful calculation (C4.5). Abstract human assessment of the key phrases produced by GenEx proposes that about 80% of the key phrases are adequate to human perusers. This degree of execution ought to be good for a wide assortment of applications. [9] Text summarization is arisen as a significant examination zone in late past. In such manner, survey of existing work on content rundown measure is valuable for doing promote research. [10] For expand questions, to comprehend the pith of the inquiry text synopsis is truly advantageous.
Programmed keyword extraction is a significant examination heading in content mining, characteristic language handling and data recovery. keyword extraction empowers us to speak to message records in a consolidated manner. An extensive investigation of looking at base learning algorithms (Naïve Bayes, uphold vector machines, strategic relapse and Random Forest) with five generally used ensemble techniques (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is led. [11] On the off chance that the likelihood conveyance of co-event between term an and the continuous terms is one-sided to a specific subset of regular terms, at that point term an is probably going to be a catchphrase. The level of inclination of a circulation is estimated by the χ2-measure. Our calculation shows equivalent execution to tfidf without utilizing a corpus. [12] Graph based strategy is quite possibly the most proficient solo approaches to extricate watchword from a solitary web text. In any case, seldom did the past diagrambased techniques think about the sentence significance. In this paper, we propose a diagram-based watchword extractor WS-Rank which brings sentences into chart where sentences are unmistakably treated by their significance. [13] The system evaluation is assessed in an unexpected way, including correlation with human annotated keywords utilizing F-measure and a weighted score comparative with the oracle system execution, just as a novel elective human evaluation. [14] In a supervised system for extricating watchwords from meeting records, a classification that is fundamentally not the same as composed content or other discourse spaces, for example, broadcast news. Notwithstanding the customary recurrence or position-based hints, we research an assortment of novel highlights, including semantically persuaded term explicitness highlights, dynamic sentence-related highlights, prosodic prominence scores, just as a gathering of highlights got from synopsis sentences. [15] The issue of programmed catchphrase extraction in the gathering area is handled by a sort fundamentally not the same as composed content. For the administered structure, we proposed a rich arrangement of highlights past the average TFIDF measures, for example, sentence remarkable quality weight, lexical highlights, synopsis sentences, and speaker information. [16] The extractive outline methods as a rule spin around finding generally important and continuous catchphrases and afterward extricate the sentences dependent on those watchwords. Manual extraction or clarification of applicable catchphrases are a dismal methodology flooding with blunders including heaps of manual effort and time. [17] The appropriation of EDM by advanced education as a logical and dynamic apparatus is offering new occasions to misuse the undiscovered information created by different student information systems (SIS) and learning management systems (LMS). This paper portrays a half breed approach which utilizes EDM and relapse examination to dissect live video streaming (LVS) understudies' internet learning practices and their presentation in their courses. Understudies' collaboration and login repeat, similarly as the quantity of talk messages and questions that they submit to their teachers, were inspected, close by understudies' last grades. [18] 3

Issue articulation and proposed methodology
Data recovery is the examination of helping customers with finding information that organizes their information needs. Actually, IR examines the securing, association, stockpiling, recovery, and circulation of data. Key phrases and important keypoints are extracted from a set of questions, matching with which our mining model searches and stores for relevant questions. Then these questions are filtered and clustered according to their complexity and difficulty. Text analysis algorithms are used to calculate the lexical complexity, key words, difficulty and are classified using clustering algorithms with a set of sample questions to determine weight-age, difficulty etc.
In the illustration, Figure 1, the full scaled architecture of the proposed model is shown. The collected set of questions on a particular topic are processed using many textual analysis algorithms and techniques for extracting similar patterns, key words and phrases that define that set of questions of a particular topic. Based on these the mining model, searches for questions on that topic comparing and analysing these phrases, patterns make a relevant set of questions. The recovery module utilizes the report file to recover those archives that contain some question terms (such records are probably going to be applicable to the inquiry), register importance scores for them, and afterward rank the recovered reports as per the scores. The positioned reports are then introduced to the client. The archive assortment is likewise called the content information base, which is ordered by the indexer for proficient recovery. If we can apply this collective model in the Unifying Logic Layer of Semantic Web Architecture, then we can set new query rules over information retrieval.

Question analytics and weight-age determination
After creating the collection of the questions from the retrieval system, these questions were passed through NLTK algorithms for analyse these questions. The questions were processed to determine which category they belong to using clustering algorithms (e.g. KNN). Then through keyword extraction their subjective complexity is determined.
Lexical complexity is an indication of readability of the text. This readability can be affected by both the use of complex sentence formation, use of complex word combinations and also in the technical field, with the use of technical terms it can be affected. Many algorithms have tried to formulate the lexical complexity on various parameters for different use cases ranging from general case, technical or even education and healthcare. Key points/ key phrases Automated Readability Index or ARI is very straight forward to calculate. It considers characters, words and sentences, leaving far from being obviously true measurements like "complex words". It is the most usable formulation in case of technical domain. ARI = 4.71 * (characters / words) + 0.5 * (words / sentences) - 21.43 (1) The Dale-Chall readability score measures a text against a word knowledge of a random fourth-grader. According to its scale, the more unfamiliar words used, the higher the reading level will be. In our case we can actually tweak the original formula where instead of going for complex words we can go for technical keywords and can calculate the score using the formula.
Depending on the domain of implementation, we can use a combination of both these indexes with different influences to determine readability index.
Weightage of the question treated as text is determined as a combination of key phrase and keyword density, readability index and inverse of accuracy on attempt.
*Accuracy index is a dynamic quantity changing from time to time so weight-age is also a dynamic quantity.

Semantic web architecture overview
The principle worries of our own is the layered engineering of Semantic Web orchestrated in chain of command. Each layer is workable to the upper layer and goes about as customer to its hidden layer imitating an upward pyramid, roundabout and a transcending design. As this work process is grouping driven and plan of steps, it addresses a reflection of higher request -change in a solitary layer will be influencing the usefulness of different layers. The most eminent occasion is the International Standards association (ISO) and the Open Systems Interconnected (OSI) [19]. The most up to date semantic web engineering as planned by Tim Berners-Lee is clarified as: Unicode layer and URI gives an improved on intends to distinguish assets like a site, a picture, a record, or an individual: fundamentally any unit with a character. Unicode, utilized for PC character portrayal, is a widespread standard encoding framework: it addresses all language dissimilar to other encoding frameworks. Followed by the XML Layer depicts the archive content. XML composition then again gives lexical help for confirmed XML archives. Then comes the RDF & Data interchange Layer: URI is utilized to target web assets and uses chart models to depict connections among different assets. The outline of RDF is an improved on displaying language that has presented classes of assets, properties and their interrelationships [20,21]. The following SPAQL -query, Ontology (OWL), RIF & RDF-S Layer: The foundation of Semantic Web, Ontology, gives semantic executable by the machine and for better man to application correspondence -a shareable space. It's primary target is to give semantic to produce web meaning, which will assist machines with unravelling the significance and help in data sharing [22]. The XML design is indicated by the RIF (Rule Interchange Format) for rules in similarity with RDF and OWL at expressive force at a halfway level as per the RIF Working Group [23]. Unifying Logic Layer passes on as a foundation layer for uniting the two recently referenced layers into a whole, to coordinate requests and attract rules over data tended to in the RDF close by ontologies related with it and schemata. Different works in this area have centred at joining rules with question implanted office, with a blend of rules and language structure. Then comes the Proof Layer is for validation of particular statements. The following Trust Layer depends upon the information source which can deny bothersome applications or customers induction to these sources. It's everything except a piece of trust and conviction between data sources and units of clients. The UI and Application Layer acts as an interface the users interacts with so it must satisfy them along with the applications. The Vertical Layers like Crypto is utilized for encryption purposes and for advanced mark. Existing in firth to 6th layer is utilized to set up trap of trust. XML marks, by applying it to the asset content, it very well may be distinguished.  Experimental setup and flow ─ Step 1: A set of collected questions are processed to extract the keywords and key phrases. Each question goes through textual pre-processing, like removal of stop words, lemmatization etc. Then keywords and key phrases are extracted from them using NLTK algorithms. ─ Step 2: Using these keywords and key phrases as searching tokens along with topic name, the mining model searches through web pages for relevant questions. The class of question it belongs to is detected using KNN algorithm trained on the acquired set of questions. The topic relevance is measured using average cosine similarity to the questions belonging to the class. ─ Step 3: These questions are then processed using NLTK algorithms to extract keywords and their key phrases. Then their lexical complexity (readability index) is calculated and their accuracy is set to 1. Then their initial weightage is calculated using the formula: Readability index is calculated based on the sector of application; ARI algorithm for technical domain, DCRI for general/medical domain. This helps to monitor the acceptability, accuracy and dynamic drill set for examinee.

Architecture & Influence of AI services with web based semantic concept
To plan a keen performing E-Learning framework to offer simple admittance to looked through courses, modules and so on Different significant things are to be created and observed and taken into concern like data and theory-based information which is the focal point of the engineering. They are enormous for going most likely as a vault where ontologies metadata deriving rules, instructive assets, course materials and client profiles are dealt with. The metadata can be arranged in a similar report or can be protected outside in an intricate information unit. Benefits of outside stockpiling or remotely put away parts incorporate the simplicity of clearing the different metadatadepiction put away in the data set and along these lines helps in productive memory the executives. Additionally, as far as the re-convenience of similar materials, there are conflicting methodologies by clients and creators which sees the chance of various depictions as indicated by various ideas.
Yet, the main part will be the UI GUI with which the client will communicate. Clients will associate with it, explore through it and will get their ideal inquiries. This is the Access Interface. Semantic Search Engines furnish the API with various strategies for directing questions and information bases. The Interference Engine is a canny blend of real factors that helps with handling and answer requests and is at risk for the completion of new genuine elements through the brilliant blend of real factors as of now have in their knowledge base. Services incorporate the assortment of predefined services that are given to the clients like appearance subtleties, obviously, course map and so on in the services segment, effective robotized tweaked searching arrangements can be incorporated utilizing the execution of AI where the catchphrases extraction can make the searching outcomes more proficient. The NLP calculations pre-measure the searching watchwords to discover and plan results all the more effectively. RAKE calculation is utilized to remove the catchphrases with expanding significance past a limit post pre-processing (disposal of stop words, lemmatisation, tokenization and so on). The catchphrases are then planned with the themes and course units and from the chose bunch cosine work helps in the exemplary outcomes with the most proficiency.

Conclusion
In the proposed system we have tried to make an end-to-end question retrieval and analytics model. The model is trained on a set of questions which is analysed through NLTK algorithms for key-point and key-phrase extraction. These key-points and keyphrases are ordered in accordance to relevance and importance based on which the mining model retrieves the questions from the web pages. These questions are the categorized and grouped into classes using classification algorithms. Then these questions are passed through the analytics model for weight-age determination. This helps in a dynamic examination system where the questions are shifted or changed according to accuracy rate of the candidate.

8
Future scope Question analytics system is currently only working on determining the weight-age based on the various complexity parameters, but in future more in-depth analytics can be conducted which can explore more qualities of the question, like categorize it as subjective, numerical, philosophical etc. These will also help in auto evaluation of the examination.

Authors
Mr. Subhabrata Sengupta, obtained his Bachelor's in Technology degree in the field of Information Technology from Bengal Institute of Technology, India. Then he obtained his Master's degree in the same field from Jadavpur University, India, and is currently pursuing his PhD in Computer Science majoring in Machine Learning and Information Retrieval from University of Engineering and Management, Kolkata, India. Currently, he is a professor at the Faculty of Information Technology, Institute of Engineering & Management, Kolkata, India. His specializations are in Machine Learning, Information Retrieval Systems, Smart and adaptive education systems. His current research interests are smart information retrieval systems, student and academic performance analysis, educational data mining and predictive modelling for smart and adaptive education systems with the applications of Machine Learning.
Mr. Anish Banerjee, is pursuing his Bacherols in Technology (2017-2021) in Information Technology from Institute of Engineering and Management, Kolkata, India. He worked as a machine Learning Engineer and a Team Lead at Analysed, Hyderabad and as a Machine Learning Developer and consultant at Durbin Technologies, Kolkata, India and is currently placed in Tata