Towards an Ontology Proposal Model in Data Lake for Real-time COVID-19 Cases Prevention

Globally, the coronavirus epidemic has now hit lives of millions and thousands of people around the world. The growing threat of this virus continues rising as new cases appear every day. Yet, affected countries by coronavirus are currently taking important measures to remedy it by using Artificial Intelligence (AI) and Big Data technologies. According to the World Health Organization (WHO), AI and Big Data have performed an important role in China's response to COVID-19, the genetic mutation name for coronavirus. Predicting an epidemic emergence, from the coronavirus appearance to a person's predisposition to develop it, is fundamental to combating it. In this battle, Big Data is on the front line. However, Big Data cannot provide all of the expected insights and derive value from manipulated data. This is why we propose a semantic approach to facilitate the use of these data. In this paper, we present a novel approach that combines between the Semantic Web Services (SWS) and the Big Data characteristics in order to extract a significant information from multiple data sources that can be exploitable for generating real-time statistics and reports. Keywords—Big Data, Data Lake, Ontology, COVID-19, Data Warehouse, Semantic Web Services.


Introduction
The 2019-2020 coronavirus is a pandemic of an emerging infectious disease, called COVID-19, caused by the coronavirus SARS-CoV-2, which begins in December 2019 in Wuhan, central China, and then spreads in all over the world [1]. On March 11, 2020, according to the WHO, the COVID-19 epidemic becomes a pandemic and needs essential protective measures against the new coronavirus, as well as deal with hospitals places saturation of intensive care, and strengthening of preventive hygiene (physical contact elimination, kisses and handshakes, end of assemblies and large events as well as unnecessary trips, application of quarantine, etc.). According to WHO, the reproduction rate (the average number of individuals that an infectious person can infect as long as he is contagious) is estimated between 1.4 and 2.5 [2].
The international reaction was organized relatively quickly: the suspected patients were confined and attempts were made to prevent or slow down the formation of new foci of contagion. As of February 25, the number of new cases reported daily outside of China is higher than in that country [3,4]. In fact, in the first months of 2020, COVID-19 took on a global scale, causing serial cancellations of sporting and cultural events all over the planet, threatening the global economy, closuring borders of many countries, and causing a stock market crash in Europe and North of America on March 12 [5,6].
Thanks to continuing advances in medicine, the death rate from COVID-19 has dropped, but the number of new cases is expected to increase. As we wage an ongoing battle against COVID-19, a new powerful weapon could well move the lines. This weapon is Big Data.
As its name implies, Big Data is incredibly massive. It concerns so large and complex data sets ignored by traditional IT tools and whose collection, management and analysis have made it necessary to develop new technologies. Combining this data can reveal trends, connections and indications that can increase a multitude of things, from the profit margin of a company to the chances of survival of a person suffering from an illness, especially COVID-19.
Gathering Big Data quickly could thus allow health authorities to implement without waiting for measures where necessary. This type of initiative would have a good chance of success given that mobile phones are used all over the world, and that more and more people are interacting on social networks like Twitter or Facebook, where published messages can be analyzed for learning more about the spread of a disease.
The use of Big Data can sometimes make it possible to detect an epidemic more quickly, particularly in regions where public health surveillance is limited [7]. The case of the Ebola virus in 2014 is an example [8]: the first public announcement of Ebola was reported on March 14, 2014 by Health Map (automated alert system collecting disease-specific data). WHO and the Ministry of Health in Sierra Leone published the opinion on the possible spread of the Ebola virus in Sierra Leone on March 22, 2014. In 2010, during the earthquake in Haiti, the identification of those infected and the deployment of a cholera vaccine could have been facilitated by the use of digital data. Unfortunately, in the inability to identify the population to be vaccinated, no vaccine was used during the early stages of the epidemic [9].
Today, there are several concerns about COVID-19 that have prompted governments to ask diverse questions: how to detect the contaminated person of this virus? How to reduce and avoid disease propagation? Can Big Data help governments to prevent COVID- 19? This article aims to answer these questions and discuss the proposal architecture based on semantic layer. We also focus on this latter by presenting a case of study of COVID-19 prevention. The construction of our ontology system will be built by combining different data sources used by governments.
This paper is outlined as follows. In the next section, we expose our research context, we present the motivating scenario and goals of this work and discuss related work. Our proposal Data Lake architecture and its layers component are detailed in the third section. The fourth section, we tackle ontologies context and methodology of development. Finally, we conclude and suggest future research works.

Research context
For a good understanding of Big Data, it is useful to have a history. Here is the definition of Gartner: Big Data brings together data of great variety, arriving in increasing volumes, at high speed. This is called the three "V's" [10]. In other words, Big Data is made up of complex datasets, mostly from new sources. These datasets are so large that traditional data processing software simply cannot handle them. However, this huge amount of data can be used to solve problems that never could have solved before. Storing data into Data Lake without any data management is one of the Big Data challenges. To handle this fact, an extraction of consistent knowledge from such data is needed.
Despite the overwhelming data growth, users typically seek a unified view of data available from Data Lake [11]. As a result, integration issues are gaining increasing attention. Data integration aims to unify data which have certain common semantics but which obtained from unrelated sources [12]. This involves combining data to get a uniform view available to users. When working on data integration, the major issue is Heterogeneity. This latter creates an interoperability problem when distributed systems have to work together. To tackle this issue, structural and semantic heterogeneity must be carried out [13].
Ontologies offer a solution to the problems of heterogeneity. They have been generally used in data integration systems because they present an explicit and machine-understandable conceptualization of a domain [14]. They provide a semantic model of the data sets being integrated. Ontologies provide a common vocabulary on a particular domain and specify terms meaning at different formalization levels with their dependences. The aim of ontologies is getting insight about a special domain and providing shared representation for reuse [15].
In the next sub-section, we detail motivations and goals of this work.

Motivating scenario and goals
With the spread of the epidemic around the world, and for controlling the virus propagation, we suppose that governments want to merge data provided from different hospitals, airports and camera streets to prevent infected people. Data integration and interoperability will be their interests since these three organizations may have different data management techniques before the merger and the data exchange involved is enormous. The complexity of data integration resulting in Data Lake and interoperability concerns data storage, structure, and the ways allowing data to be integrated and operated as a single entity [16].
Ontologies are the major proposal solution to interoperability problems. They help to better understand data complexity and facilitate systems and data using semantic interoperability. Ontologies provide cross-cutting meanings for terms in Data Lake as a solution to semantic heterogeneity between data.
In our case, and as discussed before, we define the Entity relationship data model in the figure bellow respectively (from the left) for Hospital, Airport and Camera Street. For an efficient use of ontologies, a new architecture have been built to better orchestrate and manage data from Data Lake to Data Warehouse [17], [18], [19]. This architecture is divided into four functional layers: Acquisition, Exploration, Semantic and Insight. While the acquisition layer implements the interface between data sources and the proposal system, the major human machine interaction is enabled by the Insight layer. In the middle are the components composed by the exploration and the semantic layer, which dynamically and incrementally extract and summarize the current metadata of Data Lake, and provide a uniform query interface [20].

Fig. 2. Proposal Data Lake process overview
Data acquisition layer: The goal of this layer is to obtain and import data loaded from heterogeneous sources for storage without any changes on its original format. Notably, to respond to this need, metadata is extracted, classified, and then indexed.
Data exploration layer: Data exploration is the collection and analysis of data from various sources to gain personalized insights from hidden patterns and trends. The goal of this layer is to make messy and scattered data clean and available for future use. To achieve this, ingested data sets are searched, compared, assembled, and finally, stored into new data sets.
Semantic layer: The goal of this layer is to prepare the new data sets to be used for creating insights. The quality of the data is cleansed, standardized and harmonized. These data sets are stored in a format depending on the proposed usage (schema on write), which can be an Enterprise Data Warehouse/Data Mart, or a SQL, key-value database, document, graph, or column database. Finally, the metadata is extracted, classified, indexed and the assembled data set is made available for distribution.
Insight layer: The goal of this layer is to create any type of insights (descriptive, diagnostic, predictive, and prescriptive). To achieve this, assembled data sets are consumed within reports, algorithms, and/or simulations. Natural processing capabilities are required when used data is coming directly from user input.
The output can be visualized, distributed, or embedded into a business process and will be stored in a format depending on the proposed usage. This system is able to keep track of data changes made between ingestion until actual use, by using data lineage and monitoring, as well as data authorization capabilities. Finally, data archiving capabilities must be integrated when data is not used anymore or when to comply with legislation rules [21].
In order to take advantage of this novel architecture, the next section will present the semantic layer approach for a better data understanding and optimal exploitation and exploration of data insights.

3
Semantic Layer: Methodology Development

Ontology in Big Data
Ontology, in knowledge engineering, is defined as a detailed description of many problems' domain, it is used for declarative and formal definition of its approach [22]. Ontology is named as a special type knowledge base, which may be divided, isolated and used separately within a specific domain [23]. Now, the use of ontologies as an appropriate means to describe various areas is widely recognized, and a broad range of ontologies available on the Web affirms approach popularity with among users and developers' groups of Web applications, including Big Data applications.
These ontologies are described in different languages and related to wide domain variety. They differ in expressive means, volume, purpose and level of knowledge formalization. Each Ontology take into account its own parameters classification that can be split into two panels, semantic and pragmatic. The semantic one classify ontologies by their connected parameters attached with information content: domain; formality level of presented knowledge; expressiveness and information description degree [24]. Thus, the pragmatic classifications range ontologies by their development purposes and uses domain.
Ontology Domain is a domain knowledge part that limits the terms meaning which they do not depend to another domain knowledge part. Alike domain ontology can be seen as a domain agreement set, and the remaining domain knowledge is an empirical set and other area laws. Thus, ontology defines the agreement level of terms by field specialists [25].
Various sources provide diverse formal models of ontological representation. Indeed, each of them include a range of terms (concepts, notions) which can be split into set of instances, classes and relationships between concepts. These relation groups ("class-subclass", hierarchical, synonymy and taxonomic) can be obviously delimited, as well as functions -a relation particular case for which a relation element is only defined by its preceding elements-axioms and functions of concepts interpretation and relations [26].
To put up a Big Data ontological model, it is required to distinguish classes set from class instances set. It is recommended to separate object relations from data relations among instances of different classes that is relations between attributes instances and their values.

Ontology construction approach
Several tools and languages allow building ontologies. Our approach to represent ontology from Data Lake necessitates taking into account characteristics of Big Data (Volume, Variety and velocity). The starting point is Data Lake which represents different data sources, whereas the target schema is Ontology Web Language (OWL).
We choose OWL as a language for representing ontologies since it is the standard recommended by the World Wide Web Consortium (W3C) for modeling ontologies.
We are particularly interested in OWL-DL because it supports maximum expressiveness while keeping back the decidability and computational completeness. The learning process of ontologies is introduced by a corpus formed of autonomous and scalable data sources, with regard to data sources number and data amount in each source.
Since Big Data is very heterogeneous, we suggest copying data sources in a common representation whereas maintaining initial sources independence. The role of this conversion is to provide a partial reduction complexity and heterogeneity of data. It passes into the pretreatment stage that is wrappers work. The processing strategies of Data source vary according to their category. The main objective of this stage is to standardize data and take into account the variety dimension of Big Data [27].
To summarize, our ontology learning approach from Big Data is based on wrapping each data source in Data Lake, which will be transformed next into local ontology. After wrapping the data sources in Data Lake, we generate a local ontology matching each data source. Afterward, we combine resulting local ontologies into a global one [28]. The figure bellow depicts the proposal ontologies construction approach and presents used data sources for our case of study. Extracting and analyzing data corresponding local ontology data source is the key of data reconstruction [29]. After the local ontology data source is reconstructed, relational parameters are set. Data are compared and preliminarily integrated. Then, reintegrated (for a second time) regarding the global ontology. Finally, multi-source heterogeneous model is established.
We chose a modular conceptualization since the beginning of the ontology development process. Modularity is an important technology for collaborative knowledge development environments. It is central to reduce the complexity of designing and understanding ontologies, to facilitate ontology verification, reasoning, maintenance and integration. This minimizes the scope as much as possible to what is strictly necessary. To summarize, ontology modules ensure the following advantages: • First, they facilitate knowledge reuse across various applications.
• Second, they are easier to build, maintain, and replace.
• Third, they enable distributed engineering of ontology modules over different locations and different areas of expertise. • Finally, they enable effective management and browsing of modules.
Our approach takes as input Big Data sets and produces as output an OWL ontology. The process described may be outlined into three main steps: data homogenization step, local ontologies generation and finally, global ontology composition. The process is fully automatic and has the particularity to involve a homogenization step to facilitate ontology generation.

Implementation & Results
Technically, ontologies are usually developed using tools for creating integrated graphic ontologies, such as Protégé, OntoEdit and OILed. In this paper, Protégé is used since it is an open source tool and allows easy knowledge construction of ontologies domain [30]. In addition, this tool is adapted to the W3C recommendation, i.e. the OWL standard. Now, we focus on building local ontologies. These local ontologies were defined on the basis of the data model presented in Figure 1. The created ontologies use the modeling of the Entity relationship data model respectively for Hospital, Airport and Camera Street Data Sources. We are illustrating the major concepts such as Person, which include Doctor, Nurse and Patient. Also Flight that include Plane and Passenger who travels. The same for Camera street ontology that captures a person in a Localization and detects different Symptoms and persons Behaviors. Without forgetting Disease concept which define different categories of Viruses including the COVID-19.
Next, we detail the second part that consists in extracting the global ontology from different concepts defined on the local ones. After analyzing those latter, we have determined the global ontology using OWL annotations. This global ontology takes up the concepts of local ontologies in order to meet the needs of detecting the persons who may be contaminated with COVID-19. In fact, if a person was exanimated by Doctor who indicates COVID-19 symptoms like headache, fever, trouble breathing, bluish lips or face... or, has been detected by a camera street which captures person's information and the same external symptoms presented before. In addition to that, the persons came from abroad in the period of virus apparition and from countries that know a huge number of contaminated people.  The main purpose of this ontology is to extract and present a significant information from multiple data sources that can be exploitable for generating real-time statistics and reports.

Conclusion and Future Work
This paper presents part of a study towards a goal of developing a methodology for knowledge acquisition from Data Lake in a knowledge base for development of intelligent systems. The project is a very long-term project and it is still at an exploratory stage. The presented work is a case of study on Corona virus prediction. It described the role of ontology in Big Data to prevent novel cases that may be suspected infected of Covid-19 for the aim of decreasing this epidemic spread rate.
In this paper, we proposed an ontology approach based on different data sources for automated ontology mapping and merging. The role of Data Lake is to interoperate ontologies for learning object retrieval and reuse from local to global ontology.
For future work, we will experiment with the discovery of instances from the ontological clusters through k-means to enable more efficient query. We also desire to setting up a new architecture combining classification, categorization and internal learning in Data Lake for the purpose of a better data management.

Authors
Jabrane Kachaoui is a PhD student in the Laboratory of Information Technology and Modeling at Hassan II University, Faculty of Science Ben M'Sik, Casablanca, Morocco. He is currently working as an IT Project Manager. His main interests are new technologies around Big Data and Data Lake. Email: jabrane2005@gmail.com Jihane Larioui is a PhD student in the Laboratory of computer science and modeling of decision support systems at Hassan II University, Faculty of Science in Casablanca, Morocco. She currently works as a Data Analyst and BI Consultant in wellknown companies. Her main current research interests concern multi-agent systems and decision-making in the context of urban mobility.