The Impact of XML Databases Normalization on Design and Usability of Internet Applications

Database normalization is a process which eliminates redundancy, organizes data efficiently and improves data consistency. Functional, multivalued, and join dependencies (FDs, MVDs, and JDs) play fundamental roles in relational databases where they provide semantics for the data and at the same time are the foundations for database design. In this study we investigate the issue of defining functional, multivalued and join dependencies and their normal forms in XML database model. We show that, like relational databases, XML documents may contain redundant information, and this redundancy may cause update anomalies. Furthermore, such problems are caused by certain dependencies among paths in the document. Our goal is to find a way for converting an arbitrary XML Schema to a well-designed one that avoids these problems. We extend the notion of tuple for relational databases to the XML model. We show that an XML tree can be represented as a set of tree tuples. We introduce the definitions of FD, MVD, and JD and new Normal Forms of XML Schema that based on these dependencies (X-1NF, X-2NF, X-3NF, XBCNF, X-4NF, and X-5NF). We show that our proposed normal forms are necessary and sufficient to ensure all conforming XML documents have no redundancies.


INTRODUCTION
Recently, several researchers studied the issue of Webbased application distinguished three basic levels in every web-based application: the Web character of the program, the pedagogical background, and the personalized management of the learning material [23].They defined a web-based program as an information system that contains a Web server, a network, a communication protocol like HTTP, and a browser in which data supplied by users act on the system's status and cause changes.The pedagogical background means the educational model that is used in combination with pedagogical goals set by the instructor.The personalized management of the learning materials means the set of rules and mechanisms that are used to select learning materials based on the student's characteristics, the educational objectives, the teaching model, and the available media.
Many works have combined and integrated these three factors in e-learning systems, leading to several standardization projects.Some projects have focused on determining the standard architecture and format for learning environments, such as IEEE Learning Technology Systems Architecture (LTSC), Instructional Management Systems (IMS), and Sharable Content Object Reference Model (SCORM).IMS and SCORM define and deliver XML-based interoperable specifications for exchanging and sequencing learning contents, i.e., learning objects, among many heterogeneous e-learning systems.They mainly focus on the standardization of learning and teaching methods as well as on the modeling of how the systems manage interoperating educational data relevant to the educational process.
The eXtensible Markup Language (XML) has recently emerged as a standard for data representation and interchange on the Internet.With the increase of dataintensive web applications, XML has conquered the field of databases.It is argued that XML can be used as a database language, which would not only support the data exchange on the web.This has led to significant research efforts including: 1) The storage of XML documents in relational databases, 2) Query languages for XML, which lead to the standard query language, XQuery 3) Schema languages for XML, which lead to the widely accepted XML Schema language, 4) Updates of XML documents and, 5) Dependency and normal form theory [1][2][3][4][5][6][7].
Although many XML documents are views of relational data, the number of applications using native XML documents is increasing rapidly.Such applications may use native XML storage facilities [2], and update XML data [3].Updates, like in relational databases, may cause anomalies if data is redundant.In the relational world, anomalies are avoided by developing a well-designed database schema.XML has its version of schema too; such as DTD (Document Type Definition), and XML Schema [4].Our goal is to find the principles for good XML Schema design.We believe that it is important to do this research now, as a lot of data is being put on the web.Once massive web databases are created, it is very hard to change their organization; thus, there is a risk of having large amounts of widely accessible, but at the same time poorly organized legacy data.
Normalization is a process which eliminates redundancy, organizes data efficiently and improves data consistency.Whereas normalization in the relational world has been quite explored, it is a new research area in native XML databases.Even though native XML databases mainly work with document-centric XML documents, and the structure of several XML document might differ from one to another, there is room for redundant information.This redundancy in data may impact on document updates, efficiency of queries, etc. Figure 1, shows an overview of the XML normalization process that we propose.Web.The growing use of XML has necessitated the XML document semantically stronger.XML functional dependency has studied as one of the ways to make the XML data semantically richer [8,13,14,21,22].
The focus of this paper is on functional, multivalued and join dependencies and normal form theory.This theory concerns the old question of well-designed databases or in other words the syntactic characterization of semantically desirable properties.These properties are tightly connected with dependencies such as keys, functional dependencies, weak functional dependencies, equality generating dependencies, multivalued dependencies, inclusion dependencies, join dependencies, etc [9][10][11][12].All these classes of dependencies have been deeply investigated in the context of the relational data model [5,6].The work now requires its generalization to XML (trees like) model.
The main contributions of this study are the new definitions of MVD and JD and the new normal forms of XML Schema (X-4NF and X-5NF).We extend our previous research works proposed in [21,22], and show how to use MVDs and JDs to detect data redundancy in XML document, and then proposed normal forms of XML Schema with respect to the MVD and JD constraints.

II. PRIMARILY DEFINITIONS
To extend the notions of FDs, MVDs and JDs to the XML database model, we represent XML trees as sets of tuples [13,14,21,22], and find the correspondence between documents and relations that leads to the definitions of functional and multivalued dependencies.We first describe the formal definitions of XML Schema (XSchema) and the conforming of XML tree to XSchema.Assume that we have the following disjoint sets:  M is a function from E to its element type definitions: i.e., M(e) = α, where e  E and α is a regular expression: where, ε denotes the empty element, t  DΤ, "+" for the union, "," for the concatenation, α * for the Kleene closure, α ?for (α + ε) and α + for (α, α * )  P is a function from an attribute name a to its attribute type definition: i.e., P(a) = β, where β is a 4-tuple (t, n, d, f), where: t  DΤ, n = Either "?" (nullable) or "¬?" (not nullable), d =A finite set of valid domain values of a or ε if not known, and f = A default value of a or ε if not known  r  E is a finite set of root elements  ∑ is a finite set of integrity constraints for XML model.The integrity constraints we consider are keys (P.K, F.K,…) and dependencies (functional and inclusion) Definition 2 (path in XSchema): Given an XSchema X = (E, A, M, P, r, ∑), a string p = p 1 …p n , is a path in X if, p 1 = r, p i is in the alphabet of M(p i −1 ), for each i  [2, n − 1] and p n is in the alphabet of M(p n−1 ) or p n = @l for some @l  P(p n−1 ).


We let paths(X) stand for the set of all paths in X and EPaths(X) for the set of all paths that ends with an element type (rather than an attribute or S), that is:

Definition 3 (XML tree):
An XML tree T is defined to be a tree, T = (V, lab, ele, att, root), where: Definition 4 (path in XML tree): Given an XML tree T, a string: p 1 …p n with p 1 ,…, p n-1 Ê and p n Ê Â{S} is a path in T if there are vertices v 1 … v n−1 V s.t.:  E, A, M, P, r, ∑) and an XML tree T = (V, lab, ele, att, root), we say that T is valid w.r.t.X (or T conforms to X) written as (T╞ X) if: where s  Str.Otherwise, ele(v) = [v 1 , … , v n ] and the string lab(v 1 ) … lab(v n ) must be in the regular language defined by M(lab(v))  att is a partial function, att: V × A → Str, s.t. for any v  V and @l  A, att(v, @l) is defined iff @l  P(lab(v)) We say that T is compatible with X (written T ⊲X) iff paths(T)  paths(X).Clearly, T╞ X  T ⊲X Definition 6 (subsumed): Given two XML trees T 1 = (V 1 , lab 1 , ele 1 , att 1 , root 1 ) and T 2 = (V 2 , lab 2 , ele 2 , att 2 , root 2 ), we say that T 1 is subsumed by T 2 , written as Definition 7 (equivalence): Given two XML trees T 1 and T 2 , we say that T 1 is equivalent to T 2 written T 1 ≡ T 2 , iff T 1 ≤ T 2 and T 2 ≤ T 1 (i.e., T 1 ≡ T 2 iff T 1 and T 2 are equal as unordered trees): We shall also write T 1 < T 2 when T 1 ≤ T 2 and T 2 ≰ T 1 In [21,22] we extended the notion of tuple for relational databases to the XML model.In a relational database, a tuple is a function that assigns to each attribute a value from the corresponding domain.In our setting, a tree tuple t in a XML Schema X is a function that assigns to each path in X a value in Vert ∪Str ∪{φ} in such a way that t represents a finite tree with paths from X containing at most one occurrence of each path.We have shown that an XML tree can be represented as a set of tree tuples.
Definition 8 (tree tuples): Given XML Schema X = (E, A, M, P, r, ∑), a tree tuple t  X is a function, t: paths(X) → VertStr{φ} such that: is defined to be the set of all tree tuples in X.For a tree tuple t and a path p, we write t.p for t(p).

Definition 11 (trees X ):
Given XML Schema X and a set of tree tuples Y T (X), trees X (Y) is defined to be: Notice that, if T  trees X (Y) and T ' ≡ T, then T ' is in trees X (Y).The following shows that every XML document can be represented as a set of tree tuples, if we consider it as an unordered tree.That is, a tree T can be reconstructed from tuples X (T), up to equivalence ≡.We have proved the following theorem [21,22].

Note that:
 We say that Y  T(X) is X-compatible if there is an XML tree T: T ⊲X and Y  tuples X (T). For X-compatible set of tree tuples Y, there is always an XML tree T: for every t Y, tree X (t) ≤ T.


We have proved the following proposition, and corollary [21,22]:

A. Functional dependencies of XML schema
We define the functional dependencies for XML Schema by using the tree tuples representation that discussed previously.Definition 12 (functional dependencies): Given an XML Schema X, a functional dependency (FD) over X is an expression of the form: S 1 → S 2 where S 1 , S 2  paths(X), S 1 , S 2 ≠ φ.The set of all FDs over X is denoted by FD(X).

Definition 14:
If for every pair of tree tuples t 1 , t 2 in an XML tree T, t 1 .S 1 = t 2 .S 1 implies they have a null value on some p  S 1 , then the FD is trivially satisfied by T.
The previous definitions extends to the equivalence classes, since, for any FD f and T ≡ T', T╞ f iff T'╞ f We write T╞ F, for F  FD(X), if T╞ f for each f F and we write T╞ (X, F), if T╞ X and T╞ F Example 6: Consider the XML Schema in example 1, we have the following FDs.Note that, cno is a key of course: courses.course.@cno→ courses.course(FD1) Another FD says that two distinct student sub-elements of the same course cannot have the same sno: {courses.course,courses.course.taken_by.student.@sno}→ courses.course.taken_by.student(FD2) Finally, to say that two student elements with the same sno value must have the same name, we use: courses.course.taken_by.student.@sno→ courses.course.taken_by.student.name.S (FD3) Definition 15: Given XML Schema X, a set F  FD(X) and f  FD(X), we say that (X, F) implies f, written (X, F) ⊦ f , if for any tree T with T╞ X and T╞ F, it is the case that T╞ f.The set of all FDs implied by (X, F) will be denoted by (X, F) + .

B. Primary and Foreign Keys of XML Schema
We present the definitions of the primary and foreign keys of the XML Schema.We'll use these definitions to introduce the normal forms of XML Schema.Also, we observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into relational database design can be applied to XML Schema design as well.
Definition 17 (key, foreign key and superkey): Let X = (E, A, M, P, r, ∑) be XML Schema, a constraint ∑ over X has one of the following forms: Key: e(l) → e, where eE and l is a set of attributes in P(e).It indicates that the set l of attributes is a key of e elements Foreign key: e 1 (l 1 )  e 2 (l 2 ) and e 2 (l 2 ) → e 2 where e 1 , e 2  E and l 1 , l 2 are non-empty sequences of attributes in P(e 1 ), P(e 2 ), respectively and moreover l 1 and l 2 have the same length.This constraint indicates that l 1 is a foreign key of e 1 elements referencing key l 2 of e 2 elements.A constraint of the form e 1 (l 1 )  e 2 (l 2 ) is called an inclusion constraint.Observe that a foreign key is actually a pair of constraint, namely an inclusion constraint e 1 (l 1 )  e 2 (l 2 ) and a key e 2 (l 2 ) → e 2 Superkey: suppose that, e  E and for any two distinct paths p 1 and p 2 in the XML Schema X, we have the constraint that: p 1 (e) ≠ p 2 (e).The subset e is called a superkey of X.Every XML Schema has at least one default superkey -the set of all its elements

C. First normal form for XML schema (X-1NF)
First normal form (1NF) is now considered to be a part of the formal definition of a relation in the basic relational database model.Historically, it was defined as: "The domain of an attribute in a tuple must be a single value from the domain of that attribute" [20].Of course, XML is hierarchical by nature.An XML "tuple" can vary from first normal form in several ways; all of them are valid by means of data modeling:

D. Second normal form of XML schema (X-2NF)
X-2NF is based on the concept of full functional dependency.Definition 18: A FD S 1 → S 2 , where S 1 , S 2  paths(X) is called full FD, if removal of any element's path p from S 1 , means that the dependency does not hold any more, (i.e., for any p  S 1 , (S 1 -{p}) does not functional determine S 2 ).Definition 19: A FD S 1 → S 2 is called partial dependency if, for some p  S 1 , (S 1 -{p}) → S 2 is hold.

Definition 20 (X-2NF):
An XML Schema X = (E, A, M, P, r, ∑) is in second normal form (X-2NF) if every elements eE and attributes l  P(e) are fully functionally dependent on the key elements of X.
The test for X-2NF involves testing for FDs whose lefthand side are part of the primary key.If the primary key contain a single element's path, the test need not be applied at all Example 8: The XML Schema Emp_Proj in the above example is in X-1NF but is not in X-2NF.Because the FDs FD2 and FD3 make Emp_Proj.Ename, Emp_Proj.Pname and Emp_Proj.Plocation partially dependent on the primary key {Emp_Proj.Sss, Emp_Proj.Pnumber} of Emp_Proj, thus violating the X-2NF test.

E. Third Normal Form of XML Schema (X-3NF)
X-3NF is based on the concept of transitive dependency.Definition 21: A FD S 1 → S 2 , where S 1 , S 2  paths(X) is transitive dependency if there is a set of paths Z (that is neither a key nor a subset of any key of X) and both S 1 → Z and Z → S 2 hold.

Example 10:
The XML Schema Emp_Dept in the above example is in X-2NF (since no partial dependencies on a key element exist), but Emp_Dept is not in X-3NF.Because of the transitive dependency of Emp_Dept.DmgrSsn (and also Emp_Dept.Dname) on Emp_Dept.Ssn via Emp_Dept.Dnumber.

F. Boyce-codd normal form of XML schema (X-BCNF)
X-BCNF is proposed as a similar form as X-3NF, but it was found to stricter than X-3NF, because every XML Schema in X-BCNF is also in X-3NF, however, an XML Schema in X-3NF is not necessarily in X-BCNF.The formal definitions of BCNF differs slightly from the definition of X-3NF Definition 23 (X-BCNF): An XML Schema X = (E, A, M, P, r, ∑) is in Boyce-Codd Normal Form (X-BCNF) if whenever a nontrivial FD S 1 → S 2 holds in X, where S 1 , S 2  paths(X), then S 1 is a superkey of X.
Also, we can consider the following definition of X-BCNF: Definition 24: Given XML Schema X and F  FD(X), (X, F) is in X-BCNF iff for every nontrivial FD f  (X, F) + of the form S → p.@l or S → p.S, it is the case that, S → p  (X, F) + .
The intuition is as follows: Suppose that S → p.@l  (X, F) + .If T is an XML tree conforming to X and satisfying F, then in T for every set of values of the elements in S, we can find only one value of p.@l.Thus, for every set of values of S, we need to store the value of p.@l only once, in other words, S → p must be implied by (X, F).
In definition 24, we suppose that, f is a nontrivial FD.Indeed, the trivial FD p.@l → p.@l is always in (X, F) + , but often p.@l → p ∉ (X, F) + , which does not necessarily represent a bad design.
To show how X-BCNF distinguishes good XML design from bad design, we consider example 1 again, when only functional dependencies are provided.
Example 11: Consider the XML Schema from example 1 whose FDs are FD1, FD2 and FD3, shown in example 6. FD3 associates a unique name with each student number, which is therefore redundant.The design is not in X-BCNF, since it contains FD3 but does not imply the functional dependency: courses.course.taken_by.student.@sno→ courses.course.taken_by.student.name To solve this problem, we gave a revised XML Schema in example 1.The idea was to create a new element info for storing information about students.That design satisfies FDs, FD1, FD2, as well as, courses.info.number.@sno→ courses.info,can be easily verified to be in X-BCNF.
IV. NORMAL FORMS BASED ON MULTIVALUED DEPENDENCIES We have discussed only FD, which is by far the most important type of dependency in XML database design theory.However, in many cases XML documents have constraints that cannot be specified as FD.In this part of the article, we discuss the concept of multivalued

Figure 1 .
Figure 1.An overview of the XML normalization processFunctional dependency (FD) is one of the integrity constraints for any data model.In relational data model, FDs, MVDs, and JDs are well studied and are widely used in normalization theory and in key algorithms.In recent years, XML has emerged as a widely used data representation and storage format over the World Wide Web.The growing use of XML has necessitated the XML document semantically stronger.XML functional dependency has studied as one of the ways to make the XML data semantically richer[8,13,14,21,22].The focus of this paper is on functional, multivalued and join dependencies and normal form theory.This theory concerns the old question of well-designed databases or in other words the syntactic characterization of semantically desirable properties.These properties are tightly connected with dependencies such as keys, functional dependencies, weak functional dependencies, equality generating dependencies, multivalued dependencies, inclusion dependencies, join dependencies, etc[9][10][11][12].All these classes of dependencies have been deeply investigated in the context of the relational data model[5,6].The work now requires its generalization to XML (trees like) model.The main contributions of this study are the new definitions of MVD and JD and the new normal forms of XML Schema (X-4NF and X-5NF).We extend our previous research works proposed in[21,22], and show how to use MVDs and JDs to detect data redundancy in XML document, and then proposed normal forms of XML Schema with respect to the MVD and JD constraints.

Ê:
set of element names  Â: set of attribute names  DΤ: set of atomic data types (e.g., ID, IDREF IDREFS, string, integer, date, etc.)  Str: set of possible values of string-valued attributes  Vert: set of node identifiers All attribute names start with the symbol @.The symbols φ and S represent element type declarations EMPTY (null) and #PCDATA, respectively.Definition 1 (XSchema): An XSchema is denoted by 6tuple: X = (E, A, M, P, r, ∑), where:  E  Ê, is a finite set of element names. A  Â, is a finite set of attribute names.


then v n−1 has a child in Str  We let paths(T) stand for the set of paths in T iJAC -Volume 3, Issue 2, May 2010 THE IMPACT OF XML DATABASES NORMALIZATION ON DESIGN AND USABILITY OF INTERNET APPLICATIONS Definition 5 (conformation and compatibility): Given an XSchema X = (


root = t.rTHE IMPACT OF XML DATABASES NORMALIZATION ON DESIGN AND USABILITY OF INTERNET APPLICATIONS  V = {v  Vert |  p  paths(X) such that v = t.p} If v = t.p and v  V, then lab(v) = last(p)  If v = t.p and v  V, then ele(v) is defined to be the list containing {t.p' | t.p' ≠ φ and p' = p.τ, τ E, or p' = p.S, ordered lexicographically 

o m a ta t h e oProposition 1 :
If t  T (X), then tree X (t) ⊲X.