XML und Datenbanken

Transcrição

XML und Datenbanken
Chapter 5
Mapping XML to Databases
Three Layer Architecture
Text-based Storage
Model-based Storage
Structure-based Storage
Hybrid Storage
Three Layer Architecture for Database Processing
Conceptual
Layer
Logical
Layer
Physical
Layer
Abstract database schema
e.g. ER schema
Concrete database schema
e.g. table schema
Storage structures
and access paths
e.g. B-tree index
Abstract Instances
e.g. entities
Concrete Instances
e.g. tables and tuples
Data sets in
database pages

Conceptual Layer: ERM, UML, ...

Logical Layer: relational model, NF2 model, OR model, ...

Physical Layer: internal storage, indexing, updates, transaction processing, query optimization
Lecture "XML and Databases" - Dr. Can Türker
5-2
Three Layer Architecture for Document Processing
Conceptual
Layer
Logical
Layer
Physical
Layer
Style sheet
Document structure
File
Document
Access structure
e.g. full-text index

Conceptual Layer:
–
editors
–
basic agreement on the document structure (intro before conclusions etc.)

Logical Layer: document model, document instances

Physical Layer: internal storage, document indexing, query processing
Lecture "XML and Databases" - Dr. Can Türker
5-3
Three Layer Architecture for XML Processing
Conceptual
Layer
Logical
Layer
<..>
<..>
<..>
Document structure
e.g. XML schema
Physical
Layer
</..>
</..>
</..>
XML documents,
XML elements

Conceptual Layer: XML schema, DTD

Logical Layer: XML documents, XML elements, etc.

Physical Layer - Storage:
–
Single file inappropriate for large document collections
–
Hot to store large XML documents?
–
How to support queries efficiently?
Lecture "XML and Databases" - Dr. Can Türker
?
Storage structure
Access structure
5-4
Types of XML Documents
<order>
<customer>Meyer</customer>
<position>
<isbn>1-234-56789-0</isbn>
<number>2</number>
<price currency="Euro">30.00</price>
</position>
</order>

Structured
–
regular
–
example: product catalogue, orders, invoices
–
data-centric applications

Unstructured
–
irregular
–
example: articles, books, emails, Web pages
–
document-centric applications
<content>
XML builds on the principles of two
existing languages, <em>HTML</em> and
<em>SGML</em> to create a simple
mechanism ...
The generalized markup concept ...
</content>

Semi-structured
–
structured and unstructured parts
–
example: publications, Amazon, MS Press
(example chapters)
<book>
<author>Neil Bradley</author>
<title>XML companion</title>
<isbn>1-234-56789-0</isbn>
<content>
XML builds on the principles of two
existing languages, <em>HTML</em>
and ...
</content>
</book>
Lecture "XML and Databases" - Dr. Can Türker
5-5
Three Layer Architecture and XML Document Types
Conceptual
Layer
Logical
Layer
Physical
Layer
document-centric
semi-structured
data-centric
Modeling structure and
content
Modeling structure and
content
Modeling structure
Tree-based editors
XML
ER, UML
Document model
Data and document model
Data model
Queries/Updates on
structure and content
Queries/Updates on
structure and content
Queries/Updates on
content
XML, SGML
XML, OEM
XML, RDM, OODM
XPath/XQuery, DOM,
IR queries
XPath/XQuery, XIRQL
XQuery, SQL, OQL
Structure on instance level
Structure on schema and
instance level
Structure on schema level
Files and full-text indexes,
structure indexes
Semi-structured databases
and full-text indexes, path
indexes
(Object-)Relational
databases and B-tree / hash
indexes
Lecture "XML and Databases" - Dr. Can Türker
5-6
Requirements on the Physical Layer

Order preserving and lossless storage of XML documents

Efficient access to XML documents and selected document parts
–
Short response times for


–
–
–
–
–
Queries
Updates
Physical design of the storage layer
Indexing
Transaction processing
Support for XPath/XQuery
Support for XML processors such DOM and SAX
Lecture "XML and Databases" - Dr. Can Türker
5-7
Storage Strategies for XML Documents
1.
Text-based
–
Store XML document as string
2.
Model-based
–
Store XML graph structure and DOM informations
3.
Structure-based
–
Derive database schema from XML structure
–
Custom mapping of XML structure to database schema
Lecture "XML and Databases" - Dr. Can Türker
5-8
Text-based Storage

Entire XML document stored as
–
File in the file system
–
CLOB (Character Large Object) in a database system
Id
Content
1
<book genre="autobiography">
<title>The Autobiography of Benjamin Franklin</title>
<author>
<first-name>Benjamin</first-name>
<last-name>Franklin</last-name>
</author>
<price currency="USD">8.99</price>
</book>

Operations (read/write) on entire XML document are efficient

No explicit support for more fine-granule access; use of indexes required!
–
Full-text index
–
Path indexes
Lecture "XML and Databases" - Dr. Can Türker
5-9
Summary Text-based Storage

Schema
–
Not required

Further Specifics
–
Full-text functionality

Document reconstruction
–
Not needed since documents are stored
entirely as string


Queries
–
Information Retrieval queries
–
XML queries possible
Efficiency
–
String must be parsed at access time
 time-consuming
–
No concurrency for read and write
operations  no parallel executions

Application area
–
Document-centric XML applications
–
Storing semi-structured XML documents
Lecture "XML and Databases" - Dr. Can Türker
5-10
Model-based Storage

Generic storage of XML document tree
–
XML elements, XML attributes, etc. are nodes of tree

Store XML elements/attributes in corresponding database tables

Element
ID
Name
Value
Parent Element
Attribute
ID
Name
Value
Parent Element
Position
Extensions for adequate storage of different data types possible
Lecture "XML and Databases" - Dr. Can Türker
5-11
Model-based Storage: EDGE, BINARY
XML documents
store
store
auction
price
title
XML
…
5.20
auction
title
Stars
price
47.11
• BINARY variant: horizontal
partitioning of EDGE along the
values of the column label
Lecture "XML and Databases" - Dr. Can Türker
CREATE TABLE edge (
id
INTEGER NOT NULL,
doc
INTEGER NOT NULL, -- document id
parent
INTEGER,
label
VARCHAR(30) NOT NULL,
content VARCHAR(100),
nodetype VARCHAR(20) NOT NULL,
pos
INTEGER,-- element position
PRIMARY KEY (id, doc),
UNIQUE (parent, pos),
CHECK (nodetype<>ˈattributeˈ and pos is null)
);
id
doc
parent
label
content
nodetype
pos
1
1
null
store
null
element
1
2
1
1
auction
null
element
1
3
1
2
title
XML
element
1
4
1
2
price
5.20
element
2
5
2
null
store
null
element
1
6
2
5
auction
null
element
1
7
2
6
title
Stars
element
1
8
2
6
price
47.11
element
2
5-12
Model-based Storage: Queries
XML queries
XQuery
SQL
DB

XML (XPath, XQuery) queries internally transformed to SQL queries

XML query result derived from the SQL result
–
Labelling of the resulting tuples
–
Result must be an XML value

Example: Get the description of all auctions that have a price less than 10
SELECT a.content, b.content
FROM Edge a, Edge b
WHERE (a.label = 'price')
AND (a.content < 10.00)
AND (b.label = 'title')
AND (b.parent = a.parent)
AND (a.doc = b.doc)
Lecture "XML and Databases" - Dr. Can Türker
id
doc
parent
label
content
nodetype
pos
1
1
null
store
null
element
1
2
1
1
auction
null
element
1
3
1
2
title
XML
element
1
4
1
2
price
5.20
element
2
5
2
null
store
null
element
1
6
2
5
auction
null
element
1
7
2
6
title
Stars
element
1
8
2
6
price
47.11
element
2
5-13
Model-based Storage: DOM (1)
DOMImplementation
Node
NodeList
NamedNodeMap
Attr

Information of the Document Object Model
stored in a database
CharacterData
Comment
Text
CDataSection
Document
DocumentFragment
DocumentType
Element
Entity
EntityReference
Notation
ProcessingInstruction
Lecture "XML and Databases" - Dr. Can Türker
5-14
Model-based Storage: DOM (2)
Node type:
ELEMENT
2
1
book
title
The Autobiography
of Benjamin Franklin
Node type:
TEXT
Lecture "XML and Databases" - Dr. Can Türker
Node type:
ATTRIBUTE
3
price
8.99
NodeID
NodeType
DocID
ParentNode
1
ELEMENT
1
NULL
2
ELEMENT
1
1
3
ATTRIBUTE
1
1
4
TEXT
1
2
NodeID
TagName
NodeID
Content
b
book
4
The Auto...
2
title
NodeID
PreviousSibling
NextSibling
FirstChild
1
NULL
NULL
2
2
NULL
NULL
4
3
NULL
NULL
NULL
NodeID
ElementID
AttributeName
AttributeValue
3
1
price
8.99
5-15
Model-based Storage: DOM Methods
XML queries
DOM
SQL
DB

XML queries (DOM method calls) mapped to corresponding SQL queries

Result of the method call derived from the result of the SQL query
Lecture "XML and Databases" - Dr. Can Türker
5-16
Summary Model-based Storage

Schema
–
Not required

Document reconstruction
–
Possible but expensive

Queries
–
XML queries possible
–
Adapted database queries
–
Queries over many elements/attributes
are expensive
Lecture "XML and Databases" - Dr. Can Türker

Efficiency
–
Context-based navigation is efficient
–
Document reconstruction expensive
–
Path expression evaluation inefficient

Application area
–
Data-centric
–
Document-centric
–
Semi-structured
5-17
Structure-based Storage

Idea:
–
Shred XML documents and map the parts to corresponding database structures
–
If available derive database schema from a DTD or XML schema
–
Tolerate loss of native XML access

Problem
–
Automatic generation of database schema
–
Exploiting structure information in queries

Example: Basic mapping from a DTD to a database schema
–
Transform DTD in tree representation
–
Nodes represent element types, attributes, etc., i.e., the type layer!
–
Edges represent relationships between element types as well as certain restrictions
Lecture "XML and Databases" - Dr. Can Türker
5-18
Mapping DTD to Databases

Rules for mapping element types:
–
–
–
–
–
–

XML element type
Sequence of element types
Alternative of element types
Element type with quantifier ?
Element type with quantifier +,*
Nested element type
table column
table column
table column
table column allowing nulls
set/list-valued table column
tuple-valued table column
Rules for mapping attributes:
–
XML attribute
table column
–
IMPLIED
NULLABLE
–
REQUIRED
NOT NULLABLE
–
Default value
DEFAULT Constraint
Lecture "XML and Databases" - Dr. Can Türker
5-19
Mapping XML onto Relational Databases
<hotel>
<name>Eden</name>
<address>
<zip>8008</zip>
<city>Zurich</city>
<street>Bellerive</street>
<no>12</No>
</address>
<phone>+441234567</phone>
<phone>+449876543</phone>
</hotel>
Variant 1
Hotel
Id
Name
Zip
City
Street
No
1
Eden
8008
Zurich
Bellerive
12
Phone
Id
Text
Hotel
1
+441234567
1
2
+449876543
1
Variant 2
Hotel
Id
Name
Address
1
Eden
101
Address

Schema typically needed

Queries use SQL

RDBMS data types

Mapping collection types to separate tables

Mapping tuple types implicitly (Variant1; tuple structure lost) or separately (Variant 2)
Lecture "XML and Databases" - Dr. Can Türker
Id
Zip
City
Street
No
101
8008
Zurich
Bellerive
12
5-20
Mapping XML onto Object-Relational Databases
<hotel>
<name>Eden</name>
<address>
<zip>8008</zip>
<city>Zurich</city>
<street>Bellerive</street>
<no>12</No>
</address>
<phone>+441234567</phone>
<phone>+449876543</phone>
</hotel>
Hotel
ID
1
Name
Eden
<Address>
{Phone}
Zip
City
Street
No
8008
Zurich
Bellerive
12
{+441234567, +449876543}

Schema typically required

Queries use SQL

Straight-forward mapping of tuple and collection types to corresponding database data types
Lecture "XML and Databases" - Dr. Can Türker
5-21
Mapping XML onto Object-Relational Databases (2)
XML-DTD
Informix-Syntax:
<!ELEMENT book (title, author+,
edition?, publisher)>
<!ATTLIST book isbn CDATA #REQUIRED>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT author
(first, last)>
<!ELEMENT first
(#PCDATA)>
<!ELEMENT last
(#PCDATA)>
<!ELEMENT edition
(#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
CREATE TABLE book (
isbn
VARCHAR(20) NOT NULL,
title
VARCHAR(100),
author
LIST(ROW(
first VARCHAR(30),
last VARCHAR(50))
NOT NULL),
edition
INTEGER,
publisher VARCHAR(100) NOT NULL,
);

ISBN
Title
3-89864-148-1
XML & Databases
{<Author>}
First
Last
Meike
Klettke
Holger
Meyer
Edition
Publisher
1
dpunkt
Problem: Map to an SQL schema that allows only XML documents which are valid w.r.t. the given DTD
Lecture "XML and Databases" - Dr. Can Türker
5-22
Mapping XML Alternatives

XML DTD supports alternatives

Example:

Mapping variants:
1. Each alternative is mapped to a separate column of a common table

<!ELEMENT author (name | ((first-name, last-name)))>
Parse tables with many null values
2. Each alternative is mapped to a separate table

Queries require union over several tables
3. Map to a single table column of type XML
<author>
<first-name>Benjamin</first-name>
<last-name>Franklin</last-name>
</author>
<author>
<first-name>Herman</first-name>
<last-name>Melville</last-name>
</author>
<author>
<name>Plato</name>
</author>
Lecture "XML and Databases" - Dr. Can Türker
Name
Variant 1
FirstName
LastName
NULL
Benjamin
Franklin
NULL
Herman
Melville
Plato
NULL
NULL
Name
Variant 2 Plato
FirstName
LastName
Benjamin
Franklin
Herman
Melville
5-23
Mapping Recursive XML Elements

XML DTD supports recursive XML elements

In database no endless recursion on instance level possible

Example: <!ELEMENT book (front, body, references)>
<!ELEMENT references (book+)>
<book>
...
<references>
<book>
...
<references>
...
</references>
</book>
</references>
</book>

book
<references>
book
ID
references
book
references
Solution: Map recursive XML elements to corresponding tables for each element type and use
primary/foreign keys and reference types, respectively, to connect these tables
Lecture "XML and Databases" - Dr. Can Türker
5-24
Mapping XML Element Ordering

Ordering is relevant in XML; must therefore preserved

Simple mapping rule: Add additional table column to hold the position of each element

Example:
<book>
<chapter>Introduction</chapter>
<chapter>Basic</chapter>
<chapter>Concept</chapter>
<chapter>Example</chapter>
<chapter>Conclusions</chapter>
</book>
Lecture "XML and Databases" - Dr. Can Türker
Chapter
Position
Introduction
1
Basics
2
Concept
3
Example
4
Conclusions
5
1-25
Mapping ANY, Mixed Content

No explicit representation available for content model ANY
–
Appropriate storage structure can only be selected on the instance level

Use text-based storage variants for mixed content
1. One column for entire document
MixedContent
<description>
Reach us at FGCZ:
<tram>10 min from SBB Zurich</tram>
<car>20 min from SBB Zurich</car>
Check www.fgcz.ch
</description>
2. Separate columns for reoccurring element types
Position
#PCDATA
Tram
Car
1
Reach us at FGCZ:
NULL
NULL
2
NULL
10 min from SBB Zurich
NULL
3
NULL
NULL
20 min from SBB Zurich
4
Check www.fgcz.ch
NULL
NULL
Lecture "XML and Databases" - Dr. Can Türker
5-26
Mapping with STORED

STORED: Semi-structured TO Relational Data

Basic idea: apply data mining on XML documents to get a good mapping onto database tables

Approach:
–
Input



–
Output


–
Relational database schema
STORED queries: rules for XML documents onto database tables
Algorithm



–
XML documents (at least a representative probe of the given XML document collection)
Query workload
Restrictions: hard disk space, number of database tables, etc.
Determine the parts of the XML tree parts that have the highest support in the document collection
and queries
These tree parts are explicitly mapped onto own tables
Irregular data mapped onto common table Overflow which is stored using the EDGE approach
Neither DTD nor XML schema required
Lecture "XML and Databases" - Dr. Can Türker
5-27
Mapping with STORED: Example
STORED query
Tree parts with
high support
FROM '//auction'
{
'title' : v1
'price' : v2
}
STORE auction(v1,v2)
XML documents
store
store
auction
…
price
title
auction
title
price
Side Table auction
unusual
doc
XML
5.20
Stars
47.11
XYZ
…
title
price
1
XML
5.20
2
Stars
47.11
Overflow Table
Lecture "XML and Databases" - Dr. Can Türker
doc
ordinal
name
datatype
value
2
3
unusual
string
XYZ
5-28
Structure-based Storage: Querying
XML queries
XQuery
SQL
DB

XML (XQuery) queries mapped to corresponding SQL queries
–
Mapping XML query to database query must be generated automatically
–
Process of mapping XML documents onto databases must be logged

Database query results must be mapped XML query results
Lecture "XML and Databases" - Dr. Can Türker
1-29
Structure-based Storage with Automatic Mapping

Schema
–
Required for DTD/XML schema mapping
–
Not needed in case of STORED

Document reconstruction
–
Logging mapping process including
inverse mapping required → not always
possible

Queries
–
Database queries
–
XML queries
–
No full-text operations
–
Usually no native access
Lecture "XML and Databases" - Dr. Can Türker

Further Specifics
–
Big schema → weakly filled database
tables (many null values)
–
No flexible data types, no explicit
storage support for alternatives

Efficiency
–
High due to full exploitation of the
DBMS

Application area
–
Data-centric XML applications
5-30
Structure-based Storage with Custom Mapping

User determines storage structures using custom mapping rules

XML document structure and database schema can be designed independently

Example:
<Hotel url="www.hotel-eden.ch">
<Name>Eden</Name>
<Address>
<Zip>8008</Zip>
<City>Zurich</City>
<Street>Bellerive</Street>
<No>12</No>
</Address>
<Price>198</Price>
</Hotel>
Price
Hotel
CHF
Link
Eden
198
www.hotel-eden.ch
<ClassMap>
<ElementType Name="Hotel"/>
<ToClassTable>
<Table Name="Price">
</ToClassTable>
<PropertyMap>
<Element Name="Name"/>
<ToColumn Name="Hotel" />
</PropertyMap>
<PropertyMap>
<Element Name="Price"/>
<ToColumn Name="CHF" />
</PropertyMap>
<PropertyMap>
<Attribute Name="url"/>
<ToColumn Name="Link" />
</PropertyMap>
</ClassMap>

XML queries possible if inverse mapping rules can be derived from the logged mapping rules

Inappropriate if full document reconstruction is required!
Lecture "XML and Databases" - Dr. Can Türker
5-31
Summary Structure-based Storage with Custom Mapping

Schema
–
Required

Efficiency
–
High due to DBMS

Document reconstruction
–
Requires inverse mappings
–
Usually not possible

Application area
–
Data-centric applications

Queries
–
Database queries
–
XML queries possible in restricted cases
Lecture "XML and Databases" - Dr. Can Türker
5-32
Hybrid Storage
Text-based
Storage
Full-Text Index
Model-based
Storage
Full-Text Index
and XML Index
Structure-based
Storage
Standard
Mapping
Custom
Mapping
Shredding
Degree
document-centric
semi-structured
•
Detecting data-centric and document-centric document parts
•
Use different storage approaches for the different document parts
Lecture "XML and Databases" - Dr. Can Türker
data-centric
1-33
Hybrid Storage

XML documents with data-centric and document-centric parts are difficult to map
–
Storage strategies for data-centric parts inappropriate for document-centric parts and vice
versa
–
None approach satisfies all requirements

Hybrid storage: use different storage models for different document parts
–
Combining text-based and structure-based storage
–
Combining structure-based storage and database type XML


Structured parts mapped to object-relational structures
Unstructured parts mapped to objects of type XML

Problem: Detecting and separating structured from irregular document parts

Basic idea of a hybrid approach to create a database schema
–
Build DTD or XML schema graph
–
For each graph node, determine a mass representing the significance of the node
–
Set shredding degree to determine granularity of resulting database schema
–
Derive table columns of type XML
–
Map structured document parts onto corresponding object-relational structures
Lecture "XML and Databases" - Dr. Can Türker
5-34
Conclusions

Mapping Strategies for Storing XML Documents
–
Text-based


–
Model-based

–
Analyze structure of XML documents to derive a database schema from the XML structure
Hybrid


Store the XML graph model
Structure-based

–
Store entire XML document as string
Plus full-text and/or path indexing
Combination of strategies above
No one-fits-all solution!
–
Appropriate storage solution mainly depends on the application area!
Lecture "XML and Databases" - Dr. Can Türker
5-35