XML und Datenbanken
Transcrição
XML und Datenbanken
Chapter 5 Mapping XML to Databases Three Layer Architecture Text-based Storage Model-based Storage Structure-based Storage Hybrid Storage Three Layer Architecture for Database Processing Conceptual Layer Logical Layer Physical Layer Abstract database schema e.g. ER schema Concrete database schema e.g. table schema Storage structures and access paths e.g. B-tree index Abstract Instances e.g. entities Concrete Instances e.g. tables and tuples Data sets in database pages Conceptual Layer: ERM, UML, ... Logical Layer: relational model, NF2 model, OR model, ... Physical Layer: internal storage, indexing, updates, transaction processing, query optimization Lecture "XML and Databases" - Dr. Can Türker 5-2 Three Layer Architecture for Document Processing Conceptual Layer Logical Layer Physical Layer Style sheet Document structure File Document Access structure e.g. full-text index Conceptual Layer: – editors – basic agreement on the document structure (intro before conclusions etc.) Logical Layer: document model, document instances Physical Layer: internal storage, document indexing, query processing Lecture "XML and Databases" - Dr. Can Türker 5-3 Three Layer Architecture for XML Processing Conceptual Layer Logical Layer <..> <..> <..> Document structure e.g. XML schema Physical Layer </..> </..> </..> XML documents, XML elements Conceptual Layer: XML schema, DTD Logical Layer: XML documents, XML elements, etc. Physical Layer - Storage: – Single file inappropriate for large document collections – Hot to store large XML documents? – How to support queries efficiently? Lecture "XML and Databases" - Dr. Can Türker ? Storage structure Access structure 5-4 Types of XML Documents <order> <customer>Meyer</customer> <position> <isbn>1-234-56789-0</isbn> <number>2</number> <price currency="Euro">30.00</price> </position> </order> Structured – regular – example: product catalogue, orders, invoices – data-centric applications Unstructured – irregular – example: articles, books, emails, Web pages – document-centric applications <content> XML builds on the principles of two existing languages, <em>HTML</em> and <em>SGML</em> to create a simple mechanism ... The generalized markup concept ... </content> Semi-structured – structured and unstructured parts – example: publications, Amazon, MS Press (example chapters) <book> <author>Neil Bradley</author> <title>XML companion</title> <isbn>1-234-56789-0</isbn> <content> XML builds on the principles of two existing languages, <em>HTML</em> and ... </content> </book> Lecture "XML and Databases" - Dr. Can Türker 5-5 Three Layer Architecture and XML Document Types Conceptual Layer Logical Layer Physical Layer document-centric semi-structured data-centric Modeling structure and content Modeling structure and content Modeling structure Tree-based editors XML ER, UML Document model Data and document model Data model Queries/Updates on structure and content Queries/Updates on structure and content Queries/Updates on content XML, SGML XML, OEM XML, RDM, OODM XPath/XQuery, DOM, IR queries XPath/XQuery, XIRQL XQuery, SQL, OQL Structure on instance level Structure on schema and instance level Structure on schema level Files and full-text indexes, structure indexes Semi-structured databases and full-text indexes, path indexes (Object-)Relational databases and B-tree / hash indexes Lecture "XML and Databases" - Dr. Can Türker 5-6 Requirements on the Physical Layer Order preserving and lossless storage of XML documents Efficient access to XML documents and selected document parts – Short response times for – – – – – Queries Updates Physical design of the storage layer Indexing Transaction processing Support for XPath/XQuery Support for XML processors such DOM and SAX Lecture "XML and Databases" - Dr. Can Türker 5-7 Storage Strategies for XML Documents 1. Text-based – Store XML document as string 2. Model-based – Store XML graph structure and DOM informations 3. Structure-based – Derive database schema from XML structure – Custom mapping of XML structure to database schema Lecture "XML and Databases" - Dr. Can Türker 5-8 Text-based Storage Entire XML document stored as – File in the file system – CLOB (Character Large Object) in a database system Id Content 1 <book genre="autobiography"> <title>The Autobiography of Benjamin Franklin</title> <author> <first-name>Benjamin</first-name> <last-name>Franklin</last-name> </author> <price currency="USD">8.99</price> </book> Operations (read/write) on entire XML document are efficient No explicit support for more fine-granule access; use of indexes required! – Full-text index – Path indexes Lecture "XML and Databases" - Dr. Can Türker 5-9 Summary Text-based Storage Schema – Not required Further Specifics – Full-text functionality Document reconstruction – Not needed since documents are stored entirely as string Queries – Information Retrieval queries – XML queries possible Efficiency – String must be parsed at access time time-consuming – No concurrency for read and write operations no parallel executions Application area – Document-centric XML applications – Storing semi-structured XML documents Lecture "XML and Databases" - Dr. Can Türker 5-10 Model-based Storage Generic storage of XML document tree – XML elements, XML attributes, etc. are nodes of tree Store XML elements/attributes in corresponding database tables Element ID Name Value Parent Element Attribute ID Name Value Parent Element Position Extensions for adequate storage of different data types possible Lecture "XML and Databases" - Dr. Can Türker 5-11 Model-based Storage: EDGE, BINARY XML documents store store auction price title XML … 5.20 auction title Stars price 47.11 • BINARY variant: horizontal partitioning of EDGE along the values of the column label Lecture "XML and Databases" - Dr. Can Türker CREATE TABLE edge ( id INTEGER NOT NULL, doc INTEGER NOT NULL, -- document id parent INTEGER, label VARCHAR(30) NOT NULL, content VARCHAR(100), nodetype VARCHAR(20) NOT NULL, pos INTEGER,-- element position PRIMARY KEY (id, doc), UNIQUE (parent, pos), CHECK (nodetype<>ˈattributeˈ and pos is null) ); id doc parent label content nodetype pos 1 1 null store null element 1 2 1 1 auction null element 1 3 1 2 title XML element 1 4 1 2 price 5.20 element 2 5 2 null store null element 1 6 2 5 auction null element 1 7 2 6 title Stars element 1 8 2 6 price 47.11 element 2 5-12 Model-based Storage: Queries XML queries XQuery SQL DB XML (XPath, XQuery) queries internally transformed to SQL queries XML query result derived from the SQL result – Labelling of the resulting tuples – Result must be an XML value Example: Get the description of all auctions that have a price less than 10 SELECT a.content, b.content FROM Edge a, Edge b WHERE (a.label = 'price') AND (a.content < 10.00) AND (b.label = 'title') AND (b.parent = a.parent) AND (a.doc = b.doc) Lecture "XML and Databases" - Dr. Can Türker id doc parent label content nodetype pos 1 1 null store null element 1 2 1 1 auction null element 1 3 1 2 title XML element 1 4 1 2 price 5.20 element 2 5 2 null store null element 1 6 2 5 auction null element 1 7 2 6 title Stars element 1 8 2 6 price 47.11 element 2 5-13 Model-based Storage: DOM (1) DOMImplementation Node NodeList NamedNodeMap Attr Information of the Document Object Model stored in a database CharacterData Comment Text CDataSection Document DocumentFragment DocumentType Element Entity EntityReference Notation ProcessingInstruction Lecture "XML and Databases" - Dr. Can Türker 5-14 Model-based Storage: DOM (2) Node type: ELEMENT 2 1 book title The Autobiography of Benjamin Franklin Node type: TEXT Lecture "XML and Databases" - Dr. Can Türker Node type: ATTRIBUTE 3 price 8.99 NodeID NodeType DocID ParentNode 1 ELEMENT 1 NULL 2 ELEMENT 1 1 3 ATTRIBUTE 1 1 4 TEXT 1 2 NodeID TagName NodeID Content b book 4 The Auto... 2 title NodeID PreviousSibling NextSibling FirstChild 1 NULL NULL 2 2 NULL NULL 4 3 NULL NULL NULL NodeID ElementID AttributeName AttributeValue 3 1 price 8.99 5-15 Model-based Storage: DOM Methods XML queries DOM SQL DB XML queries (DOM method calls) mapped to corresponding SQL queries Result of the method call derived from the result of the SQL query Lecture "XML and Databases" - Dr. Can Türker 5-16 Summary Model-based Storage Schema – Not required Document reconstruction – Possible but expensive Queries – XML queries possible – Adapted database queries – Queries over many elements/attributes are expensive Lecture "XML and Databases" - Dr. Can Türker Efficiency – Context-based navigation is efficient – Document reconstruction expensive – Path expression evaluation inefficient Application area – Data-centric – Document-centric – Semi-structured 5-17 Structure-based Storage Idea: – Shred XML documents and map the parts to corresponding database structures – If available derive database schema from a DTD or XML schema – Tolerate loss of native XML access Problem – Automatic generation of database schema – Exploiting structure information in queries Example: Basic mapping from a DTD to a database schema – Transform DTD in tree representation – Nodes represent element types, attributes, etc., i.e., the type layer! – Edges represent relationships between element types as well as certain restrictions Lecture "XML and Databases" - Dr. Can Türker 5-18 Mapping DTD to Databases Rules for mapping element types: – – – – – – XML element type Sequence of element types Alternative of element types Element type with quantifier ? Element type with quantifier +,* Nested element type table column table column table column table column allowing nulls set/list-valued table column tuple-valued table column Rules for mapping attributes: – XML attribute table column – IMPLIED NULLABLE – REQUIRED NOT NULLABLE – Default value DEFAULT Constraint Lecture "XML and Databases" - Dr. Can Türker 5-19 Mapping XML onto Relational Databases <hotel> <name>Eden</name> <address> <zip>8008</zip> <city>Zurich</city> <street>Bellerive</street> <no>12</No> </address> <phone>+441234567</phone> <phone>+449876543</phone> </hotel> Variant 1 Hotel Id Name Zip City Street No 1 Eden 8008 Zurich Bellerive 12 Phone Id Text Hotel 1 +441234567 1 2 +449876543 1 Variant 2 Hotel Id Name Address 1 Eden 101 Address Schema typically needed Queries use SQL RDBMS data types Mapping collection types to separate tables Mapping tuple types implicitly (Variant1; tuple structure lost) or separately (Variant 2) Lecture "XML and Databases" - Dr. Can Türker Id Zip City Street No 101 8008 Zurich Bellerive 12 5-20 Mapping XML onto Object-Relational Databases <hotel> <name>Eden</name> <address> <zip>8008</zip> <city>Zurich</city> <street>Bellerive</street> <no>12</No> </address> <phone>+441234567</phone> <phone>+449876543</phone> </hotel> Hotel ID 1 Name Eden <Address> {Phone} Zip City Street No 8008 Zurich Bellerive 12 {+441234567, +449876543} Schema typically required Queries use SQL Straight-forward mapping of tuple and collection types to corresponding database data types Lecture "XML and Databases" - Dr. Can Türker 5-21 Mapping XML onto Object-Relational Databases (2) XML-DTD Informix-Syntax: <!ELEMENT book (title, author+, edition?, publisher)> <!ATTLIST book isbn CDATA #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT author (first, last)> <!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT edition (#PCDATA)> <!ELEMENT publisher (#PCDATA)> CREATE TABLE book ( isbn VARCHAR(20) NOT NULL, title VARCHAR(100), author LIST(ROW( first VARCHAR(30), last VARCHAR(50)) NOT NULL), edition INTEGER, publisher VARCHAR(100) NOT NULL, ); ISBN Title 3-89864-148-1 XML & Databases {<Author>} First Last Meike Klettke Holger Meyer Edition Publisher 1 dpunkt Problem: Map to an SQL schema that allows only XML documents which are valid w.r.t. the given DTD Lecture "XML and Databases" - Dr. Can Türker 5-22 Mapping XML Alternatives XML DTD supports alternatives Example: Mapping variants: 1. Each alternative is mapped to a separate column of a common table <!ELEMENT author (name | ((first-name, last-name)))> Parse tables with many null values 2. Each alternative is mapped to a separate table Queries require union over several tables 3. Map to a single table column of type XML <author> <first-name>Benjamin</first-name> <last-name>Franklin</last-name> </author> <author> <first-name>Herman</first-name> <last-name>Melville</last-name> </author> <author> <name>Plato</name> </author> Lecture "XML and Databases" - Dr. Can Türker Name Variant 1 FirstName LastName NULL Benjamin Franklin NULL Herman Melville Plato NULL NULL Name Variant 2 Plato FirstName LastName Benjamin Franklin Herman Melville 5-23 Mapping Recursive XML Elements XML DTD supports recursive XML elements In database no endless recursion on instance level possible Example: <!ELEMENT book (front, body, references)> <!ELEMENT references (book+)> <book> ... <references> <book> ... <references> ... </references> </book> </references> </book> book <references> book ID references book references Solution: Map recursive XML elements to corresponding tables for each element type and use primary/foreign keys and reference types, respectively, to connect these tables Lecture "XML and Databases" - Dr. Can Türker 5-24 Mapping XML Element Ordering Ordering is relevant in XML; must therefore preserved Simple mapping rule: Add additional table column to hold the position of each element Example: <book> <chapter>Introduction</chapter> <chapter>Basic</chapter> <chapter>Concept</chapter> <chapter>Example</chapter> <chapter>Conclusions</chapter> </book> Lecture "XML and Databases" - Dr. Can Türker Chapter Position Introduction 1 Basics 2 Concept 3 Example 4 Conclusions 5 1-25 Mapping ANY, Mixed Content No explicit representation available for content model ANY – Appropriate storage structure can only be selected on the instance level Use text-based storage variants for mixed content 1. One column for entire document MixedContent <description> Reach us at FGCZ: <tram>10 min from SBB Zurich</tram> <car>20 min from SBB Zurich</car> Check www.fgcz.ch </description> 2. Separate columns for reoccurring element types Position #PCDATA Tram Car 1 Reach us at FGCZ: NULL NULL 2 NULL 10 min from SBB Zurich NULL 3 NULL NULL 20 min from SBB Zurich 4 Check www.fgcz.ch NULL NULL Lecture "XML and Databases" - Dr. Can Türker 5-26 Mapping with STORED STORED: Semi-structured TO Relational Data Basic idea: apply data mining on XML documents to get a good mapping onto database tables Approach: – Input – Output – Relational database schema STORED queries: rules for XML documents onto database tables Algorithm – XML documents (at least a representative probe of the given XML document collection) Query workload Restrictions: hard disk space, number of database tables, etc. Determine the parts of the XML tree parts that have the highest support in the document collection and queries These tree parts are explicitly mapped onto own tables Irregular data mapped onto common table Overflow which is stored using the EDGE approach Neither DTD nor XML schema required Lecture "XML and Databases" - Dr. Can Türker 5-27 Mapping with STORED: Example STORED query Tree parts with high support FROM '//auction' { 'title' : v1 'price' : v2 } STORE auction(v1,v2) XML documents store store auction … price title auction title price Side Table auction unusual doc XML 5.20 Stars 47.11 XYZ … title price 1 XML 5.20 2 Stars 47.11 Overflow Table Lecture "XML and Databases" - Dr. Can Türker doc ordinal name datatype value 2 3 unusual string XYZ 5-28 Structure-based Storage: Querying XML queries XQuery SQL DB XML (XQuery) queries mapped to corresponding SQL queries – Mapping XML query to database query must be generated automatically – Process of mapping XML documents onto databases must be logged Database query results must be mapped XML query results Lecture "XML and Databases" - Dr. Can Türker 1-29 Structure-based Storage with Automatic Mapping Schema – Required for DTD/XML schema mapping – Not needed in case of STORED Document reconstruction – Logging mapping process including inverse mapping required → not always possible Queries – Database queries – XML queries – No full-text operations – Usually no native access Lecture "XML and Databases" - Dr. Can Türker Further Specifics – Big schema → weakly filled database tables (many null values) – No flexible data types, no explicit storage support for alternatives Efficiency – High due to full exploitation of the DBMS Application area – Data-centric XML applications 5-30 Structure-based Storage with Custom Mapping User determines storage structures using custom mapping rules XML document structure and database schema can be designed independently Example: <Hotel url="www.hotel-eden.ch"> <Name>Eden</Name> <Address> <Zip>8008</Zip> <City>Zurich</City> <Street>Bellerive</Street> <No>12</No> </Address> <Price>198</Price> </Hotel> Price Hotel CHF Link Eden 198 www.hotel-eden.ch <ClassMap> <ElementType Name="Hotel"/> <ToClassTable> <Table Name="Price"> </ToClassTable> <PropertyMap> <Element Name="Name"/> <ToColumn Name="Hotel" /> </PropertyMap> <PropertyMap> <Element Name="Price"/> <ToColumn Name="CHF" /> </PropertyMap> <PropertyMap> <Attribute Name="url"/> <ToColumn Name="Link" /> </PropertyMap> </ClassMap> XML queries possible if inverse mapping rules can be derived from the logged mapping rules Inappropriate if full document reconstruction is required! Lecture "XML and Databases" - Dr. Can Türker 5-31 Summary Structure-based Storage with Custom Mapping Schema – Required Efficiency – High due to DBMS Document reconstruction – Requires inverse mappings – Usually not possible Application area – Data-centric applications Queries – Database queries – XML queries possible in restricted cases Lecture "XML and Databases" - Dr. Can Türker 5-32 Hybrid Storage Text-based Storage Full-Text Index Model-based Storage Full-Text Index and XML Index Structure-based Storage Standard Mapping Custom Mapping Shredding Degree document-centric semi-structured • Detecting data-centric and document-centric document parts • Use different storage approaches for the different document parts Lecture "XML and Databases" - Dr. Can Türker data-centric 1-33 Hybrid Storage XML documents with data-centric and document-centric parts are difficult to map – Storage strategies for data-centric parts inappropriate for document-centric parts and vice versa – None approach satisfies all requirements Hybrid storage: use different storage models for different document parts – Combining text-based and structure-based storage – Combining structure-based storage and database type XML Structured parts mapped to object-relational structures Unstructured parts mapped to objects of type XML Problem: Detecting and separating structured from irregular document parts Basic idea of a hybrid approach to create a database schema – Build DTD or XML schema graph – For each graph node, determine a mass representing the significance of the node – Set shredding degree to determine granularity of resulting database schema – Derive table columns of type XML – Map structured document parts onto corresponding object-relational structures Lecture "XML and Databases" - Dr. Can Türker 5-34 Conclusions Mapping Strategies for Storing XML Documents – Text-based – Model-based – Analyze structure of XML documents to derive a database schema from the XML structure Hybrid Store the XML graph model Structure-based – Store entire XML document as string Plus full-text and/or path indexing Combination of strategies above No one-fits-all solution! – Appropriate storage solution mainly depends on the application area! Lecture "XML and Databases" - Dr. Can Türker 5-35