XML Databases 10 O - IfIS - Technische Universität Braunschweig
Transcrição
XML Databases 10 O - IfIS - Technische Universität Braunschweig
XML Databases 10. XML Storage 1 – Overview Silke Eckstein Andreas Kupfer Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 10. XML Storage 1 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 2 10.1 Motivation • Applications require different types of XML documents – Structure vs. content – Regular vs. irregular • Thus, XML documents are – Data-centric – Document-centric – or somewhere in-between • Questions – Storage of XML documents – Efficient processing of queries on the stored documents or data • There are several methods for storage – 1st goal: Learn and understand methods – 2nd goal: Classify methods • Principles • Advantages and disadvantages • Usage XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 3 10.1 Motivation • Characterisation of XML documents: – Data-centric documents • Structured, regular • E.g. product catalog, order, invoice – Document-centric documents • Unstructured, irregular • E.g. scientific article, book, email, web page – Semi-structured documents • Data-centric and document-centric parts • E.g. publications, Amazon, MS Press (example chapters) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 4 10.1 Motivation • Requirements for the physical layer: – Order preserving and lossless storage of XML documents – Efficient access to XML documents or parts thereof • Quick response time for – Queries – Update operations • • • • Indexing Transaction processing Support of XPath and XQuery Support of SAX and DOM for applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 5 10.1 Motivation • Storage approaches for XML documents – Text-based • Storage as character data – Model-based • Generic storage of the graph structure • Storage of the DOM – Schema-based • Mapping to (object-)relational databases – Deriving the database schema from the XML structure – Using user defined mapping procedures XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 6 10. XML Storage 1 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 7 10.2 Text-based storage • The whole XML document text is stored as character data – File in the file system – CLOB (Character-Large-OBject) in the DBS • Operations documents as a whole are very efficient – Reading and writing the whole document – But the content is monolithic and opaque with respect to the relational query engine (query can't inspect a fragment) • Getting granular access requires additional support – Full text index – Path index XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 8 10.2.1 Index structures • Index structures for XML documents allow efficient access for specific queries – Different types of indexes are optimized for different types of queries • Generate redundancy – Index has to be up-to-date by propagating data changes • Index structures can be storage structures as well – They define the storage method XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 9 10.2.1 Index structures • Types of index structures – Value index • Indexes atomar values of an XML document, like element content or attribute values • Index format for structured parts of XML documents • Already known from databases (B-trees, hash index, …) – Full text index • Indexes single words from the full text • Index format for unstructured parts of XML documents • Already known from Information Retrieval (inverted lists, tries, suffix trees, …) – Path index • Indexes subtrees/paths in an XML document • Index format for semistructured parts of XML documents • Already known from object-databases (access support relations, …) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 10 10. 2.1 Index structures • B-tree as value index for an XML fragment document [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 11 10. 2.1 Index structures • Full text index – Not limited to exact matches • Keyword-based search and boolean retrieval • Pattern search (with regular expressions) – Use of • Statistical, word-based methods – Stop word removal – Elimination of uncommon items • Linguistic methods – Normalization of words (e.g. capitalisation, hyphenation,) – Word decomposition by rules (engl.) or dictionaries (german) – Stemming • Knowledge-based methods – Use of ontologies and thesauri to search for synonyms, hypernyms and hyponyms XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 12 10. 2.1 Index structures • Inverted list as full text index for XML word [Tür08] occurrence word position in the text XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 13 10. 2.1 Index structures word [Tür08] occurrence XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 14 10. 2.1 Index structures • Path index – Structure information must be identifiable and reconstructable • Assigning the markup to the content as well as • Representing the hierarchical nesting and order of elements/attributes – Especially suited for keyword search with regard to structure or path expressions FOR $b IN //book WHERE CONTAINS($b/author,"Benjamin") RETURN $b XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 15 10. 2.1 Index structures • Types of path indexes – Nested path index • Access to root node from every node – Multi-index • Accessing parent nodes – Join-index • Access parent and child nodes – Access Support Relations (ASR) • Generalization of indexes above, by listing all paths in a table [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 16 10. 2.1 Index structures • Conclusion – Efficient query processing on XML documents requires different types of index structures – Value index • For efficient access to structured parts • Keyword search, value search – Full text index • For efficient access to unstructured parts – Path index • Using the document structure • Navigating queries XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 17 10.2 Text-based storage • Summary text-based storage – Schema definition: • not required – Document reconstruction: • documents stay in their original format – Queries: • Information retrieval queries • Processing the markup of the queries • XML queries possible – Special features: • Full text functions – Efficiency: • Character string must be parsed on every access with XML processors expensive • No concurrency on read or write no parallel processing – Usage: • For document-centric XML applications • Suitable to only a limited extent also for semi-structured applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 18 10. XML Storage 1 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 19 10.3 Model-based storage • Idea: generic storage of the graph structure – XML elements, XML attributes, … are nodes of a graph – Nesting of elements defines edges – Nodes get an (internal) ID based on graph traversal • Using relations or object classes to store elements and attributes Elements ID Element name Value Reference to preceeding Rank Attributes ID Attribute name Value Reference to element • Document structure can be restored completely • Extension for data type adapted storage is possible XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 20 10.3 Model-based storage • The EDGE approach [FK99] XML documents – Variant BINARY: horizontal partition of EDGE based on label [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 21 10.3 Model-based storage • XML queries – XML queries (XPath, XQuery) are mapped to SQL queries (taking storage structures into account) – Result of XML query is generated from result of database query • "Labeling" of the result tuples • Result is in XML format [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 22 10.3 Model-based storage • Example: list bargain buy with prices SELECT a.content, b.content FROM Edge a, Edge b WHERE (a.label = 'price') AND (a.content < 10.00) AND (b.label = 'description') AND (b.parent = a.parent) AND (a.key = b.key) [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 23 10.3 Model-based storage • DOM-based storage – Information from the Document Object Model are stored in the database – Storage alternatives • (Object-)relational databases • Object-oriented databases • Developing own data structure [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 24 10.3 Model-based storage DOM-based storage – example Node type: ELEMENT Node type: ATTRIBUTE Node type: TEXT [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 25 10.3 Model-based storage • XML Queries – XML queries (DOM method invocations) are mapped to SQL queries (taking storage structures into account) – Result of method invocation is generated from result of database query [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 26 10.3 Model-based storage Summary model-based storage – Schema definition: • not required for storage – Document reconstruction: • Possible, but expensive – Queries: • XML queries possible • Adapted database queries – Special features: • Querying many elements/attributes is expensive – Efficiency: • Navigation from the given context is efficient • Restoring the document and evaluating path expressions is inefficient – Usage: • For data- and document-centric as well as for semi-structured XML applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 27 10. XML Storage 1 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 28 10.4 Schema-based storage • Motivation – XML content shall be stored in a conventional database – Accepting the loss of native access – DB schema is derieved from a DTD or an XML schema • Problem – Generate DB schema automatically – Thereby use as much structure information as possible • General approach for mapping from a DTD – – – – Transform DTD into a tree representation Nodes: element types, attributes, etc. (type layer!!!) Edges: nesting relationships of element types and their restrictions Traverse tree in order to transform nodes and edges into database tables (according to certain rules) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 29 10.4 Schema-based storage • Generating the DB schema for a DTD: – Rules to map element types: XML element type Sequence of element types Alternative of element types Element type with quantifier ? Element type with quantifier +,* Nested element types column of a table columns of a table column of a table column with null values set/list of columns (SET OF, LIST OF) TUPLE OF – Rules to map attributes: XML attribute IMPLIED REQUIRED Default value column of a table null values allowed null values not allowed DEFAULT constraint XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 30 10.4 Schema-based storage • Mapping to relational databases – – – – DTD is usually required Queries use SQL functionality RDBMS data types are used (e.g. prices are NUMERIC) Problem: Mapping of collection types • Subdivide into additional relations – Example: Comment: Comment_ID 44901 Customer_Info: ID Fname Customer_info Feedback C0001 F0001 Lname C0001 Charles Sanchez Feedback: ID F001 Email C.Sanchez@hotmail... Type Content opinion Darjeeling Special… XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 31 10.4 Schema-based storage • Mapping with STORED (Semistructured TO RElational Data) – Basic idea: Use data mining techniques on the XML structure to find a good mapping to tables [DFS99] – Input • • • • XML documents (or an average sample of the collection) Query workload Restrictions of storage space, number of tables, … No DTD or XML schema is required! – Output • Relational schema • STORED-queries: Mapping instructions for XML documents to DB tables – Procedure • Determine the XML subtrees with the largest support in the collection and in the queries • These subtrees are materialised in tables • Irregular data is stored in overflow tables according to the EDGE approach XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 32 10.4 Structure-based storage • Mapping with STORED – example Subtrees with high support XML documents shown as tree structure [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 33 10.4 Schema-based storage • Mapping to object relational databases – DTD is usually required – Queries use SQL functionality – "Natural" mapping to tuple types, collection types – In case of irregular document structure databases contain many null values. Comment: Comment_ID 44901 <Customer_info> <Feedback> Fname Lname Email Type Charles Sanchez C.Sanchez@hotmail... opinion Darjeeling Specia… XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Content 34 10.4 Schema-based storage [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 35 10.4 Schema-based storage • Mapping of recursive data definitions – DTDs can be recursive – Infinite recursion is impossible on instance layer of a database – Procedure: • • • • Marking the nodes Subdividing into separate tables Use primary and foreign keys in RDBMS Use reference types in ORDBMS <!ELEMENT book (front, body, references)> <!ELEMENT references (book+)> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 36 10.4 Schema-based storage • Mapping of element sequences – Sequence can be important • Use an additional attribute in these cases – Example: <lecture> <lesson>Introduction</lesson> <lesson>XML basics</lesson> … ⇓ Order Lesson 1 Introduction 2 XML basics XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 37 10.4 Schema-based storage • Mapping of alternatives – XML allows to specify alternatives – Example: <!ELEMENT car (compactCar | sedan | van)*> – Three possible storage variants • Each alternative is stored as separate table column • Subdivide alternatives in separate tables • Use a table column of type XML type XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 38 10.4 Schema-based storage • Variant 1 – all alternatives in one table • – Problem: many null values (wasting storage space) [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 39 10.4 Schema-based storage • Variant 2 – subdivided into multiple tables • – For queries, combination of tables is needed [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 40 10.4 Schema-based storage • Variant 3 – Using column type XML – XML type allows XML queries or DOM methods [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 41 10.4 Schema-based storage Mapping of mixed content – example [Tür08] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 42 10.4 Schema-based storage • Mapping of mixed content – Mapping to plain tables is ill-suited – Use variant 3 from above or • Content model ANY is not representable at all – Arbitrary content, arbitrary element types – Often the fitting storage structure can only be decided on instance layer XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 43 10.4 Schema-based storage • Schema-based storage with automatic mapping – Advantages • Queries, data types, aggregation functions, views • Integration in other databases when storing structured data – Disadvantages • Large schema, sparsely filled databases (many null values) • No flexible data types, storage of alternatives has problems • Less flexible queries – No information retrieval queries possible without additional extensions – No full text operations for semi- or unstructured data – Usually native access is not possible any more XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 44 10.4 Schema-based storage • Mapping solutions with different specializations – Algorithms, middleware, commercial applications, … – Varying amount of required input or user decisions – Many algorithms create different database schemas • Two phases – Mapping • Assign a place for each node type in the DB – Shredding • Import the XML data as DB tuples XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 45 10.4 Schema-based storage Algorithm/product [Bus08] |based on: n/a DTD schema |restrictions: keys cardin. types | DTD optimisation XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 46 10.4 Schema-based storage • The shredder can be part of the DB – Usually requires an XML schema – In the IBM Data Studio, the shredder is part of the "annotated XML schema decomposition" – Direct approach in DB2: • register the XML schema and call the stored procedure: register xmlschema http://our.org/custacc from dec_files/custacc.xsd as cust_schema ; complete xmlschema cust_schema enable decomposition ; call SYSPROC.XDBDECOMPXML ('VRODRIG', 'CUST_SCHEMA', ? , ?, 1, null, null, null) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 47 10.4 Schema-based storage • Shredding without XML schema in DB2 – XMLTABLE function in combination with an INSERT INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT) SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT FROM XMLTABLE( XMLNAMESPACES('http://www.sal.com/mails' AS "email"), '$doc/email:mails/mail' (: some xquery-expression :) PASSING xml-source AS "doc" COLUMNS MAILFROM VARCHAR (100) PATH 'envelope/from', MAILTO VARCHAR (100) PATH 'envelope/to', MAILDATE VARCHAR (30) PATH 'envelope/email:Date', SUBJECT VARCHAR (100) PATH 'envelope/Subject') AS T; http://www.ibm.com/developerworks/db2/l ibrary/techarticle/dm-0801ledezma/ XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 48 10.4 Schema-based storage • Summary Schema-based storage with automatic mapping – Schema definition: • Is usually required and analysed • not required, e.g. for STORED – Document reconstruction: • Limited (requires logging of the mapping process) – Queries: • Database queries • XML queries possible,but lack the XPath horizontal axes, e.g. following, preceding-sibling – Special features: • Federation with existing databases is possible – Efficiency: • High efficiency by using the DB-engine – Usage: • For data-centric XML applications, but with limited nesting XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 49 10.4 Schema-based storage • User defined mapping – Idea • In all previously shown methods it is not possible to affect the storage in the DB • With user defined mappings the user defines the storage structure • The structure of XML documents and database schema can be designed independently from each other • Also possible: storing XML documents in existing databases – Annotation of DTD and XML schema, respectively • In many cases the mapping definition is combined with existing schema information – Only limited XML queries possible • Logging of the mapping process from XML documents to databases • For a given query all relevant data has to be stored (lossless mapping) XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 50 10.4 Schema-based storage • Example: XML document [Tür08] mapping instruction XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 51 10.4 Schema-based storage • Mapping instruction – Example syntax for XML-DBMS (Roland Bourret) <ClassMap> <ElementType Name="sales:SalesOrder"/> <ToClassTable> <Table Name="Sales"/> </ToClassTable> <PropertyMap> <Attribute Name="SONumber"/> <ToColumn> <Column Name="Number"/> </ToColumn> </PropertyMap> </ClassMap> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig Connection between elements and tables Connection between elements/attributes and table columns 52 10.4 Schema-based storage • Remarks – Many different mapping languages or schema annotations • Automatic mappings usually have an internal mapping language – Remember the mapping constructs from lecture 5 and 6. The SQL/XML annotations are a mapping language, too. – DB2 uses similar annotations as SQL/XML • On the next slide, the example from lecture 6 is shown with DB2 syntax XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 53 <xsd:complexType xmlns:db2-xdb= "http://www.ibm.com/xmlns/prod/db2/xdb1" name="ROW.ACCOUNT"> <xsd:sequence> <xsd:element name="NAME" Mapping SQL type="CHAR_20" table columns to db2-xdb:rowSet="Account" XML SQL/XML elements db2-xdb:column="Name"/> schema <xsd:element name="BALANCE" annotations in type="NUMERIC_12_2"/> Mapping table DB2 db2-xdb:rowSet="Account" rows to XML (table is called db2-xdb:column="Balance"/> <row> rowSet) </xsd:sequence> elements </xsd:complexType> Mapping SQL tables CREATE TABLE Account ( Name CHAR(20), Balance NUMERIC(12,2), ); Name Balance Joe 2000 Jim 3500 <ACCOUNT> <row> <NAME>Joe</NAME> <BALANCE>2000</BALANCE> </row> <row> <NAME>Jim</NAME> <BALANCE>3500</BALANCE> </row> </ACCOUNT> [Tür08] <xsd:complexType name="TABLE.ACCOUNT"> <xsd:sequence> <xsd:element name="row" type="ROW.ACCOUNT"/> </xsd:sequence> </xsd:complexType> <xsd:element name="ACCOUNT" type="TABLE.ACCOUNT"/> XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 54 10.4 Schema-based storage • Summary schema-based storage with user defined mapping – Schema definition: • Depends on mapping language – Document reconstruction: • Not possible in most cases (requires logging of the mapping process) – Queries: • Database queries • XML queries in rare cases only! – Special features: • Integration with existing databases is possible – Efficiency: • High efficiency by using the DB-engine – Usage: • For data-centric XML applications XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 55 10. XML Storage 1 10.1 Motivation 10.2 Text-based storage 10.2.1 Index structures 10.3 Model-based storage 10.4 Schema-based storage 10.5 Conclusion 10.6 Overview and References XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 56 10.5 Conclusion • Different methods for storage of XML documents – Text-based • Storing whole XML documents as string • Can use full text index or path index – Model-based • Generic mapping of the tree structure – Schema-based • Detect and analyse the structure of the XML documents • Derive a DB schema from the structure – Hybrid approaches • A combination of some of those methods – No algorithm has the optimal solution for all kind of XML documents – Reasonable solution is heavily dependent on the application XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 57 10.6 References • "XML und Datenbanken" [Tür08] – Can Türker – Lecture, University of Zurich, 2008 • "XML und Datenbanken" [KM03] – M. Klettke, H. Meier – dpunkt.verlag, 2003 • "Generierung eines adaptiven Datenbankschemas für datenzentrierte XMLDokumente" [Bus08] – Carsten Busche – Diplomarbeit, TU Braunschweig, 2008 • [FK99] – D. Florescu, D. Kossmann: Storing and Querying XML Data using an RDBMS. IEEE Data engineering Bulletin (DEBU), Volume 22(3), Seiten 27-34, 1999. • [DFS99] – A. Deutsch, M.F. Fernández, D. Suciu: Storing Semistructured Data with STORED. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Seiten 431-442, ACM, 1999. XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 58 10.6 Overview 1. 2. 3. 4. 5. Introduction XML Basics Schema definition XML query languages I Mapping relational data to XML 6. SQL/XML 7. XML processing 8. XML query languages II – XQuery Data Model 9. XML query languages III – XQuery 10. XML storage I – Overview 11.XML storage II 12. Updates / Transactions 13. Systems XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 59 Questions, Ideas, Comments • Now, or ... • Room: IZ 232 • Office our: Tuesday, 12:30 – 13:30 Uhr or on appointment • Email: [email protected] XML Databases – Silke Eckstein – Institut für Informationssysteme – TU Braunschweig 60