An Architecture for Dimensional and Spatial Analyses Integration
Transcrição
An Architecture for Dimensional and Spatial Analyses Integration
An Architecture for Dimensional and Spatial Analyses Integration Ana Cristina F. Ferreira1 [email protected] Maria Luiza. Campos1 [email protected] Astério Kiyoshi.Tanaka2 [email protected] ABSTRACT During the last 30 years, decisions involving conventional data have been supported by traditional decision support systems, like EIS (Executive Information Systems), DSS (Decision Support Systems) and lately OLAP (On-Line Analytical Processing) systems on Data Warehouse (DW) environments. Another kind of system also used by decision makers is Geographical Information System (GIS). These systems focus on spatial analysis and are more commonly used in specific application domains like Cartography, Insurance, Health and Environment Management. In this paper, we propose an architecture that supports dimensional and spatial analyses, which takes advantage of the main features and functionalities of OLAP and GIS tools, through an integration model at conceptual/logical level. A case study is described which applies the proposed framework. 1. Introduction Since the 70’s, decision-making processes have been supported by information systems, either implemented on file systems or on database management systems (DBMSs). More recently, two groups of systems have emerged: systems based on Data Warehouse (DW), most often using On-Line Analytical Processing (OLAP) tools/applications, and Geographical Information Systems (GISs). Systems based on DW are oriented to business decision making. Usually, in this domain, the temporal aspect is a very important factor to analyses. Information is presented as cubes, consisting of several dimensions. Each dimension represents a perspective through which data can be analyzed [VT99]. OLAP tools offer powerful functionalities to carry out multidimensional analyses, supporting management decisions. GISs are information systems where data are analyzed with a geographical perspective and the analyses' results are presented in the form of maps. The use of GISs to support decision-making processes with a strong emphasis on spatial analysis is continually increasing. Therefore, while time and space are dimensions commonly used together in decision making, the functionality to properly deal with these dimensions simultaneously can not be found in one unique tool. GIS and OLAP tools appeared in different periods of time, have different focus, and are used in different domains, normally by users with different skills. 1 Departamento de Ciência da Computação - IM - Universidade Federal do Rio de Janeiro, RJ, Brasil. 2 Departamento de Informática Aplicada - UNI-RIO - Universidade do Rio de Janeiro, RJ, Brasil. A common problem found in enterprise information systems is the complexity to combine spatial and dimensional analyses. When an enterprise has an OLAP application, and wants to make some spatial analyses on this data, it is necessary: 1) to understand the concepts of one specific tool; 2) to search for the source data in the database used by the OLAP tool and 3) to construct some mechanism to migrate data from one system to another. For each specific pair of tools (OLAP, GIS) all these processes must be repeated. The same occurs when there is a GIS being used and it is necessary to visualize the same data through the OLAP perspective. This paper presents a generic architecture based on a model that integrates concepts of both approaches. Mapping concepts from each tool (OLAP or GIS) to corresponding concepts of the integration model creates a path between both approaches. With this approach, data originally structured for one tool, will be viewed and translated to constructs supported by the other tool. The remainder of this paper is organized as follows: sections 2 and 3 present, respectively, the main characteristics and functionalities of OLAP and GIS tools, which are relevant for our work. Section 4 describes an architecture where such functionalities are integrated and highlights the advantages of this integration solution. In section 5, a case study using the integration model is presented. In section 6, the proposed architecture is compared to some related works and, finally, section 7 presents our final remarks. 2. OLAP Tools OLAP applications focus on end users’ analytical requirements and on the modeling and computation processing necessary to accomplish them, without detailing the processes involved to make raw data available or to ensure their accuracy [Thom97]. For those reasons, OLAP applications are well integrated in Data Warehouse environments and are usually built over them [CWM00]. OLAP systems organize data using the multidimensional paradigm in the form of data cubes, each of which is a combination of multiple dimensions. Each cell of the data cube corresponds to a unique set of values for the different dimensions and contains the values of measures associated to this set of values. The members of one dimension can be organized based on parent-child relationships. A parent member usually represents the consolidation of its child members; these relationships between members of one dimension are called hierarchies [Thom97]. A dimension contains one or more natural hierarchies, together with other attributes that do not have a hierarchy relationship to any of the attributes in the dimensions [Kimb98]. Typical operations and OLAP functionalities offered to end users include the aggregation or de-aggregation of information (roll-up and drill-down) along a dimension, the selection of specific parts of a cube (slicing) and the re-orientation of the multidimensional view of data on the display (pivoting) ([OLAP97, VT99]). Several characteristics and functionalities expected in OLAP tools can be found in [Thom97, Kimb98, OLAP97]. OLAP tools can implementation format diversity and metadata access data stored in commercial DBMS or in proprietary databases. The of operations and metadata management is usually proprietary. This and the need for interoperability among tools motivated many modeling standards initiatives. Some examples are the Common Warehouse Metamodel (CWM) and the Open Information Model (OIM). The former was adopted by the OMG (Object Management Group) as the standard for metamodeling and metadata interchange in the data warehousing domain. The latter was initially proposed by Microsoft, and later adopted by the MDC (Metadata Coalition). More recently, it was announced they were to be merged [OMG01]. The CWM establishes a common metamodel for warehousing and also standardizes syntax and semantics needed for importing, exporting and other dynamic warehousing operations [CWM00]. The CWM is one of the modeling standards used in the integration model, which is under development by the time of this writing. 3. GIS Tools Cowen in [Cowe98] defines GIS as a decision support system involving the integration of spatially referenced data in a problem-solving environment. In [AS98], GISs are defined as systems that give a computational treatment to geographical data. A GIS stores the geometry (shape) and attributes of geographically referenced data that are localized on the earth surface, according to a cartographic projection. The main characteristics of GIS are the integration of spatial information in a single database and the supporting mechanisms that enable to combine several informations (through algorithms for manipulation and analysis); querying, recovering, visualizing and plotting geographically referenced data. GISs can be used on various application domains, like insurance, health, environmental management, urban planning, cartography and networks (water, energy, and telephony). There are three main usage approaches for GISs, which characterize their chronological evolution: as a tool to produce maps, based on proprietary file systems (first generation – in the 70’s), as a support to spatial analysis, based on conventional DBMSs (second generation – in the 80’s) and as a geographic database, based on spatially-aware DBMSs (third generation – in the 90’s to current days). Usually, spatial data is classified in geo-objects and geo-fields, according to two different views of the geographical space [Worb95]. Geo-objects are represented by discrete object types (points, lines, and polygons), which are the topological dimensions of the object, and are used in cadastral maps. Geo-fields are continuous geographical entities, with no welldefined locations, and are used in thematic maps. In this work, we focus on the usage of GIS as a tool to support spatial analysis in conjunction with OLAP tools, independently of the way GIS data is stored. Our framework is restricted to the object view of the geographical space, therefore spatial data must be represented as geo-objects. The development of GIS applications, like traditional information systems, should evolve through the phases of analysis, design and implementation. The analysis phase consists mainly of data and process modeling. During this phase, conceptual models are powerful tools to guide designers in a GIS application development. Although some general-purpose conceptual models (e.g., E-R, OMT, UML) have been successfully applied to model conventional information systems, they cannot cope well with some new specific requirements posed on database modeling by GISs. Therefore, in the last few years many database researchers have proposed extensions to those models (e.g. GeoER, MODUL-R, GeoOMT, GeoOOA, MADS) in order to support geographical database conceptual modeling [Pret99]. In this work, to map GIS concepts, we use GeoFrame [IF98], a conceptual framework developed for geographic data conceptual modeling. 4. The Proposed Architecture Given that geographic data analysis would benefit from a dimensional approach and that data analysis with a dimensional approach very often needs a geographic perspective, this work proposes an architecture (figure 1) that enables end-users to have a dimensional and spatial integrated view of data, independently if data sources are OLAP or GIS tools. Client Client Client Application Layer Integration Model Metadata Store Wrapper Middleware Layer Query Results Mediator OLAP Tool Wrapper GIS Tool Data Source Layer Figure 1- Architecture to integrate dimensional and spatial analyses 4.1 – Architecture Components The following layers compose the architecture shown in figure 1: • Data source layer – responsible for providing the data to be analyzed, formed mostly by preexisting data stored in OLAP or GIS tools. At this level, heterogeneity is high, not only because different types of tools might be used, but also because data might be structured in different ways, especially among GIS tools. • Midlleware layer - responsible for the mediation between application and data source layers. This middleware is composed by the following components: • The integration model – It is the main component of the architecture. It is a model responsible for integrating concepts utilized in OLAP and GIS tools. Applications developed in the application layer can be based in this integration model. The model is based on CWM, extending some already available OLAP concepts with concepts commonly found in GIS tools. The mapping to deployment structures is done with the support of other CWM package, the Transformation package, also extended to include GIS deployment structures. Figure 2 shows the main classes and relationships of the proposed model. According to this model, an integrated Schema is composed by Cubes, Dimensions and Themes. The association of Dimensions, that include measures (facts) or Time, Spatial or a NonTimeSpatial Dimension, composes a Cube. A Dimension can have several levels, referring to different hierarchies. The information from the GIS tools is integrated in the Schema through the Theme Class that aggregates Views, which are related to TemporalInformation. A View is a grouping of GeographicPhenomena that generalizes the phenomena whose localization on the world surface is considered in a given time (GeographicObject or GeographicField). A GeographicObject is spatially represented by SpatialObject (point, line, and polygon) and in the real world is represented by a RealObject. The LinktoView Class represents the relationship between Dimensions and Level from the OLAP side with a View. In practice, in an implementation level, LinkToView is a process where parameters from both sides are associated in order to obtain a query result. 0..1 0..1 Schema TemporalInformation * Cube 0..1 1 * CubeDimAssoc calcHierarchy * 0..1 Hierarchy * 1 * 0..1 displayDefault 0..1 1 * * Theme Dimension * 1 GeographicPhenomena TimeDim SpatialDim * HierarchyLevelAssoc MeasureDim * Level NomTimeSpatDim * * * * View * GeographicField * * * LinkToView * * GeographicObject SpatialObject * RealObject * Figure 2 - Main Classes and Relationships of the Integration Model Due to space restriction, classes and relationships responsible for the mapping to deployment structures are not shown. • Mediator – responsible for query processing, including decomposition (according to some integration schema), optimization and submission of application queries to the wrappers. • Wrappers – responsible for the translation between the data model used by data sources and the integration model. Each structural change in data sources requires the adjustment of this architectural component. The use of wrappers copes well to dealing with the structural heterogeneity of the data sources, as they are built for a specific tool. • Metadata repository – this repository stores integration schemas (derived according to the integration model), statistics about submitted queries and other descriptive and operational information. • • Query results database – responsible for storing query results of the most frequently used queries. Application layer - a layer of client applications developed over the integration model. These applications should provide visualization functionalities similar to those provided individually by OLAP and GIS tools. 4.2 – Guidelines for integration The integration between OLAP and GIS tools can be achieved at two different levels (figure 3): complete integration and integration-by-association levels. In the first case, members of a spatial dimension in OLAP tools can be linked to spatial objects. In this way, a drill-down in the spatial dimension corresponds, for instance, to changing from a "state view" to a "city view" in a map. At the integration-by-association level, a dimension - not necessarily a geographical one, can be linked to geographical data. Complete Integration Spatial Dimensions Integration GIS & OLAP Integration-by-Association Dimensions with some aspect related with the geographic theme Figure 3- Integration levels The steps and engines necessary to integrate a data source in the proposed architecture are described below: 1 - The DBA needs to evaluate the data sources and verify if their integration to the architecture is feasible. The following criteria should be taken into account: whenever OLAP tools are the data source, the information must contain elements that allow data to be spatially referenced (for instance city code, state and zip code). When GIS tools are the data sources, the information must include elements that allow analyses according to perspectives other than space (for instance, a historical perspective is possible if there are some identical cadastral maps along the time). 2 - Once the DBA decides for the integration of the data source through the architecture, the concepts of the data source tool need to be described by concepts of the integration model. It is also necessary to indicate what elements should be mapped into the concepts of the other approach. This phase corresponds, in fact, to the creation of a schema (based on the integration model) for the data sources. 3 – These descriptions are stored in the metadata repository and they will be used to support wrappers and mediators implementation. The storage of these information in the metadata repository allows: the construction of mechanisms for monitoring data sources changes, in terms of their structure or in terms of data. Another advantage is a larger transparency on wrappers and mediators operations (usually considered as a black box). 4 - An application built on top of the integration model can query (in a transparent way) the data sources instantiated in the integration model. The queries based on the integration model have their syntax stored in the metadata repository, where a specific application (agent) is responsible for managing the frequency of use of these queries. That information is useful for choosing the query results to be materialized in the query results database. The query results are “inserted” in the architecture through the mediator and are eventually stored in the query results database. The query results are made available to the end-user applications in XML format. 5. A case study In this section, the use of the integration model is explored in a simple case study. A health insurance company has an OLAP and a GIS applications, both containing information about loss and claims. In the OLAP application, it is possible to analyze the quantity of claims and paid claims per time, medical procedure and space dimension. Figure 4 shows a star schema for the application, with dimensions and fact tables stored in a relational database. Each dimension includes a two-level hierarchy: procedure_code à statistical_group in the Medical_procedure_dim, month à year in the Time_dim and city à state in the Space_dim. Medical_procedure_dim Time_dim medical_procedure_key statistical_group procedure_code time_key year month * * Loss_fact medical_procedure_key time_key space_key claim_quantity claim_payed Space_dim * space_key state city Figure 4- OLAP Conceptual Schema (star schema) In the GIS tool, one interesting map for this case is about loss occurrences by region in a given year. In this map, states are represented as polygons; hospitals, industries and the insured homes are represented by points. Figure 5 shows a conceptual schema2 for this situation. In this example, the geographic database has objects that can be GeoObjects (hospital, state, industry and the address of the insured parts who suffer the loss), InformationPlan and Non-spatial objects (Insured Person). GeoObjects are located in Map of geoObjects that are (in this example) Cadastral Maps (Industry Map, State Map, Hospital Map and Loss Map). Logically, some information are represented in relational structures and other in proprietary file structures. 2 It is important to note that there are few GISs that have a conceptual level of abstraction. Anyway, it is possible to create one from the logical structure. Objects Nom-spatial Loss StatisticalGroup Geo-Object Time InformationPlan Is a Geo-fields Insured Person Hospital Map of geo-objects State Industry Is a Cadastral Map Industry Map Loss Map Hospital Map State Map Figure 5- GIS Conceptual Schema Starting the integration from the OLAP application point-of-view, the schema (loss schema) has Dimensions (ClaimPayedDim, ClaimQuantityDim, MedProcDim, TimeDim, SpaceDim) and Cubes (LossCube). A Cube is formed by the aggregation of dimensions. A dimension can have Hierarchies3 (procedure_code à statistical_group, month à year, and cityà state) and Levels (procedure_code, statistical_group, month, year, city, state). From the GIS tool point-of-view, loss_theme is a group of Views in different times, associated to MapofObjects. The theme is in fact the StatisticalGroup from the LossMap. The time of MapOfObjects is the time of its associated maps, restricted by the time passed as parameter from the OLAP side. The Spatial dimension is represented in a process applied in the StateMap with the SpaceDimLevel as parameter. Figure 6 shows the integrated schema for the application. ClaimPayedDim 1 MedProcDim 1 1 1 1 1 LossCube 1 * 1 1 1 1 Loss_View 1 1 ClaimQuantityDim 1 1 * 1 TimeDim LossMap 1 * SpaceDim 1 MedProcDimLevel 1 TimeDimLevel SpaceDimLevel 1 1 1 1 1 1 HospitalMap Time Time StatisticalGroup 1 1 1 StateMap Time Figure 6- Integration Schema 3 1 MapOfGeoObjects In order to provide a better readability of the schema, the hierarchies were omitted in the figure 6 IndustryMap Time An example of a useful query is to find the correlation between a given disease (statistical group) in a given period with the presence of hospitals and/or industries. From the OLAP side, TimeDimLevel, MedProcDimLevel and SpaceDimLevel are selected. These data are parameters for the process execution in the GIS side that will generate a Loss_View enabling the spatial correlation analysis. 6. Related Works Integration of distributed systems itself constitutes a challenge that has already been addressed by many research groups. In the literature, we can find proposals for integration of distributed DW environments as well as for GIS integration. Interoperability among diverse geo-processing systems is the goal of the OpenGIS Consortium [OGC01], and has been proposed through conceptual models [Thom98] and through architecture [SSM97]. Architecture for construction of distributed DW systems using a heterogeneous DBMS is proposed in [SSU+00]. The most common scenario to combine time and space perspectives is to make dimensional analysis using OLAP tools, exporting results to a GIS tool where some spatial analyses is made ([Keel97, Gonz99]. Whenever new analyses are necessary, all the work must be redone. Another frequently used solution is the construction of bridges between specific tools. EnviroMapper [ZSM+98] is a bridge that binds a GIS implemented in ArcView / MapInfo and an environmental data repository (Envirofacts Warehouse) implemented in an Oracle DBMS. EnviroMapper and the Envirofacts Warehouse are being developed for the US Environmental Protection Agency. These approaches just provide interoperability but do not constitute a satisfactory integrated environment for the user. If the GIS or DW application are updated this can led to bridge maintenance. The GOAL Project [KMM00, KKM00] has as objectives DW and GIS integration, the use of data mining techniques and the evaluation of specific features of geographical data with regard to knowledge discovery processes. In the GOAL project, GISs are data sources, from which data are extracted and loaded into DW platforms. GISs are also used for the presentation of the analysis results. As it has an Extract-Transform-Load (ETL) process involved, it seems that this approach is oriented to the construction (from the scratch) of a DW with spatial objects. The binding between GIS and DW are only through elements of the spatial dimension and the GIS taxonomy objects. In [Stef97], the construction of a spatial data cube is proposed using a spatial data warehouse model, which consists of both spatial and non-spatial dimensions and measures. There are three cases for modeling dimensions in a spatial data cube: nonspatial, spatial-to-nonspatial – whose primitive level data are spatial but whose generalization, stated at a certain high level, becomes non-spatial, and spatial-to-spatial dimensions - whose primitive and all high-level generalized data are spatial. This model considers the use of two types of measures: numerical and spatial measures, which contain one or a collection of pointers to spatial objects. This proposal is oriented for the construction of cubes, considering only one data source. Summarizing, there are some proposals related with the integration of dimensional and space analyses, namely: integration using a spatial data cube model [Stef97]; integration through bridges among two specific tools [ZSM+98], and also an architecture that seems to solve at least part of the integration problems ([KMM00, KKM00]). An approach to the integration of data managed by different tools should solve structural as well as semantic aspects. Semantic integration still constitutes a challenge and a research issue to the academic world. There is an expectation that with the maturity of standards and metadata architectures such as MOF, CWM, OpenGIS and RDF, and the complementary use of ontologies, this problem could be adequately addressed. 7. Conclusions In the enterprise world, there is often no time and justification to replace or convert existing systems. New functionality must be integrated with other packages, existing applications, and data sources. Therefore, building applications that are adaptable to business and technology changes, while retaining legacy/existing applications and legacy technology as reasonably as possible, is a very useful approach. In this article, an architecture that supports dimensional and spatial data analyses uniformly was proposed. This architecture is very helpful when applied over existing systems, since there is no need for changes in source (original) applications. Therefore, it is possible to protect existing investments and to enable rapid responses to changing user requirements. Even though the idea of integrating multidimensional and spatial analysis (DW and GIS) was also discussed elsewhere ([KMM00, KKM00]), the architecture proposed in this paper has a very different approach, focusing in integration of preexisting systems without extracting, transformation and loading of data from one system to the other. Besides the proposed architecture, the integration model is now being refined, based in current metadata and modeling standards, mainly CWM. As next steps in our research, we have the implementation of wrappers and mediator in the real case study of a Health Insurance Company, where OLAP and GIS applications need a complementary view. There are also other issues to be addressed: • The implementation of a mechanism responsible for managing the most frequent queries (an agent); • The use of encoded standards (i.e. FDGC, RDF) in XML, when delivering information to the application layer; • Algorithms to more efficiently choose which query results should be materialized, taking into account parameters other than just query frequency. References [AS98] Assad, E.D.; Sano, E.E.. “Geographical Information Systems. Applications in the Agriculture”, Embrapa, 1998. (In Portuguese). [Cowe98] Cowen, D.J..”GIS Versus CAD Versus DBMS: What are the Differences”. Photogrammetric Engineering and Remote Sensing, 1998, 54:11, 15511554. [CWM00] OMG: "CWM – Comom Warehouse Metamodel Specification“. http://www,omg.org [Gonz99] Gonzales, M.L.. "Spatial OLAP: conquering Geography". DB2 magazine, Spring 1999. http://www.db2mag.com/99sp_gonz.shtml [IF98] Iochpe, O.; Filho, J. L.. “A Basic Class Hierarchy to support GIS Conceptual Design”. In International Conference on Modeling Geographical and Environmental Systems with Geographical Information Systems. 1998, Hong Kong, China. [Keel97] Keeler, M.. "Treasure Maps for Decision Support". Database Programming & Design - special edition - Spatial Database Novembro de 1997. www.dbpd.com [Kimb98] Kimball, R. "The Data Warehouse Lifecycle Toolkit". John Wiley & Sons, Inc.,1998. [KKM00] Kouba, Z.; Matousek, K..; Miksovsky P.. "On Data Warehouse and GIS Integration" In Procedings of DEXA2000, Greewinch, 2000 [KMM00] Kouba, Z.; Marik V.; Miksovsky P.. "Data Warehousing and Geographical Information". In Procedings of SCI 2000. [OGC01] OpenGIS Consortium, http://www.opengis.org/, Last update: 02/06/2001 11:26:36 [OLAP97] OLAP Council. "The APB-1 Benchmark In: http://www.olapcouncil.org/research/bmarkly.htm [OMG01] OMG: The Object Management Group Homepage In: http://www,omg.org [Pret99] Preto, A . G..”METASIG: Ambiente de Metadados para Aplicações de Sistemas de Informações Geográficos”. Instituto Militar de Engenharia, Rio de Janeiro, Brasil. M Sc. Thesis, 1999, (in Portuguese). [SSM97] Strauch, J.C. M.; Souza, J. M.; Mattoso, M. L. Q.. "MULTISIG: An architetcture for Interoperability between Geographics Bases". In: Proceedings of GIS Brasil 97,(in Portuguese). [SSU+00] Silva, D. S.; Siqueira, S.W. M.; Uchôa, E. M. A.; Braz, M.H.L.B. ; Melo, R. N.:” An architetcture for Datra Warehouse Systems Using a Heterogeneous Database Management System". 15th Brasilian Symposium on Databases (SBBD), Paraíba, Brasil, 2000 [Stef97] Stefanovic, N.. "Design and Implementation of On-Line Analytical Processing (OLAP) of Spatial Data." University of Belgrade, Yugoslavia. M Sc. Thesis, 1997. [Thom97] Thomsen, E.. "OLAP Solutions. Building Multidimensional Information Systems". Wiley Computer Publishing, 1997, 576pp. [Thom98] Thome, R.. “Interoperability in Geoprocessing: conversion between GIS conceptual models and comparison to the OPENGIS standard”. Instituto Nacional de Pesquisas Espaciais, Brasil. M Sc. Thesis, 1998, (in portuguese). In: http://www.dpi.inpe.br/teses/thome/ [VT99] Vassiliadis, P.; Sellis, T.. "A survey on logical Models for OLAP Databases". In: http://www.dbnet.ece.ntua.gr/~dwq [Worb95] Worboys, M. – “GIS - A Computing Perspective” – Taylor & Francis, 1995 [ZSM+98] Zhuang, V.; Sun, J.; Moss, M.; Israel, S.; McEnaney, B.; Wolf, D.. "EnviroMapper—The Visualization Tool to the Envirofacts Warehouse". In ESRI User Conference, 1997. http://www.esri.com/library/userconf/proc98/PROCEED/TO850/PAP831/P83 1.HTM