Relational Geographic Databases

Relational Geographic Databases Dehua Zhao1, Byunggu Yu (corresponding author)1, Bong H. Hong2, and Dan Randolph1 1 Department of Computer Science Un...
Author: Ethel Bradley
34 downloads 0 Views 140KB Size
Relational Geographic Databases Dehua Zhao1, Byunggu Yu (corresponding author)1, Bong H. Hong2, and Dan Randolph1 1

Department of Computer Science University of Wyoming PO Box 3315 Laramie, Wyoming 82071, USA phone: (307) 766-2440, fax: (307) 766-4036 [email protected] 2

Department of Computer Science and Engineering Pusan National University San 30, Jangjeon-dong, Gumjeong-gu, Pusan, 609-735, Korea phone: +82-51-510-2424, fax: +82-51-517-2431 [email protected]

Abstract. This paper proposes a generic relational-database schema that can efficiently accommodate various types of GIS data. The proposed schema complies with the OpenGIS Simple Features Specification for SQL developed by OGC (OpenGIS Consortium) and can be used for any geographic application whose geographic objects are represented based on 2D geometry with linear interpolation between vertices. The generic schema that we propose in this paper makes it possible to automatically generate a relational database schema for any existing or new 2D GIS dataset. This facilitates the migration and deployment of GIS data in well-established relational database environments. Consequently, sharing and integrating GIS data become much more feasible. In addition, since any relational database management system (DBMS) can be used, developing a GIS application system on existing GIS data is facilitated. We verified the proposed schema and automatic schema generation mechanism by developing and testing a relational geographic information system. Keywords. Geographic Information System, databases, relational databases, schema design.

1 Introduction Just a few decades ago, paper maps were the principal means of synthesizing and representing geographic information. Paper maps are limited to manual manipulation and fail to meet the increasing demand for interactive manipulation and analysis of geographic data. The rapid development of new computer software and hardware technologies has made meeting this demand possible: various types of geographic information systems (GIS’s) that can replace traditional paper maps have been developed. In recent years, a GIS has become more than a cartographic tool to produce digital maps. A GIS provides storage, management, and retrieval of geographic spatial data (e.g., the boundaries of lakes) and related non-spatial data (e.g., names, sizes, and average water temperatures of lakes). The GIS application domain spans many areas including Urban Planning, Route Optimization, Public Utility Network Management, Demography, Cartography, Agriculture, Natural Resources Management, Coastal Monitoring, Fire Control, and Epidemic Monitoring. In recent years, there is an increasing demand for database-supported GIS’s that are streamlined for handling complex statistical or analytical queries. Most modern GIS’s have been developed based on a file system. As a result, each GIS has its own logical data formats and file structures. Unfortunately, these file-system-based GIS’s have several well-known problems that have been found in the area of databases: data sharing, data redundancy and inconsistency, transaction control and recovery, concurrency control, and security. The most feasible approach to these problems is building a GIS based on a well-established database model [6]. Maybe, the most successful database model that has been proven to effectively attack the problems of early file systems in a reliable manor is the relational database model – almost all database management systems (DBMS’s) support relational database model. Previous works related to developing a GIS based on database technology are categorized into two approaches: hybrid approach [15, 16] and integration approach [12, 13, 14, 21, 22]. The hybrid approach uses a DBMS to store and manage non-spatial data, and spatial data is separately managed by either a proprietary file system (e.g., ARC/INFO) [15] or a spatial data manager (e.g., Papyrus) [16]. On the other hand, the integration approach extends the ER-model (the relational database model) by adding new data types and operations to capture spatial semantics

[12, 13, 21, 22] and requires the DBMS to support user defined ADTs (Abstract Data Types) and operations [14, 21, 22]. In these systems, the major problem is data sharing and migration among heterogeneous systems. This paper proposes a highly flexible and portable GIS schema called the Generic Relational-GeographicInformation Schema (GRGIS) that can be automatically specialized to accommodate any GIS dataset whose data objects are represented based on 2D geometry with linear interpolation between vertices. The schema is based on the relational database model and a widely-used standard SQL (SQL92) [6] and fully compatible with the OpenGIS Simple Features Specification for SQL developed by OGC (OpenGIS Consortium) [5]. This paper also proposes our technique called the automatic Schema Generation mechanism for GIS applications (SGGIS) that can automatically generate a relational database schema, given a GIS dataset. The GRGIS and SGGIS facilitate the migration and deployment of GIS data in well-established relational database environments. Consequently, sharing and integrating GIS data become much more feasible. In addition, since any database management system (DBMS) that supports the basic relational database model and SQL can be used, developing a GIS application system on an existing DBMS and reusing existing sets of geographic data are facilitated. An experimental GIS system called the RGIS (Relational Geographic Information System) has been developed to verify the GRGIS and the SGGIS. The remainder of this paper is organized as follows: Chapter 2 gives an overview of commonly used GIS data models. Chapter 3 introduces the GRGIS and SGGIS. Chapter 4 shows our experimental results. Finally, we provide the summary and discuss our future work in Chapter 5.

2 Backgrounds In many ways, a GIS presents a simplified view of the real world. Each geographic data object associates with two kinds of data: non-spatial data and spatial data. Non-spatial data of a geographic object consists of alphanumeric values describing (or being associated with) the object. Spatial data of a geographic object represents geometric properties of the object. Geometric properties of a geographic object define the geometry (i.e., geometric figure) of the object by defining the interiors, the boundaries, and the exteriors of the object [8]. Existing spatial data models can be classified into two groups depending on how they view the real world: field model and object model [1, 2, 3, 4]. The field model views the world as a continuous surface over which features vary in a continuous distribution (e.g. atmospheric pressure). In this model, the world (i.e., a field) is partitioned into areas, and the emphasis is on the contents of these areas. The object model thinks of the world as a surface littered with recognizable objects. Another classification of spatial data is based on the representation of spatial data: raster representation and vector representation. Typically the field model is developed based on the raster representation. The vector representation, on which the object model is implemented, explicitly stores the geometric features of the identified geographic objects (typically obtained from raster data). It takes much less storage space and provides efficient geometrical and topological operations. Although the field model is still used in some applications such as atmosphere GIS applications and environmental GIS applications, the object model is becoming widely accepted, this is because of the fact that geometrical and topological operations are necessities for an increasing number of emerging GIS applications. In this paper, we focus on the object model and the vector representation. 2.1 Object Models based on Vector Representation In the vector representation, objects are constructed from points as primitives. A point is represented by a pair of X, Y coordinates, whereas more complex linear and region objects are represented by structures (lists, sets) on their point representation. Considering collections of geographic objects, interests are also given to the representation of topological relationships among geographic objects. Differing in the expression of topological representation, the representations of geographic-objects collection are usually classified into two models: spaghetti representation and topological representation. In the spaghetti representation, the geometric properties of any spatial object are described independently of other objects. No topological relations are stored, and all topological relations are computed on demand. On the other hand, the topological representation describes geometrical properties in terms of node, arc, polygon, region, and the topological relations among them. For example, a node is represented by a point and a list of arcs starting (or ending) at this node; an arc is represented by its ending nodes and the polygons having the arc as a common boundary; a polygon is represented by a list of arcs. The main advantage of the spaghetti representation is its simplicity. The drawbacks of this model are mainly due to the lack of explicit information about topological relations among spatial objects. In addition, the spaghetti

2

representation implies data redundancy. For example, the coordinate values representing a boundary shared by two adjacent regions are duplicated. The topological representation can efficiently support some topological queries. For example, looking for polygons adjacent to a given polygon P1 is straightforward. P1 is scanned, and accessing each of its arcs provides a polygon adjacent to P1. However, if P1 is not one of the data objects, this does not work, and such topological queries can be better supported by spatial access methods (e.g., topological R*-trees [9] and QSF-trees [10]). In fact, spatial access methods can be more easily built on the datasets represented in the spaghetti representation. Moreover, the topological representation shows lower performance of generating query results for display, because each object’s actual coordinate values are replaced by other objects’ identifications. That is, a larger number of objects should be accessed. This results in an increased number disk accesses. Due to these facts, for GIS’s dealing with a large volume of geographic data, the spaghetti representation is preferable. 2.2 Data Management A GIS (Geographic Information System) needs to store both spatial data and non-spatial data. Early GIS’s were built on top of proprietary file systems. This allows the system performance to be optimized for some functions, but does not respect the important data independence principle, and leads to many drawbacks in terms of data sharing among different GIS’s, data consistency, data recovery, security, and data integrity. These problems can be solved by developing GIS’s on top of a well-developed Database Management System (DBMS). Although individual DBMS’s have their own extensions, in recent years, all general purpose DBMS’s have been developed based on well-established relational-database technology. Thus, a GIS application that is designed based on the relational database model and a standard SQL (e.g., SQL 92) can be implemented on any relational DBMS without significant modifications. Furthermore, the problems of data sharing among different GIS applications, data consistency, data recovery, security, and data integrity can be effectively solved. The efforts of deploying geographical information in relational DBMS’s can be classified into two approaches: hybrid approach and integration approach. In the hybrid approach, spatial data is stored in external files, and only non-spatial data is managed by a DBMS. The spatial data is accessed by geographic object identifier (gid) (each spatial object has a unique gid value), and all gid values are also stored (as attribute values) in the non-spatial database to connect the spatial data and the non-spatial data of each geographical object. Arc/Info (ESRI), MGE (Intergraph), and TiGRis (Intergraph) are well-known GIS’s that follow this approach. This approach provides more flexibility and greater potential for data sharing and integration. However, this approach also suffers from some drawbacks among which the most important are: (1) the coexistence of heterogeneous data models, which implies difficulties in modeling and sharing of spatial data; and (2) the partial loss of basic DBMS functionalities, such as data recovery and query optimization regarding the spatial part of data. The integrated approach employs a DBMS to store both spatial data and non-spatial data. This approach eliminates drawbacks of the file-based approach and the hybrid approach, since both spatial data and non-spatial data are stored in a database. Thus, data sharing, data integrity, data recovery, security, and data consistency are guaranteed by the DBMS to a great extent (assuming that the DBMS is a reliable DBMS, such as Informix, Oracle, and DB2, supporting concurrency control, recovery, and security). Unfortunately, the existing implementations are based on a specially extended ER-model with new geographic constructs, ADTs (Abstract Data Type) of spatial objects, and additional operations on these ADTs [12, 13, 14, 21, 22]. These implementations suffer from one or more of the followings: (1) limited representation of complex polygons (e.g., the number of the boundary points must be less than 2000 in [15]), (2) poor performance, (3) application dependent database schema, and (4), most importantly, since the implementations make use of the extended features of a specific commercial DBMS, data sharing and integration are rather cumbersome. 2.3 The OpenGIS Standard for SQL Environment In May 1999, OGC (OpenGIS Consortium) released the OpenGIS Simple Features Specification for SQL [5]. The purpose of this specification is to define a standard SQL schema that supports storage, retrieval, query and update of a collection of geospatial features. In the specification, geometric features are represented based on 2D geometry with linear interpolation between vertices. For the implementation, two target SQL environments can be considered: SQL92 and SQL92 with Geometry Types. SQL92, also known as SQL2, has been widely supported by commercial DBMS products. SQL92 with Geometry Types requires the underlying DBMS to support some geometry types and SQL3 features, which allows the users to extend the type system by defining ADTs (Abstract Data Types) and operations. However, these extended SQLs are not widely supported by DBMS’s or standardized.

3

2.4 Motivations We initiated this research to attack the problem of geographic information sharing. The goal of this research is to develop a Generic Relational-Geographic-Information Schema (GRGIS) and an automatic Schema Generation mechanism for GIS applications (SGGIS). A GRGIS is a high level relational abstraction of all types of geographic data (both spatial and non-spatial) of the OGC Simple Features (see Section 2.3), and an SGGIS is a mechanism that can generate, without human intervention, a specific relational geographic database schema given a geographic dataset. Given a geographic dataset, the GRGIS and SGGIS that we propose in this paper do not separate spatial data and non-spatial, but store them together in a single database so as to take full advantage of the relational database technology (i.e., data sharing, data integrity and constancy, recovery, security, and optimized query processing).

3 GRGIS and SGGIS In this section, we introduce our design of the Generic Relational-Geographic-Information Schema (GRGIS) and automatic Schema Generation mechanism for GIS applications (SGGIS). The GRGIS adopted spaghetti representation, which makes the schema simple and efficient. Topological queries can be efficiently supported through spatial access methods and point access methods. Using efficient multidimensional access methods and simple spaghetti representation is more efficient than using complex topological representation in processing topological queries. The reasons for this include: (1) Topological queries can involve an arbitrary query region and one of the eight topological relations defined in [9, 10]. If the query region of a given query is not one of the data objects, keeping track of all the topological relations between data objects is not helpful in processing the query; (2) As mentioned in Section 2.1, in the topological representation, to find the exact coordinates of the points constituting the result set of a query, we may need to read the coordinates of some other objects. This results in an increased number of storage accesses. The SGGIS is the mechanism for generating schema of specific GIS applications. 3.1 GRGIS Various types of geometric features are defined in the OpenGIS Simple Features Specification for SQL. Moreover, each GIS defines its own set of geometric features. Through the rest of the paper, to minimize the redundancy in describing our GRGIS and SGGIS, we discuss only a representative subset of the geometric features, more specifically point, multipoint, polyline, and polygon. Generic geometric features. This Section gives the definitions of generic geometric features. Definition 1. A point is stored as and represents a single location in coordinate space. Definition 2. A multipoint is a set of points representing a conceptual object or group. There is no order among points, and the number of points varies from one set to another. A multipoint is stored as {, , …, } where n ≥ 1 is the number of member points. Definition 3. A polyline is a set of one or more parts. A part is a sequence of two or more connected points (vertices) with linear interpolation between points. In a part, each consecutive pair of points defines a line segment. Parts may or may not be connected to one anther. Parts may or may not intersect one another. The number of parts and the number of points in each part vary. A polyline is stored as {, ….,} where m ≥ 1 and, for all i= 1, m n[i] ≥ 2. Note that m is the number of parts and n[i] is the number of vertices constituting the ith part. Definition 4. A polygon is a set of one or more exterior rings and zero, one, or more interior rings. Each interior ring defines a hole in the polygon. A ring is a sequence of three or more connected points (vertices) that form a simple and closed loop. In a ring, each consecutive pair of points defines a line segment, and the last point is connected to the first point to form a closed region. No two line segments of a ring cross. No two rings of a polygon cross, the rings of a polygon may intersect at a point but only as a tangent [5]. The order of vertices of a ring indicates which side of the ring is the interior of the polygon. While the ring representing the outer boundary of a

4

polygon is represented by the clockwise sequence of the component points (vertices), each interior ring representing a hole is represented by the counterclockwise sequence of the vertices. A polygon is stored as {, ….,} where m ≥ 1 and, for all i= 1, m n[i] ≥ 3. Note that m is the number of rings and n[i] is the number of vertices constituting the ith ring. The above definitions cover geometries of Point, MultiPoint, Line, MultiLineString, Polygon, and MultiPolygon defined in [5]. Our definitions are simple, and have high expressive power. The GRGIS and SGGIS can be extended to accommodate other types, like Curve and MultiCurve.

Geographic object sets (Themes). Our model is a theme-based model. A theme is a collection of objects that have the same type of geometric feature (i.e., point, multipoint, polyline, or polygon) and the same set of non-spatial attributes (i.e., all objects in the same theme have the same spatial data structure and the same non-spatial data structure). This definition of themes coincides with that of ArcView themes [7, 18]. Although some GIS’s, such as Arc/Info, allow the users to put objects that have different types of geometric features in the same theme, such objects can be always separated into different themes. Definition 5. A theme is a collection of geographic objects that have the same type of geometric feature and the same set of non-spatial attributes. For point themes, each object’s spatial (geometric) properties represent only its location. For this, two attributes X_COORD and Y_COORD are necessary: Each element of the Cartesian product of their domains represents a unique location (in most cases, a pair of a longitude value and a latitude value). In addition, for each theme, a primary key, which consists of one or more non-spatial attributes, is required to identify objects. In a multipoint theme, each object consists of a various number of points. Thus, the member points can be modeled as weak entities each of which is dependent on a single multipoint object. Our model can support analytical queries and statistical queries such as “Report the total, mean, and variance of the populations of the cities whose population is greater than 20000 in Wyoming.” To efficiently answer such queries, the system must be able to efficiently find spatial objects having a certain topological relation with respect to a given spatial query object. For example, to process the query above, all cities that are contained by the given spatial object (the region of Wyoming) must be found first. Then the population values of the cities are retrieved to compute the total, mean, and the variance. In the spaghetti representation, topological relations between spatial objects are tested on the fly. To process a topological query, spatial access methods can be used to quickly find the candidate objects whose approximations (e.g., minimum bounding rectangles) satisfy the given topological predicate with respect to the approximation of the given query region. Then, the actual spatial properties of the candidate objects are referenced to delete the objects whose actual regions do not satisfy the given topological predicate with respect to the actual query region. This wellknown processing technique for geographic queries is called two-phase spatial query processing [9, 10]. The most frequently used spatial approximation is MBR (minimum bounding rectangle) [9, 10]. For this purpose, each multipoint object has its own abstraction represented by the center of the point group and the minimum bounding rectangle (MBR) that minimally encloses all the member points. In our model, the center point and the MBR are represented by and , respectively. Figures 1a and 1b show the ER diagrams representing a generic point theme and a generic multipoint theme, respectively. We modeled polyline themes and polygon themes in the same sense. Figures 2a and 2b show a generic theme of polyline objects and a generic theme of polygon objects, respectively. Note that, we store a set (or sequence) of points as a string (the “POINTS” multi-value attribute of Figure 1b). Since the length of the string vary, the data type varchar[n] can be used to eliminate wasted storage space. The maximum length n is determined by the DBMS (typically, 255). Since a multipoint object, a part of a polyline, or a ring of a polygon can have a large number of points (or vertices) that require more than n characters, the discriminator attribute SEQ is used to connect all string chunks constituting a single point set, a part of a polyline, or a ring of a polygon. ER diagrams in Figures 1 and 2 constitute 4 types of instance-level schemas. These generic instance-level schemas need be specialized to create a geographic database. In each specialized ER diagram, italic words are replaced with real words. For example, to create a point theme “City” in which each city has a location, city name, state name, population, and size and in which the primary key consists of city name and state name, the generic point object

5

entity (Figure 1(a)) is specialized as follows: (1) the entity name “point object” is replaced by City; (2) “key attributes” is replaced by CITY_NAME and STATE_NAME; (3) “non-spatial attributes” is replaced by POPULATION and SIZE.

CX_COORD

CY_COORD

MIN_X X_COORD

MAX_X

Y_COORD

MIN_Y key attributes

point object

multipoint object

MAX_Y

has

non-spatial attributes

non-spatial attributes

key attributes

multipoint feature n SEQ

POINTS

(a) (b) Figure 1. ER diagrams representing a generic theme of point objects (a) and a generic theme of multipoint objects (b): A bolded ellipse represents a set of attributes; A double rectangle represents a weak entity set; The double ellipse superscripted by n represents a member-points array of maximum length n; The mapping cardinality of has is many (multipoint feature) to one (multipoint object); SEQ is the discriminator. Note that the union of key attributes and non-spatial attributes constitute the nonspatial attributes of the geographic objects

CX_COORD MIN_X

MAX_X

MIN_X

polyline object MIN_Y

key attributes

MAX_X polygon object

MAX_Y

has

CY_COORD

MIN_Y

non-spatial attributes

key attributes

polyline feature

MAX_Y

has

non-spatial attributes

polygon feature n

PART_NO, SEQ_NO

n

VERTICES

RING_NO, SEQ_NO

VERTICES

(a) (b) Figure 2. ER diagrams representing a generic theme of polylines objects (a) and a generic theme of polygon objects (b): A bolded ellipse represents a set of attributes; A double rectangle represents a weak entity set; Each of the double ellipses superscripted by n represents a vertex-points sequence of maximum length n. Note that the union of key attributes and non-spatial attributes constitute the non-spatial attributes of the geographic objects.

Geographic databases. We view a geographic database as a set of themes. Therefore, in our model, a geographic database is represented by a set of themes. Each theme can have its own attributes, such as the name, feature type (e.g., polylines), and the area covered by the theme. In addition, considering data sharing, since different GIS’s possibly provide different set of primitive data types, exactly matching each primitive data type of a GIS with one of the primitive data types of a DBMS (database management systems) is not always possible. The most feasible approach to this type conversion problem is assigning more general data types to an attribute if there is no exactly matching data type supported by the underlying DBMS. Additional information about the origin of the theme, i.e. the release of the GIS (GIS software name and the version) in which the theme was created, and the original data types are stored in the database. Because of these reasons, we defined a meta-level entity set as shown in Figure 3: each meta-level entity describes a single theme, and each theme is described by a single meta-level entity. In our model, a geographic database is defined as follows:

6

Definition 6. A geographic database is a union of a set of themes and a set of Theme entities. Theme entities are meta-level entities each of which describes one theme, and each theme is described by one Theme entity (there is a one-to-one relationship, called described_by, between themes and Theme entities).

NAME, ORIGIN

GEOMETRY _TYPE

ORG_TYPE_NAME MIN_WIDTH

Theme

has attributes

Attribute _Description

ATTRIBUTE_NAME

MAX_WIDTH

PRECISION

covers

DESCRIPTION MIN_X, MIN_Y, MAX_X, MAX_Y

Theme _Region n RING_NO, SEQ_NO

VERTICES

Figure 3. ER diagram representing meta-level entities

3.2 SGGIS: Schema Generation mechanism for GIS applications Creating a geographic database starts with specializing the ER diagrams introduced in Section 3.1, by replacing italic words with actual words. A mechanism or algorithm is needed to generate the actual schema for each specific GIS application. SGGIS – Schema Generation mechanism for GIS application is designed for this purpose. The two-level structure of the GRGIS makes the schema generation process automatic without human intervention or code modification. The instance level schema depends on the contents of the meta-level entities. For example, the theme name gives the name of the corresponding data table (data relation), and the GEOMETRY_TYPE determines overall structure of the data relation. Given a set of themes, the SGGIS creates a geographic database as follows: Create a database. create database database-name; create table Theme ( NAME varchar[MAX_NAME_LENGTH] not null, ORIGIN varchar[MAX_ORIGIN_LENGTH] not null, GEOMETRY_TYPE integer, MIN_X real, MIN_Y real, MAX_X real, MAX_Y real, DESCRIPTION varchar[MAX_DESCRIOTION_LENGTH], primary key (NAME, ORIGIN) ); create table Attribute_Description ( TNAME varchar[MAX_NAME_LENGTH] not null, TORIGIN varchar[MAX_ORIGIN_LENGTH] not null, ATTRIBUTE_NAME varchar[MAX_ATTRIBUTE_NAME_LENGTH] not null, ORG_TYPE_NAME varchar[MAX_ORG_TYPE_NAME_LENGTH], MIN_WIDTH integer, MAX_WIDTH integer, PRECISION integer, primary key (TNAME, TORIGIN, ATTRIBUTE_NAME), foreign key (TNAME, TORIGIN) references Theme (NAME, ORIGIN) ); create table Theme_Region ( TNAME varchar[MAX_NAME_LENGTH] not null, TORIGIN varchar[MAX_ORIGIN_LENGTH] not null, RING_NO integer not null, SEQ_NO integer not null,

7

VERTICES varchar[MAX_STRING_LENGTH], primary key (TNAME, TORIGIN, RING_NO, SEQ_NO), foreign key (TNAME, TORIGIN) references Theme (NAME, ORIGIN) );

Create specialized themes. To create a theme, the low-endpoint (i.e., ) and the high-endpoint (i.e., ) of the MBR (minimum bounding rectangle) that minimally bounds the region covered by the theme are computed first. Then the theme is created as follows: insert into Theme values (…); for (i=0; i