The Grid-DBMS: Towards Dynamic Data Management in Grid Environments

The Grid-DBMS: Towards Dynamic Data Management in Grid Environments Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria Mirto Center for Advanced Co...
Author: Roxanne Gibson
6 downloads 2 Views 2MB Size
The Grid-DBMS: Towards Dynamic Data Management in Grid Environments Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria Mirto Center for Advanced Computational Technologies/ISUFI & SPACI Consortium, University of Lecce, Italy {giovanni.aloisio, massimo.cafaro, sandro.fiore, maria.mirto}@unile.it

Abstract Nowadays many data grid applications need to manage and process a huge amount of data distributed across multiple grid nodes and stored into heterogeneous databases. Grids encourage the publication of scientific data in a more open manner than is currently the case, and many e-Science projects have an urgent need to interconnect legacy and independently operated databases through a set of data access and integration services. In the data grid area a set of dynamic and adaptive services could address specific issues related to automatic data management aiming at both providing high performance and fully exploiting a grid infrastructure. In this paper we introduce the Grid-DBMS concept, a framework for dynamic data management in grid environments, highlighting its requirements, architecture, components and services. Keywords: Grid-DBMS, Grid-Database, Relational Data Source, Grid Computing, GRelC Project, Globus Toolkit.

to introduce a set of dynamic services, i.e., a more complex and adaptive (with respect to the past [8]) framework. This paper presents the Grid-DBMS specification, illustrating a framework for data management in grid environments. We do not propose a new kind of DBMS or suggest a specific grid middleware. Indeed, the aim of this paper is to identify the services which are strongly involved in the data grid area, paying special attention to adaptive and dynamic data management services. The outline of the paper is as follows. In Section 2 we introduce the Grid-DBMS concept and related definitions. In Section 3 we present the requirements of this framework, whereas in Section 4 we describe the main Grid-DBMS services. In Section 5 we highlight the basic components of a Grid-DBMS, whilst in Section 6 we illustrate the overall architecture. In Section 7 we discuss the most important issues related to a Grid-DBMS whereas in Section 8 we present the GRelC project [9], an implementation of a Grid-DBMS. We recall related work in Section 9, and conclude the paper in Section 10.

2. Grid-DBMS: Basic Definitions 1. Introduction Many e-Science projects need to manage and process a huge amount of data distributed across multiple grid nodes and stored into heterogeneous databases. Several research activities related to grids [1] have generally focused on applications where data is stored in files, but nowadays there is an urgent need to interconnect legacy and independently operated databases [2]. Database Management Systems (DBMSs) [3] represent a reliable, accepted and powerful instrument to store persistent data but, to date, they are not grid enabled (with the notable exception of Oracle [4]), that is, there is not data grid middleware, based on standard protocols providing basic requirements such as security, availability and transparency and fully exploiting the power of grids. Lastly, many efforts have been concentrating in this direction, basically providing static services for data access and integration [5, 6, 7], but to really exploit a grid infrastructure for data management it is necessary

This paper presents the Grid-DBMS concept, a system for dynamically managing data sources in grid environments. Before delving into details about the Grid-DBMS requirements and architecture, we need to formally introduce this concept providing some basic definitions. In our vision a “Grid-DBMS is a system which automatically, transparently and dynamically reconfigures at runtime components such as Data Resources, according to the Grid state, in order to maintain a desired performance level. It must offer an efficient, robust, intelligent, transparent, uniform access to Grid-Databases”. By dynamic reconfiguration we mean: 9 Data source relocation (a Data Source can be moved from one node of the grid); 9 Data source replication (a Data Source can be replicated on different nodes of the grid) 9 Data source fragmentation (a Data Source can be partitioned on different nodes of the grid) In the Grid-DBMS definition we referred to the Grid-Database concept, which “is a collection of one or more databases logically interrelated (distributed

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

over a grid environment) which can also be heterogeneous and contain replica, accessible through a Grid-DBMS front end. It represents an extension and a virtualization of the Database concept in a grid environment”. The last definition, that we need to introduce, is related to the Logical Data Space which refers to “the virtualized physical space in which a Grid-Database can be managed”. Finally, the three concepts can be brought together as follows: “A Grid-Database is dynamically managed by the Grid-DBMS within its Logical Data Space”.

3. Grid-DBMS: Requirements In our vision the Grid-DBMS must provide the following basic requirements: Security: data security is a fundamental requirement of a Grid-DBMS that aims at protecting data against unauthorized accesses. It includes: data protection and user control. The former is required to prevent unauthorized users from understanding the physical content of data, so data encryption must be used to protect information exchanged on the network. The latter is required to perform authentication and authorization processes. Authentication is required to check a user’s identity, whilst authorization determines whether a user has the right to perform a read/write operation on a database object (the object represents a subset of a database). Transparency: it refers to separation of the higherlevel semantics of a system from low-level implementation issues. There are various possible forms of transparency within a distributed environment, so a Grid-DBMS must hide many implementation details strongly connected with: physical data location: the user must know nothing about the physical location of a database on the grid. This way mechanisms which move (totally or partially) a data source can be entirely transparent to the user (data relocation transparency); network: considering distributed databases we need to properly handle the network and a Grid-DBMS must also hide any details connected with it; data replication: replication of data improves performance, reliability and availability of the entire system. It is worth noting here that the user must not be aware of the existence of multiple copies of the same logical information. data fragmentation: fragmentation of data consists of dividing database information into smaller fragments and treating each one of them as a separate unit. This process allows the system to improve global performance, availability and reliability. Moreover, fragmentation increases the level of concurrency and therefore the system throughput. DBMS heterogeneity: many different DBMSs exist, such as ORACLE, DB2, PostgreSQL, MySQL, etc. Moreover, an increasing number of applications

interact with not relational databases (e.g., flat files in the bioinformatics field). The Grid-DBMS has to conceal this heterogeneity (different back-end errors, APIs, data types, physical support, etc.) providing a uniform access interface to data sources, that is, performing data virtualization. This way the access mechanism will be independent (transparent) of the actual implementation of the data source. Easiness: the Grid-DBMS must provide an “easy” solution for accessing data sources in grid environments. Easiness both from the developer and the administrator point of view must be provided by rich, automatic and powerful instruments (such as libraries, high level configuration tools, console of management, etc.). Robustness: it represents a fundamental key factor in a distributed environment. Indeed stability is a basic requirement for such components which provide access to and interaction with data sources. Efficiency: from the performance point of view the Grid-DBMS must provide high throughput, concurrent accesses, fault tolerance, reduced communication overhead, etc. Three factors have strong impact on the performance of a Grid-DBMS: - data localization, that is the ability to store/move data in close proximity to its point of use. This can lead to: reduced query response time (due to data fragmentation) and better exploitation of distributed resources (using for different portion of a database, different CPUs and I/O services); - query parallelism due basically to data distribution (intra-query parallelism) and concurrent accesses to the Grid-DBMS (inter-query parallelism); - high level queries: in grid environments new kind of queries can improve global performance (reducing connection time and/or amount of data transferred) exploiting advanced and efficient data transport protocols (i.e. protocol supporting parallel data streams), compression mechanisms, and so on. Dynamicity: as we stated in our definition a GridDBMS must be dynamic in the sense that according to the state of the grid resources, it has to reconfigure its components (data sources) in order to provide high performance, availability and efficiency. So, relocation of data sources, fragmentation (which consequently leads to the fragment allocation problem) and replication of databases are the three basic pillars which can be jointly used by the GridDBMS to perform more complex and dynamic data management activities. Intelligence: A Grid-DBMS must provide some intelligent components (i.e., smart schedulers) in order to carry out the dynamic mechanisms cited before. In the Grid-DBMS architecture that we envision, we basically need two kinds of schedulers: 1) a data-scheduler, which must address: • relocation of data sources; • replication of the data sources; • fragment allocation,

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

choosing the “best” nodes of the grid (taking into account system optimization performance parameters/goals); 2) a query scheduler, which has to: • provide a distributed query optimization engine which must finds, in grid environments, the best node (computational resource) on which critical operations (join, semi-join, union, cartesian product, etc.) can be performed; • choose, from a set of replicated catalogues, the “best” replica of the dataset to use, in order to maximize throughput, provide load balancing, minimize response time and communication overhead. To support decision making processes (scheduling activities, self-diagnosis tool, etc.), the schedulers have to retrieve both static and dynamic information from a Grid Information Service (local information about the machines, i.e. CPU, memory, disk), Network Information Service (global information about the network, i.e. bandwidth and latency) and the Replica Performance Monitor (statistical information about query response time related to different replica of the same database hosted on different grid nodes).

necessary to define a cross-DBMS delivery mechanism to move structured data from one location to one or more others; • Data Monitoring Service (DMS): a service which aims at monitoring the entire Enterprise Grid (hosts and databases performance) in order to obtain information snapshots useful for making decision processes. Exceeding critical thresholds can, in turn, trigger higher level data management processes aimed at re-establishing (by means of reconfiguration activities) a desired level of application performance; • Dynamic Reconfiguration Service (DRS): a service which is responsible for automatically reconfiguring (replicating, relocating and partitioning) data sources considering: o information (statistics) coming from the DMS; o administrator-defined parameters (minimum performance levels, number of concurrent accesses, etc.); and using the services offered by the SMS. • Data-Optimizer Service (DOS): a service which aims at optimizing the performance related to data sources (creating views, indexes, etc., exploiting the DAS) based on information coming from the DMS.

5. Grid-DBMS Main Components 4. Grid-DBMS Services The Grid-DBMS represents a valid, self-optimizing and self-configuring solution. It does not represent a new kind of DBMS and it is not just a set of data access and integration services, but a rather dynamic and complex environment. Indeed in our vision the Grid-DBMS must provide the following services: • Data Access Service (DAS): a layer providing a standard interface for relational and not-relational (i.e. textual database) data sources. It is placed between grid applications and DataBase Management Systems. It has to conceal many physical details such as the physical database location, the database name, etc. It must hide the DBMS heterogeneity offering a uniform access interface to data sources, thus providing a first level of data virtualization; • Data Gather Service (DGS): the main purpose of this service (placed on top of DAS) is to allow the user to look at a set of distributed databases (fragments), as a single logical data source, that is, a Grid-DataBase. Hence, it has to hide the number of fragments, the DAS physical locations, etc., providing a second level of data virtualization. This layer must offer the capabilities of data federation [10,11] and distributed query processing (DQP) [12,13], providing the illusion that a single database is being accessed, whereas, in fact, several distributed databases are being accessed; • Static Management Service (SMS): this layer has to provide the basic primitives to move, split and copy a data repository. It is placed on top of DAS and it is extensively used by higher levels connected with dynamic data management. At this layer it is

At the highest level the Grid-DBMS consists of two components (see Figure 2): one mainly hardware and another one specifically software. They are:

Fig. 1 Grid-DBMS Components

9

9

the Legacy System: it brings together both hardware elements (pc, workstations, storage, network, etc.) of the grid infrastructure (named Enterprise Grid - EG) and legacy software for data management (i.e. DBMSs already installed). the Grid-DBMS middleware: grid middleware which allows managing, accessing, integrating, optimizing and reconfiguring data sources installed within a Legacy System.

6. Grid-DBMS Architecture As shown in Fig. 2, the Grid-DBMS architecture is composed of several layers, each one taking into account important issues. For each level many

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

primitives/services can be defined but this details are out of the scope of this paper. In the Grid-DBMS architecture, the lower layers 1 and 2 are strongly related to the Legacy System component, whereas the upper layers 3, 4 and 5 are connected with the Grid-DBMS middleware (see Section 5). Let us go now into more details about the GridDBMS architecture: • the Fabric layer (level 1) comprises the underlying systems of the Enterprise Grid, that is storage (containing data sources), computers, operating systems, computational resources, networks, routers and so on;



Fig. 2 Grid-DBMS architecture





the DBMS layer (level 2), consists of a set of applications (DBMSs) useful to interact with specific data sources (DBMSs, by definition, are powerful tools or complex software packages designed to store and manage databases). At this level we can find both commercial database products (i.e. ORACLE 10g, IBM/DB2, etc.) and Open Source database software (PostgreSQL, UnixODBC [13], etc.) which support data management offering different solutions and functionalities. the Data Access layer (level 3), must provide basic and uniform primitives to get access to and interact with different data sources. It has to hide, from the higher levels, the DBMSs heterogeneity performing a basic kind of data virtualization. This layer (which can be considered as a set of static services) is composed of two sub-layers: - the former (sublevel 3.1) is the Standard Database Access Interface (SDAI) strongly related to the database connectivity facilities. It provides a set of primitives to directly establish and manage a connection with data sources. The SDAI must provide a uniform set of back-end errors, APIs and data types effectively masking any kind of heterogeneity.



- the latter (sublevel 3.2), is a set of components which exploit the SDAI, and thus provide a set of high level services connected with session control, user management and data access control policy, concurrency and transaction management, advanced query submission, etc. the fourth layer (level 4) consists of three joint subcomponents: - the first is the Data Gather Service which is responsible for offering capabilities of data federation and DQP. Within this layer we find the Query Scheduler previously introduced (see Section 3); - the second is the Grid Monitor, which is responsible for monitoring Host (resource consumption information, i.e. CPU load average, free memory, etc.) and Grid-Database performance (query response time, etc.). All of this metadata are then stored over time into a System Performance Database to supply information for decision making processes and diagnosis tools at the higher levels; - the third is the Grid-Database Management which provides basic services for dynamic data management. In particular this level offers relocation, replication and fragmentation primitives. It must exploit some cross-DBMS delivery mechanisms useful to move structured data from one location to one or more others. Moreover it must offer efficient data transport protocol (using for instance, parallel streams, etc.) to reduce the data transfer time and consequently to improve the performance of the overall system; the fifth layer is the Dynamic Grid-Database Management which has to dynamically, transparently and automatically reconfigures (exploiting level 4 primitives) data sources. At this level we find the Data Scheduler previously introduced (see Section 3) and also the Data Optimizer which improves system performance by means of views, indexes, etc. dynamically created.

7. Grid & Grid-DBMS: Issues In the following we propose some of the most important issues strongly related to the Grid-DBMS concept. Exploiting a Grid-DBMS, an Enterprise could: 9 reuse the existing physical framework, optimizing the usage of physical resources in terms of storage and computational power. This way, reusing the EG infrastructure (a low-cost solution), buying new large and expensive computing systems (high cost solution) can be avoided; 9 improve the performance of the entire system in terms of efficiency, reliability and availability moving, replicating and distributing data sources

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

9 9

9

over the most performant machines of the EG (the Grid-DBMS is able to support changing workload by exploiting dynamic data resource reconfiguration processes); transparently access/join data stored in heterogeneous and widespread data sources providing a real data virtualization; easily extend the physical framework adding and/or deleting resources without either turning off the system or activating complex reconfiguration processes; automatically monitor all the physical resources (EG) and the data sources (Grid-DB), providing self-diagnosis instruments aimed at reducing the human interaction.

8. Case Study: the GRelC Project The Grid Relational Catalog Project (developed at the CACT/ISUFI of the University of Lecce) represents a prototype of a Grid-DBMS. To date we have developed some services such as the Data Access Service (GRelC-Service [14]) and the Data Gather Service [15]. They provide several features such as: mutual authentication based on X509v3 certificates, authorization based on an Access Control List, delegation mechanisms, data encryption in order to define a secure data transport layer, access control policy to indicate who does what and so on. In our project we chose to adopt the Globus Toolkit [16] as grid middleware, because it represents the “de facto” standard, released under a public license and successfully deployed and used in many grid projects. Moreover, we can transparently take advantage of the uniform access to distributed resources using this middleware. Security is provided by means of the Grid Security Infrastructure (GSI) [17], that is a layer based on public key technology. Data transfer leverages the GridFTP [18] protocol and compression mechanisms in order to obtain high performance (jointly using these two features we can dramatically reduce the connection time between client application and GRelC Service leading to a better performance in terms of efficiency and throughput as proved by the results of some experiments conducted in a real European grid testbed – additional details can be found in [19]). New kind of queries [20] are also available in order to improve the overall performance and to satisfy new requests coming from user applications. To date, two versions of the GRelC Service are available: the former leverages a client/server architecture whereas the latter is a GSI enabled Web Service (using the gSOAP Toolkit [21]; moreover, to guarantee a secure data channel between client and GRelC Service, it uses the GSI support, available as a gSOAP plugin [22]). From the developer point of view, the service provides a rich set of methods/APIs including connection management, operations for data manipulation, advanced delivery mechanisms

leveraging efficient data transport protocol (GridFTP protocol) and exploiting compression libraries (Zlib). We have also been investigating the Grid-DB Management services (level 4 of the Grid-DBMS architecture – see Section 6) in order to provide static services for data delivery, that is basic primitives for the Dynamic Grid-Database Management Service. We plan in the future to move towards the Grid Services architecture (using the Globus Toolkit v4, which is OGSA [23] compliant)

9. Related Works The Spitfire Project [24] is part of the Work Package 2 in the European Data Grid Project and provides a means to access relational databases from the grid. It is a very thin layer on top of an RDBMS (by default MySQL) that provides a JDBC driver. It is using Web Service technology (Jakarta Tomcat) to provide SOAP-based RPC (through Apache Axis) to a few user-definable database operations. The Open Grid Services Architecture Data Access and Integration (OGSA-DAI) [25] is another project concerned with constructing middleware to assist with access and integration of data from separate data sources via the grid. It is engaged in identifying the requirements, designing solutions and delivering software that will meet this purpose. The project was conceived by the UK Database Task Force and is working closely with the Global Grid Forum DAISWG and the Globus team. These two projects are strongly related to data access and integration services, but they do not address the dynamic data management as described in the GridDBMS specification. The Storage Resource Broker (SRB) [26] was designed to provide applications with uniform access to distributed storage resources. Its main focus is filebased data and it provides several features including a Metadata Server, a logical naming scheme for datasets, automatic replica creation and maintenance, etc., but it does not solve the problem of integrating databases into a grid. With the Grid-DBMS concept, our aim is to provide a framework supporting both static (access and integration) and dynamic (management) services in a grid environment.

10. Conclusions Nowadays, several data grid applications need to access, share, manage and integrate a massive amount of data distributed across heterogeneous and geographically spread grid resources. Database Management Systems, represent a reliable, accepted and powerful instrument to store persistent data but currently (with a notable exception) they are not grid enabled. Lastly, many efforts have been basically providing only static services for data access and integration. Our aim is to provide a more complex

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

framework supporting both static (access and integration) and dynamic (management) services in grid environments. This leads to the Grid-DBMS concept, which has been extensively defined and discussed in this paper. It provides the basic primitives for dynamically, transparently and automatically managing data sources in grids. We plan in the future to extend the Grid-DBMS architecture including new services, functionalities and capabilities to take into account a wider range of issues related to grid data management.

11. References [1] Foster, I., & Kesselman, C. (1998). The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann. [2] Özsu, M.T., & Valduriez, P. (1999). Principles of Distributed Database Systems, 2nd edition, Prentice Hall (Ed.), Upper Saddle River, NJ, USA. [3] Clemons E. K (1985). Principles of Database Design, Vol. 1, Prentice Hall (Ed.). [4] Oracle Grid Computing Technologies URL: [http://otn.oracle.com/products/oracle9i/grid_computing/ind ex.html]. [5] Database Access and Integration Services WG, URL: [https://forge.gridforum.org/projects/dais-wg]. [6] N. W. Paton, M. P. Atkinson, V. Dialani, D. Pearson, T. Storey, P. Watson, “Database Access and Integration Service on the Grid”, Global Grid Forum OGSA-DAIS WG. Technical Report (2002). [7] P. Watson, Databases and the Grid. Technical Report CS-TR-755, University of Newcastle, 2001. [8] V. Raman, I. Narang, C. Crone, L. Haas, S. Malaika, T. Mukai, D. Wolfson, C. Baru. “Data Access and Management Services on Grid”, Technical Report Global Grid Forum 5 (2002) [9] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, "The GRelC Project: Towards GRID-DBMS", Proceedings of Parallel and Distributed Computing and Networks (PDCN) IASTED, Innsbruck (Austria) February 17-19, 2004. [10] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. Open Grid Services Architecture: A Unifying Framework for Distributed System Integration. Technical Report, Globus Project, 2002. URL: [www.globus.org/research/papers/-ogsa.pdf[ [11] Sheth, A. P. and J. A. Larson (1990). “Federated Database Systems for Managing Distributed, Heterogeneous and Autonomous Databases”. ACM Computing Survey 22(3): 183-236. [12] J. S. A, Gounaris, P. Watson, N. W. Paton, A.A.A. Fernandes, and R. Sakellariou. Distributed query processing on the grid. In Proceedings of the 3rd International Workshop on Grid Computing (GRID 2002), pages 279290. LNCS 2536, Springer-Verlag, 2002. [13] L. Haas, D. Kossmann, E.L. Wimmers and J. Yang, Optimizing Queries Across Diverse Data Sources. In proc. VLDB, pages 276-285, Morgan-Kaufmann, 1997 [14] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Early Experiences with the GRelC Library", Journal of Digital Information Management, Vol. 2, No. 2, pp 54-60, June 2004. Digital Information Research Foundation (DIRF) Press. [15] M. Mirto, G. Aloisio, M. Cafaro, S. Fiore, “A Gather Service in a Health Grid Environment”, CD-Rom of

Medicon and Health Telematics 2004, IFMBE Proceedings, Volume 6, July 31 – August 05, Island of Ischia, Italy. [16] The Globus Project, URL: [http://www.globus.org/]. [17] Tuecke S. (2001). Grid Security Infrastructure (GSI) Roadmap. Internet Draft 2001. URL: [www.gridforum.org/security/ggf1_-200103/drafts/draftggf-gsi-roadmap-02.pdf] [18] GridFTP Protocol. URL: [http://www-fp.mcs.anl.gov/dsl/GridFTP-ProtocolRFC-Draft.pdf] [19] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Advanced Delivery Mechanisms in the GRelC Project” to appear in the Proceeding of 2nd International Workshop on Middleware for Grid Computing (MGC 2004), October 18 2004, Toronto, Ontario Canada. [20] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “The GRelC Library: A Basic Pillar in the Grid Relational Catalog Architecture”, Proceedings of Information Technology Coding and Computing (ITCC), April 5 to 7, 2004, Las Vegas, Nevada, Volume I, pp.372-376. [21] R.A. Van Engelen, K.A. Gallivan, “The gSOAP Toolkit for Web Services and Peer-To-Peer Computing Networks.”, Proceedings of IEEE CCGrid Conference, May 2002, Berlin, pp. 128-135. [22] M. Cafaro, D. Lezzi, R.A. Van Engelen, “The GSI plugin for gSOAP.”, URL: [http://sara.unile.it/~cafaro/gsiplugin.html] [23] Foster, I., Kesselman, C., Nick, J., & Tuecke, S. (2002). The Physiology of the Grid: An Open Grid Services Architecture for Distributed System Integration. Technical Report for the Globus project. URL: [http://www.globus.org/-research/papers/ogsa.pdf]. [24] The Spitfire Project, URL: [http://edg-wp2.web.cern.ch/edg-wp2/spitfire/]. [25] Open Grid Services Architecture Data Access and Integration, URL: [http://www.ogsadai.org.uk/]. [26] SRB (2000). Storage Resource Broker Documentation v.1.1.8. URL: [http://www.npaci.edu/DICE/SRB/CurrentSRB/SRB.htm].

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

Suggest Documents