Spatial Data Analysis Support for Cancer Epidemiology in CARESS

Spatial Data Analysis Support for Cancer Epidemiology in CARESS Frank Wietek1, Vera Kamp2 1 Oldenburger Forschungs- und Entwicklungsinstitut für Info...
Author: Sherilyn Owen
0 downloads 0 Views 220KB Size
Spatial Data Analysis Support for Cancer Epidemiology in CARESS Frank Wietek1, Vera Kamp2 1

Oldenburger Forschungs- und Entwicklungsinstitut für Informatik-Werkzeuge und -Systeme (OFFIS), Escherweg 2, D-26121 Oldenburg, e-mail: [email protected] 2

Universität Oldenburg, FB Informatik, Escherweg 2, D-26121 Oldenburg, e-mail: [email protected]

Abstract CARTools, a toolbox designed to support population-based cancer registries, is currently being developed at OFFIS, an institute for computer science. The first application area of this toolbox is the registry being established in Lower-Saxony, a federal state of Germany. CARLOS (Cancer Registry Lower-Saxony) is the name of the corresponding project, started in 1993. The CARTools comprise four tools, one of which is CARESS (CARLOS Epidemiological and Statistical Data Exploration System), implementing various routines for statistical and epidemiological data analysis on the registry database. In this paper, the functionality of CARESS with respect to supporting spatial data analysis – which is the main focus of this system – and the design of the integration of data analysis features in an interactive graphical user interface will be presented. The system is supposed to provide different features and interfaces for different user groups (epidemiologists, registry staff, external researchers) and different tasks (e.g. quality control, incidence monitoring, cluster analysis). To ensure both efficiency and flexibility, the underlying database system strives to integrate concepts from the fields of spatiotemporal, multidimensional and statistical database systems.

1 The Project CARLOS To reinforce efforts in the fight against cancer and to improve knowledge about potential causes and risk factors for cancer – aiming at an improved prevention –, the Ministry of Social Affairs forces the establishment of a homogeneous population-based cancer registry in Lower-Saxony, a federal state of Germany, since 1992. The name of the corresponding project, which started in 1993 with a pilot period, followed by a test period 1995-97, and which is presently beginning on routine data collection, is CARLOS (Cancer Registry Lower-Saxony) [AFH+96]. Evaluation of the applicability of a model for cancer registration proposed by Prof. Michaelis (Mainz) [MK92], also the Federal Cancer Registration Act passed in

1995 is based on, has been the main task in the pilot period. This model suggests a distinction between two separate offices: A notification office („Vertrauensstelle“) collects and encrypts patient-related reports, before transferring them to the registration office („Registerstelle“), which persistently stores, links together, and analyses the data. Concepts for data encryption and record-linkage have been designed, evaluated, and standardised [AMIT96]. Apart from OFFIS, which will take over the part of the registration office, the main collaborating partners in the project are – under the management of the ministry – the Association of Panel Doctors in Lower-Saxony („Kassenärztliche Vereinigung Niedersachsen (KVN)“) and a number of additional cancerregistering institutions, like clinical registries, pathological laboratories, etc. The KVN is expected to contribute the largest amount of reports to the registry by way of the „Nachsorgeleitstellen“, which are institutions engaged in patient follow-up and organisation of secondary therapy and further examinations. Currently, the project is in the final stage of a test period lasting from 1995 to 1997. During this time, special emphasis has been and is put on the integration of a variety of cancer-registering institutions, the implementation of routine registration, and computer-based support for epidemiological data analysis. To provide software support for almost all steps of population-based cancer registration, especially in Lower-Saxony, a toolbox called CARTools is being developed at OFFIS [AFH+96]. The CARTools comprise four tools providing support for the main tasks „data transfer to the registry“, „encryption of data sets“, „record linkage“ and „data analysis“. CARESS (CARLOS Epidemiological and Statistical Data Exploration System) is the name of the tool implementing the statistical and epidemiological data analysis process based on the registry database.

2 Analysing the Registry Database with CARESS Experiences in a number of cancer registries have shown that there is still a lack of facilities to analyse the large amounts of collected data flexibly and efficiently both for routine monitoring and reporting and to answer public ad-hoc queries regarding supposed clusters of cases or increased incidence rates. Often, data storage and management, statistical analysis and visualisation of results are implemented by different systems and tools, so that data transfer and conversion takes a lot of time or is at least inconvenient. Furthermore, in many cases new problems and queries with just some parameters changed afford new software routines to be written and integrated.

The data analysis system CARESS is designed to meet these needs and to make work at population-based cancer registries more efficient and flexible. It is supposed to provide a comfortable data analysis environment for different groups of users (cf. fig. 1) interested in • quality control, i.e. calculation of indices like HV (proportion of histologically verified tumours), M/I (relation of mortality to incidence), DCO (number of cases only known by means of a death certificate), etc., • incidence monitoring, survival analysis, and cluster analysis, • formation of hypotheses concerning potential cancer risks by comparison with background information, • generation of reports and data export. registry staff research, general public

monitoring, routine analysis

epidemiologists, statisticians

exploratory data analysis

reports

menu-based user-interface export

SS

graphical network editor

MDD DB registry data

CA RE

visual query language mapping layer

DB population data

...

DB spatial data

cancer registry Fig. 1. System architecture and user groups of CARESS

We designed the system not to be restricted to utilisation in the registry of LowerSaxony, but to be portable at least to other population-based cancer registries and even to be applicable in similar domains like health reporting or descriptive epidemiology in general.1 Thus we chose the metaphor of an „ epidemiological 1

Since 1996, the cancer registry of Hamburg also evaluates CARESS in analysing the Hamburg cancer database.

workbench“ , being modular and extensible with respect to new methods for analysis or visualisation and interfaces to further tools and systems. Figure 1 outlines the architecture of CARESS. Tumour- and patient-related registry data is linked together with population data and additional spatial data into a unified view onto the database. This view is implemented by a MDD (multidimensional discrete data) mapping layer, which is based on concepts developed for spatio-temporal, multidimensional and statistical database systems [Bau94, CCS93, Gue94, Sho82]. Aggregated data sets (e.g. case or population counts by region, age, and sex) are implemented as data cubes, providing efficient access to groups of single data values. By defining categorisation hierarchies on classifying attributes (space, time, age, sex, or type of disease), flexible aggregation of data in different granularities is supported. At present, we make use of the relational database management system ORACLE, while also evaluating object-oriented and object-relational concepts. To keep the user from having to formulate complex SQL-queries to select and combine data, we plan to implement a visual query language based on the MDD mapping layer mentioned above. This language should comprise statistical, spatial and temporal operators as well as data management facilities for multidimensional data, especially aggregation. Further components of the system architecture are • a graphical editor to easily combine different calculation and visualisation procedures, to compare algorithms and data sets within an exploratory data analysis, and even facilitate integration of external tools [WK97], • a tool to define and execute sequences of analysis steps for report generation and data export, and • a menu-based user interface providing an easy-to-use access to routine calculations and graphics. Menu-based user interface and graphical editor are discussed in more detail in the following section. Support for report generation is still subject to future work.

2.1 User Interfaces A number of epidemiological measures (number of cases, crude, directly or indirectly standardised and cumulative rates, SMR, relative risk and so on) can be calculated in CARESS for a study population described by different parameters, like place of residence, time, age, sex, kind of tumour, and further medical attributes. Visualisation of results in bar charts and line charts, tables, and above all thematic maps is easily parametrised and interactively modifiable by kind of

measure displayed, standard population, type of cases (incidence or mortality), and different specific parameters. Figure 2 gives an impression of the user interface of CARESS – in this case an example of a thematic map created by the system. The open menu shows different types of maps provided (grouping by quantiles, maps based on a scale with 3-10 classes of equal ranges, probability maps, etc.). An online help system explains utilisation of the tool, describes all calculations and visualisation procedures, and gives the exact formulas of all measures.2

Fig. 2. Example of an evaluation

In the graphical network editor, a study comprising groups of evaluations is modelled as a network of methods. Each node represents the application of one method of a certain method class to one or more data sets, which may be processed in parallel (independently) or in combination as a joint input of one

2

As this system documentation is implemented using HTML, it is also accessible via WWW: „ http://elbe.offis.uni-oldenburg.de:3999/caress-doku/“ .

calculation. The connections between nodes represent the data flow between data sources, data analysis and visualisation procedures. Menu-based interface and graphical editor are not meant to be isolated from each other, but the first can be regarded as a special-purpose view onto the second. Each menu selection can be translated into a network of single analysis steps (outlined in figure 3), which may be modified and extended in the network editor for further, more detailed calculations. Thus, both of the user interfaces are based on a single concept of data access and analysis.

data source: registry database

restriction of parameters and aggregation

calculation of statistics and classification

visualisation of results

Fig. 3. Translation of menu-based parametrisations to analysis networks

2.2 Basic Model for Data Analysis Our application model is based on several layers describing a cycle of interactive data exploration (see figure 4). As already mentioned above, an integrated view onto the different kinds of data, especially the combination of statistical and spatial data, constitutes the core of our system. Selection of data requires facilities

to perform visual queries and the existence of especially spatial and temporal operators, which may be used in long interactive and iterative data Interpretation selections. Techniques applied in analysing the data are determined Visual presentation by visual combination of predeResults fined, above all statistical operators, based on aggregation and summaAnalysis tion of data. Complex operations Data selection can be constructed and combined Querying interactively using concepts of visual programming languages. Database Additionally, facilities to reuse Integration parts of selected and analysed data in the course of the same study, but Case data Environmental data Spatial data also in further data explorations have to be provided. Finally, visual Fig. 4. Interactive data exploration presentation of results both leads to data interpretation and triggers new steps of analysis.

3 Spatial Statistics The main focus of CARESS is on spatial data analysis. In collaboration with the cancer registry of Hamburg, different measures describing spatial and space-time clustering of cancer cases are being implemented in CARESS (cf. [BN91, Wal92]). Work in this area is based on results of the symposium „ Methods of Spatial Description and Analysis of Cancer Registry Data“ , which has taken place in March 1996 in Hamburg [BSMH96]. Cluster indices are supposed to enhance the information presented visually in thematic maps (cf. section 2) – we consider them another way of data description, not primarily some kind of significance test. One can distinguish between global and local cluster analysis, i.e. judging clustering in the region under study as a whole and examining exact locations of clustering. The global cluster indices implemented in CARESS can be divided into three groups: • Moran’s I (1948), Geary’s c (1954), the statistic of Ohno & Aoki (1979), and the D-test (1985) test for spatial autocorrelation, i.e. similar cancer rates in neighbouring regions. They differ in the use of measures (possibly incl.

standardisation for age, sex, or population size in general), ranks and classifications. • The test statistic of Potthoff-Wittinghill (1966) and similar indices consider heterogeneity of cancer rates in general or concentration of high rates in few regions without taking neighbourhood of regions into account. • The test defined by Knox (1963) focuses on space-time interaction: Are cases which are „ close“ in space also „ close“ in time? This type of test is especially relevant for infectious diseases. As far as possible, we calculate p-values for cluster indices both using simulations and approximations based on estimates for mean and variance, offering the chance to compare and better understand characteristics of these measures. Decision support, which index to use in which situation, and an improved understanding of how to interpret differences in the results provided by different indices will be subject to further research. In contrast to the cluster indices listed above, a procedure proposed by Besag & Newell in 1991 helps to find the locations of clustering. The algorithm can be outlined as follows: • Select a cluster size M = k × average number of cases. • For each region test for clustering around this region: – Aggregate neighbouring regions, until sum of cases •0 – Test for significance w.r.t. aggregated population based on a Poisson distribution of cases and an expected number of cases derived form the whole population under study. • Highlight „ significant“ regions. Complementing the objective of „ improving“ thematic maps regarding their power to judge clustering, we are also starting to evaluate, design, and implement procedures for smoothing cancer maps (distance-weighted smoothing, empirical Bayes estimators, splines etc.) in our project.

4 Integration of Geographical Data The most important spatial data source of CARESS is ATKIS („ Amtliches Topographisch-Kartographisches Informationssystem" – Official Topographical Cartographic Information System), a database built up by the federal surveyoffices [AdV91]. ATKIS provides – beneath data describing the borders of admin-

istrative units – a large number of geographical objects from the basic German map („ Deutsche Grundkarte“ ), classified in different groups (e.g. streets, power plants, high voltage mains). The project InterGIS [Fri97] aims at defining and implementing a flexible interface to ATKIS via Internet. CARLOS or CARESS, respectively, takes the place of a client of the geographical data services defined by the geographical data server. This also includes a library of geographical operations (like along of, around, etc.) located at the client (see figure 5). These operations serve as the basic tool for selecting the respective population under study and for correlating cancer rates with background information associated with geographical objects, thus guiding the formation of hypotheses concerning potential risk factors for cancer. These „ first ideas“ may be subject to further investigations in adjacent case-control or cohort studies.

CARLOS

Cache

GeoLib

WWWClients

Application

Cache

GeoLib

Internet, TCP/IP, CORBA

WWW-Server

Geo-Server

DB

Geographical Data (ATKIS)

Fig. 5. Architecture of a geographical data server

5 Summary and Future Work So far, menu-based user interface and graphical network editor are isolated tools, with the editor being in a prototypical stage and the menu-based system already being employed in first routine data analysis. The next step will be to integrate

these two systems into a homogeneous environment. Besides, in collaboration with the cancer registry of Hamburg, the system is continuously evaluated, enhanced by new routines for data analysis, and improved with respect to userfriendliness and efficiency. New concepts for data management, especially of spatial and multidimensional data, are evaluated and integrated. We are aware of the problem known as „ fishing for significance“ . CARESS is a powerful data analysis tool that encourages the user to use and compare a variety of algorithms with similar aims (e.g. cluster analysis) on the same data set or even on a number of data sets describing different subsets of the population under study. A system of this kind may also be misused by users, who don’ t take into account the danger of misinterpreting statistically significant results because of ignoring the context of the analysis carried out (or even those who do, but don’ t care). Although we want to emphasise, that CARESS is just a tool, which provides data analysis facilities and the usage of which the user himself is responsible for, we strive to facilitate a reasonable usage of our system by providing as much information about the methods applied and a sensible way to interpret their results – also in combination with other procedures – in a user-friendly online help system. The integration of concepts from the field of knowledge-based systems, guiding the users in his choice and warning him in case of potentially misleading results, is subject to future work (see [WK97] for some first ideas). Once CARESS will be implemented as a complete data analysis environment in the cancer registry of Lower-Saxony, we aim at transferring the tool to related domains, thus showing the flexibility and extensibility of the underlying basic concepts and system architecture.

References [AFH+96] H.-J. Appelrath, J. Friebe, H. Hinrichs, V. Kamp, J. Rettig, W. Thoben, and F. Wietek. CARLOS (Cancer Registry Lower-Saxony): Tätigkeitsbericht für den Zeitraum 1.1.-31.12.1996. Technical Report, OFFIS, Oldenburg, 1996. [AFH+96] H.-J. Appelrath, J. Friebe, H. Hinrichs, V. Kamp, J. Rettig, W. Thoben, and F. Wietek. Softwarewerkzeuge für (epidemiologische) Krebsregister. In: M. P. Baur, editor, 41. Jahrestagung der GMDS, Bonn, Sept. 1996. [AMIT96] H.-J. Appelrath, J. Michaelis, I. Schmidtmann, and W. Thoben. Empfehlung an die Bundesländer zur technischen Umsetzung der Verfahrensweisen gemäß Gesetz über Krebsregister (KRG). Informatik, Biometrie und Epidemiologie in Medizin und Biologie, 27(2): 101-10, 1996. [AdV91] Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland (AdV). ATKIS-Gesamtdokumentation, Teil D, ATKIS-Objektartenkatalog. Niedersächsisches Landesvermessungsamt, Hannover, 1991.

[Bau94]

P. Baumann. On the Management of Multidimensional Discrete Data. VLDB Journal. Special Issue on Spatial Database Systems, 4(3): 401-444, 1994. [BSMH96] C. Baumgardt-Elms, M. Schümannn, S. v. Manikowsky, and U. Haartje, editors. Symposium "Methoden regionalisierter Beschreibung und Analyse von Krebsregisterdaten", Hamburg, March 1996. Behörde für Arbeit, Gesundheit und Soziales, Temmen Verlag. [BN91] J. Besag and J. Newell. The Detection of Clusters in Rare Diseases. Journal of the Royal Statistical Society, 154(1): 143-155, 1991. [CCS93] E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (On-line Analytical Processing) to User Analysts: An IT Mandate. White Paper, Arbor Software Corporation, 1993. [Fri97] J. Friebe. Eine GeoServer-Architektur zur Bereitstellung geographischer Basisdaten im Internet. In: K. R. Dittrich and A. Geppert, editors, Datenbanksysteme in Büro, Technik und Wissenschaft, pages 251-60, Ulm, March 1997. Springer Verlag. [Gue94] H. Güting. Spatial Database Systems. VLDB Journal, 3(4), 1994. [MK92] Michaelis, J., Krtschil, A.: Aufbau des bevölkerungsbezogenen Krebsregisters für Rheinland-Pfalz. Ärzteblatt Rheinland-Pfalz, 45(10): 434-438, 1992. [Sho82] A. Shoshani. Statistical Databases – Characteristics, Problems and Some Solutions. In: 8th International Conference on Very Large Data Bases (VLDB), pages 208-222, Mexico City, Mexico, 1982. Morgan Kaufmann. [Wal92] S. D. Walter. The Analysis of Regional Patterns in Health Data. American Journal of Epidemiology, 136(6): 730-759, 1992. [WK97] V. Kamp and F. Wietek. Intelligent Support for Multidimensional Data Analysis in Environmental Epidemiology. In: X. Liu, P. Cohen, and M. Berthold, editors, Advances in Intelligent Data Analysis - Reasoning about Data. Second International Symposium, IDA-97, volume 1280 of Lecture Notes in Computer Science, pages 299310, London, August 1997. Springer Verlag.

Suggest Documents