Spatial data management

INTERACTION WITH USERS • SESSION C Spatial data management Geir-Harald Strand Survey and statistics division, NIBIO Spatial data management Geir-Ha...
Author: Anissa Hodge
4 downloads 0 Views 640KB Size
INTERACTION WITH USERS • SESSION C

Spatial data management Geir-Harald Strand Survey and statistics division, NIBIO

Spatial data management Geir-Harald Strand1 A considerable amount of official statistics is based on spatially referenced primary data. This potential is probably not fully utilized. A more cognisant handling of the spatial aspect of statistics can boost the efficiency in data collection, create new opportunities for data analysis and improve the communication of statistics through more and better cartography. The key is to ensure that professional spatial data management is included in the statistical information chain as described in the GSBPM. This is partly a technological issue, but also an organizational challenge. Key words: spatial data management, spatial information system

1.

Introduction

Much of the official statistics produced today is based on primary data with direct or indirect spatial reference. The reference can be an explicit coordinate, but is more likely to be an address, a cadastral unit or an administrative region. This kind of references allow the observations to be linked by location, providing for new approaches to data collection, new opportunities for data analysis and more use of maps as a visualization and communication tool. An example from the statistical community is the increased use of geographical grids as a framework for spatial statistics (Strand & Bloch 2009, Fujimoto et al. 2015). Dedicated spatial data management through standardized and well-documented spatial data bases is the key to actuate the potential of the spatial data held by producers of statistics. The primary data must remain inside the existing databases, but information about the locations referenced by the primary data can be organized in dedicated spatial data systems. The introduction of spatial data management for the statistics on land resources (Strand 2013), agriculture and forestry (Tomter et al. 2010) in Norway has fostered improvements throughout the entire information chain ranging from data collection through analysis to data delivery. Spatially referenced data can be visualised as thematic maps, but also cross-linked by location and analysed with respect to spatial aspects. The potential is probably not fully utilized. This assertion can be illustrated using an example. Part of the Norwegian economic statistics for agriculture is based on a detailed review of the accounts from 910 farms. A possible application of these statistical data is to examine the difference in the economic results of sheep-farmers in areas with and without the

1

Survey and statistics division, NIBIO, Norway, Email: [email protected]

Statistics Sweden | scb.se/nsm2016 | [email protected]

presence of large carnivores. The locations of the farms are, however, not recorded in the material. Extracting subsets based on location is therefore no simple task. Fortunately, the cadastral property identification code for each farm has been recorded. This is an asset, because the National Agricultural Administration maintains a database where the cadastral property identification code of every farm holding in Norway is kept, together with key geographical data (including a representative point location). The location of each farm in the survey is thus obtained by connecting the two data sets and retrieving the locations from the NAA database. When this is done, the survey data can be linked (by position) to a digital map of the management areas for large carnivores. The latter data set is produced and maintained by the National Environmental Administration and available as part of the National Spatial Data Infrastructure. Through this link, the survey farms can be classified and analysed according to their location: Inside or outside the management areas for large carnivores. The example shows that there is a potential for a spatial application, but it is not yet utilized. Such examples are probably abundant. Increased attention to management of the spatial aspect of statistics is therefore expected to improve the efficiency of data collection, provide new opportunities for data analysis and improve the communication of statistics through more and better cartography.

2.

Everything is somewhere

“Everything is somewhere” is the catchy title of a geography quiz book for children (McClintock 1986). This assertion may not quite be true – but much of the primary data used in official statistics does have a location and can be linked to a place. People live somewhere and they do work somewhere. Accidents happen somewhere. Goods are produced somewhere and maybe sold somewhere else. It is consequently transported between those places. Much statistics can be produced without any knowledge about these locations, but more statistics can be created when the locations are known – possibly also contributing new and valuable information (Goodchild 2007). The spatial references can be direct or indirect. A direct spatial reference is an explicit coordinate or set of coordinates. The direct spatial reference can be used to place the observation on a map. The indirect spatial reference is a reference to another object that possesses the required direct spatial reference. An address, a cadastral unit or the identification code of an administrative (e.g. NUTSx) region can act as indirect references. This information cannot, by itself, place an observation on a map. But the coordinates of the referenced object can be used for this purpose. Spatial references – direct or indirect – allow observations to be sorted, structured and combined by location. This provides opportunities for new approaches to data collection; new prospects for data analysis; and more use of maps as a visualization and communication tool. The prerequisite is obviously that spatial references are collected and stored along with the data.

Statistics Sweden | scb.se/nsm2016 | [email protected]

3.

Spatial data in the GSBPM context

The process needed to produce official statistics is described in the Generic Statistical Business Process Model (GSBPM). The eight phases of GSBPM are: 1) Specify needs; 2) Design; 3) Build; 4) Collect; 5) Process; 6) Analyse; 7) Disseminate; 8) Evaluate. The process described by the GSBPM thus resembles a value-chain, perhaps more appropriately described as an information-chain. The spatial aspects of the data must be accounted for throughout the process, and in particular in the central part represented by steps 4 to 7 (Figure 1).

Figure 1: GSBPM steps 4 – 7 with step 5 described as “Data management” instead of “Data processing”

Proper spatial references allow the process to collect data by linkage to other databases, and use spatial analysis and cartographic communication if needed. Spatial references allow the data in a particular business process to interact, through spatial linkages, with spatial data residing in other business processes. The multiple use of the statistical system for farm accounts, described above, is an example of such interaction. The prerequisite for interaction is that proper spatial references – direct or indirect - are obtained in the data collection phase and that they are stored for later use. Consequently, there is strong a need for (spatial) data management. This is currently not included in GSBPM (although it could be seen as variant of phase 5: Processing data). Standardization together with good knowledge and understanding of the reference systems is crucial to obtain proper and workable spatial references. Inadequate or faulty spatial references can rarely be corrected later, at least not without incurring major additional cost or loss of information. Too often, the potential of a data set is lost – not because of missing spatial references but due to imperfect or undocumented reference systems. The solution is to follow established national and international standards, in close cooperation with competent national authorities. Data management is at the core of a spatial data information chain. In principle, the spatial reference is simply an additional characteristic of an observation. A database containing spatial data is at first glance only an extension of any ordinary data base structure. Latitude and longitude (or any other spatial reference, e.g. NUTS code,

Statistics Sweden | scb.se/nsm2016 | [email protected]

address, postal code or a grid identifier) can be handled as variable characteristics. There is, however, an inherent spatial ordering implied in the reference system – representing not only where observations are made, but also how they are related to each other. This topological information (Egenhofer et al. 1989) must be maintained in the data base in order to make full use of the spatial information (Marceau et al. 2001), and is why a dedicated spatial database system is needed for spatial data management. Data analysis (phase 6 of GSBPM) is where information is extracted from the database. A database stocked with spatially referenced data allows the analyst to include the spatial context and relationships in the analysis (e.g. Voss 2007). Survey data can also be downscaled using small area estimation methodology (Strand & AuneLundberg 2012, Leyk et al. 2013). Finally, the dissemination phase of the GSBPM is where the results are conveyed to users. Spatial references enable the use of maps as part of the reporting. Any information that could be drawn on a map (if the spatial reference were available) is potentially spatial data. The spatial reference can be direct (by coordinate) or indirect (by reference to another dataset containing coordinates). Clearly, it is important to maintain access to key data supporting indirect spatial reference (cadastres, address registers, NUTS data, grids and postal codes are but a few examples).

4.

Spatial data management by example

For statistical purposes, the Norwegian agricultural sector is a wilderness of observation units. The cadastral system is built on continuous parcels of land where one or more parcels can constitute a basic property unit. Each basic property unit has one or more owners and no basic property unit can cross the border between two NUTS 5 regions. The cadastre is a spatial database managed by the National Mapping Authority and maintained (online) by the local municipal authorities. Information about the basic property units, including their location and geometry (boundary) can be retrieved from the central database by way of remote data service requests (Strand 2001). A basic property unit is a juridical entity, but does not correspond to the actual farming unit, defined as an economic entity. A farm can – and will frequently – consist of several basic property units. This is a changeable relationship, and currently not represented in the cadastre. Instead, a centralized registry (the farm register) has been established, connecting basic property units to the operational farm units. The spatial references in the cadastre are explicit and direct, providing coordinates for individual plots of land. The spatial references in the farm register are indirect, using the basic property numbers as references. Any change in the cadastre, e.g. adjusting the geometry of a parcel boundary, is thus immediately also reflected in the registry. As a consequence of this organization, a particular application can request the registry to return a list of all the basic property units belonging to a particular farm unit. This list is, as a next step, used to request a longer list, containing all the parcels for each basic property unit, from the cadastre.

Statistics Sweden | scb.se/nsm2016 | [email protected]

The direct spatial reference for the farm unit consists of the combined geometries of all these parcels retrieved from the cadastre, and can be used to calculate the area or draw a map of the farm. Equally important: This geometry can be used to request additional information about the farm from auxiliary databases. Examples are information about environmentally protected locations or cultural heritage sites.

5.

Statistics for farms

The spatial data management described for the farm units and cadastral service provides a basis for a broader system of farmland statistics. An example is the land resources found on each farm. Land resource mapping for individual farms or parcels is a costly and inefficient undertaking. It is easier to organize land resource mapping as a broad, national wall-to-wall survey. This was done in Norway, starting in the 1960’s. The results were later digitized and the information is sustained by a continuous maintenance program. The information is kept in a spatial database (accessible from http://kilden.nibio.no) at the Survey and statistics division of NIBIO

Figure 2: Cadastral units and Land resource map units combined results in “atomic” spatial elements (sometimes called Minimal Mapping Units), unique with respect to cadastral as well as land resource information

The digital land resource “map” is a database representing a partition (in a mathematical sense) of the land surface. Formally, the data structure is quite similar to the parcel data held in the cadastre. Each unit in the database is an observation with an explicit spatial reference (a geometry) and a set of attributes characterizing the area. A national standardization program has assured that the reference geometry is compatible with the cadastre, as well as with all other spatial information held by public institutions in Norway. Consequently, although the actual shape of the spatial units in the cadastre and the land resource map are different, they can be combined by simple geometrical operations in order to create “atomic” spatial units. The combination of cadastral units and land resource map units results in “atomic” spatial units. An “atomic” unit is unique with respect to its ancestors – in this case the cadastral as well as land resource information (Figure 2). Each “atomic” spatial unit reference a single cadastral unit as well as a single land resource unit. With reference to the farm registry, as described above, it is now possible to assemble all the atomic elements that belong to a particular farm unit and compute land resource statistics (area by land resource class) for the farm unit. The example can of course be extended to any combination of entities present in a common geographical space. The Norwegian system for farmland statistics is using this approach to combine spatially referenced data from multiple sources – all using spatial data management and national geospatial standards to maintain compatibility (Figure 3). Information is

Statistics Sweden | scb.se/nsm2016 | [email protected]

fetched “on-the-fly” from several sources, combined as a preparation for analysis, processed statistically and cartographically and reports prepared in terms of tabular and cartographic output which is returned to the user. The user initiates the farmland statistics by choosing a farm identification code. The identification code is used in a request to the central farm register, which will return a list of the basic property units that constitute the farm. This list is used in a request to the national cadastre, which returns a list of land parcels – including geometry. The extreme north, south, east and west coordinates are used to define a bounding box around the farm

Figure 3: The Norwegian system for farmland statistics is combining data from multiple sources in order to compile on-the fly statistical information.

The bounding box is used in a request to the land resources database, which returns the land resource units falling (at least partially) within the box. This is a potentially time-consuming operation, and spatial indexing of the database is critical in order to ensure a rapid response. The organization of the spatial database is thus also an important aspect of the system. The parcel and land resource information is intersected – as described above - in order to create atomic spatial elements. Those not belonging to the farm in question are discarded and the rest subject to an elementary aggregation and summary operation, resulting in a statistical report of area per land resource class. Finally, cartographic background information (topography, aerial photographs etc.) is fetched by remote requests from other spatial databases (using the bounding box) and assembled into a visual report consisting of a map and a statistical table (Figure 4). The

Statistics Sweden | scb.se/nsm2016 | [email protected]

whole operation is done on-the-fly online, typically within 15-30 seconds. The system can be accessed from http://gardskart.skogoglandskap.no/ The farmland statistics system is possible due to spatial data management. Basic information is available and each topic is maintained by a particular institution without redundant copies that creates uncertainty regarding data authority. Data are reused for several purposes, and standardization ensures compatibility between systems. Another important factor is the organization of a national spatial data infrastructure facilitating the sharing and exchange of data between public agencies.

Figure 4: Output from the Norwegian system for farmland statistics.

6.

Conclusion

Efficient and flexible use of spatial information in a statistical production process following the GSBPM model requires systematic and well-designed data storage between data collection and analysis. There is wide acceptance of the fact that the efficiency of the information chain is enhanced when the data storage aspect is professionalized. It allows better documentation and easier access to data, and also facilitates multiple uses of the same data. This requires that data management is taken in as an element of the “processing” phase in the GSBPM. We maintain that this is equally true for spatial data. Spatial data management involves a systematic approach to include spatial data and spatial references in the overall database management strategy of the information chain. The methodology as well as the technology needed to build, maintain and use spatial data management systems are well known and thoroughly tested. The obstacle is mainly organizational. The Survey and statistics division of NIBIO has developed its spatial data management system over a period of 20 years. Our experience is that a number of organizational factors represent the key to successful spatial data management

Statistics Sweden | scb.se/nsm2016 | [email protected]

1) Acknowledgement - accepting that spatial data management is an important issue for the organization 2) Ownership - involvement in the issue by the top management 3) Integration (of spatial data management) in the overall data management policy 4) Availability of resources – human as well as technical 5) Prioritizing – starting in one end and leaving some tasks for later 6) Long term commitment to spatial data management

7.

References

Egenhofer, M. J., Frank, A. U. and Jackson, J. P. (1989) A topological data model for spatial databases, In Buchmann, A.P., Günther, O., Smith, T.R. and Wang, Y-F. (eds) Design and Implementation of Large Spatial Databases, Lecture Notes in Computer Science, 409: 271-286. Springer Berlin-Heidelberg Fujimoto, S., Mizuno, T., Ohnishi, T., Shimizu, C. and Watanabe, T. (2015) Geographic Dependency of Population Distribution. In: Proceedings of the International Conference on Social Modeling and Simulation, plus Econophysics Colloquium 2014, 151 - 162, Springer International Publishing. Goodchild, M. F. (2007). The Morris Hansen Lecture 2006 Statistical Perspectives on Spatial Social Science. Journal of Official Statistics, 23: 1 - 15. Leyk, S., Buttenfield, B. P., Nagle, N. N., and Stum, A. K. (2013) Establishing relationships between parcel data and land cover for demographic small area estimation. Cartography and Geographic Information Science, 40: 305-315. Marceau, D. J., Guindon, L., Bruel, M., and Marois, C. (2001) Building temporal topology in a GIS database to study the land-use changes in a rural-urban environment. The Professional Geographer, 53: 546-558. McClintock, J. (1986) Everything is somewhere. The geography quiz book, William Morrow & Co Strand, G-H. (2001) The role of Agriculture and Forestry in a National Geospatial Data Infarstructure, Third International Conference on Geospatial Information in Agriculture and Forestry, Denver, Colorado 5-7 November 2001 Strand, G-H. (2013) The Norwegian area frame survey of land cover and outfield land resources. Norsk Geografisk Tidsskrift - Norwegian Journal of Geography 67, 24-35.

Statistics Sweden | scb.se/nsm2016 | [email protected]

Strand, G-H. and Aune-Lundberg, L. (2012) Small-area estimation of land cover statistics by post-stratification of a national area frame survey. Applied Geography, 32(2), 546-555. Strand, G-H. and Bloch, V.V.H. (2009) Statistical grids for Norway. Documentation of national grids for analysis and visualization of spatial data in Norway. Documents 2009/9, Statistics Norway, Oslo. Tomter, S.M., Hylen, G., Nilsen, J.E. (2010) Norway. In: Tomppo, E., Gschwanter, T., Lawrence, M., McRoberts, R. (Eds.), National Forest Inventories, Pathways for Common Reporting. Springer, pp. 411-424. Voss, P. R. (2007) Demography as a spatial social science. Population research and policy review, 26: 457-476.

Statistics Sweden | scb.se/nsm2016 | [email protected]