DATA ENTRY AND PREPARATION
INTRODUCTION
- Spatial data can be obtained from various sources and can be collected from scratch, using direct spatial data acquisition techniques, or indirectly, by making use of existing spatial data collected by others.
- Direct data collection could involve field survey data and remotely sensed images, while paper maps and existing digital data sets fall under indirect sources.
- This topic discusses the collection and use of direct and indirect data collection with a view to prepare users of spatial data by drawing attention to issues concerning data accuracy and quality.
- A range of procedures for data checking and clean-up would be examined to prepare data for analysis, including several methods for interpolating point data.
DIRECT SPATIAL DATA CAPTURE
- One way to obtain spatial data is by direct observation of the relevant geographic phenomena. This can be done through ground-based field surveys, or by using remote sensors in satellites or airplanes.
- Data which is captured directly from the environment is known as primary data.
- With primary data the core concern in knowing its properties is to know the process by which it was captured, the parameters of any instruments used and the rigour with which quality requirements were observed.
- Remotely sensed imagery is usually not fit for immediate use, as various sources of error and distortion may have been present, and the imagery should first be freed from these.
- An image refers to raw data produced by an electronic sensor, which are not pictorial, but arrays of digital numbers related to some property of an object or scene, such as the amount of reflected light.
- For an image, no interpretation of reflectance values as thematic or geographic characteristics has taken place. When the reflectance values have been translated into some ‘thematic’ variable, we refer to it as a raster.
- In practice, it is not always feasible to obtain spatial data by direct spatial data capture. Factors of cost and available time may be a hindrance, or previous projects sometimes have acquired data that may fit the current project’s purpose.
INDIRECT SPATIAL DATA CAPTURE
- Spatial data can also be sourced indirectly and such data can be derive from existing paper maps through scanning, data digitized from a satellite image, processed data purchased from data capture firms or international agencies, etc.
- This type of data is known as secondary data. Therefore, any data which is not captured directly from the environment is known as secondary data.
- Sources of secondary data in GIS are discussed below.
INDIRECT SPATIAL DATA CAPTURE
- Spatial data can also be sourced indirectly and such data can be derive from existing paper maps through scanning, data digitized from a satellite image, processed data purchased from data capture firms or international agencies, etc.
- This type of data is known as secondary data. Therefore, any data which is not captured directly from the environment is known as secondary data.
- Sources of secondary data in GIS are discussed below.
DIGITIZATION
- A traditional method of obtaining spatial data is through digitizing existing paper maps which can be done using various techniques. Before adopting this approach, one must be aware that positional errors already in the paper map will further accumulate, and one must be willing to accept these errors.
- There are two forms of digitizing: on-tablet and on-screen manual digitizing. In on-tablet digitizing, the original map is fitted on a special surface (tablet), while in on-screen digitizing, a scanned image (map or some other image) is shown on the computer screen. In both of these forms, an operator follows the map’s features (mostly lines) with a mouse device, thereby tracing the lines, and storing location coordinates relative to a number of previously defined control points.
- The function of these points is to ‘lock’ a coordinate system onto the digitized data: the control points on the map have known coordinates, and by digitizing them we tell the system implicitly where all other digitized locations are.
- At least four control points are needed, but preferably more should be digitized to allow a check on the positional errors made through the use of RMS
- Another set of techniques also works from a scanned image of the original map, but uses the GIS to find features in the image. These techniques are known as semi-automatic or automatic digitizing, depending on how much operator interaction is required. If vector data is to be distilled from this procedure, a process known as vectorization follows the scanning process.
- This procedure is less labour-intensive, but can only be applied on relatively simple sources.
SCANNING
- An ‘office’ scanner illuminates a document and measures the intensity of the reflected light with a CCD array. The result is an image as a matrix of pixels, each of which holds an intensity value.
- Office scanners have a fixed maximum resolution, expressed as the highest number of pixels they can identify per inch; the unit is dots-per-inch (dpi).
- For manual on-screen digitizing of a paper map, a resolution of 200–300 dpi is usually sufficient, depending on the thickness of the thinnest lines, while for manual on-screen digitizing of aerial photographs, higher resolutions are recommended—typically, at least 800 dpi.
- (Semi-)automatic digitizing requires a resolution that results in scanned lines of at least three pixels wide to enable the computer to trace the centre of the lines and thus avoid displacements. For paper maps, a resolution of 300–600 dpi is usually sufficient.
- Automatic or semi-automatic tracing from aerial photographs can only be done in a limited number of cases. Usually, the information from aerial photos is obtained through visual interpretation.
- After scanning, the resulting image can be improved with various image processing techniques. It is important to understand that scanning does not result in a structured data set of classified and coded objects.
- Additional work is required to recognize features and to associate categories and other thematic attributes with them.
VECTORIZATION
- The process of distilling points, lines and polygons from a scanned image is called vectorization. As scanned lines may be several pixels wide, they are often first thinned to retain only the centreline.
- The remaining centreline pixels are converted to series of (x; y) coordinate pairs, defining a polyline. Subsequently, OCR features are formed and attributes are attached to them.
- This process may be entirely automated or performed semi-automatically, with the assistance of an operator. Pattern recognition methods—like Optical Character Recognition (OCR) for text—can be used for the automatic detection of graphic symbols and text.
- Vectorization causes errors such as small spikes along lines, rounded corners, errors in T- and X-junctions, displaced lines or jagged curves.
- These errors are corrected in an automatic or interactive post-processing phase.

SELECTING A DIGITIZING TECHNIQUE
- The choice of digitizing technique depends on the quality, complexity and contents of the input document. Complex images are better manually digitized, while simple images are better automatically digitized.
- Images that are full of detail and symbols—like topographic maps and aerial photographs—are therefore better manually digitized.
- In practice, the optimal choice may be a combination of methods. For example, contour line film separations can be automatically digitized and used to produce a DEM.
- Existing topographic maps must be digitized manually, but new, geometrically corrected aerial photographs, with vector data from the topographic maps displayed directly over it, can be used for updating existing data files by means of manual on-screen digitizing.
OBTAINING SPATIAL DATA ELSEWHERE
- Over the past two decades, spatial data has been collected in digital form at increasing rate, stored in various databases by the individual producers for their own use and for commercial purposes.
- More and more of this data is being shared among GIS users for several reasons. Some of this data is freely available, although other data is only available commercially, as is the case for most satellite imagery.
- High quality data remain both costly and time-consuming to collect and verify, as well as the fact that more GIS applications are looking at not just local, but national or even global processes.
- As we will see later, new technologies have played a key role in the increasing availability of geospatial data.
- As a result of this increasing availability, we have to be more careful that the data we have acquired is of sufficient quality to be used in analysis and decision making.
CLEARINGHOUSE AND WEB PORTALS
- Spatial data can also be acquired from centralized repositories. More often those repositories are embedded in Spatial Data Infrastructures which make the data available through what is sometimes called a spatial data clearinghouse.
- This is essentially a marketplace where data users can ‘shop’. It will be no surprise that such markets for digital data have an entrance through the world wide web.
- The first entrance is typically formed by a web portal that categorizes all available data and provides a local search engine and links to data documentation (also called metadata).
- It often also points to data viewing and processing services. Standards-based geo-web services have become the common technology behind such portal services
METADATA
- Metadata is defined as background information that describes all necessary information about the data itself. More generally, it is known as ‘data about data and this includes:
- Identification information: Data source(s), time of acquisition, etc.
- Data quality information: Positional, attribute, and temporal accuracy, lineage,
- etc.
- Entity and attribute information: Related attributes, units of measure, etc.
- In essence, metadata answers who, what, when, where, why, and how questions about all facets of the data made available.
- Maintaining metadata is a key part of maintaining data and information quality in GIS. This is because it can serve different purposes, from a description of the data itself to providing instructions for data handling.
- Depending on the type and amount of metadata provided, it could be used to determine the data sets that exist for a geographic location, evaluate whether a given data set meets a specified need, or process and use a data set.
DATA FORMATS AND STANDARDS
- An important problem in any environment involved in digital data exchange is that of data formats and data standards.
- Different formats were implemented by different GIS vendors; different standards came about with different standardization committees.
- The phrase ‘data standard’ refers to an agreed-upon way of representing data in a system in terms of content, type, and format.
- The good news about both formats and standards is that there are many to choose from; the bad news is that this can lead to a range of conversion problems.
- Several metadata standards for digital spatial data exist, including the International Organization for Standardization (ISO) and the Open Geospatial Consortium (OGC) standards.
DATA QUALITY
- With the advent of satellite remote sensing, GPS, and GIS technology, and the increasing availability of digital spatial data, resource managers and others who formerly relied on the surveying and mapping profession to supply high-quality map products are now in a position to produce maps themselves.
- At the same time, GISs are being increasingly used for decision support applications, with increasing reliance on secondary data sourced through data providers or via the internet, through geo-web services.
- The implications of using low-quality data in important decisions are potentially severe. There is also a danger that uninformed GIS users introduce errors by incorrectly applying geometric and other transformations to the spatial data held in their database.
- Below we look at the main issues related to data quality in spatial data. We shall subsequently discuss positional, temporal, and attribute accuracy, lineage, completeness, and logical consistency.
- We will begin with a brief discussion of the accuracy and precision of the terms, as these are often taken to mean the same thing.
DATA ACCURACY VS DATA PRECISION
- Accuracy should not be confused with precision, which is a statement of the smallest unit of measurement to which data can be recorded.
- In conventional surveying and mapping practice, accuracy and precision are closely related. Instruments with appropriate precision are employed, and surveying methods are chosen, to meet specified accuracy tolerances.
- In GIS, however, the numerical precision of computer processing and storage usually exceeds the accuracy of the data. This can give rise to so-called spurious accuracy, for example calculating area sizes to the nearest m2 from coordinates obtained by digitizing a 1: 50;000 map.
- Using graphs that display the probability distribution of a measurement against the true value T, the relationship between accuracy and precision can be clarified.
- An accurate measurement has a mean close to the true value; a precise measurement has a sufficiently small variance.

POSITIONAL ACCURACY
- The surveying and mapping profession has a long tradition of determining and minimizing errors. This applies particularly to land surveying and photogrammetry, both of which tend to regard positional and height errors as undesirable.
- Cartographers also strive to reduce geometric and attribute errors in their products, and, in addition, define quality in specifically cartographic terms, for example, quality of linework, layout, and clarity of text.
- It must be stressed that all measurements made with surveying and photogrammetric instruments are subject to error. These include:
- Human errors in measurement (e.g. reading errors) are generally referred to as gross errors or blunders. These are usually large errors resulting from carelessness which could be avoided through careful observation, although it is never absolutely certain that all blunders have been avoided or eliminated.
- Instrumental or systematic errors (e.g. due to misadjustment of instruments). This leads to errors that vary systematically in sign and/or magnitude but can go undetected by repeating the measurement with the same instrument. Systematic errors are particularly dangerous because they tend to accumulate.
- So–called random errors are caused by natural variations in the quantity being measured. These are effectively the errors that remain after blunders and systematic errors have been removed. They are usually small, and dealt with in least–squares adjustment.
QUANTIFYING POSITIONING ACCURACY USING ROOT MEAN SQUARE ERROR (RMSE)
- Measurement errors are generally described in terms of accuracy.
- In the case of spatial data, accuracy may relate not only to the determination of coordinates (positional error) but also to the measurement of quantitative attribute data. The accuracy of a single measurement can be defined as: “the closeness of observations, computations or estimates to the true values or the values perceived to be true”.
- In the case of surveying and mapping, the ‘truth’ is usually taken to be a value obtained from a survey of higher accuracy, for example by comparing photogrammetric measurements with the coordinates and heights of a number of independent checkpoints determined by field survey.
- Although it is useful for assessing the quality of definite objects, such as cadastral boundaries, this definition clearly has practical difficulties in the case of natural resource mapping where the ‘truth’ itself is uncertain, or boundaries of phenomena become fuzzy.
- Prior to the availability of GPS, resource surveyors working in remote areas sometimes had to be content with ensuring an acceptable degree of relative accuracy among the measured positions of points within the surveyed area. If location and elevation are fixed with reference to a network of control points that accuracy is assumed to be free of error, then the absolute accuracy of the survey can be determined.
ROOT MEAN SQUARE ERROR
- Location accuracy is normally measured as a root mean square error (RMSE).
- The RMSE is similar to, but not to be confused with, the standard deviation of a statistical sample. The value of the RMSE is normally calculated from a set of check measurements (coordinate values from an independent source of higher accuracy for identical points).
- The differences at each point can be plotted as error vectors, as is done in Figure below for a single measurement. The error vector can be seen as having constituents in the x- and y-directions, which can be recombined by vector addition to give the error vector representing its locational error.
- For each checkpoint, the error vector has components x and y. The observed errors should be checked for a systematic error component, which may indicate a (possibly repairable) lapse in the measurement method.
- The positional error of a measurement can be expressed as a vector, which in turn can be viewed as the vector addition of its constituents in x- and y-direction, respectively x and y. The systematic error Vx in x is then defined as the average deviation from the true value:
- Analogously to the calculation of the variance and standard deviation of a statistical sample, the root mean square errors mx and my of a series of coordinate measurements are calculated as the square root of the average squared deviations:
ACCURACY TOLERANCES
- The RMSE can be used to assess the probability that a particular set of measurements does not deviate too much from, i.e. is within a certain range of, the ‘true’ value.
- In the case of coordinates, the probability density function often is considered to be that of a two-dimensional normally distributed variable
- The three standard probability values associated with this distribution are:
- 0.50 for a circle with a radius of 1.1774 mx around the mean (known as the circular error probable, CEP);
- 0.6321 for a circle with a radius of 1.412 mx around the mean (known as the root mean square error, RMSE)
- 0.90 for a circle with a radius of 2.146 mx around the mean (known as the circular map accuracy standard, CMAS).
- The RMSE provides an estimate of the spread of a series of measurements around their (assumed) ‘true’ values.
- It is therefore commonly used to assess the quality of transformations such as the absolute orientation of photogrammetric models or the spatial referencing of satellite imagery.
- The RMSE also forms the basis of various statements for reporting and verifying compliance with defined map accuracy tolerances
ATTRIBUTE ACCURACY
- We can identify two types of attribute accuracies. These relate to the type of data we are dealing with:
- For nominal or categorical data, the accuracy of labeling (for example the type of land cover, road surface, etc).
- For numerical data, numerical accuracy (such as the concentration of pollutants in the soil, height of trees in forests, etc).
- Depending on the data type, assessment of attribute accuracy may range from a simple check on the labeling of features—for example, is a road classified as a metalled road actually surfaced or not?—to complex statistical procedures for assessing the accuracy of numerical data, such as the percentage of pollutants present in the soil.
- When spatial data are collected in the field, it is relatively easy to check on the appropriate feature labels.
- In the case of remotely sensed data, however, considerable effort may be required to assess the accuracy of the classification procedures. This is usually done by means of checks at a number of sample points.
- The field data are then used to construct an error matrix (also known as a confusion or misclassification matrix) that can be used to evaluate the accuracy of the classification.
PERCENTAGE ACCURACY IN IMAGE CLASSIFICATION
- In the Table below, three land use types were identified and for 62 checkpoints that are forest, the classified image identifies them as forest. However, two forest checkpoints are classified in the image as agriculture. Vice versa, five agriculture points are classified as forests.
- Correct classifications are found on the main diagonal of the matrix, which sums up to 92 correctly classified points out of 100 in total
- The table below is an example of a simple error matrix for assessing map attribute accuracy. The overall accuracy is (62+18+12)/100 =92%
TEMPORAL ACCURACY
- As noted, the amount of spatial data sets and archived remotely sensed data has increased enormously over the last decade.
- These data can provide useful temporal information such as changes in land ownership and the monitoring of environmental processes such as deforestation.
- Analogous to its positional and attribute components, the quality of spatial data may also be assessed in terms of its temporal accuracy.
- For a static feature, this refers to the difference in the values of its coordinates at two different times.
- This includes not only the accuracy and precision of time measurements (for example, the date of a survey), but also the temporal consistency of different data sets.
- Because the positional and attribute components of spatial data may change together or independently, it is also necessary to consider their temporal validity.
- For example, the boundaries of a land parcel may remain fixed over a period of many years whereas the ownership attribute may change more frequently
LINEAGE
- Lineage describes the history of a data set. In the case of published maps, some lineage information may be provided as part of the metadata, in the form of a note on the data sources and procedures used in the compilation of the data.
- Examples include the date and scale of aerial photography and the date of field verification.
- Lineage may be defined more formally as:
- “that part of the data quality statement that contains information that describes the source of observations or materials, data acquisition and compilation methods, conversions, transformations, analyses and derivations that the data has been subjected to, and the assumptions and criteria applied at any stage of its life.”
- All of these aspects affect other aspects of quality, such as positional accuracy.
- Clearly, if no lineage information is available, it is not possible to adequately evaluate the quality of a data set in terms of ‘fitness for use’.
COMPLETENESS
- Completeness refers to whether there are data lacking in the database compared to what exists in the real world.
- Essentially, it is important to be able to assess what does and what does not belong to a complete dataset as intended by its producer. It might be incomplete (i.e. it is ‘missing’ features that exist in the real world), or overcomplete (i.e. it contains ‘extra’ features which do not belong within the scope of the data set as it is defined).
- Completeness can relate to either spatial, temporal, or thematic aspects of a dataset. For example, a data set of property boundaries might be spatially incomplete because it contains only 10 out of 12 suburbs.
- It might be temporally incomplete because it does not include recently subdivided properties, and it might be thematically overcomplete because it also includes building footprints.
DATA PREPARATION
- Spatial data preparation aims to make the acquired spatial data fit for use.
- Images may require enhancements and corrections of the classification scheme of the data.
- Vector data also may require editing, such as the trimming of overshoots of lines at intersections, deleting duplicate lines, closing gaps in lines, and generating polygons.
- Data may require conversion to either vector format or raster format to match other data sets which will be used in the analysis.
- Additionally, the data preparation process includes associating attribute data with the spatial features through either manual input or reading digital attribute files into the GIS/DBMS.
- The intended use of the acquired spatial data may require only a subset of the original data set, as only some of the features are relevant for subsequent analysis or subsequent map production.
- In these cases, data and/or cartographic generalization can be performed on the original data set.
Comments
Post a Comment