Geospatial Modelling Environment

# kde (Kernel Density Estimation)

## Calculates kernel density estimates based on a set of input points

### Description

This tool calculates kernel density estimates based on a set of input points. This tool implements three types of kernel: Gaussian (bivariate normal), quartic, and uniform. The quartic kernel is an approximation to the Gaussian kernel that is used because it is computationally simpler and faster. However, I would suggest that for most scientific applications there is little justification for using the quartic kernel over the Gaussian kernel. The Gaussian kernel is the default in this tool, although the quartic kernel has been included in order to allow users to make comparisons with software packages that calculate the quartic kernel.

The bandwidth you provide will depend on the type of kernel used in the calculation. If the kernel is bivariate normal the bandwidth is the covariance matrix for a bivariate normal distribution. Although this is a 2x2 matrix, you need only provide three parameters because the two parameters representing the covariance between x and y are identical. The three parameters needed are thus: the standard deviation for x, the standard deviation for y, and the covariance. Note that some software packages require you to provide a bandwidth parameter, h, while others require h^2. Although h is smaller than h^2 and therefore easier to work with, h^2 is the correct representation for the covariance matrix. It is important to be aware of how the bandwidth is represented when comparing the output from different software packages.

The command also implements several bandwidth estimation algorithms in the 'ks' library in R, which will estimate an optimized bandwidth matrix for you. The algorithms available are the plug-in estimator (bandwidth="PLUGIN"), smoothed cross validation (bandwidth="SCV"), biased cross-validation (bandwidth="BCV"), a second BCV algorithm (bandwidth="BCV2"), and least squares cross validation (bandwidth="LSCV"). Note that there can be large differences among the different bandwidth estimators, so it is recommended that you try several of them in order to determine which is most biologically relevant to your data and question. I have found the plug-in and SCV algorithms seem to perform very well. Processing times may be long with large numbers of points (thousands). Note also that some algorithms (e.g. LSCV) are sensitive to points with identical coordinates. You might consider adding a small amount of random noise to your points in R if this is an issue with your dataset. The bandwidths are calculated using the default settings for these estimators in the ks library in R.

If the kernel is the quartic approximation to the bivariate normal distribution, then you only specify a single value that represents the radius beyond which the density estimate for the kernel is 0. Thus, the quartic kernel bandwidth parameters corresponds to a real distance on the ground, unlike the bandwidth for the bivariate normal kernel which is a covariance matrix. Thus, these two bandwidths do not directly map. You cannot, for instance, estimate the optimal bandwidth using a bivariate normal kernel algorithm (like least squared cross validation) and then use it in a quartic kernel calculation: the optimal bandwidth for the quartic kernel will be very different.

The uniform kernel corresponds to what is also sometimes referred to as 'simple density'. The bandwidth represents the radius of a circle within which points are counted around each cell. The density value is simply n / pi*h^2, where n is the nunmber of points in the circle.

It takes some experience to learn what suitable cell size values are. A cell size that is too large will result in a 'blocky' output raster that is a poor statistical approximation to a continuous surface. A cell size that is too small will result in a very large output raster (many cells) that takes a long time to calculate. I suggest the following rule of thumb to calculate a reasonable bandwidth: take the square root of the x or y variance value (whichever is smaller) and divide by 5 or 10 (I usually round to the nearest big number - so 36.7 becomes 40). Before using this rule of thumb value calculate how many cells this will result in for the output (take the width and height of you input points, divide by the cell size, and multiply the resulting numbers together). If you get a value somewhere between 1-20 million, then you have a reasonable value. If you have a value much larger then 20 million cells then consider increasing the cell size.

A scaling factor is often used in KDE calculations to prevent a loss of precision in density values. Point density values are often very small numbers, and some raster formats do not support double-precision values (the Imagine img format is the only format that does, and for that reason I recommend it as the format for the output raster). The scaling factor is just a value that the point density values are multiplied to make them larger. The default is 1000000. Again, scaling factors may vary between software packages and this is something that must be considered when making comparisons.

Weights can be optionally specified for each point via a field in the attribute table. The kernel based on each point is weighted by the value in this field, and the density estimates are standardized by dividing by the sum of the weight values. Thus, you do not need to standardize the weights yourself. By default (when no weight field is specified) all points are weighted equally.

By default the output extent is automatically calculated as the extent of the input point dataset plus a suitable buffer distance that ensures the density surface is not unduly truncated at the edges. However, you may override this extent using the 'ext' option which requires that you specify either a reference geospatial layer (vector or raster dataset), or the minimum x, maximum x, minimum y, and maximum y coordinates of the desired extent.

### Syntax

kde(in, out, bandwidth, cellsize, [weightfield], [scalingfactor], [kernel], [ext], [edgeinflation], [where]);

 in the input point data source out the output raster data source bandwidth the bandwidth (see the help documentation for details); bandwidth estimators include: SCV, BCV, BCV2, PLUGIN, LSCV, CVh cellsize the cell size dimension of the output raster [weightfield] the name of a field in the input table that contains the weighting value for each point (the default is no field and equal weighting of all points) [scalingfactor] multiplies densities by this value - see help for details (default=1000000) [kernel] kernel type (default=GAUSSIAN; options: GAUSSIAN, QUARTIC, UNIFORM) [ext] the input reference data source (vector or raster dataset), or coordinates, that defines the analysis extent [edgeinflation] the extra distance to add to each side of the output raster extent when determining raster dimensions (default=0) [where] the selection statement that will be applied to the feature data source to identify a subset of features to process (see full Help documentation for further details)

### Example

kde(in="C:\data\locs.shp", out="C:\data\kdeloc1.img", bandwidth=c(10000,10000,0), cellsize=20);

kde(in="C:\data\locs.shp", out="C:\data\kdeloc2", bandwidth=10000, cellsize=20, weightfield="WEIGHTS");

kde(in="C:\data\locs.shp", out="C:\data\kdeloc3", bandwidth=40000, cellsize=20, kernel="UNIFORM");

kde(in="C:\data\locs.shp", out="C:\data\kdeloc4.tif", bandwidth=500, cellsize=50, kernel="QUARTIC", edgeinflation=200);

### Messages

Please consider making a purchase to support the continued development of these tools  Read more...

Tips on how to use this interface efficiently  Read more...