Uses of and Notation
Here we focus on common uses of and Notation within the GISciences.
General descriptive statistics
- Mean: take the sum of a set of elements, and multiply by the reciprocal of the length of the set.
- Variance (of a discrete random variable), describing how far the observed values are from the expected mean or what is the sum of the squared deviations of all observations from the mean.
You will also see in place of . The meaning is the same. For the above is expected to be
and is a probability density functions (that we will talk about in coming weeks), e.g. a normal distribution.
Here the goal is to just see the summation notation in action and have a general idea of what the variance equation is doing. and can be viewed as 'black box' variables where something else is going on.
- Standard Deviation
Here is the value of some observation, is the mean of the set of all and is the number of elements in (or more formally, the population size). In a spatial analysis context, you will sometimes see the denominator of a statistics as instead of in instances where the population size is known.
Linear Regression
Linear regresssion is one way to model a linear relationship between a dependent variable and some number of independent variables. For example, the number of snowy days (dependent) might have a linear relationship with elevation (independent). We could try to model that relationship with linear regression.
If you have never taken a statistics class, no need to worry. The goal here is simply to demonstrate how summation notation is used in the context of linear regression.
Here we see a population (black dots) and a best fit line.
The relationship between these variables is modeled as or the dependent variable is approximately equal to the intercept plus the slope of the line times . This is a linear function.
How then do we estimate and ? We can do this using summation!
As a quick aside, compare and . The hat here means predicted value. Looking at the image above, we can clearly see that the function describing the blue line () will sometimes be close to the correct estimate and other times be quite divergent.
Back to estimation. We can find the best fit coefficients using the following formula (derived using calculus that we are not worried about here).
and
.
and are the sample means.
The formula for estimating moves the start and stop notation from the top and bottom, respictively to the right hand side. This is just another way that you might encounter a forumla notated. The meaning is exactly the same.
Computational Geometry
Here the topics get a bit closer to what is occuring under the hood of a Geographic Information System. Much of the foundational GIS work come from geometry and more recently computational geometry.
- Length of a line
First, let's start with the length of a line segment composed of two points. We know that in Euclidean space the distance between any two points is defined by the Pythagorean theorem:
Therefore, it is possible to utilize Sigma notation to formulate the computation of the total line length as the sum of the lengths of the segments:
,
where is a function that computes the Euclidean distance between and the next point in the set , and is the total number of vertices in the line.
- Area of a simple polygon
Frequently, we utilize a GIS to compute the area of 2-dimensional vector features. For example, if we want to know the area of all the residential lots in Tempe, we would need to first be able to compute the area of each individual lot. Then it is possible to use the same strategy as above - define the single entity function and then nest said function inside Sigma notation. Assuming every residential lot in Tempe is a simple polygon, we could forumulate the area function as:
,
where is the area, is the k-th point in the set of counter-clockwise sorted points, and (x,y) are the vertex (point) coordinates.
For those of you that are interested - this is a derivation of Green's Theorm
- Average nearest neighbor distance
As a final example, in spatial point pattern analysis, it is frequently desirable to know the distance between a given point and the nearest neighbor, for example to compute a G-function ().
Average nearest neighbor distance can be formulated by first computing the distance between a point and all other points or . Textually, compute the minimum distance ( between a point and all of the other points in a set. This notation assumes that the sets of points and are the refering to the same point set ().
The mean nearest neighbor distance is then:
,
where is the distance between observation and its nearest neighbor.