## CENTRAL TENDENCY & DISPERSION

__- To access the Google Drive Folders which contain the main tasks and resources for the different measures of central tendency and dispersion click here.__**Resources**

__- The table below shows the measures of central tendency & dispersion that are required to be studied for the Welsh A level examination.__**Information**

The following GeoFile - Measures of Central Tendency and Variation contains information and tasks to be completed - to download the file click here.

The mean - The arithmetic mean or average (–x) is perhaps the best known and commonly utilised measure of central tendency. It has many applications stretching across a range of subject areas and is calculated by summing the data within a given set together and dividing by the total number of pieces of data utilising the following equation.

__- There are three calculations that you need to to be able to calculate. They are the__**Measures of Central Tendency****mean**,**median**and**mode**.The following GeoFile - Measures of Central Tendency and Variation contains information and tasks to be completed - to download the file click here.

The mean - The arithmetic mean or average (–x) is perhaps the best known and commonly utilised measure of central tendency. It has many applications stretching across a range of subject areas and is calculated by summing the data within a given set together and dividing by the total number of pieces of data utilising the following equation.

The quantitative data in Figure 1 is a set of midday temperature values recorded in the centre of Cheltenham throughout the month of June. Calculate the mean value where ∑ x = 613 and n = 30.

Therefore the mean (–x) is equal to 20.43°C.

The mean is a relatively straightforward statistic to generate but does have several limitations which must be considered when using it to summarise a data set:

a) It is heavily influenced by any extreme/outlying points within the data set, and when calculated incorporating these points the mean value could be misleading with reference to the rest of the data set.

b) It gives no information as to how the data within the set is spread around this middle point, hence two data sets with similar mean values may represent widely differing distributions of data.

Means of two geographical data sets can be compared to show whether or not they are actually different, for example size of river sediment collected in two parts of the channel.

Mode - The second measure of central tendency to be considered is the mode. The mode of a set of data, referred to as the modal value, is the most frequently occurring value within the data set.

Using the data within Figure1 there are six individual days when the temperature was 21°C, which is more than any other temperature within this month. We therefore say that the modal temperature was 21°C.

Points to consider in relation to the mode

a) The mode of any given data set is not always a single value. For instance if a data set has two values that occur the same number of times then it would termed bimodal.

b) If the data are recorded within numerically defined categories the modal category may be obvious but the exact value of the mode becomes more complicated to calculate.

c) The mode has little value in relation to the more complicated statistical tests discussed later.

Using the data within Figure1 there are six individual days when the temperature was 21°C, which is more than any other temperature within this month. We therefore say that the modal temperature was 21°C.

Points to consider in relation to the mode

a) The mode of any given data set is not always a single value. For instance if a data set has two values that occur the same number of times then it would termed bimodal.

b) If the data are recorded within numerically defined categories the modal category may be obvious but the exact value of the mode becomes more complicated to calculate.

c) The mode has little value in relation to the more complicated statistical tests discussed later.

Median - The median or mid-point value (m) is the numerical value falling within the data set at which half of data are above it and half are below it. It is again relatively simple to calculate the median: Finally put the numerical value of (n) into the equation in Equation Box 2 to find out the location of the median within the data set.

1) First put the data into arithmetic order (either ascending or descending) based on the data values themselves or ranks previously assigned to the data.

2) Count the number of items of data (termed n).

Finally put the numerical value of (n) into the equation in Equation Box 2 to find out the location of the median within the data set.

1) First put the data into arithmetic order (either ascending or descending) based on the data values themselves or ranks previously assigned to the data.

2) Count the number of items of data (termed n).

Finally put the numerical value of (n) into the equation in Equation Box 2 to find out the location of the median within the data set.

If the number of pieces of data in the set (n) is odd then the median value will appear as an integer (whole number) and it is simply a case of finding the piece of data this refers to by counting though the data set from either end.

If the number of pieces of data is even then the median value will lie between two pieces of data within the set. To establish the median number based upon this simply add the data values on either side of the median point together and average them. The averaged value will be the median of the data set. We must now consider whether there is an odd or even number of items of data in this set. Calculate the median value utilising the equation in Box 2:

1) Add 1 to the total number of pieces of data in the set (30 + 1)

2) Divide the resultant number (31) by 2. 31/2 = 15.5

3) This indicates where the median value lies (the 15.5th value). Since we only have a 15th and 16th value we must average these in order to find the median value. So (21 + 21)/2 = 21. The median value and temperature for June is 21°C.

Points to consider when using the median value

a) It gives no information as to how the data within the set is spread around its median value, hence two data sets with similar medians may have a widely differing distribution of data.

b) With large numbers of observations within a data set it can be a tedious statistic to calculate especially when the data is being manipulated by hand as opposed to by machine where programs such as Microsoft Excel will be able to calculate this with a simple click of a button.

As explained previously there are inherent limitations related to the use of the mean, median and mode as measures of central tendency not least the fact that they are single values being used to describe what can be large sets of observations. The mean or average has been suggested as the most statistically significant of these statistics however it also has inherent flaws, as discussed above.

__-The terms deviation, dispersion and variability used in this context all refer to analysing a set of data in terms of its spread around the mean or median value.__

**Measures of Dispersion**The following GeoFile - Measures of Central Tendency and Variation contains information and tasks to be completed - to download the file click here.

**The Range**- This is the simplest measure of variation and it is the most obvious way to describe the scattering of the data within a set. The range defines the region within which all the data values lie. In order to calculate the range the lowest value from the data set is subtracted from the highest to provide a single numerical value, again describing the data set. There are a number of points to be considered when using the range to describe a data set:

1) It is only calculated using two pieces of data from the entire data set.

2) It gives no indication of the spread of data in the remaining data set within the two extremes used in the calculation of the range.

3) Whenever an outlier/anomalous result is present and represents the highest or lowest value the range statistic will utilise this figure and as a result a misleading impression of the true limits/spread of data set will be given.

It would be easy to simply dismiss the range as a statistical measure of dispersion, having highlighted its laws, but there are ways of improving modifying the concept of the range to provide more statistically valuable information relating to the data set. The first of these is the interquartile range.

**The Interquartile Range**- The interquartile range is a statistical value that describes where the middle 50% of the data within any given set lies. It takes into account the median value but in addition to this gives an indication of how the data within the set are spread out around it. It uses a calculation similar to that used in finding the range. Using this technique the possibility of outliers having a significant impact upon the median value or range of the data set is reduced. The interquartile range is found as followed:

1) First find the median value of the data set as explained utilising the equation in Box 2. This represents the point half- way or 50% through the data set.

2) Next find the 25% and 75% quartiles; these are points which represent the outer limits of the middle half of the data set. To calculate these count the number of individual pieces of data on either side of the median.Take this value to be “n” and then calculate the median of each half of the data set using the same process as described in Box 2. The value in the upper half of the data set is described as the upper quartile and in the lower set as the lower quartile.

3) The numerical difference between the upper and lower quartiles is referred to as the interquartile range and has one key benefit range as calculated earlier in that by considering this value the possibility of outliers giving a misleading impression of the spread of the data (as occurs with the range) is minimised.

The formula below show how to work out the interquartile range - use this formula to check if the interquartile range is correct for the data shown in figure 3.

Points to consider in relation to the interquartile range

a) It can be a laborious process to calculate the location of the quartiles, especially when there is a large number of data within the set.

b) The interquartile statistic, in a similar way to the range, does not give any further indication of how the entire set of data is distributed, just the limits of the middle 50% of the data.

c) Not all values are considered and hence a false impression may be given of the data set being analysed.

Despite many of the drawbacks, the interquartile range is frequently used in association with graphical techniques to represent data. One such technique is the box and whisker plot (shown in Figure 3) which, once the quartiles are found is an effective method of displaying information pertaining to the central tendency of a data set.

The box and whisker plot not only shows the interquartile range and median but also the range, and hence is a valuable method of data presentation allowing for simple interpretation from what can be a large data set. The interquartile range is most useful when comparing one or more data sets which appear to have similar means, medians or ranges. It effectively indicates how dispersed the data are around the median; however, without the box and whisker plot there remains no indication of the spread of the data above or below the median value!

It represents the average difference of the data above and below the mean value of the data set. In order to apply the standard deviation to describe a given data set you must first make the assumption that the data within the set are normally distributed. This means:

1) Most values are close to the average.

2) There are only a small number of very high and very low values.

3) There are equal numbers of values above and below the mean.

These assumptions when considered graphically can be represented by the following curve known as the normal distribution curve (figures below). In order to calculate the standard deviation value (0) of a data set it is necessary to use a formula. This formula, commonly known as the standard deviation equation, is displayed in equation box 3 below.

**Standard Deviation**- You might have two set of data that produce the same mean, but you might have a very different range of values within them. You could then use the interquartile range to take out the extreme values and give you a clearer idea of the spread of the data. Standard deviation is one additional statistical tool that produces a figure indicating the extent to which data is clustered around the mean.It represents the average difference of the data above and below the mean value of the data set. In order to apply the standard deviation to describe a given data set you must first make the assumption that the data within the set are normally distributed. This means:

1) Most values are close to the average.

2) There are only a small number of very high and very low values.

3) There are equal numbers of values above and below the mean.

These assumptions when considered graphically can be represented by the following curve known as the normal distribution curve (figures below). In order to calculate the standard deviation value (0) of a data set it is necessary to use a formula. This formula, commonly known as the standard deviation equation, is displayed in equation box 3 below.

In order to calculate the standard deviation value (0) of a data set it is necessary to use a formula. This formula, commonly known as the standard deviation equation, is displayed in equation box 3 below. The Google Slide ' Calculating - Standard Deviation' also shows you how to work out the standard deviation.

Mathematically speaking, the standard deviation is the most statistically sound technique for describing the central tendency of a data set. This is for a number of reasons:

1. In its calculation it includes all the data values within the set and hence there is no selective bias in choosing which pieces of data to use or ignore as in other methods.

2. It is capable of showing the variation between two data sets even if the mean values of the data sets, interquartile range or range are similar.

3. Through the application of confidence intervals it is possible to assert the likelihood of future measurements taken in line with the data set will fall within a designated range.

Standard deviation is useful to geographers in the analysis of data collected from physical measurements. Comparing rainfall figures for one location at one time of year (e.g. for January) over a period of 10 or more years would be one example. Looking for variation in samples of river water from one site tested for a particular pollutant concentration, for oxygen levels, or for numbers of a species such as water fleas over a number of years, would be another.

In addition to summarising the distribution of the data within a set around its mean value the standard deviation value can be of further use. By knowing the number of items of data within the set, the standard deviation can be compared to tables of significance to give an indication of the likelihood of the result being due to chance. These tables can also be used to infer the probability that future data collected under similar conditions will lie within one or more standard deviations of the mean value.

Mathematically, under a normal distribution curve, 68.3 percent of all observations fall within plus or minus one standard deviation of the middle of the curve; 95.5 percent of test observations fall within two standard deviations of the middle of the normal curve and 99.7 percent of test observations fall within three standard deviations. The key point is that the larger the sample size, the greater the probability that the test results will fall within one to two standard deviations of the middle of the normal curve of population behavior.

When sample sizes are at least 100, if the results are quantified and displayed on a graph, the results will tend to approximate what is called the "normal curve" of distribution (see diagram). That is, the majority of people will give you an "average" response, a smaller number will give you a "below average" or an "above average" response, and a very small number will give you an "exceptionally below average" or an "exceptionally above average" response. This distribution is also known as a bell curve.

**Normal Distribution**- To explain what standard deviation is and what it is used for it is first important to understand something called the ‘normal distribution curve’. This is something that mathematicians noticed about data that is collected.Mathematically, under a normal distribution curve, 68.3 percent of all observations fall within plus or minus one standard deviation of the middle of the curve; 95.5 percent of test observations fall within two standard deviations of the middle of the normal curve and 99.7 percent of test observations fall within three standard deviations. The key point is that the larger the sample size, the greater the probability that the test results will fall within one to two standard deviations of the middle of the normal curve of population behavior.

When sample sizes are at least 100, if the results are quantified and displayed on a graph, the results will tend to approximate what is called the "normal curve" of distribution (see diagram). That is, the majority of people will give you an "average" response, a smaller number will give you a "below average" or an "above average" response, and a very small number will give you an "exceptionally below average" or an "exceptionally above average" response. This distribution is also known as a bell curve.

The larger your sample size is, the more likely that it is that your results will reflect the normal distribution curve. The steeper the curve the more clustered the data is around the mean and vice versa. Sometimes there can be a clear skew in the data:

This may be when you are researching something controversial which people have strong views about – for example, if you did a survey ‘The BNP should be the next leaders of the UK’, you would expect the data to be skewed towards the negative responses. In a negative skew the mode lies to the right of the mean and vice versa for the positive skew. The greater the differed between the mean and mode, the greater the skew is likely to be.

This may be when you are researching something controversial which people have strong views about – for example, if you did a survey ‘The BNP should be the next leaders of the UK’, you would expect the data to be skewed towards the negative responses. In a negative skew the mode lies to the right of the mean and vice versa for the positive skew. The greater the differed between the mean and mode, the greater the skew is likely to be.

__- Complete the task below to develop your understanding of how measures of central tendency can be useful in a geographical context.__

**Over to You: IQR and Standard Deviation**To download a "word file' that contains tasks and templates that will help you to complete the calculations for the 'inter quartile range' and 'standard deviation' tasks shown below click here.

The Google Sheet 'Pebbles - Standard Deviation' (see below) shows you step by step how to calculate the standard deviation for the task above - to open this file click here.

__- Complete the tasks below by using the resources below - which are also contained within the 'Portsmouth - Standard Deviation' Google Slide.__

**Over to You: Portsmouth Standard Deviation**