Descriptive Statistics

Statistical methods provide a numerical summary of data, by presenting fewer numbers in the place of many more. For quantitative data (eg height or age) the more common are a measure of some central value, average, (median or mean) together with an indication of the spread or distribution of the data (percentiles or standard deviation). SPSS will be used to illustrate these summary statistics using the file height.sav.

The Median

The median value of a set of data is the central observation, following the values being ranked in order of size. In other words half of the data will have a value less than the median, and the other half of the data will have a value greater than the median. The file height.sav contains height data from two classes. We shall find the median height. SPSS readily gives median values, but in order to make the definition of a median clearer we will work it out step by step. Before any calculations are made it is worth examining a histogram of height (see summary statistics) in order to get an idea of what the data looks like. To calculate the median we first need to rank the data by size of height. Choose Transform, Rank Cases, place height in the Variable(s) box, OK (Figure 13). A new variable called rheight will now be created, this will contain the ranks of height from 1 the shortest to 80 the tallest. The centre of the 80 ranks lies half way between the 40th and 41st rank (thus 40 cases or half the data will lie either side of the central point). In SPSS choose Data, Sort Cases, place rheight in the Sort by box, OK. You will now see the data displayed in order of rank, making it easier to find the 40th (159.59) and 41st (159.68) ranked value of height. The median value lies half way between 159.59 & 159.68 ie 159.635. There are a number of ways of using SPSS to obtain the median in a straightforward manner. From the main menu choose Statistics, Summarize, Frequencies, place height in the Variable(s) box, choose Statistics, then click on the Median option. (Figure 14). From the frequency table the Cum Percent column also indicates where the median value lies, ie between the Cum Percent values 50.0 and 51.3.


Figure 13

Quantiles

The median divides the data into two parts. If we now divide the lower half of the data into two equal parts the dividing point would be called the first quartile. Then if the top half were divided similarly the dividing point would be called the third quartile. We would then have divided the data into four parts. This process may be continued into eighths or indeed any fraction that is desired. We shall find some quartiles for the height data in SPSS. Follow the procedure used for obtaining a median, but this time choose the Quartiles option too (Figure 14). A percentile is a dividing point that divides the data into hundredths, thus from the SPSS !Output window we have

First Quartile or 25th Percentile = 155.470

Second Quartile or 50th Percentile = 159.634 (median)

Third Quartile or 75th Percentile = 163.801


Figure 14

The most commonly used percentiles are the 25th and 75th (Quartiles) or the 5th and 95th. Both can be used to help summarize the spread or distribution of the data. The 5th and 95th percentiles may be obtained by choosing the Percentiles option in the Frequencies: Statistics window (Figure 14), then enter 5 in the box, Add, enter 95 in the box, Add.

For example the height data was found to have a median value of 159.6cm with 5th & 95th Percentiles at 145.2cm and 175.7cm. The figures have been rounded to 1 decimal place for ease of reading.

The data has been collected from two different classes, Class A & Class B. It may be worth looking to see if there are any differences between these two groups. That is we would like to perform an analysis on Class A & Class B separately.

Figure 15 shows data from the file height.sav, the mean, and standard deviation should be ignored for now. However it can be seen that although Classes A & B have a similar central point for height, the distribution of heights in each class is different.


Figure 15
 

To replicate the charts in Figure 15, choose Data, Split File, Repeat analysis for each group, place Class in the Groups based on box, then OK. A Split File On caption will appear in the bottom right hand corner of the screen. Do a histogram of height, to get the two charts. You will have to alter the scaling on the x and y axis to replicate exactly Figure 15. You may now obtain the median and 5th & 95th percentiles for both classes A & B, to see how these differ.

You should obtain the results shown in Table 1:
Median
5th percentile
95th percentile
Class A
159.8 
142.9
181.8 
Class B
159.6 
152.7
166.8 
Table 1 

Now we have a numeric way of describing how the spread of the data in Class A is more than for Class B.

You should repeat this exercise for the time variable. The histograms will reveal that the data is not as symmetrical as for the height variable.

You should obtain:
Median
5th percentile
95th percentile
Class A
1.06 
0.02
6.89 
Class B
2.45 
0.34
8.50 
Table 2 

In Table 2 it can be seen that the 5th & 95th percentiles do not lie roughly equidistant from their respective median values as they did in Table 1.

You must go back to Data, Split File, Analyze all cases, OK to remove the split file.

The Mean

From, the mean Birthweight was given as 3263.2g. From the histogram, it can be seen that this value is in the centre of the distribution with roughly half the data to the right of this point and the other half to the left (a symmetrical distribution). Thus the mean may be used as an indication of a central value of the data.

For a population of size n with each measurement denoted Xi, where i = 1,2,....,n

The Mean,

Or the sum of all the values in the population divided by the number of values in the population.

Using the file c:\spsswin\height the overall mean height of both classes may be obtained by using the main Statistics option, Summarize, Explore, then place height in the Dependent List box, for the time being, to prevent unnecessary charts being produced, in the Display box click on the Statistics option, then OK (Figure 16). In order to obtain the mean for each class repeat the above procedure but put class in the Factor box. (The overall mean may also be obtained by choosing the Statistics option when obtaining a Frequency table, or by choosing Descriptives instead of Explore)

You should find the overall mean is 159.676

For Class A the mean is 159.559

For Class B the mean is 159.792.


Figure 16

The Standard Deviation

When the mean is used as an indication of the central point of some data, then the standard deviation should be used as an indication of the shape or spread of the data. Figure 15 shows data from the file height.sav, here it can be seen that although Classes A & B have a similar mean height, the distribution of heights in each class is different. In Class B the heights are clustered closer around the mean value than in Class A. The standard deviation gives an average value of how far the cases differ from the mean. Thus the standard deviation for Class B is less than for Class A.

In precise terms the standard deviation is the square root of the average of the squared differences between the data values and the mean.

A deviation, d(Xi) is the distance between a population unit Xi and the population mean .

ie d(Xi) = Xi -

Thus in mathematical notation, for a population of size N, with mean  the standard deviation (SD) is given as:

This may be made clearer by using SPSS to go through the steps used in calculation, we shall calculate the standard deviation for Class A:

To select for analysis just those cases from Class A, choose the Data option. Then Select Cases, If condition is satisfied, If... . Place Class in the right hand box, then click on = , 1 from the grey boxes to give Figure 17. Then Continue, OK. A Filter On caption will appear in the bottom of the screen.

Note: When you want to return to analyzing all cases choose Data, Select Cases, All Cases, OK.


Figure 17

In SPSS d(Xi) may be calculated using the Transform option, followed by the Compute option. See Figure 18. By typing dev in the Target Variable: box a new variable is computed. Placing or typing height - 159.559 in the Numeric Expression: box will assign the values of d(Xi). The values of dev may be examined by returning to the data window. If we were to just take a mean of these values then we would get 0 for Class A (and Class B). Instead, by squaring the d(Xi) values, we will ensure that all the deviation values are positive. In SPSS, return to the Compute option, and compute dev2 = dev**2 (dev squared). The sum of the dev2 column (4419.836) may be obtained in a similar method as the median was obtained (Figure 14). Divide (You may Use the Calculator provided with Windows) the sum of the dev2 column by N-1 (39) to give a value known as the Variance (N-1 rather than N, since only N-1 of the deviations are independent from each other. The last value can always be calculated from the others because the dev column must sum to 0). Taking the square root of the variance will give the standard deviation.

Check your calculations by obtaining the standard deviation of height by the method illustrated in Figure 14. 


Figure 18

Use the Statistics, Summarize, Frequencies, Statistics options to obtain the standard deviations for both Class A & B. You should obtain:
Mean
Standard Deviation
Class A
159.6 
10.65
Class B
159.8 
4.37
Table 3 

If you now multiply the standard deviation by 2 then add and subtract these values from the respective means you will obtain Table 4:
Mean
Mean - 2SD
Mean + 2SD 
Class A
159.6 
138.2
181.0 
Class B
159.8 
151.0
168.6 
Table 4 

Now compare Table 4  with Table 1, you will see that the two tables are fairly similar. This is because for 'Normally'** distributed data, 95% of the values lie within approx. 2SD (1.96SD to be precise) of the mean.

Now repeat this for the time variable to get:
Mean
Mean - 2SD
Mean + 2SD 
Class A
1.88 
-2.32
6.08 
Class B
3.20 
-1.96
8.36 
Table 5 

Table 5 implies that approx. 95% of class A have response times between -2.32 & 6.08, this is clearly not possible since times cannot be negative. Here the description of the data should be made using the median and percentiles as in Table 2.

For reasons that will become clearer later, only 'Normally' distributed data should be described using the mean & standard deviation otherwise the median & percentiles should be used.

** A full description of the Normal distribution will be given later. However it may already be concluded that Normally distributed data is symmetrical in shape.


Introduction | Summary Statistics | Descriptive Statistics | Sampling | Normal Distribution | The t-Student Distribution |
Correlation and Regression | Analysis of Variance  | Contingency Tables | Non-Parametric Statistics