Statistical methods provide a numerical summary of data, by presenting fewer numbers in the place of many more. For quantitative data (eg height or age) the more common are a measure of some central value, average, (median or mean) together with an indication of the spread or distribution of the data (percentiles or standard deviation). SPSS will be used to illustrate these summary statistics using the file height.sav.
First Quartile or 25th Percentile = 155.470
Second Quartile or 50th Percentile = 159.634 (median)
Third Quartile or 75th Percentile = 163.801
The most commonly used percentiles are the 25th and 75th (Quartiles) or the 5th and 95th. Both can be used to help summarize the spread or distribution of the data. The 5th and 95th percentiles may be obtained by choosing the Percentiles option in the Frequencies: Statistics window (Figure 14), then enter 5 in the box, Add, enter 95 in the box, Add.
For example the height data was found to have a median value of 159.6cm with 5th & 95th Percentiles at 145.2cm and 175.7cm. The figures have been rounded to 1 decimal place for ease of reading.
The data has been collected from two different classes, Class A & Class B. It may be worth looking to see if there are any differences between these two groups. That is we would like to perform an analysis on Class A & Class B separately.
Figure 15 shows data from the file height.sav, the mean, and standard deviation should be ignored for now. However it can be seen that although Classes A & B have a similar central point for height, the distribution of heights in each class is different.
To replicate the charts in Figure 15, choose Data, Split File, Repeat analysis for each group, place Class in the Groups based on box, then OK. A Split File On caption will appear in the bottom right hand corner of the screen. Do a histogram of height, to get the two charts. You will have to alter the scaling on the x and y axis to replicate exactly Figure 15. You may now obtain the median and 5th & 95th percentiles for both classes A & B, to see how these differ.
You should obtain the results shown in Table 1:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Now we have a numeric way of describing how the spread of the data in Class A is more than for Class B.
You should repeat this exercise for the time variable. The histograms will reveal that the data is not as symmetrical as for the height variable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In Table 2 it can be seen that the 5th & 95th percentiles do not lie roughly equidistant from their respective median values as they did in Table 1.
You must go back to Data, Split File, Analyze all cases, OK to remove the split file.
For a population of size n with each measurement denoted Xi, where i = 1,2,....,n
The Mean,

Using the file c:\spsswin\height the overall mean height of both classes may be obtained by using the main Statistics option, Summarize, Explore, then place height in the Dependent List box, for the time being, to prevent unnecessary charts being produced, in the Display box click on the Statistics option, then OK (Figure 16). In order to obtain the mean for each class repeat the above procedure but put class in the Factor box. (The overall mean may also be obtained by choosing the Statistics option when obtaining a Frequency table, or by choosing Descriptives instead of Explore)
You should find the overall mean is 159.676
For Class A the mean is 159.559
For Class B the mean is 159.792.
In precise terms the standard deviation is the square root of the average of the squared differences between the data values and the mean.
A deviation, d(Xi) is the distance between a population unit
Xi and the population mean
.
ie d(Xi) = Xi -
Thus in mathematical notation, for a population of size N, with mean
the standard deviation (SD) is given as:
or SD =
,
for i = 1,2,.....,N.To select for analysis just those cases from Class A, choose the Data option. Then Select Cases, If condition is satisfied, If... . Place Class in the right hand box, then click on = , 1 from the grey boxes to give Figure 17. Then Continue, OK. A Filter On caption will appear in the bottom of the screen.
Note: When you want to return to analyzing all cases choose Data, Select Cases, All Cases, OK.
In SPSS d(Xi) may be calculated using the Transform option, followed by the Compute option. See Figure 18. By typing dev in the Target Variable: box a new variable is computed. Placing or typing height - 159.559 in the Numeric Expression: box will assign the values of d(Xi). The values of dev may be examined by returning to the data window. If we were to just take a mean of these values then we would get 0 for Class A (and Class B). Instead, by squaring the d(Xi) values, we will ensure that all the deviation values are positive. In SPSS, return to the Compute option, and compute dev2 = dev**2 (dev squared). The sum of the dev2 column (4419.836) may be obtained in a similar method as the median was obtained (Figure 14). Divide (You may Use the Calculator provided with Windows) the sum of the dev2 column by N-1 (39) to give a value known as the Variance (N-1 rather than N, since only N-1 of the deviations are independent from each other. The last value can always be calculated from the others because the dev column must sum to 0). Taking the square root of the variance will give the standard deviation.
Check your calculations by obtaining the standard deviation of height by the method illustrated in Figure 14.
Use the Statistics, Summarize, Frequencies, Statistics options to obtain the standard deviations for both Class A & B. You should obtain:
|
|
|
|
|
|
|
|
|
|
|
|
If you now multiply the standard deviation by 2 then add and subtract these values from the respective means you will obtain Table 4:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Now compare Table 4 with Table 1, you will see that the two tables are fairly similar. This is because for 'Normally'** distributed data, 95% of the values lie within approx. 2SD (1.96SD to be precise) of the mean.
Now repeat this for the time variable to get:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 5 implies that approx. 95% of class A have response times between -2.32 & 6.08, this is clearly not possible since times cannot be negative. Here the description of the data should be made using the median and percentiles as in Table 2.
For reasons that will become clearer later, only 'Normally' distributed data should be described using the mean & standard deviation otherwise the median & percentiles should be used.
** A full description of the Normal distribution will be given later. However it may already be concluded that Normally distributed data is symmetrical in shape.
Introduction
|
Summary
Statistics | Descriptive Statistics |
Sampling |
Normal Distribution
| The t-Student Distribution
|
Correlation and Regression
|
Analysis of Variance
|
Contingency Tables |
Non-Parametric Statistics