Analysis of variance or ANOVA is the term given to the method of analyzing data from two or more groups. The example given before, comparing birth weights between males and females will again be used as an example of the ideas behind ANOVA.
An ANOVA starts by calculating the overall variance in the data we are comparing between k groups. Here that is the Birthweight (weight) variable. The variance is the standard deviation squared, and represents the average squared deviation from the mean. Or in other words is the sum of the total squared deviations divided by the number of observations (N) minus one. N-1 is now referred to as the degrees of freedom (df) and the sum of the squared deviations is referred to as the sum of squares (SS). In effect, ANOVA attempts to find out how much of the total SS or overall variation can be explained by allowing for other variables. It uses the means of the 2 groups to calculate a sum of squares for one group, and a sum of squares for the other group. The sum of these two sums of squares should be less than the total sum of squares. If this difference is significantly less, then we conclude that the variable explains a significant amount of the overall variation, and furthermore that there is a significant difference between the groups.
A useful mathematical relationship that helps us to calculate these values is:
or (N-1)xVariance of Birthweight
For the male group the SS are calculated from:
For the female group the SS are calculated from:
Thus the within groups SS with N-k (93) df is given as:
So, we may say that the sex of the baby explains 339390.18 of the total variation of 7834806.95, and that we have a further 7495416.77 of the variance left unexplained. We now have to find some method of finding out whether this represents a significant difference in the variation left to be explained or residual variation. We can form an ANOVA table (Table 10).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Following the columns in Table 10, we arrive at a value calle1 & df2 . df1 refers to the between groups df, and df2 refers to the within groups or residual df. Using the F tables given in Appendix 3, we first go to the df1 column labeled 1, then try and find the df2 column labeled 93, or the nearest alternative. We can see from the 120 row, that for a probability of 0.05, for sex to explain a significant amount of the variance, F must be greater than 3.92. This is the case, so we conclude that there is a difference in the birthweights of males and females.
The F distribution is related to the t distribution in that, if you were to take the square root of the 1df1 column in the F tables we would get the t distribution with df2 degrees of freedom. Hence 4.21 = 2.05 the value of t we obtained earlier.
To perform an ANOVA in SPSS, we choose Statistics, ANOVA, General Factorial, place weight in the Dependent Variable: box (see Figure 37), then place babysex in the Factor(s) box. Click on Define Range and type 1 in the Minimum box and 2 in the Maximum box, Continue, OK.
Your output should then be similar to Figure 38. Here you can see that SPSS gives the same ANOVA table as Table 10. SPSS also calculates the exact p value of 0.043, the same value as the t test gave us.
SPSS can also give us the difference between the two means and a confidence interval. If we also click on the Contrasts box in Figure 37. and the on the Display parameter estimates option, then click once on babysex in the Factors box, use the down arrow to change the Contrast from deviation to simple, and click on the First option, Change, Continue, OK (see Figure 39). You will then obtain the additional output shown in Figure 40. Here the contrast you have specified is the difference in the means between each group and
the first group. In our case we only have two groups, so this gives us the difference between the mean Birthweight of females and the mean Birthweight of males, together with a 95% CI, and a p value based upon the t test. When you have a number of groups it is possible for the overall group effect not to be significant, but for one of the individual contrasts to be significant. This is because, to some extent, the ANOVA takes account of the number of individual t tests you are performing with all the contrasts that are possible within the group. So you should always ensure that the overall effect of the group is significant before accepting the results of any spurious contrasts.
A graphical representation of the mean gain by strain and sex can also be examined by choosing, Graph, Bar, Clustered, Define, choose the Other summary function option, then place gain in the Variable box, strain in the Category Axis box, and sex in the Define Clusters by box, OK. You should get a chart similar to Figure 42.
Alternatively you can produce a graph giving the means, together with a bar showing 1 standard error of the mean either side. Whether bars overlap or not gives an approximate indication of whether a comparison of the respective means would be significant. In SPSS, Graph, Error Bar, Clustered, Define, choose the Other summary function option, then place gain in the Variable box, strain in the Category Axis box, and sex in the Define Clusters by box. Now choose Bars Represent, and from the list choose Standard error of mean, then click on the Multiplier box and change the 2 to a 1, OK.
We may now perform the ANOVA. In SPSS choose Statistics, ANOVA, General Factorial, place weight in the Dependent Variable: box (see Figure 37), then place strain in the Factor(s) box. Click on Define Range and type 1 in the Minimum box and 3 in the Maximum box. Now place sex in the Factor(s) box. Click on Define Range and type 1 in the Minimum box and 2 in the Maximum box Continue, OK.
You will get the ANOVA table shown in Figure 44.
SPSS, first gives the residual SS, with df = Number of strains (3) times Number of sexes (2) times one less than the number of observations in each group (4). This SS indicates how much variance is left to explain. It then gives the SS for the main effect of the strain of rat, with df = no. sexes - 1, then the SS for the main effect of the sex of rat, df = no. strains - 1. Neither of these effects comes out as explaining a significant amount of the variation in weight gain. SPSS also gives what is called an interaction effect SS (STRAIN BY SEX). This tests to see whether the effect of strain is the same for each sex. For example whether the difference between the mean of strain a and strain b is the same for males and females. The df for this SS are given by the product of the dfs for the sex and strain SS. Again this effect is not significant.
If we were analyzing data from an observational study, rather than from a closely controlled experimental study, we would not have a balanced design. That is there would not be equal numbers in each group. This means that we cannot disentangle the SS of the various effects in such a straightforward manner as before. It is not possible to divide the total SS into independent components, and it may be that two different variables may help to explain overlapping components of the total variance. In this case we have to perform a kind of sequential analysis. We first find the apparent SS explained by one variable, and then add another variable to find whether significantly more of the remaining variance can be explained.
We shall use the data from vlbw.sav as an example. To perform an ANOVA now, we follow the same procedure as before, but we must now choose the Model option, then click on Custom and under Build Term(s) change the option to Main effects, and then place cigs and then drink in the Model box, to give Figure 45. You must now change the Sum of squares: option from unique to sequential. You will now obtain a ANOVA table similar to Figure 46. You can see, sexweight contains the Birthweight values with the mean Birthweight for each cigs group subtracted from the respective group. The values in the xweight column are called the residual values, having fitted cigs. If we examine this variable, we see that it has a mean of 0, and a variance of 66978.84. If we multiply this variance by the sample size minus 1 (94) we get 6296010.96, the same as the residual SS when just cigs was fitted. It is this new total SS that we are now attempting to explain. If we look at Figure 47, we can see a chart produced in a similar way to Figure 43. This time we are using the xweight variable stratified by drink. The drink row of the ANOVA table shows how much of this new SS that drink accounts for, and then gives a new residual SS. Any further variables we wish to fit attempt to explain this latest reduced SS of 5770336.72. The drink variable was found to explain a further significant amount of the variance.
We should now fit the interaction term for cigs and drink to test if the effect of drinking is the same for all three levels of cigarette smoking. In SPSS, we return to the window shown in Figure 45. We can enter the interaction term by changing the Build Term(s) option to Interaction, and then clicking on cigs, drink and then simultaneously adding them to the Model box to give cigs*drink, in addition to the main effect terms of cigs and drink, Continue, OK.
You should find that the interaction term is significant. You should now obtain the parameter effects and confidence intervals. If you examine Figure 48, a chart showing the mean Birthweight stratified by drinking status and smoking status then you should be able to interpret the parameter estimates.
Introduction
|
Summary
Statistics |
Descriptive Statistics
|
Sampling |
Normal Distribution
| The t-Student Distribution
|
Correlation and Regression
| Analysis of Variance |
Contingency Tables |
Non-Parametric Statistics