Chi-Squared
What is it?
(chi-squared) is a measure of how far a set of data varies from an expected distribution. It is a "Goodness of Fit" test. The larger the number, the worse the fit is.
The AEB Maths syllabus requires the use of
as an approximation to
You are required to do hypothesis tests on contingency tables using
. For instance, we might be given the table below:
Chips Mashed Boiled Male 45 22 15 Female 67 44 26 A sample of school children were asked which was their favourite way of eating potatoes.
You are asked to find whether the choice of potato is independent of gender.As with all hypothesis tests, we need to define
and
: sex and choice of potato are independent
: sex and choice of potato are dependant
We find out what the expected values are under the null hypothesis.
To help us do this, we find the column totals and row totals :
Chips Mashed Boiled
Male 45 22 15 82 Female 67 44 26 137
112 66 41 219 To find the expected number of males who prefer chips, we find
A simple way to work this out is
Work out the expected values for all the data :
Chips Mashed Boiled
Male 41.9 24.7 15.4 82 Female 70.1 41.3 25.6 137
112 66 41 219 Note that you can use the row and column totals to help find these figures (e.g. female & chips must be 112 - 41.9)
Next we find the difference between the observed (O) results and the expected results (E). We square the result to get rid of negative results, and then divide by the expected result so that the figure is expressed as a proportion of what was expected.
Hence
[in formula book]
For the data above, we get:
O E | O-E | 45 22 15 67 44 26 41.9 24.7 15.4 70.1 41.3 25.6 3.1 2.7 .4 3.1 2.7 .4 .23 .30 .01 .14 .18 .01 Thus our test statistic,
comes to 0.87
The shape of a typicaldistribution is shown below:
The value of Chi-squared is always positive, and the distribution is positively skewed.
Degrees of Freedom
The distribution has one parameter,
(pronounced "new"). This is a measure of the "degrees of freedom". In other words, the number of free choices that you can make when allocating values to the expected frequencies. In this case there are 2 because you would only need 2 figures, for example (male/chips) and (male/mashed) and you would be able to work out the other figures from the row totals.
In general, degrees of freedom = (number of rows- 1) ´ (number of columns- 1)
The table of the chi-squared distribution is in the formula book, and shows critical values of
for varying degrees of freedom.
At the 5% significance level, the critical value is 5.991, so we accept the Null Hypothesis and conclude that choice of potato is independent of sex.
Yates' Continuity Correction
When there is only one degree of freedom, i.e. on a 2 by 2 contingency table, the formula for the approximation to chi-squared is changed slightly to
i.e. subtract a half from each absolute value of O- E before squaring. The question will usually remind you to use Yates' correction when it is necessary, but the altered formula is not in the formula book.
When E is less than 5
If the expected value for any individual cell is less than 5, then the row or column must be combined with another, as the approximation will not be good enough on such a small value.
In this case, the choice of which row or column to combine it with should be based on these criteria, which should be taken in order:
- Combine the row or column with another that is logically linked to it
- Combine the row or column with another so as to make the expected frequencies as close to each other as possible.
In the above example, if the expected number of male/boiled was less than 5, the boiled column would be combined with the mashed column, to make the expected frequencies of the remaining columns closer.