Chapter 4 – Comparing Counts with Expected Values: Chi-Square Test
(back to table of contents; back to stats page)
INTRODUCTION
Sometimes data are collected as counts of numbers of individuals or of observations that can be placed into different categories. Often these data are best analyzed using a Chi-Square Test that compares observed counts (also called observed values) to expected values. The exact way the test works depends on how the expected values are determined. This chapter addresses two general approaches for determining expected values.
Part I describes how to calculate expected values from contingency tables; Part II describes how to calculate expected values based on a reference data set or on theory.
Important note on the terms count, value, frequency, and relative abundance
Throughout this chapter, I use the terms count, value and frequency. It’s worth defining these terms here in order to avoid confusion. Count always refers to raw data that represent counts of observations or individuals.
Value is always preceded by the terms observed or expected. Observed values are the same as the counts or the raw data – they are the data collected during the study. Expected values represent the numbers we would expect in our data set based on contingency tables, reference data sets, or theory.
In this manual, frequency always refers to a calculated number and never refers to a count. Observed frequencies are calculated as the number of observations in a particular cell or category divided by the total number of observations. Expected frequencies can be determined in several different ways, as described below. Expected frequencies are used to calculate expected values; they are never used in the Χ
2 equation described later in the chapter. For both observed frequencies and expected frequencies, the sum of the frequencies for all categories is always 1.
For data on the abundance of different species in a community, relative abundance is synonymous with observed frequency. For example, the observed frequencies of different tree species in a forest represent the relative abundance of those species. An example would be Sugar Maple having a relative abundance of 0.6 in a forest canopy - this means that 60% of the canopy trees are Sugar Maple.
PART I: CONTINGENCY TABLES
When data are collected as counts of observations that fall into two (or more) different categories for two different variables, they are typically analyzed using a contingency table. The contingency table is then used to calculate expected values so that a Chi-Square Test can be performed. This type of analysis is best demonstrated with a specific example.
BACKGROUND EXAMPLE
Imagine you’re interested in whether crayfish spend more time under cover during the day than during the night. You collect data on the number of crayfish found in two different categories for the variable cover type (under cover vs. in open) and two different categories for the variable light condition (dark vs. daylight). As a result, there are four possible cover type/light combinations that you sample.
Research Hypothesis: More crayfish will be found in the open under dark conditions than under daylight.
In this hypothesis, the independent variable is light condition (dark vs. daylight) and the dependent variable is where the crayfish are found (under cover vs. in open). By convention, the independent variable is represented by rows in a contingency table, and the dependent variable is represented by columns. A data table for the crayfish example could be setup as follows:
Table 4.1. Example of how to set up a contingency table for the crayfish data set.
The data that you collect can then be entered into this table as the number of crayfish found in each category.
Table 4.2. Crayfish data. Numbers represent the number of individuals found under the specified conditions.
GRAPHING THE DATA
Once data are entered into your table, the next step is to create a bar graph to visualize your data.
If numbers are entered as shown above, you can follow
these directions to create your bar graph.
In addition,
these directions on formatting graphs may be helpful.
Once completed, your graph should look similar to this:
Figure 4.1. Bar graph showing the number of crayfish found under different cover type/light combinations.
Based on Figure 4.1, it’s clear that more crayfish are found in the open than under cover when it’s dark (and more under cover than in the open when it’s light); this pattern is consistent with your hypothesis. The next step is to determine whether this pattern is strong enough to be statistically significant. In other words, can these results be explained simply by random chance, or is something more interesting going on here?
In order to test for statistical significance, we need to calculate the expected values for each cell in Table 4.2. The expected values represent the numbers that should be found in each cell of the table if there’s no association between the two variables. In the crayfish example, it’s the number that should be found in each cell if there’s no association between light condition and whether crayfish are found in the open or under cover. Once expected values are calculated, a Chi-Square Test can be done to test for statistical significance. In order to calculate expected values, we’ll setup a contingency table.
Calculating Expected Values for Cells in Contingency Tables
(
Here you can find a spreadsheet that will do these calculations.)
- First, calculate sums for rows, columns, and the grand total for the all the values in the table (Table 4.3a).
- The expected value for each cell is calculated by multiplying the row total by the column total, then dividing by the grand total. The calculations are simple enough to do using a calculator (Table 4.3b), or they can be programmed into MS Excel using equations.
- The full procedure for calculating expected values for the crayfish data set is illustrated in Table 4.3. (Note: this table does not show equations in the format necessary for use with MS Excel.)
Table 4.3a. Contingency table showing the observed values and calculated row totals, column totals, and grand total.
Table 4.3b. Contingency table showing the equations for calculating expected values. These equations represent the mathematical calculations to be performed; they do not correspond to formulas for use with MS Excel.
Table 4.3c. Contingency table showing the values entered into the equations for calculating expected values.
Table 4.3d. Contingency table showing the calculated expected values.
PERFORMING THE STATISTICAL TEST
Once you have your observed values and your expected values, it’s relatively simple to use an equation to calculate a value for Chi-Square (represented by the symbol Χ
2). By using Χ
2 and the degrees of freedom, a p-value can be found. In this test, the p-value represents the probability that the observed values are different than the expected values as the result of random chance. If there is a low probability that they are different simply due to random chance (if p
< 0.05), we can conclude that the observed values are statistically different from the expected values.
The equation for calculating Χ
2 is shown below. As described below the equation, it’s easy to determine Χ
2 using a table and a calculator. Alternatively,
this spreadsheet will do the calculations for you.
Table 4.10 at the end of the chapter shows how this spreadsheet is setup.
In addition,
Table 4.11
shows what the table should look like once you’ve entered the crayfish data.
X2 Equation:    X2 = Σ (o – e )2 / e
(o = observed value, e = expected value)
Using a Table and Calculator to Determine Χ2.
- Setup a table with five columns (you can do this by hand on a piece of paper). Give the five columns the following headings: observed #, expected #, o-e, (o-e)2, and (o-e)2/e. (The abbreviations o and e stand for observed # and expected #.)
- Enter your observed values and expected values into the first two columns.
- To fill in the third column, subtract each expected value from each observed value.
- To fill in the fourth column, simply square the values in the third column.
- To fill in the fifth column, divide the values in the fourth column by the corresponding expected values. Then add all of the values in the fifth column to calculate Χ2.
- Table 4.5 shows this table for the crayfish data.
- You now need to know the degrees of freedom in order to find the p-value. For contingency table analysis, df = (the number of rows -1) * (the number of columns-1). For the crayfish data, there are two rows and two columns in the contingency table, so the df = 1.
- You can use MS Excel to find the p-value based on Χ2 and the df. You can use the chidist option by typing in "=chidist(Χ2,df)" In a cell in an MS Excel spreadsheet type in "=chidist(Χ2,df)". So, for the crayfish data type in the following: =chidist(10.6,1). After hitting the return button, you should see the value "0.001131".
Interpreting the Results
As shown in Table 4.5, the p-value for this Chi-Square Test is 0.001131 (this should be reported as p=0.0011). So how does this relate to our original hypothesis that more crayfish will be found in the open under dark conditions than under daylight? Well, this p-value literally means that the probability that random chance is causing the observed values to be different from the expected values is 0.11%. There is a very small chance that the differences between observed and expected values are the result of random chance (and p is in fact
< 0.05), so we can conclude that the differences are statistically significant. If we refer back to Figure 4.1 and our conclusions based on that figure, it’s clear that these data support the hypothesis that more crayfish will be found in the open under dark conditions than under daylight.
Reporting the Results
It’s traditional in Chi-Square analysis to report the degrees of freedom, the Chi-Square Value, and the p-value. Often this is simply done within parentheses at the end of a sentence that reports whether the test was significant. Here’s an example of how you might report the results of the analysis on the crayfish data: A Chi-Square Test showed that there was a significant association between light conditions and cover-type (df=1, Χ
2=10.6, p=0.0011).
Other Uses for Contingency Tables
Contingency tables can be applied to a wide variety of situations where data can be sorted into two types of categories (i.e. by two variables). For example, the number of isopods found in different experimental conditions of wet vs. dry and dark vs. light, or the number of endangered mammal populations that are declining vs. not-declining in protected vs. non-protected habitats. In addition, tables can be expanded to include more rows or columns depending on the different levels that can be distinguished for each variable. A 3 x 3 table would work for the endangered mammal example if the categories for population status could be divided into declining, stable and growing and the categories for protection could be divided into unprotected, semi-protected, and fully protected.
PART II: EXPECTED FREQUENCIES BASED ON A REFERENCE DATA SET OR ON THEORY
Sometimes data consist of counts that are assigned to categories and the expected number for each category is based on expected frequencies from other data sets or from theory. For example, data on the frequency of tree species in the understory of a forest can be compared to expected frequencies based on trees in the canopy. Alternatively, theory can be used to determine expected frequencies such as when Mendelian ratios are used to predict the expected number of individuals with different phenotypes in the F2 generation of a genetic cross.
BACKGROUND EXAMPLE
You are managing a forest as part of a wildlife reserve and are interested in whether the mix of tree species in the forest is likely to change in the future. One way to get at this issue is to collect data on the mix of tree species in the understory and compare it to the mix of tree species in the canopy layer. Species that are under-represented in the understory may become less common in the future.
Research Question: In the wildlife reserve, does the mix of different tree species in the forest understory match the mix of different tree species in the canopy?
To answer this question, you collect data on the number of individuals of different tree species in the wildlife reserve. The raw data are shown in Table 4.6 and can be found
here.
In this example, the mix of tree species in the canopy layer (or more specifically, the relative abundance of tree species in the canopy) can be used to calculate the expected values for the number of individuals of each tree species in the understory. Then the same formula for Chi-Square that we used for the crayfish data can be used to calculate a value for Χ
2 and determine whether the observed values are different the expected values.
The following sets of directions described below explain how to calculate expected values for number of trees in the understory, how to create relative abundance graphs, and how to do a Chi-Square Test on theses data. Alternatively, you can enter your data into
this spreadsheet which will do the calculations, make the graph, and perform the statistical test for you. If you choose to use the spreadsheet, I strongly recommend that you read through the directions below so that you understand what is being represented.
Here you will find the section on interpreting your statistical output.
Calculating Expected Values from Expected Frequencies
- The first step is to make sure that your reference data (in this case the canopy tree data) are converted to frequencies. To do this, simply add up the total number of trees to calculate a grand total. Then for each species, divide the number sampled by the grand total. Once you have a frequency for each tree species, you can check your work by confirming that the frequencies for all the species add up to a total of 1. (Note for analyzing genetic crosses: This same technique can be used to calculate expected frequencies from expected ratios in genetics crosses. For example, for an expected ratio of 9:3:3:1, each value is divided by the grand total of 16 to give the expected frequencies of 0.5625, 0.1875, 0.1875, and 0.0625.)
- These expected frequencies can then be used to calculate expected values for the data you are testing (in this case the understory tree data). First, calculate the grand total for the number of trees found in the understory.
- To calculate an expected value for a given species, multiply the expected frequency for that species by the grand total in the understory. The sum of the expected values should be the same as the grand total for the observed values from the understory.
- It is also worth calculating the observed frequencies for the understory data. These are calculated by dividing the number of individuals of each species in the understory by the grand total of trees in the understory. These observed frequencies are a measure of relative abundance of the different tree species and are very useful for graphing the data in order to make visual comparisons.
- The equations and calculated values for determining expected frequencies and expected numbers are shown in Table 4.7.
GRAPHING THE DATA
The frequencies (columns titled expected frequ and observed frequ) in Table 4.7b represent the relative abundance of different tree species and can be used to make a bar graph showing possible differences between the canopy and the understory. Table 4.8 shows the relevant data for making the graph.
This type of bar graph (in this case referred to as a relative abundance graph) can be made by modifying the directions found earlier in this chapter for graphing the crayfish data. Of course, it will need to be labeled differently. Once completed, it should look similar to the one shown in Figure 4.2. This type of graph provides a way to visually inspect your data and begin to figure out whether the relative abundance (or "mix") of different tree species in the understory matches the canopy.
Figure 4.2. Relative abundance of tree species in the canopy and in the understory for the wildlife reserve data set.
It’s easy to see from Figure 4.2 that the relative abundance of tree species in the understory is very similar to the canopy. However, it’s still important to perform a Chi-Square Test in order to confirm that the differences are not statically significant and to report the values for Χ
2 and
for p.
PERFORMING THE STATISTICAL TEST
Now that you have observed values and expected values for the understory trees (these are found in Table 4.7 and summarized in Table 4.9), you can calculate Χ
2 exactly the same way as described earlier in this chapter for the crayfish data. However, an important difference for this type of Chi-Square Test is that df is determined differently than for contingency table analysis. Here, df = # categories – 1. So, for the tree data there are a total of five species, so df = 5 – 1 = 4.
Important Note: Even though we used the relative abundance data (or frequency data) to create a bar graph, Χ
2 should always be calculated using count data. Therefore, be sure to use the observed values and expected values in your analysis, not the observed and expected frequencies.
Interpreting your Statistical Output
When you calculate Χ
2, you should get a value of 0.19. When you use the chidist option in MS Excel to find the p-value for a Χ
2 of 0.19 with 4 degrees of freedom, you should get a p-value of 0.99 (this is done automatically if you're using a spreadsheet programmed to do this analysis). A p-value of 0.99 means that there is a 99% probability that random chance is causing the differences between the observed and expected values for numbers of different tree species in the understory. Clearly this is not a significant difference. Here’s an example of how you could report these results: Based on a Chi-Square Test, the relative abundance of different tree species in the understory is no different from the relative abundance of different species in the canopy (df=4, Χ
2=0.19, p=0.99). At least based on these data, the mix of tree species in the wildlife reserve is not likely to shift in the near future.
MORE ON CHI-SQUARE TEST AND ALTERNATIVE TESTS
The Concept Behind the Chi-Square Test
In a Chi-Square Test, the equation for Χ
2 is used to quantify the difference between observed and expected values; the greater the difference between observed and expected, the greater the calculated value of Χ
2. The calculated value of Χ
2 is then compared to the Χ
2 distribution to find a p-value. In general, the greater the value of Χ
2, the lower the p-value, the less likely random chance is causing the differences between the observed and expected values, the more likely something interesting is causing those differences.
Other Types of Chi-Square Test
It is important to mention that for 2 x 2 contingency tables, the estimate for p as described above is subject to minor error. To calculate the exact value of p, Fisher's exact test must be used (Gotelli and Ellison 2004).
Contingency table analysis can be expanded to include more rows and or columns per category (such as a 3 x 3 table) or even to include more categories in a multi-way contingency table (such as 2 x 2 x 2).
When A Chi-Square Test is Appropriate
The biggest concern in using the Chi-Square Test is when many of the expected values are near zero. Snedecor and Cochran (1980) recommend the following guidelines to avoid problems associated with low values.
- None of the expected values should be less than one.
- Two of the expected values can be near 1 if most other values are greater than five.
Alternatives to the Chi-Square Test
Bayesian analysis can be used as an alternative to the Chi-Square Test as described in Gotelli and Ellison (2004).
Setting Up Spreadsheets to Calculate Chi-Square Values
Tables 4.10 through 4.13 show how to use formulas to setup spreadsheets to calculate Χ
2 values. In order to use those tables
successfully, you need to enter the formulas exactly as shown in the exact same cells as shown. Alternatively, you can setup up the tables differently
if you have experience using formulas in MS Excel and if you understand how the Χ
2 formula works. Alternatively,
Here
you can find a spreadsheet that will do these calculations and might work for you. Good luck!
Table 4.10. Spreadsheet showing formulas to calculate Χ
2 for data entered into a contingency table. Formulas must be typed exactly as shown into the exact same cells as shown in order for calculations to be correct (see Table 1.4 for an introduction on how to use formulas). The spreadsheet below is "split" in two parts in order to fit on the page below, but on your computer screen, column F should appear to the right of column E.
Directions for using the spreadsheet shown in Table 4.11
- enter the observed values (counts) into the contingency table shown in gray
- data from a 2 x 2, 2 x 3, 3 x 2, or 3 x 3 table can be entered
- enter the number of rows and columns into the gray cells B16 and B17
- the value for Χ2 is shown in cell K14 and the p-value is shown in K16
- Table 4.11 shows the values you should see in your spreadsheet if you enter the data shown within the gray cells in that table.
Table 4.11. Spreadsheet showing the output you should get if you setup the spreadsheet shown in Table 4.10 and enter the data shown in the gray cells below.
Table 4.12. Formulas to setup a spreadsheet to calculate Χ
2 when expected frequencies are based on reference data or on
theory. Formulas must be typed exactly as shown into the exact same cells as shown in order for calculations to be correct. You may also use
this spreadsheet that has already been setup - it might suit your needs. The spreadsheet below is "split" in two parts in order to fit on the page below, but on your computer screen, column F should appear to the right of column E.
Table 4.13. Spreadsheet showing the output you should get if you setup the spreadsheet shown in Table 4.12 and enter the data shown in the gray cells shown below.