(back to table of contents; back to stats page)

Sometimes data are collected as counts of numbers of individuals or of observations that can be placed into different categories. Often these data are best analyzed using a Chi-Square Test that compares observed counts (also called observed values) to expected values. The exact way the test works depends on how the expected values are determined. This chapter addresses two general approaches for determining expected values.

Throughout this chapter, I use the terms count, value and frequency. It’s worth defining these terms here in order to avoid confusion. Count always refers to raw data that represent counts of observations or individuals.

Value is always preceded by the terms observed or expected. Observed values are the same as the counts or the raw data – they are the data collected during the study. Expected values represent the numbers we would expect in our data set based on contingency tables, reference data sets, or theory.

In this manual, frequency always refers to a calculated number and never refers to a count. Observed frequencies are calculated as the number of observations in a particular cell or category divided by the total number of observations. Expected frequencies can be determined in several different ways, as described below. Expected frequencies are used to calculate expected values; they are never used in the Χ

For data on the abundance of different species in a community, relative abundance is synonymous with observed frequency. For example, the observed frequencies of different tree species in a forest represent the relative abundance of those species. An example would be Sugar Maple having a relative abundance of 0.6 in a forest canopy - this means that 60% of the canopy trees are Sugar Maple.

When data are collected as counts of observations that fall into two (or more) different categories for two different variables, they are typically analyzed using a contingency table. The contingency table is then used to calculate expected values so that a Chi-Square Test can be performed. This type of analysis is best demonstrated with a specific example.

Imagine you’re interested in whether crayfish spend more time under cover during the day than during the night. You collect data on the number of crayfish found in two different categories for the variable cover type (under cover vs. in open) and two different categories for the variable light condition (dark vs. daylight). As a result, there are four possible cover type/light combinations that you sample.

In this hypothesis, the independent variable is light condition (dark vs. daylight) and the dependent variable is where the crayfish are found (under cover vs. in open). By convention, the independent variable is represented by rows in a contingency table, and the dependent variable is represented by columns. A data table for the crayfish example could be setup as follows:

Table 4.1. Example of how to set up a contingency table for the crayfish data set.

The data that you collect can then be entered into this table as the number of crayfish found in each category.

Table 4.2. Crayfish data. Numbers represent the number of individuals found under the specified conditions.

Once data are entered into your table, the next step is to create a bar graph to visualize your data. If numbers are entered as shown above, you can follow these directions to create your bar graph. In addition, these directions on formatting graphs may be helpful.

Once completed, your graph should look similar to this:

Figure 4.1. Bar graph showing the number of crayfish found under different cover type/light combinations.

Based on Figure 4.1, it’s clear that more crayfish are found in the open than under cover when it’s dark (and more under cover than in the open when it’s light); this pattern is consistent with your hypothesis. The next step is to determine whether this pattern is strong enough to be statistically significant. In other words, can these results be explained simply by random chance, or is something more interesting going on here?

In order to test for statistical significance, we need to calculate the expected values for each cell in Table 4.2. The expected values represent the numbers that should be found in each cell of the table if there’s no association between the two variables. In the crayfish example, it’s the number that should be found in each cell if there’s no association between light condition and whether crayfish are found in the open or under cover. Once expected values are calculated, a Chi-Square Test can be done to test for statistical significance. In order to calculate expected values, we’ll setup a contingency table.

(Here you can find a spreadsheet that will do these calculations.)

- First, calculate sums for rows, columns, and the grand total for the all the values in the table (Table 4.3a).
- The expected value for each cell is calculated by multiplying the row total by the column total, then dividing by the grand total. The calculations are simple enough to do using a calculator (Table 4.3b), or they can be programmed into MS Excel using equations.
- The full procedure for calculating expected values for the crayfish data set is illustrated in Table 4.3. (Note: this table does not show equations in the format necessary for use with MS Excel.)

Table 4.3a. Contingency table showing the observed values and calculated row totals, column totals, and grand total.

Table 4.3b. Contingency table showing the equations for calculating expected values. These equations represent the mathematical calculations to be performed; they do not correspond to formulas for use with MS Excel.

Table 4.3c. Contingency table showing the values entered into the equations for calculating expected values.

Table 4.3d. Contingency table showing the calculated expected values.

Once you have your observed values and your expected values, it’s relatively simple to use an equation to calculate a value for Chi-Square (represented by the symbol Χ

The equation for calculating Χ

(o = observed value, e = expected value)

- Setup a table with five columns (you can do this by hand on a piece of paper). Give the five columns the following headings: observed #, expected #, o-e, (o-e)2, and (o-e)2/e. (The abbreviations o and e stand for observed # and expected #.)
- Enter your observed values and expected values into the first two columns.
- To fill in the third column, subtract each expected value from each observed value.
- To fill in the fourth column, simply square the values in the third column.
- To fill in the fifth column, divide the values in the fourth column by the corresponding expected values. Then add all of the values in the fifth column to calculate Χ
^{2}. - Table 4.5 shows this table for the crayfish data.
- You now need to know the degrees of freedom in order to find the p-value. For contingency table analysis, df = (the number of rows -1) * (the number of columns-1). For the crayfish data, there are two rows and two columns in the contingency table, so the df = 1.
- You can use MS Excel to find the p-value based on Χ
^{2}and the df. You can use the chidist option by typing in "=chidist(Χ^{2},df)" In a cell in an MS Excel spreadsheet type in "=chidist(Χ^{2},df)". So, for the crayfish data type in the following: =chidist(10.6,1). After hitting the return button, you should see the value "0.001131".

As shown in Table 4.5, the p-value for this Chi-Square Test is 0.001131 (this should be reported as p=0.0011). So how does this relate to our original hypothesis that more crayfish will be found in the open under dark conditions than under daylight? Well, this p-value literally means that the probability that random chance is causing the observed values to be different from the expected values is 0.11%. There is a very small chance that the differences between observed and expected values are the result of random chance (and p is in fact

It’s traditional in Chi-Square analysis to report the degrees of freedom, the Chi-Square Value, and the p-value. Often this is simply done within parentheses at the end of a sentence that reports whether the test was significant. Here’s an example of how you might report the results of the analysis on the crayfish data: A Chi-Square Test showed that there was a significant association between light conditions and cover-type (df=1, Χ

Contingency tables can be applied to a wide variety of situations where data can be sorted into two types of categories (i.e. by two variables). For example, the number of isopods found in different experimental conditions of wet vs. dry and dark vs. light, or the number of endangered mammal populations that are declining vs. not-declining in protected vs. non-protected habitats. In addition, tables can be expanded to include more rows or columns depending on the different levels that can be distinguished for each variable. A 3 x 3 table would work for the endangered mammal example if the categories for population status could be divided into declining, stable and growing and the categories for protection could be divided into unprotected, semi-protected, and fully protected.

Sometimes data consist of counts that are assigned to categories and the expected number for each category is based on expected frequencies from other data sets or from theory. For example, data on the frequency of tree species in the understory of a forest can be compared to expected frequencies based on trees in the canopy. Alternatively, theory can be used to determine expected frequencies such as when Mendelian ratios are used to predict the expected number of individuals with different phenotypes in the F2 generation of a genetic cross.

You are managing a forest as part of a wildlife reserve and are interested in whether the mix of tree species in the forest is likely to change in the future. One way to get at this issue is to collect data on the mix of tree species in the understory and compare it to the mix of tree species in the canopy layer. Species that are under-represented in the understory may become less common in the future.

To answer this question, you collect data on the number of individuals of different tree species in the wildlife reserve. The raw data are shown in Table 4.6 and can be found here.

In this example, the mix of tree species in the canopy layer (or more specifically, the relative abundance of tree species in the canopy) can be used to calculate the expected values for the number of individuals of each tree species in the understory. Then the same formula for Chi-Square that we used for the crayfish data can be used to calculate a value for Χ

The following sets of directions described below explain how to calculate expected values for number of trees in the understory, how to create relative abundance graphs, and how to do a Chi-Square Test on theses data. Alternatively, you can enter your data into this spreadsheet which will do the calculations, make the graph, and perform the statistical test for you. If you choose to use the spreadsheet, I strongly recommend that you read through the directions below so that you understand what is being represented.

Here you will find the section on interpreting your statistical output.

- The first step is to make sure that your reference data (in this case the canopy tree data) are converted to frequencies. To do this, simply add up the total number of trees to calculate a grand total. Then for each species, divide the number sampled by the grand total. Once you have a frequency for each tree species, you can check your work by confirming that the frequencies for all the species add up to a total of 1. (Note for analyzing genetic crosses: This same technique can be used to calculate expected frequencies from expected ratios in genetics crosses. For example, for an expected ratio of 9:3:3:1, each value is divided by the grand total of 16 to give the expected frequencies of 0.5625, 0.1875, 0.1875, and 0.0625.)
- These expected frequencies can then be used to calculate expected values for the data you are testing (in this case the understory tree data). First, calculate the grand total for the number of trees found in the understory.
- To calculate an expected value for a given species, multiply the expected frequency for that species by the grand total in the understory. The sum of the expected values should be the same as the grand total for the observed values from the understory.
- It is also worth calculating the observed frequencies for the understory data. These are calculated by dividing the number of individuals of each species in the understory by the grand total of trees in the understory. These observed frequencies are a measure of relative abundance of the different tree species and are very useful for graphing the data in order to make visual comparisons.
- The equations and calculated values for determining expected frequencies and expected numbers are shown in Table 4.7.

The frequencies (columns titled expected frequ and observed frequ) in Table 4.7b represent the relative abundance of different tree species and can be used to make a bar graph showing possible differences between the canopy and the understory. Table 4.8 shows the relevant data for making the graph.

This type of bar graph (in this case referred to as a relative abundance graph) can be made by modifying the directions found earlier in this chapter for graphing the crayfish data. Of course, it will need to be labeled differently. Once completed, it should look similar to the one shown in Figure 4.2. This type of graph provides a way to visually inspect your data and begin to figure out whether the relative abundance (or "mix") of different tree species in the understory matches the canopy.

Figure 4.2. Relative abundance of tree species in the canopy and in the understory for the wildlife reserve data set.

It’s easy to see from Figure 4.2 that the relative abundance of tree species in the understory is very similar to the canopy. However, it’s still important to perform a Chi-Square Test in order to confirm that the differences are not statically significant and to report the values for Χ

Now that you have observed values and expected values for the understory trees (these are found in Table 4.7 and summarized in Table 4.9), you can calculate Χ

Important Note: Even though we used the relative abundance data (or frequency data) to create a bar graph, Χ

When you calculate Χ

In a Chi-Square Test, the equation for Χ

It is important to mention that for 2 x 2 contingency tables, the estimate for p as described above is subject to minor error. To calculate the exact value of p, Fisher's exact test must be used (Gotelli and Ellison 2004).

Contingency table analysis can be expanded to include more rows and or columns per category (such as a 3 x 3 table) or even to include more categories in a multi-way contingency table (such as 2 x 2 x 2).

The biggest concern in using the Chi-Square Test is when many of the expected values are near zero. Snedecor and Cochran (1980) recommend the following guidelines to avoid problems associated with low values.

- None of the expected values should be less than one.
- Two of the expected values can be near 1 if most other values are greater than five.

Alternatives to the Chi-Square Test Bayesian analysis can be used as an alternative to the Chi-Square Test as described in Gotelli and Ellison (2004).

Tables 4.10 through 4.13 show how to use formulas to setup spreadsheets to calculate Χ

Table 4.10. Spreadsheet showing formulas to calculate Χ

- enter the observed values (counts) into the contingency table shown in gray
- data from a 2 x 2, 2 x 3, 3 x 2, or 3 x 3 table can be entered
- enter the number of rows and columns into the gray cells B16 and B17
- the value for Χ
^{2}is shown in cell K14 and the p-value is shown in K16 - Table 4.11 shows the values you should see in your spreadsheet if you enter the data shown within the gray cells in that table.

Table 4.11. Spreadsheet showing the output you should get if you setup the spreadsheet shown in Table 4.10 and enter the data shown in the gray cells below.

Table 4.12. Formulas to setup a spreadsheet to calculate Χ

Table 4.13. Spreadsheet showing the output you should get if you setup the spreadsheet shown in Table 4.12 and enter the data shown in the gray cells shown below.