Chapter 3 – Looking For Relationships between
Dependent and Independent Variables: Scatterplots and Regression Analysis

INTRODUCTION
As mentioned in the introduction of this manual, biologists and ecologists are often interested in patterns that occur in nature. Once a pattern has been identified, one of the main goals of science is to try to understand the cause for that pattern. This is where it becomes important to understand the distinction between the independent variable and the dependent variable. This distinction is most easily understood by referring to a specific example.

You are interested in the diversity of fish communities in lakes in northern New England. You and a team of field biologists collect data from lakes throughout the region and find that the number of fish species present varies from 1 to 15. What accounts for this variation? In other words, what is causing the differences in the number of fish species in these different lakes?

There are probably many factors that influence fish diversity and we could easily come up with several hypotheses to account for the variation in fish diversity. Each hypothesis would focus on a different factor or possible cause for the differences in fish diversity (or more sophisticated hypotheses might include several independent variables). For example, the size of the lake is likely a factor influencing diversity such that large lakes have more diverse fish communities than small lakes. In this case, lake size is the independent variable and fish diversity is the dependent variable.

Independent variable – the "predictor" variable; the variable that seems to be causing the observed change in the dependent variable. In the fish example above, one possible independent variable is lake size. Other possible independent variables that might explain fish diversity include lake pH, nutrient levels in lakes, amount of cover along the bottom of lakes, etc.

Dependent variable – the "response" variable that is influenced by the independent variable. This is usually the variable for which you are trying to find an explanation. It can also be thought of as the variable that "depends on" the independent variable.

BACKGROUND EXAMPLE

You are interested in the effects of vegetation complexity on the diversity of fish communities in the Adirondack Mountains. You visit fifteen lakes that vary in amount and type of vegetation. For each lake, you measure an index of vegetation complexity and sample the fish community to determine the number of species living in that lake. In this example, the hypothesis that you're interested in testing could be stated as follows:

Research Hypothesis: Vegetation complexity affects the diversity of fish communities in lakes in the Adirondack Mountains; the more complex the vegetation, the greater the number of fish species.

In this hypothesis, the independent variable is vegetation complexity and the dependent variable is number of fish species. (In this case, we will use a hypothetical vegetation complexity index abbreviated as “veg. index” as our measure of vegetation complexity)

To test this hypothesis, you collect the following data (a copy can be found here): ENTERING AND DESCRIBING THE DATA

See Chapter 1 for detailed directions on how to enter data and use formulas to calculate basic descriptive statistics (Table 1.4 should be particularly useful). Table 3.2 shows the raw data for the fish study along with corresponding descriptive statistics. Table 3.2 summarizes some basic information from your data set. For example, it shows that lakes ranged in vegetation complexity from 3.8 to 6.8 with a mean index of 5.4. For the number of fish species, the range was 0 to 12 with a mean of 4.3. Although this information is useful, we have not yet begun to test our research hypothesis. In order to do that, we need to make a scatterplot and perform a regression analysis.

GRAPHING THE DATA

Directions for making a scatterplot to display the relationship between two continuous variables using the fish diversity data set can be found here. Figure 3.1. Scatterplot of number of fish species vs. vegetation index. In this graph, the data clearly show a positive relationship between the two variables; lakes with greater vegetation complexity have more fish species.

Interpreting a Scatterplot
In this example, a scatterplot is used to begin to determine whether there's a relationship between the independent and dependent variables. Here we will only consider linear relationships (variables may also be related in a non-linear way, but that is beyond the scope of this manual). It is possible for the two variables to show a positive relationship (a line drawn through the data points would slope upward from left to right), a negative relationship (line would slope downward from left to right), or no relationship. Sometimes the relationship between the variables is clearly revealed by a scatterplot and then a regression analysis is then used to quantify specific parameters that describe that relationship. However, some scatterplots can be difficult to interpret and regression analysis is necessary to objectively determine whether there is a relationship between the variables.

Figure 3.1 for the fish diversity data set is an example of a positive relationship. Overall, the number of fish species increases with increasing vegetation complexity. A line drawn through the data points on this graph would slope upward from left to right. This pattern supports our hypothesis that vegetation complexity affects the diversity of fish communities in lakes in the Adirondack Mountains. Later we will confirm this positive relationship with a regression analysis and look at some specific statistical parameters that will help us quantify the relationship.

Examples of Other Scatterplots
Examples of other possible outcomes from our fish diversity study are shown below. Particularly important is Figure 3.4 that shows a graph that’s difficult to interpret. For the data corresponding to Figure 3.4, a regression analysis is critical to help objectively determine whether there is a relationship or not.

Negative Relationship Figure 3.2. Scatterplot of number of fish species vs. vegetation index. In this graph, the data clearly show a negative relationship between the two variables; lakes with greater vegetation complexity have fewer fish species.

No Relationship Figure 3.3. Scatterplot of number of fish species vs. vegetation index. In this graph, there is no apparent relationship between the two variables. As vegetation complexity increases, the number of fish species does not consistently increase or decrease.

Positive Relationship or No Relationship? Figure 3.4. Scatterplot of number of fish species vs. vegetation index. This graph is difficult to interpret. Is it an example of a positive relationship, or no relationship?

Without doing a regression analysis (or without a lot of experience interpreting scatterplots), there are two reasonable interpretations of Figure 3.4. One possible interpretation is that the data points are scattered throughout the graph and that lakes with a relatively high vegetation complexity do not consistently have either more or fewer fish species than lakes with low vegetation complexity; under this interpretation there is no relationship between the number of fish species and our vegetation index. However, it is also true that the lakes with fewest fish species have relatively low vegetation complexity and the lake with the most fish species has a relatively high vegetation complexity. Also, a line drawn through the center of the scatter of points would slope up from left to right. Perhaps there is a weak positive relationship between the two variables. The best way to objectively quantify this relationship is by using regression analysis (note: A regression analysis on the data for Figure 3.4 reveals a significant positive relationship between the variables; p = 0.042, R2 = 0.28; see below for definitions of these parameters).

TESTING THE DATA

A regression analysis estimates a regression line (characterized by slope and intercept) that characterizes the relationship between the two variables. The p-value represents the probability that the relationship between the two variables is due to random chance. Here are directions for how to do a regression analysis in MS Excel. For the remainder of this Chapter, we will focus on the original data presented in Table 3.1 - the data that are graphed in Figure 3.1.

Table 3.3. Output from an MS Excel regression analysis on the fish diversity data set (those shown in Table 3.1). Presenting and Interpreting the Output
There is a lot of useful information in the "SUMMARY OUTPUT" table from the regression analysis, but at this point there are three main values of importance; the value for "R Square", the value for "Observations" (sample size), and the value for "Significance F" (the p-value for the overall regression analysis). In this case, those values are 0.59, 15, and 0.00084. In addition, the values for intercept and slope should be reported. These values (-10.7 and 2.8 in Table 3.3) are the estimates for the intercept and slope of the regression line – the line that goes through the center of the scatter of data points. The rest of the information is important for detailed analysis of the regression output, but is not necessary for a basic interpretation of the results. I recommend simplifying the output to report the regression results as shown in Table 3.4. It is important to understand the meaning of these values in order to correctly interpret the analysis. The sample size simply reports how much data you have collected. In general, the more data, the more confident you can be in the results . . . the more data the better. Results from analyses with sample sizes less than 10 should be interpreted with caution.

The p-value is a measure of the probability that the relationship between the two variables is due to random chance. In this case the p-value is 0.00084 which is clearly less than the cutoff of 0.05, so the relationship is statistically significant. It literally means that there is a 0.084% probability that the relationship between number of fish species and vegetation complexity is due to random chance. Because this probability is very low, we can conclude that there is likely something non-random and therefore biologically or ecologically interesting causing the relationship between the two variables. We can conclude that there is a statistically significant positive relationship between the number of fish species and the vegetation index supporting our hypothesis that vegetation complexity affects the diversity of fish communities in lakes in the Adirondack Mountains.

More on p-value
The p-value can vary from 0 to 1. The higher the p-value, the more likely the relationship is simply due to random chance and the less likely there is a biologically or ecologically meaningful relationship between the two variables. In this case, if the relationship were due to random chance, then we could conclude that there is nothing about vegetation complexity that is influencing the number of fish species.

The lower the p-value, the less likely the relationship is due to random chance and the more likely it is due to something else, such as an influence of vegetation complexity on the number of fish species. It is important to remember that 0.05 is simply a cutoff agreed to by most biologists and ecologists. Using 0.05 does not mean that you will always reach the correct conclusion (see Type I and Type II Errors in Appendix I for more on reaching false conclusions). A p-value of 0.06 is nearly significant and could reflect a meaningful relationship. A p-value of 0.04 is significant, but it is still possible that the relationship between the two variables is just the result of random chance. On the other hand, a p-value of 0.0001 reveals a very low probability that a relationship is caused by random chance, and is considered highly significant. In simple terms, the lower the p-value, the more likely there is a real and meaningful relationship between the two variables.

The R2 value
The R2 value (which also varies between 0 and 1) is just as important as the p-value in understanding the results of the regression analysis. R2 represents the amount of variation in the dependent variable that can be explained by the independent variable. In this case the R2 is 0.59 which means that 59% of the variation in the number of fish species in the 15 lakes that were sampled can be explained by the vegetation index of the lakes. That means that 41% of the variation in the number of fish species is unexplained by variation in our vegetation index values. Not surprisingly, there are probably other important influences on number of fish species besides vegetation complexity.

Another way of thinking about R2 is that is quantifies how tightly the data points are clustered around the regression line. If the data are tightly clustered around the line, the R2 value will be relatively high, and the relationship between the two variables is relatively strong. If the data are widely scattered around the line, the R2 value will be low and the relationship between the variables is relatively weak.

R2 vs. p-value
It is critical to understand both values when interpreting your analysis. With large sample sizes it is quite possible to have a very low p-value (such as 0.001) and a very low R2 (such as 0.08). The correct interpretation would be that there is a highly significant statistical relationship between the two variables (only a 0.1% chance that the relationship is due to random chance), but only 8% of the variation in the dependent variable is explained by the independent variable. Even though the relationship between the variables is probably real and meaningful, there is a lot of variation in the dependent variable that is unexplained by the independent variable.

The intercept and slope are simply estimates for the equation for the regression line that could be plotted through the center of the scatter of data points. They correspond to the parameters b and m in the equation y = mx + b. Often this equation and/or the regression line are included on scatterplots that correspond to regression analyses. In MS Excel, you can find the options for adding a trendline by clicking on your scatterplot and then clicking on Chart Tools and then the Layout tab.

An important note on causation
Regression analysis is simply a statistical tool and cannot conclusively determine whether the independent variable that you are testing is causing the changes in the dependent variable. It is always possible that some unmeasured variable is the true cause of the changes in the dependent variable and is in fact related to both of the variables in your analysis. For example, it is possible that the lakes in the study above were artificially stocked with fish, and lakes with higher vegetation complexity were stocked with more fish species. In this case, vegetation complexity itself may not be causing the differences in the number of fish species, but it is still related to the number of fish species.

In general, experiments (rather than observational studies) in which the independent variable is manipulated and the effect on the dependent variable is measured are required in order to confirm what is truly causing variation in a dependent variable.

MORE ON REGRESSION AND ALTERNATIVE TESTS

The Concept Behind Regression Analysis
What follows is a very brief overview of how regression works. For details, consult Snedecor and Cochran (1980).

In a regression analysis, the slope of the regression line is calculated based on SS (SS is an abbreviation for sum of squares; see section at end of Chapter 2 on the concept behind Anova), then values for SS are converted to MS (mean square) by dividing by df. In order to determine whether this slope is significantly different from zero (in other words whether there is a significant relationship between the dependent and independent variables), the MS for the regression line is divided by the MS residual (an estimate of the variation in the data not related to the regression line) to calculate an F ratio. This F ratio is then used to find a p-value similar to the way p-values are found in Anova. The greater the F ratio, the lower the p-value, the less likely the relationship between the dependent and independent variable is the result of random chance.

Other Types of Regression Analysis
The regression analysis described in this chapter is linear regression. It is also possible that the two variables are related in a non-linear way and a non-linear regression model should be used. Another type of regression analysis is called multiple regression in which several different independent variables are measured and compared to see which is most strongly related to the dependent variable.

When Linear Regression is Appropriate
(see Gotelli and Ellison, 2004 for thorough discussion of assumptions.)
• When the two variables include a dependent and an independent variable (this is the same as saying the relationship is hypothesized to be a cause and effect). Correlation is used to test for relationships between variables when no cause and effect is hypothesized.
• When the relationship between the dependent and independent variables is linear.
• When the variances are constant along the regression line.
• When the data are independent (see glossary for more on independence).
• When the data were collected in an unbiased manner. Though this manual does not go into detail on methods for data collection, it is important to stress that statistical tests cannot correct for data sets that were collected improperly. See Brower et. al. (1998) or Krebs (1989) for more on unbiased sampling.

Alternatives to Regression Analysis
Monte Carlo and Bayesian Analysis are non-parametric alternatives to regression (Gotelli and Ellison, 2004).