Doing a Regression Analysis in MS Excel
• First, make sure that your data are entered appropriately. Below is an example for the fish diversity data from Chapter 3 of the Online Stats Manual.
• To do a regression analysis, go to the Data tab, select Data Analysis, Regression, and OK. Your screen should look like the screenshot shown below.
• In the Regression window, enter your data as described in the next two steps. Important - don’t switch your independent and dependent variables!
• Click within the Input Y Range box, then click and drag down the column heading and values for your dependent variable (dependent variables are typically displayed along the vertical y-axis in graphs). In the fish diversity data set, # of fish species is the dependent variable.
• Click within the Input X Range box, then click and drag down the column heading and values for your independent variable (independent variables are typically displayed along the horizontal x-axis in graphs). In the fish diversity data set, veg. index is the independent variable.
• Since you included column headings, make sure the Labels box is checked, then click OK. • If you were successful, you should see output that looks like this (though your output won’t include the red circles): • Now you need to format and interpret your output.
• For the purposes of many undergraduate projects, you will only need three values from this table: sample size (called Observations by Excel), p-value (called Significance F by Excel), and R-squared. (Be sure to check with your professor or with the project guidelines before reducing your output to these three values.)
• Rather than re-formatting the output table, I recommending building a new table to look like this: Interpreting these values:
• sample size: In this case, the number of lakes in which a vegetation index was calculated and fish species were recorded.
• p-value: This is the value for determining whether there’s statistical significance. In regression analysis, if p is less than (or equal to) 0.05, the relationship between the dependent and independent variable is statistically significant. In this case, p is much less than 0.05, so there is a significant relationship. In technical terms, p is the probability that the relationship between the two variables is due to random chance. Because there’s a low probability that this relationship is due to random chance, there is probably something interesting causing the relationship. It may be that vegetation complexity does indeed lead to greater fish diversity. See this section of the Online Stats Manual for more.
• R-squared: This is a measure of how scattered the data are around a regression line (such as a trendline in a scatterplot). The value of R-squared can vary from 0 to 1. High values of R-squared indicate low scatter. In technical terms, R-squared is the percent of variation in the dependent variable that’s explained by the independent variable. In the example above, 59% of the variation in the number of fish species can be explained by vegetation complexity.
• IMPORTANT: A scatterplot should always accompany regression analysis. You can find directions here. Here’s a scatterplot of the data for the analysis described above. As you can see from the graph, the relationship between the two variables is positive - an increase in the vegetation index is associated with an increase in the number of fish species. What some of the other output from MS Excel means:
• The MS (mean square) and F values work similarly to Anova. See the bottom of these directions for a discussion of these values.
• Coefficient: This column gives the slope and intercept of the regression line. In the output above, the slope is 2.8 (a positive value indicating a positive relationship) and the intercept is -10.7. So, the equation for the regression line is y = 2.8*x + (-10.7) . . . remember y = mx + b? . . .