This manual is for the student who has little or no training in the use of spreadsheets or statistical analysis but has the need or the desire to make sense out of quantitative data. It should be helpful for anyone wishing to analyze relatively straightforward data sets but who finds formal statistics texts "user unfriendly." It is intended to provide the beginner with the tools to start to interpret data without using sophisticated statistical software or complicated equations.
I believe that students can understand many of the key concepts in statistics without learning a lot of statistical theory. I also think that it is possible to successfully perform and interpret statistical tests while not knowing the formulas behind those tests. As my friend Michael Lehner says, a carpenter doesn't need to know the details of how a hammer is made to successfully bang a nail.
In the writing of this manual, I have attempted to make the interpretation of quantitative data as simple as possible without glossing over or over-simplifying the most important concepts. Students may be dismayed to find that even in its simplest form, statistics is a tricky topic with a certain amount of technical language that cannot be eliminated. Statisticians may be dismayed to find that equations are avoided as much as possible, discussion of null hypotheses and data transformations are left for appendices, and discussion of assumptions of particular tests is minimized. Nonetheless, I believe that this manual is at an appropriate level for a beginner to start using statistics.
It is essential to stress that this manual is not intended as a substitute for a statistics text. Anyone who wants to be sure they have interpreted their data correctly, particularly if they intend to present their analyses to a scientific audience or to publish their work in a professional research journal, should consult a statistics text (see references cited) and/or a statistician.
A Note on the Organization of Chapters
Each chapter includes a realistic example of a research question or hypothesis along with a hypothetical data set. I purposely chose hypothetical data in order to keep the data sets small and to ensure that they illustrate important concepts. Each chapter also includes sections with detailed directions for using MS Excel to make graphs and perform statistical analyses. After these directions, guidelines for interpreting the graphs and statistical output are included.
It is worth noting that Chapter 1 is the longest and assumes the least knowledge of MS Excel or statistics. Though a reader does not have to read Chapter 1 in order to understand Chapters 2 – 4, a true beginner with spreadsheets will have an easier time with the later chapters after a thorough reading of Chapter 1. On the other hand, those with more experience with spreadsheets and statistics should be able to skim chapters for the information they desire.
Jump in and start analyzing!
It is my hope and belief that a beginner can grasp many of the most important concepts required for basic statistical analysis. One of the best ways to begin this understanding is to jump right in and make some graphs and do some statistical tests. This experience can be a crucial step towards clear thinking about the interpretation of quantitative data.
A Few Critical Concepts
Although this manual is designed for the reader with little or no formal training in statistics, it is impossible to discuss basic statistical analysis without first covering some critical concepts. Several are discussed briefly below and reviewed and expanded periodically throughout the manual. In addition, the glossary provides concise definitions for many of the terms presented here and in the following chapters.
Any collection of quantitative data (group of numbers) can be summarized based on location and spread.
Location of Data
– Measures of location summarize where most of the data are found. The mean (= average or arithmetic mean), mode and median are all measures of location
. For example, if you measure the height of 100 sugar maple trees, calculating a mean height of 30m would tell you something about the location of the data. This manual will focus entirely on mean as a measure of location. However, mode and median are defined in the glossary and tests incorporating these measures are mentioned at the end of Chapter 1.
Spread of Data
– Measures of spread summarize how variable the data are. Are all the measured sugar maples close to 30m in height, or are some considerably shorter and some considerably taller than 30m? Methods for quantifying spread include comparing maximum values and minimum values and measuring range, standard deviation and variance. As we will see in Chapter 1, summarizing the spread of the data is just as important as summarizing the location.
Pattern, Dependent Variables and Independent Variables
– Often the goal of a scientific study is to explain what factors cause a particular pattern in the natural world. For example, if you observe that sunfish grow larger in some lakes than in other lakes, you may want to understand why. You could propose and test hypotheses that focus on different factors that might affect the size of sunfish. For example, one hypothesis could be that sunfish grow larger in lakes with greater food availability.
In this hypothesis, there are two variables, fish size and food availability. Fish size is the dependent variable; it is the variable you wish to explain. Food availability is the independent variable that you hypothesize is affecting fish size. In general, the dependent variable (also called the response variable) is the variable that is affected by the independent variable, or responds to or depends on the independent variable. The independent variable (also called the predictor variable) is the factor that you hypothesize is causing the change in the dependent variable.
If you analyze your data and find that the size of sunfish is indeed associated with food availability, you have identified a pattern in the natural world! This is an exciting and important first step in the scientific process. However, it would require further study to explain the cause of the pattern. For example, maybe food availability does not determine the size of sunfish. Perhaps there is a predator present in some lakes that eats large sunfish and the invertebrates that sunfish feed on. In this case, the presence or absence of the predator is causing the pattern.
Categorical vs. Continuous Variables
In general, both independent and dependent variables can be classified into two types, continuous and categorical. Continuous variables are measured numerically and can have a wide range of values. For example, tree height can be assigned a numerical value in meters and theoretically can vary from 0 all the way up to the maximum height that trees can grow.
Categorical variables are typically assigned to one of a few limited values or categories. For example, fish may come from one of several populations from different lakes such as Dublin Lake, Lake Nubanusit, or Lake Winnepesaki. In this case, fish population is a categorical variable and the three lakes are the possible categories for this variable.
Before analyzing your data, you are going to have to choose what type of statistical test is suitable. Knowing whether your variables are categorical or continuous is an important part of this decision. If the independent variable is categorical and the dependent variable is continuous, a t-Test or Anova may be most appropriate (Chapters 1 and 2). If your independent variable and dependent variable are both continuous, a regression analysis may be most appropriate (Chapter 3). If your independent variable is continuous but your dependent variable is categorical, a logistic regression may be most appropriate (not described in this manual – see Gotelli and Ellison, 2004). If both variables are categorical, some type of Chi-Square analysis may be most appropriate (Chapter 4). Appendices II and III provide more information on choosing the correct test for your data.
– Random chance influences any data set that scientists work with. Let's use the height of maple trees to focus on two specific examples of how random chance can affect data. 1.
Imagine a windstorm knocks the branch off of a nearby oak tree that destroys the top of one of our maple trees. This event reduces the height of the maple tree and can be considered an example of random chance (unless we specifically are interested in the effects of damage by neighboring trees on the height of sugar maples). 2.
Measurement error is also considered the result of random chance (unless the focus of our study is to quantify measurement error). For example, if two field workers estimate the height of a sugar maple tree, their estimates may not be exactly the same and can result in random chance affecting the data set.
It is important to remember that random chance affects every data set – we attempt to minimize its effect with good experimental or sampling design, but there is no way to entirely eliminate it. A key question in data analysis is "how much has random chance influenced my data?"
It is possible for random chance to make it look like there are meaningful patterns in a data set. For example, whenever we calculate the means for two groups of numbers (such as the height of maple trees on south-facing vs. north-facing slopes), the means will rarely be exactly the same. Imagine that we calculate a mean height of 30.1 m for randomly selected trees on south slopes and 29.9 m for trees on north slopes. On average, are trees on south slopes really taller than on north slopes, or has random chance caused the difference? (As we will see in Chapter 1, measuring variation in the data can resolve this question.) If we go out to the field again and re-sample the two tree populations, will we get exactly the same means? Probably not. Maybe we would calculate a greater mean for trees on the north slope than the south slope the second time we collect data. In this case, the differences we observe between the means are most likely due to random chance rather than the slope that trees are growing on.
– Because random chance influences every data set, it is important to quantify its effects by using statistics to find p-values. The purpose of a p-value is to estimate how likely it is that randomness is causing a pattern in our data set. A p-value measures the probability that the pattern we are interested in (such as the difference between means) is the result of random chance. A high p-value reflects a high probability that random chance is causing the pattern. In the sugar maple example described in the previous section, a high p-value would indicate that any difference in the mean height of trees on south vs. north-facing slopes is simply the result of random chance and therefore is not meaningful.
On the other hand, a low p-value reflects a low probability that the pattern is the result of random chance. Most likely something other than random chance is causing the pattern! If we are comparing mean values, then we would say that the difference between the means is "statistically significant." In the sugar maple example, something associated with south facing slopes may be causing trees to grow taller than on north slopes.
P-values range from 0 to 1. The cutoff value for p most often used in the scientific literature is 0.05. If p <
0.05, we say that the pattern is statistically significant. In other words, if the probability is less than or equal to 0.05 that random chance is causing the pattern, we consider it likely that something other than random chance is causing the pattern. Yet another way of making this statement – when the probability is 5% or less that the pattern is caused by random chance, then something else is probably the cause. It is important to point out that the 0.05 cutoff is simply a number agreed upon by the scientific community. It is still possible that random chance is causing the pattern in the data when p <
0.05; it's just not very likely (see Appendix I on Type I & II Errors). Overall, the lower the p-value, the less likely that random chance accounts for the pattern and the more likely some other factor is causing the pattern.
All of the statistical analyses described in this manual are examples of parametric analyses. Parametric analyses rely on the assumption that the data being tested were sampled from a specified distribution (often the normal distribution – see Appendix III for more on parametric tests and the normal distribution). In addition, like most statistical tests, parametric analyses rely on the assumption of independence. Independence means that the outcome of one observation is not affected by the outcome of another. When you are collecting data, this means that each data point must be independent of the other data points in your study (see Gotelli and Ellison, 2004 for more on independence).
Although this manual does not include a thorough discussion of the assumptions required by particular parametric tests, there is a brief section at the end of each chapter describing when each test is appropriate. In addition, alternative non-parametric analyses are listed. It is the goal of this manual to help you start interpreting your data, a goal that can often be met with parametric analyses even if an assumption is violated. However, be sure to interpret your results with caution! As stated elsewhere, it is critical to consult a statistics text (see references cited) and/or statistician before presenting your results to a professional audience or journal.
Chapters and Appendices
Depending on your previous training in statistics, it may be helpful to refer back to this introduction while reading the following chapters. Also, some of the critical concepts introduced here are repeated or expanded in chapter sections and in the appendices. If the concepts are not entirely clear after your first reading of a certain definition or description, don't worry. You may need to revisit some of these concepts several times before you fully grasp them; but if you want to understand quantitative data, it is worth the effort – it will pay off!