(back to table of contents; back to stats page)

I believe that students can understand many of the key concepts in statistics without learning a lot of statistical theory. I also think that it is possible to successfully perform and interpret statistical tests while not knowing the formulas behind those tests. As my friend Michael Lehner says, a carpenter doesn't need to know the details of how a hammer is made to successfully bang a nail.

In the writing of this manual, I have attempted to make the interpretation of quantitative data as simple as possible without glossing over or over-simplifying the most important concepts. Students may be dismayed to find that even in its simplest form, statistics is a tricky topic with a certain amount of technical language that cannot be eliminated. Statisticians may be dismayed to find that equations are avoided as much as possible, discussion of null hypotheses and data transformations are left for appendices, and discussion of assumptions of particular tests is minimized. Nonetheless, I believe that this manual is at an appropriate level for a beginner to start using statistics.

It is essential to stress that this manual is not intended as a substitute for a statistics text. Anyone who wants to be sure they have interpreted their data correctly, particularly if they intend to present their analyses to a scientific audience or to publish their work in a professional research journal, should consult a statistics text (see references cited) and/or a statistician.

Each chapter includes a realistic example of a research question or hypothesis along with a hypothetical data set. I purposely chose hypothetical data in order to keep the data sets small and to ensure that they illustrate important concepts. Each chapter also includes sections with detailed directions for using MS Excel to make graphs and perform statistical analyses. After these directions, guidelines for interpreting the graphs and statistical output are included.

It is worth noting that Chapter 1 is the longest and assumes the least knowledge of MS Excel or statistics. Though a reader does not have to read Chapter 1 in order to understand Chapters 2 – 4, a true beginner with spreadsheets will have an easier time with the later chapters after a thorough reading of Chapter 1. On the other hand, those with more experience with spreadsheets and statistics should be able to skim chapters for the information they desire.

It is my hope and belief that a beginner can grasp many of the most important concepts required for basic statistical analysis. One of the best ways to begin this understanding is to jump right in and make some graphs and do some statistical tests. This experience can be a crucial step towards clear thinking about the interpretation of quantitative data.

(back to table of contents; back to stats page)

Although this manual is designed for the reader with little or no formal training in statistics, it is impossible to discuss basic statistical analysis without first covering some critical concepts. Several are discussed briefly below and reviewed and expanded periodically throughout the manual. In addition, the glossary provides concise definitions for many of the terms presented here and in the following chapters.

Any collection of quantitative data (group of numbers) can be summarized based on location and spread.

In this hypothesis, there are two variables, fish size and food availability. Fish size is the dependent variable; it is the variable you wish to explain. Food availability is the independent variable that you hypothesize is affecting fish size. In general, the dependent variable (also called the response variable) is the variable that is affected by the independent variable, or responds to or depends on the independent variable. The independent variable (also called the predictor variable) is the factor that you hypothesize is causing the change in the dependent variable.

If you analyze your data and find that the size of sunfish is indeed associated with food availability, you have identified a pattern in the natural world! This is an exciting and important first step in the scientific process. However, it would require further study to explain the cause of the pattern. For example, maybe food availability does not determine the size of sunfish. Perhaps there is a predator present in some lakes that eats large sunfish and the invertebrates that sunfish feed on. In this case, the presence or absence of the predator is causing the pattern.

In general, both independent and dependent variables can be classified into two types, continuous and categorical. Continuous variables are measured numerically and can have a wide range of values. For example, tree height can be assigned a numerical value in meters and theoretically can vary from 0 all the way up to the maximum height that trees can grow.

Categorical variables are typically assigned to one of a few limited values or categories. For example, fish may come from one of several populations from different lakes such as Dublin Lake, Lake Nubanusit, or Lake Winnepesaki. In this case, fish population is a categorical variable and the three lakes are the possible categories for this variable.

Before analyzing your data, you are going to have to choose what type of statistical test is suitable. Knowing whether your variables are categorical or continuous is an important part of this decision. If the independent variable is categorical and the dependent variable is continuous, a t-Test or Anova may be most appropriate (Chapters 1 and 2). If your independent variable and dependent variable are both continuous, a regression analysis may be most appropriate (Chapter 3). If your independent variable is continuous but your dependent variable is categorical, a logistic regression may be most appropriate (not described in this manual – see Gotelli and Ellison, 2004). If both variables are categorical, some type of Chi-Square analysis may be most appropriate (Chapter 4). Appendices II and III provide more information on choosing the correct test for your data.

It is important to remember that random chance affects every data set – we attempt to minimize its effect with good experimental or sampling design, but there is no way to entirely eliminate it. A key question in data analysis is "how much has random chance influenced my data?"

It is possible for random chance to make it look like there are meaningful patterns in a data set. For example, whenever we calculate the means for two groups of numbers (such as the height of maple trees on south-facing vs. north-facing slopes), the means will rarely be exactly the same. Imagine that we calculate a mean height of 30.1 m for randomly selected trees on south slopes and 29.9 m for trees on north slopes. On average, are trees on south slopes really taller than on north slopes, or has random chance caused the difference? (As we will see in Chapter 1, measuring variation in the data can resolve this question.) If we go out to the field again and re-sample the two tree populations, will we get exactly the same means? Probably not. Maybe we would calculate a greater mean for trees on the north slope than the south slope the second time we collect data. In this case, the differences we observe between the means are most likely due to random chance rather than the slope that trees are growing on.

On the other hand, a low p-value reflects a low probability that the pattern is the result of random chance. Most likely something other than random chance is causing the pattern! If we are comparing mean values, then we would say that the difference between the means is "statistically significant." In the sugar maple example, something associated with south facing slopes may be causing trees to grow taller than on north slopes.

P-values range from 0 to 1. The cutoff value for p most often used in the scientific literature is 0.05. If p

All of the statistical analyses described in this manual are examples of parametric analyses. Parametric analyses rely on the assumption that the data being tested were sampled from a specified distribution (often the normal distribution – see Appendix III for more on parametric tests and the normal distribution). In addition, like most statistical tests, parametric analyses rely on the assumption of independence. Independence means that the outcome of one observation is not affected by the outcome of another. When you are collecting data, this means that each data point must be independent of the other data points in your study (see Gotelli and Ellison, 2004 for more on independence).

Although this manual does not include a thorough discussion of the assumptions required by particular parametric tests, there is a brief section at the end of each chapter describing when each test is appropriate. In addition, alternative non-parametric analyses are listed. It is the goal of this manual to help you start interpreting your data, a goal that can often be met with parametric analyses even if an assumption is violated. However, be sure to interpret your results with caution! As stated elsewhere, it is critical to consult a statistics text (see references cited) and/or statistician before presenting your results to a professional audience or journal.

Depending on your previous training in statistics, it may be helpful to refer back to this introduction while reading the following chapters. Also, some of the critical concepts introduced here are repeated or expanded in chapter sections and in the appendices. If the concepts are not entirely clear after your first reading of a certain definition or description, don't worry. You may need to revisit some of these concepts several times before you fully grasp them; but if you want to understand quantitative data, it is worth the effort – it will pay off!