APPENDIX VII: Finding and Displaying Hidden Data on Scatterplots
(back to table of contents; back to stats page)
Hidden data occur in scatterplots when two or more data points have the same values for the x and y coordinates; though there are several data points with the same values, the scatterplot only shows one data point on the graph. As we will see below, the presence of hidden data can greatly influence the interpretation of a scatterplot. Therefore, hidden data points need to be identified and displayed.
Note: If you're having problems with hidden data and you're making a
scatterplot with an x-axis representing categorical data, you may be better off simply displaying means and error bars (rather than the raw data) as described in
Appendix VI. However, if your x-axis represents a continous variable (such as in regression analysis), read on.
Below is a hypothetical data set for fish sampled in lakes with different vegetation complexity (data are different from the example in Chapter 3; data for this example can be found
here). Figure VIIa is a scatterplot of the data that does not display hidden data points.
Figure VIIa. The number of fish species plotted against vegetation index. Hidden data are not displayed. Notice that there are only nine data points on the graph even though sixteen lakes were sampled.
In the figure above it appears that there’s no relationship between the number of fish species and vegetation complexity. There is no indication that lakes with a higher vegetation index have consistently more or fewer fish species than lakes with a lower vegetation index. However, this graph is misleading because it doesn’t display hidden data. In fact, regression analysis reveals that there's a significant positive relationship between the two variables (p=0.03, R
2=0.29). How can this be? Read on.
It’s important to always check your scatterplot against your raw data for evidence of hidden data. The first (and often easiest) way to identify the presence of hidden data is to count the number of points on the scatterplot. In Figure VIIa there are only nine data points even though sixteen lakes were sampled. Clearly there must be data that are hidden.
Finding Hidden Data
-
To find all hidden data on a scatterplot, you can use the sort command on the on the Data tab in your Excel spreadsheet. In fact, I recommend "double-sorting" the data.
-
First highlight your two columns of data, then go to the Data tab and click the Sort option.
-
In the Sort by window, select Sort by veg. index (or whatever your first variable is). Be sure Order Smallest to Largest is selected.
- To "double-sort," you then need to click the Add Level button, select Sort by # fish species (or whatever your second variable is). Again, be sure Order Smallest to Largest is selected. Click OK.
-
Your data should look like those in the table below.
-
It should now be easy to see data that share identical x and y coordinates. It turns out that many of the data points share the same values for veg. index and # fish species, such as the three lakes that had a veg. index of 4.0 and 2 fish species.
-
All of these data points should be represented on your scatterplot. The easiest way to do this is by labeling data points on the graph with a number in parentheses as shown in Figure VIIb. Labeling can be done using text boxes (click on your graph, go to Chart Tools, Layout, Text Box).
These data have been "double-sorted." Note that when sorted, it's easy to identify and count data with identical x and y coordinates (identical data shown in bold with underlines).
Figure VIIb. Scatterplot showing the same data that are graphed in Figure VIIa except that hidden data are shown by numbers in parentheses. Now the scatterplot shows a positive relationship between number of fish species and vegetation complexity as also indicated by regression analysis (p=0.03, R2=0.29).
Clearly hidden data can affect your interpretation of your graphs! Be sure to check for this problem - fortunately, in many data sets it's not an issue.