Stats Trek III: What is Normal, Anyway?

Written by Jessica Fry

This is the third installment of our Stats Trek series, where we talk about all things data! In the current series, we systematically dissect a paper from the scientific literature and discuss some of the things that we should consider when reading primary literature (i.e., peer-reviewed scientific articles that present original data from an experiment). Like all worthwhile skills, this takes practice! If you are new to this series, I recommend that you begin with the second installment here.

We will rejoin our paper by Sankey et al. (download it here) in the Materials & Methods section. The Materials & Methods are intended to allow a fellow scientist to understand how an experiment was done to the point of being able to replicate that experiment. We discussed the choice of certain experimental groups in our last installment, so now we will move on to experimental design.

Experiments have one simple goal: to reject or accept a hypothesis. A hypothesis is generally a falsifiable statement that can be turned into an IF/THEN statement. IF [HYPOTHESIS] THEN [EXPERIMENTAL RESULT]. This paper takes observations— that grooming lowers the heart rate of horses—and asks the question: Can grooming be used to promote bonding and facilitate learning, and how does this compare to a food reward? Strictly speaking, this is not the hypothesis. Open your paper and examine Figure 1 and the accompanying legend.  An acceptable hypothesis for the experiment presented in figure 1 would be “Horses learn as well when presented with a grooming reward as a food reward.” This is a falsifiable statement. Our experiment: IF horses learn faster when presented with a food reward rather than a grooming award THEN they will EXPERIMENTAL RESULT. So how do we set up the experiment to test this?

Setting up the variables

To paraphrase the Materials & Methods: 20 Konik horses, reared under conventional domestic conditions or with their families in semi-natural conditions were mixed together after 10 months into multi-age groups. Grooming the withers is a socio-positive behavior for these horses. Horses aged 1-2 years old were housed identically, with identical exposure to humans, and were randomly allocated into two groups.

All of this set-up is to introduce us to the controlled variables in the experiment. Controlled variables are those that are identified by the scientist as things that must be monitored and kept as constant as possible between groups. For example, if our food-reward group was handled by people more often than our grooming-reward group, that might skew our results.

We’ve discussed the importance of group choice already, so let’s look at our other variables.

Independent variables are the variables that are changed purposefully by the scientist. In this experiment, we are changing the type of reward, but we are controlling the human interaction, the number of times the horse is exposed to the command, how hungry the horses are, and what experience the horse has immediately after the training session. All of this allows us to measure our dependent variable, which is the data that we are collecting.

When we’re interpreting the results, we need to consider how the dependent variable is being measured. The authors describe a precise method where the horses are trained for one period of five minutes per day for six days. A verbal cue for the horse to be still is issued, and the horse is rewarded for remaining immobile for a predetermined number of seconds. When the subject horse is successful three times at one time step, the time is increased. At the end of the training, the horse is released to the outdoor paddock. Figure 1 shows us the results of this study:

Figure 1 - Fry

Figure 1

The black bars represent food reward, and the gray bars represent grooming reward. The top of each bar in the bar graph represents the average (represented by  in the Results section text). The small bars that emerge from the bars are referred to as error bars and represent the variability of the sample.

Considering day 1, we see that the food-rewarded horses remained immobile for an average of 9 seconds, plus or minus 1 second standard error, while the grooming-rewarded horses remained immobile for an average of 5 seconds plus or minus 1.3 seconds. Inspecting the graph, we see that in subsequent days the food reward group outperforms the grooming rewards group in the stillness task. We can also see that the grooming rewards group plateaus on training day 3, but the variability in the group increases. What does this mean in practice? Let’s talk variability, and then, some statistics.


Variability is how far each data point is from the average. Let’s take two sets of numbers and calculate their average.

Set 1:

1, 2, 2, 3, 5, 5, 5, 5, 7, 8, 8, 9
Sum = 60, divided by the number of data points (12) = 5

Average =5

Set 2:

1, 1, 1, 1, 2, 5, 5, 8, 9, 9, 9, 9
Sum = 60 divided by the number of data points (12) = 5

Average = 5

Which set has more variation? To calculate that, we find how far each data point is from the average. So for Set 1, that average is (5-1), (5-2), etc. to get

4, 3, 3, 2, 0, 0, 0, 0, 2, 3, 3, 4.

Then we square each of those values and take their average:

= 6.3

If we repeat this procedure for the second set, we get a value of 12.1 (Try it!)

So while the averages for these two sets of data are the same, the variance is higher in our second set of numbers. We can use the variance to calculate two important numbers: the standard deviation (SD), and the standard error of the mean (SE, which our authors chose).


By Dan Kernler - Own work, CC BY-SA 4.0, HYPERLINK:

By Dan Kernler – Own work, CC BY-SA 4.0, HYPERLINK:

Standard deviation is the square root of the variance, and is how we get the mathematical definition of normal!

When we are looking at statistics, it’s important for us to consider whether our data are within a normal distribution (or if we even expect them to be!). What is normal?

If we collect all of our data on a graph, a normal distribution will have 68 percent of data points within one standard deviation of the mean. The SDs of our example data sets are 2.5 for our first set, and 3.47 for our second set.

So our example graphs look something like this, with Set 1 in blue and Set 2 in red:


Figure 2: Frequency of the values in our data sets

Figure 2: Frequency of the values in our data sets

Set 1 looks similar to our normal distribution above, but how can we check? Sixty-eight percent of the values will be within one SD, or 5 – 2.5 = 2.5, and 5 + 2.5 = 7.5

We have six values between 2.5 and 7.5, and 68 percent of 12 total values is 8.16. So it’s close, but not quite a normal distribution.

Standard error is calculated by dividing the standard deviation by the total number of data points, and represents how close each horse was to the value you would get if you randomly chose a different horse in the same population to measure. So in Figure 1, the increase in the SE (shown by the length of the error bars) for the grooming reinforcement group indicates that the authors’ measured response is more varied for the horses in this group.

So why do we care? There is a lot of variability when measuring biological phenomena. For example, the average body temperature for a human is 98.6°F.  Because most people fall around this normal temperature, we can say with confidence that 100°F is a notable deviation and can be used to make medical decisions.  But what if we lived in a world where the healthy body temperature for a human had more variability?  I could have a daily temperature of 96°F, another person, 101.2°F.  In that case, a nurse would have to ask: “what is your healthy temperature?” and “What is your temperature now?” to determine if you had a fever, and a fever might be defined as the increase of temperature of 2°F above normal.  In the more variable world, we need to understand whether the change we measure is because of our intervention, or because the phenomenon is just highly variable.  To do this, we apply statistical tests to either accept or reject our hypothesis. The tests we choose depend on whether or not our distribution is normal.

Your assignment for next time: Analyze Figure 2.  What is the graph telling you? What are the dependent, independent and controlled variables? What are the results? Answers next time, along with a discussion of what scientists mean when they say something is significant!

For more on normal distribution, standard deviation, and standard error (using dog height as an example), see:


Jessica L. Fry Ph.D., KPA-CTP, is an Assistant Professor of Biology at Curry College in Milton, MA.  She is the current President of the New England Dog Training Club, and has one more cat then there are laps in her house.