Stats Trek IV

Written by Jessica Fry, PhD

This is the fourth installment of our Stats Trek series, where we talk about all things data! In the current series, we systematically dissect a paper from the scientific literature and discuss some of the things that we should consider when reading primary literature (i.e., peer-reviewed scientific articles that present original data from an experiment). Like all worthwhile skills, this takes practice! If you are new to this series, I recommend that you begin with the second installment here, and continue with the third installment here.

I would be remiss if I did not point you to this excellent blog post on the process of reading scientific articles.

We left our paper by Sankey et al. (download it here) with the assignment to analyze Figure 2 in our paper. What is the graph telling you? What are the dependent, independent, and controlled variables? What are the results?


Figure 2. Latency to approach the experimenter and time spent near her (distance <0.5 m) in the “motionless person test” by food-rewarded and grooming-rewarded horses, before and after training.

Figure 2. Latency to approach the experimenter and time spent near her (distance <0.5 m) in the “motionless person test” by food-rewarded and grooming-rewarded horses, before and after training.

Mann-Whitney U-test & Wilcoxon test *P<0.05, **P<0.01.

One of the key practices to understanding the scientific literature is the ability to interpret data presented on a graph.  Best practice is to look at the graphs before reading the authors’ conclusions, to determine for yourself the results based on the data.  This graph presents two results of one experiment, the “motionless human test” described in the materials and methods. The horse enters the ring with the experimenter for 5 minutes, and the data collected is latency to approach (how long it takes the horse to be within <0.5 meters of the human), and how long they remained within that 0.5m of the human.

The first thing to consider is the hypothesis being tested.  The hypothesis is a falsifiable statement.  In this case, a potential hypothesis for this graph would be that food-rewarded horses will spend more time near a human and approach the human faster than grooming-rewarded horses.  Remember, that the question being asked by the paper is “Do horses prefer food or grooming as their reward?” and this figure is one set of experiments to try to determine the answer.

The next thing we need to do is orient ourselves to the graph using independent and dependent variables.  Independent variables are the variables that are changed by the experimenter, in this case whether or not the animal was given a food or grooming reward (FR or GR).  Independent variables are generally, but not always, depicted on the X-axis.  It is important to note that we are also measuring the same group of horses twice, once before the training, and once after.  The dependent variables are the variables that we expect to change, in this case, latency of approach, and time spent near human.  These are found on the Y axis, broken up into seconds (5 minutes = 300 seconds).

Now that we have oriented ourselves to the graph, the first thing we need to do is check the controls.  In this experiment, the experimenters wanted to ensure that their groups of horses began the experiment with similar affinity toward humans, therefore they conducted this test before any rewards were given.  The first column of each group is the results prior to training.  The column height is the mean, or average of all of the results for the group.  The bars above and below each column represent the standard error, which is the average amount that each horse deviated from the mean.  For example, in the latency graph, the food group had a mean ()of 235.6  ± 32.7 seconds, and the grooming group 202.8 ±40.9 seconds. While their means are different, when you add in the variability of the response from all the horses in the grooming group (202.8 + 40.9 = 243.7 seconds ) that number is well within the range for the food group, indicating that these columns represent similar responses in both groups. To simplify, if the error bars overlap, the data are not different between the groups.

Our grooming and feeding reward groups began with similar responses to people in their training environment.  What happened after the horses were trained using the same techniques and the same amount of time, with only the method of reward differing? Considering the latency to approach, the average latency decreases from ±SE = 235.6±32.7s to ±SE = 78.8±37.7 s.  Crucially, you can see that the error bars do not overlap, which is your first indication that this result may be statistically significant.  The authors denote statistically significant results using asterisks (*) indicating * p<.05 and ** p<.01.

Statistical significance is a mathematical method used by experimenters to determine how likely their result is to happen by chance.  To do so, experimenters “reject the null hypothesis.”  In our latency to approach example, we test the hypothesis that horses receiving food rewards will approach within 0.5m of a person more quickly than a horse that has not received rewards (column 1 vs. column 2).  The null hypothesis would then be that horses that receive food rewards approach with the same latency as the same horses prior to receiving their reward.  Prior to their experiment, researchers look at their experimental design and choose the statistical tests that make the most sense given their groups.  There are tests for independent groups as well as tests that consider the same measurement taken in the same test subject over time.  In our paper, the Mann-Whitney and Wilcoxon tests were chosen (for more on test parameters, look here.)

The result of conducting the statistical test is a p-value.  The p-value represents the likelihood of seeing the result, if the null hypothesis is true.  Generally, a p-value of <0.05 (less than 5%) is considered significant.  A p-value of 0.05 says that there is a 5% chance of seeing the results that we see (which look like there is a change between the horses’ latency before and after food reward) but that there is really no effect of the food reward on latency. Many scientific studies set their significance threshold at 0.05.  It is important to remember that there is still a 5 in 100 chance that the null hypothesis is true (or 1 in 100, for a p-value of 0.01)!

Given the complexity of measuring behavior, and the pressures on scientists to produce significant results to further their careers, there are many ways for statistics to be abused.  P-hacking is a significant issue in modern science (link, and comprises a few different ethical misdemeanors.  One of the most prevalent is measuring ten effects and only choosing to report on those that show a statistically significant result.  Would we draw the same conclusions from the data if the experimenters measured five other indicators of the human-horse bond and found no effect?  This illustrates the importance of replication and meta-analysis in scientific research.  Replication is simply repeating the same experiment using the same methods in another group.  Recent studies show that many studies that were statistically significant in psychology cannot be replicated in another research environment. Meta-analysis refers to analyzing all the available data on a topic as a whole.  If five studies have been done on the human-horse bond formed through food or grooming, what does analyzing these together tell us?

An additional misdemeanor is post-experiment “selection” of data.  Ideally, the criteria for outliers would be decided before the experiment begins.  In our paper, we could imagine that a data point might be dropped for a food-reward group horse that escaped their stall and helped themselves to the treats before their daily training session.  That data would be dropped even if the horse performed in a way that corresponded to the hypothesis.

Additional issues of this type include adding more experimental subjects to increase the sample size to achieve significance after the experiment has begun. Take, for example, the flip of a standard, fair, two-sided coin.  It is well established that the chance of getting heads is 50%.  If my first flip is heads, what is the chance that my second flip will also yield heads?  50%.  My first five flips may yield all heads!  It is only after a large enough sample size that we can see that half of the flips yield heads and half tails. If we test a batch of 100 coins for fairness, it would take five flips (5*.5*.5*.5*.5 = .0325) for the chances for them to all be heads by chance to be less than 5% (in this case, 3.25%).  In a group of unfair coins that flip heads 52% of the time, 5 flips heads will happen 3.8% of the time!  So in our sample of 100 in each group, between three and four of both the fair and unfair coins will flip heads five times out of five.  Five flips doesn’t seem enough to be certain, so we continue flipping all of our coins, 100 times each, and find that a group of them flips heads 52% of the time, and another group flips heads 50% of the time, and this result is statistically significant.  But does it matter for the function of the coin flip to determine who goes first in the Super Bowl?

One of the most important takeaways from a discussion on statistics is that statistical significance does not equal functional significance!  If 1,000 people take a drug to treat their high blood pressure, and that drug reduces all 1,000 patients’ systolic blood pressure 5 mmHg, that represents a statistically significant result.  However, if these patients are still hypertensive, this result is not functionally significant!  Many medical journals are now requiring that studies cite number needed to treat and number needed to harm rather than statistical significance in treatment studies.  If a drug needs to be given to 1,000 people to help one, but will cause harm to 100 out of 1,000, then that drug may not be suitable for the clinic.

For your final exercise in this paper, read the discussion/conclusion section.  Do the authors agree with your assessment of their data?  Do they overinterpret?  When you put your contrarian hat on, do the conclusions seem farfetched?  Is there data you would like to see?  I personally would love to have the data plotted for each horse, so we could assess potential individual preferences for food or petting that disappears in the group data.  As we trainers are wont to say, the value of a reinforcer is determined by the learner!


Jessica L. Fry PhD, KPA-CTP, is an assistant professor of biology at Curry College in Milton, Mass.  She is the current president of the New England Dog Training Club, and has one more cat than there are laps in her house.