Skip to content

#34 Reveal human progress, part 2

In the previous post (here), I put my Ask good questions mantra on a temporary hold as I argued for another three-word exhortation that I hope will catch on with statistics teachers: Reveal human progress.  In this post I will merge these two themes by presenting questions for classroom use about data that reveal human progress.

The first three of these questions present data that reveal changes over time.  I think these questions are appropriate not only for introductory statistics but also for courses in quantitative reasoning and perhaps other mathematics courses.  The fourth question concerns probability, and the last two involve statistical inference.

As always, questions that I pose to my students appear in italics.


1. The following graph displays how life expectancy has changed in South Africa over the past few decades:

  • a) Describe how life expectancy has changed in South Africa over these years. 
  • b) In which of these three time periods did life expectancy change most quickly, and in which did it change most slowly: 1960 – 1990, 1990 – 2005, 2005 – 2016?
  • c) Explain what happened in South Africa in 1990 – 2005 that undid so much progress, and also explain what happened around 2005 to restart the positive trend.  (You need to use knowledge beyond what’s shown in the graph to answer this.  Feel free to use the internet.)

Question (a) is meant to be straightforward.  I expect students to comment on the gradual increase in life expectancy from 1960 – 1990, the sudden reversal into a dramatic decline from 1990 – 2005, and then another reversal with an even more rapid increase from 2005 – 2016.  A more thorough response would note that the life expectancy in 2005 had plunged to a level about equal to that of 1965, and the life expectancy in 2016 had rebounded to exceed the previous high in 2005.

Question (b) addresses rates of change.  I have in mind that students simply approximate these values from the graph.  Life expectancy increased from about 52 to 62 years between 1960 and 1990, which is an increase of about 10 life expectancy years over a 30-year time period, which is a rate of about 0.33 life expectancy years per year*.  From 1990 – 2005, life expectancy decreased by almost 10 years, for a rate of about 0.67 life expectancy years per year.  The years between 2005 – 2016 saw an increase in life expectancy of about 10 years, which is a rate of about 1 life expectancy year per year.  So, the quickest rate of change occurred in the most recent time period 2005 – 2016, and the slowest rate of change occurred in the most distant time period: 1960 – 1990.

* Unfortunately, the units here (life expectancy years per year of time) are tricky for students to express clearly.  This can be one of the downsides of using real data in an interesting context.

It usually takes students a little while to think of the explanation in part (c), but some students eventually suggest the HIV/AIDS epidemic that devastated South Africa in the 1990s.  Fortunately, effective medication became more available, helping to produce the dramatic improvement that began around the year 2005.


2. The following graph adds to the previous one by including the life expectancy for Ghana, as well as South Africa, over these years:

  • a) Compare and contrast how life expectancy changed in these two countries over these years.
  • b) Which country had a larger percentage increase in life expectancy over these years?  Explain your answer without performing any calculations.
  • c) Suppose that you were to calculate the changes in life expectancy for each year by subtracting the previous year’s value.  Which country would have a larger mean of its yearly changes?  Which country would have a larger standard deviation of its yearly changes?  Explain your answers.

For part (a), I expect students to respond that Ghana did not experience the dramatic reversals that South Africa did.  More specifically, Ghana experienced only a slight decline from about 1995 – 2000, much less dramatic and briefer than South Africa’s precipitous drop from 1990 – 2005.  For full credit I also look for students to mention at least one other aspect, such as:

  • Ghana had a much lower life expectancy than South Africa in 1960 and had a very similar life expectancy in 2016.
  • Ghana’s increase in life expectancy since 2005 has been much more gradual than South Africa’s steep increase over this period.

The key to answering part (b) correctly is to realize that the two countries ended with approximately the same life expectancy, but Ghana began with a much smaller life expectancy, so the percentage increase is larger for Ghana than for South Africa.

Part (c) is not at all routine, requiring a lot of thought.  Because Ghana had a larger increase in life expectancy over this time period, Ghana would have a larger mean for the distribution of its yearly changes.  But South Africa had steeper increases and decreases than Ghana, so South Africa would have more variability (and therefore a larger standard deviation) in its distribution of yearly changes*.

* The means of the yearly changes turn out to be 0.302 years for Ghana, 0.188 years for South Africa.  The standard deviations of the yearly changes are 0.625 years for South Africa, 0.174 years for Ghana.


3. Consider the following graph of energy-related carbon dioxide (CO2) emissions (in million metric tons) in the United States from 1990 – 2005:

  • a) Describe what the graph reveals.
  • b) Determine the least-squares line for predicting CO2 emissions from year.
  • c) Interpret the value of the slope coefficient.
  • d) Use the line to predict CO2 emissions for the year 2018.
  • e) The actual value for CO2 emissions in 2018 was 5269 million metric tons.  Calculate the percentage error of the prediction from the actual value.
  • f) Explain what went wrong, why the prediction did so poorly.

Students have little difficulty with part (a), as they note that CO2 emissions are increasing at a fairly steady rate from about 5000 million metric tons in 1990 to about 6000 million metric tons in 2005.  I intend for students to use technology to determine the least squares line in (b), the equation of which turns out to be: predicted CO2 emissions = -135,512 + 70.61 × year. 

To interpret the slope coefficient in part (c), students should respond that the predicted CO2 emissions increases by 70.61 million metric tons for each additional year.  Using this line to predict CO2 emissions for the year 2018 in part (d) gives: -135,512 + 70.61 × 2018 ≈ 6979 million metric tons.

This prediction is not very close to the actual value of CO2 emissions in 2018, as it over-predicts by more than 1700 million metric tons.  The percentage error for part (e) is: (6979 – 5269) / 5269 × 100% ≈ 32.5%. 

The explanation in part (f) is that we should have been cautious about extrapolation.  By using the least squares line to make a prediction thirteen years into the future, we assumed that the linear increasing trend would continue in the years following 2005.  We did not have a good justification for making this assumption.

In fact, a graph of the entire dataset from 1990 – 2018 reveals that the increasing trend from 1990 – 2005 actually reversed into a decreasing trend from 2005 – 2018:

Students find these data to be very surprising.  I hope the surprise aspect helps to make the caution about extrapolation memorable for them.


The next three questions concern Hans Rosling’s Gapminder/Ignorance Test.  I presented three of the twelve questions on this test in the previous post (here).  Each of the twelve questions asks respondents to select one of three options.  The correct answer for each question is the most optimistic of the three options presented.

4. Suppose that all people select randomly among the three options on all twelve questions.  Let the random variable X represent the number of questions that a person would answer correctly.

  • a) Describe the probability distribution of X.  Include the parameter values as well as the name of the distribution.
  • b) Determine and interpret the expected value of X.
  • c) Determine the probability that a person would obtain exactly the expected value for the number of correct answers.
  • d) Determine and compare the probabilities of correctly answering fewer than the expected value vs. more than the expected value. 
  • e) Discuss how the actual survey results, as shown in the following graph, compare to the binomial distribution calculations.

Under the assumption of random selection among the three options on all twelve questions, the probability distribution of X, the number of correct answers, would follow a binomial distribution with parameters n = 12 and p = 1/3.  A graph of this probability distribution is shown here:

The expected value of X can be calculated as: E(X) = np = 12×(1/3) = 4.0.  This means that if the questions were asked of a very large number of people, all of whom selected randomly among the three options on all twelve questions, then the average number of correct answers would be very close to 4.0.

The binomial probabilities in (c) and (d) can be calculated to be 0.2384 for obtaining exactly 4 correct answers, 0.3931 for 4 or fewer correct, and 0.3685 for more than 4 correct.

The survey data reveal that people do much worse on these questions that they would with truly random selections.  For example, about 80% of respondents got fewer than four correct answers, whereas random selections would produce about 39.31% with fewer than four correct answers.  On the other side, about 10% of people answered more than four questions correctly, compared with 36.85% that would be expected from random selections.


5. When the question about how the proportion of the world’s population living in extreme poverty has changed over the past twenty years, only 5% of a sample of 1005 respondents in the United States gave the correct answer (cut in half), while 59% responded with the option furthest from the truth (doubled).

  • a) Determine the z-score for testing whether the sample data provide strong evidence that less than one-third of all Americans would answer correctly.
  • b) Summarize your conclusion from this z-score, and explain the reasoning process behind your conclusion.
  • c) Determine a 95% confidence interval for the population proportion who would answer that the rate has doubled.
  • d) Interpret this confidence interval.

The z-score in (a) is calculated as: z = (0.05 – 1/3) / sqrt[(1/3)×(2/3)/1005] ≈ -19.1.  This is an enormous z-score, indicating that the sample proportion who gave the correct response is more than 19 standard deviations less than the value one-third.  Such an extreme result would essentially never happen by random chance, so the sample data provide overwhelming evidence that less than one-third of all adult Americans would have answered correctly.

The 95% confidence interval for the population proportion in part (c) is: .59 ± 1.96 × sqrt(.59×.41/1005), which is .59 ± .030, which is the interval (.560 → .620).  We can be 95% confident that if this question were asked of all adult Americans, the proportion who would give the most wrong answer (doubled) would be between .560 and .620.  In other words, we can be 95% confident that between 56% and 62% of all adult Americans would give the most wrong answer to this question.


I asked my students the question about how the extreme poverty rate has changed, before revealing the answer.  The table below shows the observed counts for the three response options in a recent class:

6. Conduct a hypothesis test of whether the sample data provide strong evidence against the hypothesis that the population of students at our school would be equally likely to choose among the three response options.

The null hypothesis is that students in the population would be equally likely to select among the three options (i.e., that one-third of the population would respond with each of the three options).  The expected counts (under this null hypothesis) are 83/3 ≈ 27.667 for each of the three categories.  All of these expected counts are larger than five, so a chi-square goodness-of-fit test is appropriate.  The chi-square test statistic turns out to equal 7.253, as shown in the following table:

The p-value, from a chi-square distribution with 2 degrees of freedom, is ≈ 0.027.  This p-value is fairly small (less than .05) but not very small (larger than .01), so we can conclude that the sample data provide fairly strong evidence against the hypothesis that students in the population would be equally likely to select among the three options.  The sample data suggest that students are more likely to give the most pessimistic answer (doubled) and less likely to give the most optimistic, correct answer (cut in half).  This conclusion should be regarded with caution, though, because the sample (students in my class) was not randomly selected from the population of all students at our school.


The six questions that I have presented here only hint at the possibilities of asking questions that help students to learn important statistical content while also exposing them to data that reveal human progress.  I also encourage teachers to point their students toward resources that empower them to ask their own questions, and analyze data of their own choosing, about the state of the world.  I listed several websites with such data at the very end of the previous post (here).

P.S. The life expectancies for South Africa and Ghana were obtained from the World Bank’s World Development Indicators dataset, accessed through google (here).  Life expectancy is defined here as “the average number of years a newborn is expected to live with current mortality patterns remaining the same.”  The data on CO2 emissions were obtained from the United States Energy Information Administration (here).  The data on the Gapminder/Ignorance Test were obtained from a link here.

Files containing the data on life expectancies and CO2 emissions can be downloaded from the links below:

#33 Reveal human progress, part 1

This post will feature many quotes that I find inspirational, starting with:

Quote #1: How can we soundly appraise the state of the world?  The answer is to count. …  A quantitative mindset, despite its nerdy aura, is actually the morally enlightened one, because it treats every human life as having equal value rather than privileging the people who are closest to us or most photogenic. – Steven Pinker, Enlightenment Now, pages 42-43

I am going to show some data that appraise the state of the world and how things have changed over the years.  First I will ask a few questions that Hans Rosling liked to ask his audiences, which I also ask of my students:

  • A: In the last twenty years, how has the proportion of the world’s population living in extreme poverty changed?  [Options: Almost doubled, Remained more or less the same, Almost halved]
  • B: What percentage of the world’s one-year-old children today have been vaccinated against some disease?  [Options: 20 percent, 50 percent, 80 percent]
  • C: Worldwide, 30-year-old men have spent an average of 10 years in school.  How many years have women of the same age spent in school, on average?  [Options: 9 years, 6 years, 3 years]

Are you ready for the answers?  Here’s a quote to reveal the correct answer for question A:

Quote #2: Over the past twenty years, the proportion of the global population living in extreme poverty has halved.  This is absolutely revolutionary.  I consider it to be the most important change that has happened in the world in my lifetime. – Hans Rosling, Factfulness, page 6

The correct answers for questions B and C are also the most optimistic of the options presented: 80 percent of one-year-old children have been vaccinated, and 30-year-old women have spent 9 years in school, on average.

Looking at data on a wide range of human experiences, Pinker uses even stronger language than Rosling to declare:

Quote #3: Here is a shocker: The world has made spectacular progress in every single measure of human well-being.  Here is a second shocker: Almost no one knows about it. – Steven Pinker, Enlightenment Now, page 52.


Can this really be true – that the world has made great progress, and that very few know about it?  Let’s return to questions A, B, and C, which were asked of people in many countries.  Rosling and his colleagues produced the following graphs of the percentage of correct responses for these questions:

Remember that these were multiple choice questions with three options.  Rosling pointed out that complete ignorance would lead to random guessing, which would produce roughly 33% correct responses in a large sample.  I’m sure you’ve noticed that for all three questions, in every country, respondents failed to achieve the level of complete ignorance.

Rosling and his colleagues asked twelve questions of this type.  For every question, the correct answer was the most optimistic of the three options provided.  Here is the distribution of number correct, where Rosling uses a chimpanzee to represent the expected value under the assumption of complete ignorance:

Do people really think the world is getting worse instead of better?  Further evidence is provided by the following survey results from asking this question directly in the year 2015:

Only in China did a higher percentage say that the world is getting better rather than worse.  In the United States, more than 10 times as many people responded worse than better


Why are people so pessimistic and ignorant (actually, worse than ignorant) about the state of the world?  Pinker argues that the negative nature of news, combined with cognitive biases such as the availability heuristic, explain much of this phenomenon:

Quote #4: Whether or not the world is actually getting worse, the nature of news will interact with the nature of cognition to make us think that it is. – Steven Pinker, Enlightenment Now, page 41

Rosling offers many explanations for this disconnect between perception and reality, starting with what he calls the gap instinct:

Quote #5: I’m talking about that irresistible temptation we have to divide all kinds of things into two distinct and often conflicting groups, with an imagined gap – a huge chasm of injustice – in between.  – Hans Rosling, Factfulness, page 21

Consider the following graph, from Rosling’s Gapminder site (here), of a country’s child mortality rate vs. the average number of babies per woman (color indicates region of the world, and the size of the circle represents the country’s population):

The countries in the bottom left of this graph have low child mortality rates and small families, while those in the upper right experience high child mortality rates and large families.  This graph displays Rosling’s gap instinct: Many people see the world as separated into two distinct groups of countries, which are often labeled developed and developing.   

But have you noticed the catch?  This graph shows the world in 1968, more than 50 years ago!  The following graph displays the same variables on the same scale in the year 2018:

The world has changed dramatically in these 50 years!  Child mortality rates have dropped substantially, which is undeniably good news.  Despite the fact that more and more babies live past age 5 (in fact, probably because of that fact), women have fewer and fewer babies than previously.  Sure, there’s still variability, and the African countries (shown in light blue) still have some catching up to do.  But the separation of countries into two clusters with a gap in between is a relic of the past.  The gap instinct that many people hold is not consistent with current data.


Next I will offer some data and graphs that reveal human progress.  Such data and visualizations abound*, but I will confine myself here to seven graphs. 

* I provide a partial list of resources in a P.S. at the end of this post.  The seven graphs shown below come from the Our World in Data site (here).

The first three graphs show decreases, for all regions of the world, in child mortality rates, average number of babies per woman, and extreme poverty rates:

The next three graphs show dramatic increases in life expectancy, literacy rates, and mean years of schooling:

The final graph displays raw counts rather than rates or averages.  Because the population of the world has been growing over time, you might wonder whether a decreasing rate of extreme poverty means that fewer people are living in extreme poverty.  The following graph shows that the number of people living in extreme poverty has indeed decreased dramatically over the past two decades, while the number of people not living in extreme poverty has increased sharply:


What does this have to do with teaching introductory statistics?  I think we (teachers of introductory statistics) have a tremendous opportunity to make our students aware of human progress.  Here’s my plea: I urge you to use data, examples, activities, and assignments that reveal* human progress to your students.

* I like the word reveal here, because we can expose students to human progress in dramatic fashion, as a plot twist providing the climax of a suspenseful narrative.

Why do I consider this so important?  I’ll call on Rosling to answer:

Quote #6: When we have a fact-based worldview, we can see that the world is not as bad as it seems – and we can see what we have to do to keep making it better.  – Hans Rosling, Factfulness, page 255

I hasten to add an important caveat: By no means am I arguing that statistics teachers should refrain from presenting examples and data that reveal problems and injustices.  Such examples can motivate students to take action for making the world a better place.  But I suspect that many statistics teachers, who are susceptible to the same inherent biases and heuristics that affect all people*, have a tendency to overdo the negative and understate the positive.  I also believe that good news about human progress can motivate students to do their part in continuing to make the world better. I am not asking teachers to recenter their entire course around data of human progress, just to show a few examples.

* I include myself here, of course.


How can we reveal the good news about human progress to students?  You know my answer: Ask good questions!

This post is something of an anomaly for this blog, as it contains few questions.  But some previous posts have already posed questions for students that use data on human progress:

  • In post #11 (Repeat after me, here), I suggested providing students with scatterplots (bubble graphs) from Rosling’s site and asking basic questions about observational units, variables, and association. 
  • I proposed asking students to calculate the percentage decrease in the extreme poverty rate between 1990 and 2015 in post #28 (A pervasive pet peeve, here). 
  • I recommended conducting a hypothesis test of whether Americans’ responses to the extreme poverty rate question are worse than would be expected by random chance in post #8 (End of the alphabet, here).

I will continue this theme in next week’s post by providing several more examples of how I have asked questions about data on human progress to teach statistical thinking in my courses.


An encouraging development is that as the year and decade came to a close in December of 2019, several columns appeared in the news to trumpet the good news of human progress.  Two examples are:

  • “This Has Been the Best Year Ever,” by Nicholas Kristof in The New York Times (here)
  • “We’ve just had the best decade in human history, seriously,” by Matt Ridley in The Spectator (here)

Finally, I offer with one more quote that I find insightful and inspiring:

Quote #7: If you could choose a moment in history to be born, and you didn’t know ahead of time who you were going to be, you’d choose now.  Because the world has never been less violent, healthier, better educated, more tolerant, with more opportunity for more people, and better connected, than it is today.  – To be revealed soon

What’s your guess – is this quote from Pinker or Rosling?  Rosling or Pinker?  I used three quotes from each above.  Does this quote break the tie?  Or do you suspect that I slipped in a quote from Kristof or Ridley here? Are you ready for the big reveal?

No, this quote does not break the tie, because these words are neither Pinker’s nor Rosling’s. They are also not Kristof’s or Ridley’s.  Who said this?  President Barack Obama, at the White House Summit on Global Development, on July 20, 2016 (here).


P.S. I highly recommend Pinker’s book Enlightenment Now (here) and Rosling’s book Factfulness (here).  These books inspired this post and provided the first six quotes above.

The project that produced the data and graphs for survey questions about the state of the world is summarized here, and the data can be found here.  The graph of survey results for the “getting better or worse” question came from the YouGov site (here).  The graph of perceived happiness levels came from the the Our World in Data site (here).  The graphs displaying Rosling’s gap instinct came from his Gapminder site (here).  The seven graphs of human progress came from the Our World in Data site: child mortality (here), babies per woman (here), poverty (here), life expectancy (here), literacy (here), and years of schooling (here).

I recommend the following resources for data and tools to explore human progress.  I relied most heavily on the first two sites in preparing this post:

#32 Create your own example, part 2

In last week’s post (here), I presented examples of questions that ask students to create their own example that satisfies a particular property, such as the mean exceeding the mean and inter-quartile range equaling zero.  I proposed that such questions can help students to think more carefully and deepen their understanding of statistical concepts.  All of last week’s examples concerned descriptive statistics.

Now I extend this theme to the realm of statistical inference concepts and techniques.  I present six create-your-own-example questions (each with multiple parts) concerning hypothesis tests and confidence intervals for proportions and means, with a chi-square test appearing at the end.  I believe these questions lead students to develop a stronger understanding of concepts such as the role of sample size and sample variability on statistical inference.

I encourage students to use technology, such as the applet here, to calculate confidence intervals, test statistics, and p-values.   This enables them to focus on underlying concepts rather than calculations.

The numbering of these questions picks up where the previous post left off.  As always, questions for students appear in italics.


6. Suppose that you want to test the null hypothesis that one-third of all adults in your county have a tattoo, against a two-sided alternative.  For each of the following parts, create your own example of a sample of 100 people that satisfies the indicated property.  Do this by providing the sample numbers with a tattoo and without a tattoo.  Also report the test statistic and p-value from a one-proportion z-test.

  • a) The two-sided p-value is less than 0.001.
  • b) The two-sided p-value is greater than 0.20.

Students need to realize that sample proportions closer to one-third produce larger p-values, while those farther from one-third generate smaller p-values.  Clever students might give the most extreme answers, saying that all 100 have a tattoo in part (a) and that 33 have a tattoo in part (b).

Instead of asking for one example in each part, you could make the question more challenging by asking students to determine all possible sample values that satisfy the property.  It turns out that for part (a), the condition is satisfied by having 17 or fewer, or 49 or more, with a tattoo.  For part (b), having 28 to 39 (inclusive) with a tattoo satisfies the condition.  Instead of trial-and-error, you could ask students to determine these values algebraically from the z-test statistic formula, but I would only ask this in courses for mathematically inclined students.


7. Suppose that you want to estimate the proportion of all adults in your county who have a tattoo. For each of the following parts, create your own example to satisfy the indicated property.  Do this by specifying the sample size and the number of people in the sample with a tattoo.  Also determine the confidence interval.

  • a) The sample proportion with a tattoo is 0.30, and a 95% confidence interval for the population proportion includes the value 0.35.
  • b) The sample proportion with a tattoo is 0.30, and a 99% confidence interval for the population proportion does not include the value 0.35.

The key here is to understand the impact of sample size on a confidence interval.  The confidence interval in both parts will be centered at the value of the sample proportion value of 0.30, so the interval in part (b) needs to be narrower than the interval in part (a).  A larger sample size produces a narrower confidence interval, so a smaller sample size is needed in part (a).

One example that works for part (a) is a sample of 100 people, 30 of whom have a tattoo, for part (a), which produces a 95% confidence interval of (0.210 → 0.390).  Similarly, creating a sample of 1000 people, 300 of whom have a tattoo, satisfies part (b), as the 99% confidence interval is (0.263 → 0.337).

Again you could consider asking students to determine all sample sizes that work.  Restricting attention to multiples of 10 (so the sample proportion with a tattoo equals 0.30 exactly), it turns out that a sample size of 340 or fewer suffices for part (a), and a sample size of 560 or more is needed for part (b).

 


8. Suppose that you want to estimate the population mean body temperature of a healthy adult with a 95% confidence interval.  For each of the following parts, create your own example of a sample of 10 body temperature values that satisfy the indicated property.  Do this by listing the ten values and also producing a dotplot that displays the ten values.  Report the sample standard deviation, and determine the confidence interval.

  • a) The sample mean is 98.0 degrees, and a 95% confidence interval for the population mean includes the value 98.6.
  • b) The sample mean is 98.0 degrees, and a 99% confidence interval for the population mean does not include the value 98.6.

This question is similar to the previous one, but dealing with a mean instead of a proportion brings the variability of the sample data into consideration.  This question removes sample size from consideration by stipulating that n = 10. 

The confidence interval for both parts will be centered at the value of the sample mean temperature: 98.0 degrees.  For the confidence interval in part (a) to include the value 98.6, the sample data need to display a good bit of variability.  A student might try a fairly simple example containing five values of 97.0 and five values of 99.0, which produces a sample standard deviation of 1.054 and a 95% confidence interval of (96.92 → 99.08) degrees.

In contrast, part (b) requires less sample variability, for the confidence interval to fall short of the value 98.6.  A student might use a fairly extreme example, such as one value of 97.9, eight values of 98, and one value of 98.1.  This results in a sample standard deviation of 0.047 and a 99% confidence interval of (97.95 → 98.05) degrees.

As with the previous questions, you could ask students to determine all values of the sample standard deviation that will work, either with trial-and-error or algebraically.  It turns out that the sample standard deviation needs to be at least 0.839 (to three decimal places) degrees in part (a), at most 0.583 degrees in part (b).


9. Suppose that you ask dog and cat owners whether their pet has been to a veterinarian in the past twelve months.  You organize the resulting counts in a 2×2 table as follows:

For each of the following parts, create your own example of a sample that satisfies the indicated property.  Do this by filling in the counts of the 2×2 table.  Also report the two sample proportions and the test statistic and p-value from a two-proportions z-test.

  • a) The two-sided p-value is less than 0.001.
  • b) The two-sided p-value is between 0.2 and 0.6.

Students need to produce a large difference in proportions for part (a) and a fairly small difference for part (b).  They could give a very extreme answer in part (a) by having 100% of dogs and 0% of cats visit a veterinarian.  A less extreme response that 80 of 100 dogs and 20 of 50 cats have been to a veterinarian produces a z-statistic of 4.90 and a p-value very close to zero.   

Stipulating that the p-value in part (b) must be less than 0.6 forces students not to use identical success proportions in the two groups.  One example that works is to have 80 of 100 dogs and 36 of 50 cats with a veterinarian visit. This produces a z-statistic of 1.10 and a p-value of 0.270.


10. The Gallup organization surveyed American adults about how many times they went to a movie at a movie theater in the year 2019.  They compared results for people with at least one child under age 18 in their household and those without such a child in their household.  Suppose that you reproduce this study by interviewing a random sample of adults in your county, and suppose that the sample means are the same as in the Gallup survey: 6.8 movies for those with children, 4.7 movies for those without, as shown in the table below:

For each of the following parts, create your own example that satisfies the indicated property.  Do this by filling in the sample size and sample standard deviation for each group.  Also report the value of the two-sample t-test statistic and the two-sided p-value.

  • a) The two-sample t-test statistic is less than 1.50.
  • b) The two-sample t-test statistic is greater than 2.50.

Students have considerable latitude in their answers here, as they can focus on sample size or sample variability.  They need to realize that large sample sizes and small standard deviations will generally produce larger test statistic values, as for part (a).  To produce a smaller test statistic value in part (b) requires relatively small sample sizes or large standard deviations.

For example, sample sizes of 10 and sample standard deviations of 4.0 for each group produce t = 1.17 to satisfy part (a).  The condition for part (b) can be met with the same standard deviations but larger sample sizes of 50 for each group, which gives t = 2.62.


11. Suppose that you interview a sample of 100 adults, asking for their political viewpoint (classified as liberal, moderate, or conservative) and how often they eat ice cream (classified as rarely, sometimes, or often).  Also suppose that you obtain the marginal totals shown in the following 3×3 table:

For each of the following parts, create your own example that satisfies the indicated property.  Do this by filling in the counts of the 3×3 table.  Also report the value of the chi-square statistic and p-value.  For part (b), also describe the nature of the association between the variables (i.e., which political groups tend to eat ice cream more or less frequently?).

  • b) The chi-square p-value is between 0.4 and 0.8.
  • c) The chi-square p-value is less than 0.001.

Like the previous questions, this one also affords students considerable leeway with their responses.  They need to supply nine cell counts in the table, but the fixed margins mean that they only have four degrees of freedom* to play around with.

* Once a student has filled in four cell counts (provided that they are not all in the same row or same column), the other five cell counts are then determined by the need to make counts add up to the marginal totals.

First students need to realize that to obtain a large p-value in part (a), the counts need to come close to producing independence between political viewpoint and ice cream frequency.  They also need to know that independence here would mean that all three political groups have 20% rarely, 50% sometimes, and 30% often eating ice cream.  Independence would produce this table of counts:

This table does not satisfy the condition for part (a), though, because the p-value is 1.0.  A correct response to part (a) requires a bit of variation from perfect independence.  The following table, which shifts two liberals from rarely to often and two conservatives from often to rarely, produces a chi-square statistic of 2.222 and a p-value of 0.695:

On the other hand, a table that successfully satisfies part (b) needs to reveal a clear association between the two variables.  Consider the following example:

The chi-square test statistic equals 13.316 for this example, and the p-value is 0.010.  This table reveals that makes liberals much more likely to eat ice cream often, and much less likely to eat ice cream rarely, compared to conservatives.


Students can use create-your-own-example questions to demonstrate and deepen their understanding of statistical concepts.  The previous post provided many examples that concerned descriptive statistics, and this post has followed suit with topics of statistical inference.

I also like to ask create-your-own-example questions that ask students, for instance, to identify a potential confounding variable in a study, or to suggest a research question for which comparative boxplots would be a relevant graph.  Perhaps a future post will discuss those kinds of questions.

As with the previous post, I leave you with a (completely optional, of course) take-home assignment: Create your own example of a create-your-own-example question to ask of your students.

P.S. A recent study (discussed here) suggests that average body temperature for humans, as discussed in question 8, has dropped in the past century and is now close to 97.5 degrees Fahrenheit.  The Gallup survey mentioned in question 10 can be found here.

#31 Create your own example, part 1

I like asking questions that prompt students to create their own example to satisfy some property.  I use these questions in many settings: class activities, homework assignments, quizzes, and exams.  Such questions prompt students to engage in higher-level thinking than rote calculations.  I also believe that these questions can lead students to deepen their understanding about properties of statistical measures and methods.

I presented one such question in post #3 (here), in which I asked students to create their own example to illustrate Simpson’s paradox.  That’s a very challenging question for most students.  In this post, I will provide five examples (each with multiple parts) of create-your-own-example questions, most of which are fairly straight-forward but nevertheless (I believe) worthwhile.  I will also discuss the statistical concepts, all related to the topic of descriptive statistics, that the questions address.  As always, questions for students appear in italics.


1. Suppose that you record the age of 10 customers who enter a movie theater.  For each of the following parts, create an example of 10 ages that satisfy the indicated property.  (In other words, produce a list of 10 ages for each part.)  Also, report the values of the mean and median for parts (c) – (e).  Do not bother to calculate the standard deviation in part (b).

  • a) The standard deviation equals zero.
  • b) The inter-quartile range equals zero, and the standard deviation does not equal zero. 
  • c) The mean is larger than the median.
  • d) The mean exceeds the median by at least 20 years.   
  • e) The mean exceeds the median by at least 10 years, and the inter-quartile range equals zero.

Part (a) simply requires that all 10 customers have the same age.  A correct answer to part (b) needs the 3rd through 8th values (in order) to be the same, in order for the IQR to equal zero, with at least one different value to make the standard deviation positive.  The easiest way to answer (b) correctly would make nine of the ages the same and one age different.

Part (c) requires knowing that the mean will be affected by a few unusually large ages.  An example that works for (d), which is more challenging than (c), is to have six ages of 10, so the median is 10, and four ages of 60, which pulls the mean up to 30. 

Part (e) is more challenging still.  An IQR of 0 again requires the 3rd through 8th values to be the same.  Two large outliers can inflate the mean enough to satisfy the property.  For example, eight ages of 10 and two ages of 60 makes the IQR 0, median 10, and mean 20.

Ideally, students think about properties of mean and median as they answer questions like this.  I think it’s fine for students to use some trial-and-error, but then I hope they can explain why an example works.  You could assess this by asking students to describe their reasoning process, perhaps for part d) or e), along with submitting their example.  

I want students to consider the context here (and always), so I only give partial credit if an example uses an unrealistic age such as 150 years.

For an in-class activity or homework assignment, I ask all five parts of this question, and I encourage students to use software (such as the applet here) to facilitate the calculations.  On a quiz or exam, I only ask one or two parts of this question.  I do think it’s important to give students practice with this kind of question prior to asking it on an exam.


2. Consider the following dotplot, which displays the distribution of margin of victory in a sample of 10 football games (mean 11.0, median 9.5, standard deviation 6.04 points):

For each of the following parts, create your own example by proposing an eleventh value along with these ten to satisfy the indicated property.  (Notice that the context here requires that the new value must be a positive integer.)  For each part, add your new data value to the dotplot.

  • a) The mean, median, and standard deviation all increase.
  • b) The mean, median, and standard deviation all decrease.
  • c) The median increases, and the mean decreases.

Students should realize immediately that part (a) requires that the new value be fairly large.  The new value must be larger than the mean and median, of course, but it needs to be considerably larger in order for the standard deviation to increase.  It turns out that any integer value of 18 or higher works.  (I do not expect students to determine the smallest value that works, although you could make the question harder by asking for that.)

Part (b) requires that the new value be less than the mean and median, but fairly close to the mean in order for the standard deviation to decrease.  A natural choice that works is 9.  (It turns out that any integer from 5 through 9, inclusive, works.)  Part (c) has a unique correct answer, which is the only integer between the median and mean: 10 points.

I provide a separate copy of the dotplot for each part of this question.  If students have access to technology as they answer these questions, you could ask them to report the new values of the statistics.


3. The Gallup organization surveyed American adults about how many times they went to a movie at a movie theater in the year 2019.  They compared results for people with at least one child under age 18 in their household and those without such a child in their household.  Suppose that you recreate this study by interviewing faculty at you school, and suppose that your sample contains 8 people in each group.For each of the following parts, create your own example that satisfies the given property.  Do this by producing dotplots on the axes provided, making sure to include 8 data values in each group.  Do not bother to calculate the values of the means and standard deviations.

  • a) The mean for those with children is larger than the mean for those without children.
  • b) The standard deviation for those with children is larger than the standard deviation for without children.
  • c) The mean for those with children is larger than the mean for those without, and the standard deviation for those with children is smaller than the standard deviation for those without.

Parts (a) and (b) are very straight-forward, simply assessing whether students understand that the mean measures center and standard deviation measures variability.  Part (c) is a bit more complicated, as students need to think about both aspects (center and variability) at the same time.  I provide a separate copy of the axes for each part.


4. Suppose that you ask dog and cat owners whether their pet has been to a veterinarian in the past twelve months.  You organize the resulting counts in a 2×2 table as follows:

For each of the following parts, create your own example of counts that satisfy the indicated property.  Do this by filling in the appropriate cells of the table with counts.  Also report the values for all relevant proportions, differences in proportions, and ratios of proportions.

  • a) The difference in proportions who answer yes is exactly 0.2.
  • b) The ratio of proportions who answer yes is exactly 2.0.
  • c) The difference in proportions who answer yes is greater than 0.2, and the ratio of proportions who answer yes is greater than 2.0.
  • d) The difference in proportions who answer yes is greater than 0.2, and the ratio of proportions who answer yes is less than 2.0.
  • e) The difference in proportions who answer yes is less than 0.2, and the ratio of proportions who answer yes is greater than 2.0.

You could make these questions easier by using the same sample size for both groups, but I prefer this version that requires students to think proportionally.  Part (c) requires one of the proportions to be fairly small, so the ratio can exceed 2.0.  Part (e) requires both proportions to be on the small side, so the ratio can exceed 2 without a large difference.  The following tables show examples (by no means unique) that work for parts (c), (d), and (e):


5. Consider the following scatterplot of sale price (in thousands of dollars) vs. size (in square feet) for seven houses that sold in Arroyo Grande, California:

The seven ordered pairs of (size, price) data points are: (1014, $474K), (1176, $520K), (1242, $459K), (1499, $470K), (1540, $575K), (1545, $500K), (1755, $580K).  The correlation coefficient between price and size is r = 0.627.  For each of the following parts, create your own example to satisfy the indicated property.  Do this by adding one point to the scatterplot and also reporting the values of the size (square feet) and price for the house that you add.  Also give a very brief description of the house (e.g., a very small and inexpensive house), and report the value of the correlation coefficient.

  • a) The correlation coefficient is larger than 0.8.
  • b) The correlation coefficient is between 0.2 and 0.4.
  • c) The correlation coefficient is negative.

Notice that I extended the scales on the axes of this graph considerably, as a hint to students that they need to consider using some small or large values for size or price.  I reproduce the graph for students in all three parts. Using technology (such as the applet here) is essential for this question.  You could ask part (a) or (c) on an exam with no technology, as long as you ask for educated guesses and do not require calculating the correlation coefficient.

The key in part (a) is to realize that the new house must reinforce the positive association considerably, which requires a house that is either considerably larger and more expensive, or else much smaller and less expensive.  Two points that work are a 500-square-foot house for $350K (r = 0.858), or a 2500-square-foot house for $650K (r = 0.846).  Students could think even bigger (or smaller) and produce a correlation coefficient even closer to 1.  For instance a 4000-square-foot house for two million dollars generates r = 0.978.

Part (b) calls for a new house that diminishes the positive association considerably, so students need to think of a house that goes against the prevailing tendency.  Students should try a small but expensive, or large but inexpensive, house.  One example that works is a 1000-square-foot-house for $550K (r = 0.374).   Part (c) is similar but requires an even more unusual house to undo the positive association completely.  For instance, a small-but-expensive house with 500 square feet for $650K achieves a negative correlation of r = -0.324.


I believe that create-your-own-example questions can help students to assess and deepen their understanding of statistical concepts related to measures of center, variability, and association.  Next week’s post will continue this theme by presenting five create-your-own-example questions that address properties of statistical inference procedures.

Are you ready for your take-home assignment*?  I bet you can guess what it is.  Ready?  Here goes: Create your own example of a create-your-own-example question that leads students to assess and deepen their understanding of a statistical concept.

* Needless to say, this assignment is optional!

P.S. The sample of 10 football games in question 2 consists of the NFL post-season games in January of 2020, prior to Super Bowl LIV, gathered from here, here, and here.  Results from the Gallup survey mentioned in question 3 can be found here.

#30 Minimize what?

What does least squares mean?  Students in an introductory statistics course typically encounter this term in the context of fitting a line to bivariate numerical data.  We tell students that among all possible lines, the least squares line is the one that makes the sum of squared vertical deviations (i.e., the sum of squared residuals) from the line as small as possible. 

In this post I explore how students can use least squares and other criteria to determine optimal measures of center for a single numerical variable.  I will describe an activity that I use with mathematically inclined students, primarily those majoring in statistics, mathematics, or economics.  I do not use this activity with typical Stat 101 students, but I do hope that this activity might be fun and worthwhile as a “beyond the exam” topic in an AP Statistics course. As always, questions that I pose to students appear in italics.


I carry a pedometer in my pocket to record the number of steps that I take each day.  Below are the data for a recent week, along with a dotplot (generated with the applet here):

Let’s start with a question meant to provoke students’ thought: Propose a number to represent the center of this distribution.  This is a very vague question, so I encourage students to just pick a value based on the graph, without giving it too much thought, and certainly without performing any calculations.  I also emphasize that there’s not a right-or-wrong answer here.

Then I ask a few students to share the values that they selected, which leads to the question: How can we decide whether one value (for the center of this distribution) is better than another?  This is a very hard question.  I try to lead students to understand that we need a criterion (a rule) for deciding.  Then I suggest that the criterion should take into account the differences (or deviations) between the data values and the proposed measure of center.  Do we prefer that these differences be small or large?  Finally, this is an easy question with a definitive answer: We prefer small differences to large ones.  I point out that with seven data values, we’ll have seven deviations to work with for each proposed measure of center.  How might we combine those seven deviations?  Would it work to simply add them?  Some students respond that this would not work, because we could have positive and negative differences cancelling out.  How can we get around that problem?  We could take absolute values of the deviations, or square them, before we add them.

Let’s get to work, starting with the least squares criterion.  Let m represent a generic measure of center.  Write out the function for sum of squares deviations (call this SSD) as a function of m.  When students need a hint, I say that there’s nothing clever about this, just a brute-force calculation.  In general terms, we could express this function as:

For these particular data values, this function becomes:

Predict what the graph of this function will look like.  If students ask for a hint, I suggest that they think about whether to expect to see a line, parabola, exponential curve, or something else.  Then I either ask students to use Excel, or ask them to talk me through its use, to evaluate this function.  First enter the seven data values into column A.  Then set up column B to contain a whole bunch of (integer) values of m, from 8000 to 16000, making use of Excel’s fill down feature.  Finally, enter this formula into column C*:

* The $ symbol in the formula specifies that those data cells are fixed, as opposed to the B2 cell that fills down to produce a different output for all of the m values.

The first several rows of output look like this:

A graph of this function follows:

What is the shape of this graph?  A parabola.  Explain why this makes sense.  Because the function is quadratic, of the form a×m^2 + b×m + cWhere does the function appear to be minimized?  Slightly above 12,000 steps.  How can we determine where the minimum occurs more precisely?  We can examine the SSD values in the Excel file to see where the minimum occurs.  Here are the values near the minimum:

We see that the minimum occurs at 12,069 steps.  Is it possible that SSD is minimized at a non-integer value of m?  Sure, that’s possible.  Can we zoom in further to identify the value of m that minimizes this function more exactly?   Yes, we can specify that Excel use multiples of .001, rather than integers, for the possible values of m, restricting our attention to the interval from 12,068 to 12,070 steps.  This produces the following graph:

Now we can examine the SSD values in the Excel file to identify where the minimum occurs:

The sum of squared deviations is minimized at the value 12,069.143.  Is this one of the seven data values?  No.  Is this the value of a common measure of center for these data?  Yes, it turns out that this is the mean of the data.  Do you think this is a coincidence?  No way, with so many decimal places of accuracy here, that would be an amazing coincidence!

If your students have studied a term of calculus, you can ask them to prove that SSD(m) is minimized by the mean of the data.  They can take the derivative, with respect to m, of the general form of SSD(m), set that derivative equal to zero, and solve for m.


Why should we confine our attention to least squares?  Let’s consider another criterion.  Instead of minimizing the sum of squared deviations between the data values and the measure of center, let’s minimize the sum of absolute deviations.

We’ll call this function SAD(m)*.  When written out, this function looks just like SSD(m) but with absolute values instead of squares.  Again we can use Excel to evaluate this function for a wide range of values of m, using the formula:

* Despite the name of this function, I implore students to be happy, not sad, as they expand their horizon beyond least squares.

What do you expect the graph of this SAD(m) function to look like?  This is a much harder question than with the SSD(m) function.  Students could have realized in advance that the SSD(m) function would follow a parabola.  But what will they expect the graph of a function that sums absolute values to look like?  What do you expect this to look like?  Ready?  Here’s the result:

Describe the behavior of this function.  This graph can be described as piece-wise linear.  It consists of connected line segments with different slopes.  Where do the junction points (where the line segments meet) of this function appear to occur?  Examining the SAD values in the Excel file, we find that the junction points in this graph occur at the m values 8457, 8589, 11593, and 13093*.

* The values 8457 and 8589 are so close together that it’s very hard to distinguish their junction points in the graph.  If we expanded the range of m values, we would see that all seven data values produce junction points.

Where does the minimum occur?  The minimum clearly occurs at one of these junction points: m = 11,593 steps.  Does this value look familiar?  Yes, this is one of the data values, specifically the median of the data.  Does this seem like a coincidence?  Again, no way, this would be quite a coincidence!  The sum of absolute deviations is indeed minimized at the median of the data values*. 

* The mathematical proof for this result is a bit more involved than using calculus to prove that the mean minimizes the sum of squared deviations.


Some students wonder: What if there had been an even number of data values?  I respond: What a terrific question!  What do you predict will happen?  Please explore this question and find out.

Let’s investigate this question now.  On Sunday, January 19, I walked for 14,121 steps.  Including this value in the dataset gives the following ordered values:

How will the mean and median change?  The mean will increase, because we’ve included a value larger than the previous mean.  The median will also increase, as it will now be the average of the 4th and 5th values, and the value we’ve inserted is larger than those values.  It turns out that the mean is now 12,325.625 steps, and the median is (11,593 + 13,093) / 2 = 12,343 steps.

Predict what will change in the graphs of these functions and the values of m that minimize these functions.  Ready to see the results?  Here is the graph for the SSD function:

This SSD function behaves as you expected, right?  It’s still a parabola, and it’s still minimized at the mean, which is now a bit larger than the previous mean.  Now let’s look at the SAD function:

Whoa, did you expect this?  We still have a piece-wise linear function, with junction points still at the data values.  The median does still minimize the function, but the median no longer uniquely minimizes the function.  The SAD function is now minimized by any value between the two middle values of the dataset.  For this dataset, all values from 11,593 → 13,093 steps minimize the SAD function*.

* While the common convention is to declare the median of an even number of values to be the midpoint of the middle two values, an alternative is to regard any value between the two middle values as a median.


Are these two criteria (sum of squared or absolute deviations) the only ones that we could consider?  Certainly not.  These are the two most popular criteria, with least squares the most common by far, but we can investigate others.  For example, if you’re a very cautious person, you might want to minimize the worst-case scenario.  So, let’s stick with absolute deviations, but let’s seek to minimize the maximum of the absolute deviations rather than their sum.  We’ll call this function MAXAD(m), and we can evaluate it in Excel with:

What do you predict this function to look like?  The resulting graph (based on the original seven data values) is:

This MAXAD function is piece-wise linear, just as the SAD function was.  But there are only two linear pieces to this function.  The unique minimum occurs at m = 12,663 steps.  How does this minimum value relate to the data values?  It turns out that the minimum occurs at the average of the minimum and maximum values, also known as the midrange. It makes sense that we use the midpoint of the most extreme values in order to minimize the worst-case scenario.

Now let’s continue with the idea of minimizing a worst-case scenario, but let’s work with squared differences rather than absolute values.  What do you expect the maximum of squared deviations function to look like, and where do you expect the minimum to occur?

Here’s the graph, again based on the original seven data values:

It’s hard to see, but the two pieces are not quite linear this time.  Because we are minimizing the worst-case scenario, the minimum again occurs at the midrange of the data values: m = 12,663 steps.

Would including the 8th data value that we used above affect the midrange?  No, because that 8th value did not change the minimum or maximum.  Is the midrange resistant to outliers?  Not at all!  The midrange is not only strongly affected by very extreme values, it takes no data values into account except for the minimum and maximum.

Could we ask students to investigate other criteria?  Sure.  Here’s a weird one: How about the median of the absolute deviations, rather than the sum or maximum of them?  I have no idea why you would want to minimize this function, but it produces a very interesting graph, and the median occurs at m = 10,775 steps:


The concept of least squares can apply to one-variable data as well as its more typical use with lines for bivariate data.  Students can use software to explore not only this concept but other minimization criteria, as well.  Along the way they can make some surprising (and pretty) graphs, and also discover some interesting results about summary statistics.

P.S. This activity was inspired by George Cobb and David Moore’s wonderful article “Mathematics, Statistics, and Teaching” (available here), which appeared in The American Mathematical Monthly in 1997.  The last section of the article discussed optimization properties of measures of center, mentioning several of the criteria presented in this post.

The very last sentence of George and David’s article (This is your take-home exam: design a better one-semester statistics course for mathematics majors) inspired Beth Chance and me to develop Investigating Statistical Concepts, Applications, and Methods (more information available here).

P.P.S. You can download the Excel file that I used in these analyses from the link below.  Notice that the file contains separate tabs for the original analysis of seven data values, a zoomed-in version of that analysis, and the analysis of eight data values.

#29 Not enough evidence

We statistics teachers often ask students to draw a conclusion, in the context of the data and research question provided, from the p-value of a hypothesis test.  Do you think a student is more likely to provide a response that earns full credit if the p-value is .02 or .20?

You may respond that it doesn’t matter.  You may believe that a student either knows how to state a conclusion from a p-value or not, regardless of whether the p-value is small or not-so-small.

I think it does matter, a lot.  I am convinced that students are more likely to give a response that earns full credit from a small p-value like .02 than from a not-so-small p-value like .20.  I think it’s a lot easier for students to express a small p-value conclusion of strong evidence against the null than a not-so-small p-value conclusion of not much evidence against the null.  Why?  In the not-so-small p-value case, it’s very easy for students to slip into wording about evidence for the null hypothesis (or accepting the null hypothesis), which does not deserve full credit in my book.

In this post I will explore this inclination to mis-state hypothesis test conclusions from a not-so-small p-value.  I will suggest two explanations for convincing students that speaking of evidence for the null, or deciding to accept the null, are not appropriate ways to frame conclusions.  I will return to an example that we’ve seen before and then present two new examples.  As always, questions that I pose to students appear in italics.


Let’s revisit the infamous 1970 draft lottery, which I discussed in post #9 (Statistics of illumination, part 3, here).  To recap: All 366 birthdays of the year were assigned a draft number.  The scatterplot on the left below displays the draft numbers vs. sequential day numbers.  At first glance, the graph appears to show nothing but random scatter, as we would expect from a truly random lottery.  But when we explored the data further, we found a bit of negative association between draft number and day number, with a correlation coefficient of -0.226.  We used simulation to investigate how surprising such a correlation would be with a truly random lottery.  The graph on the right shows the results for 10,000 random lotteries.  We see that none of the 10,000 simulated correlation coefficients is as large (in absolute value) as the -0.226 value that was achieved with the actual 1970 draft lottery.  Therefore, because a result as extreme as the one observed would be very unlikely to occur with a truly random lottery, we concluded that the observed data provide very strong evidence that the lottery process was not truly random.  (The explanation turned out to be insufficient mixing of the capsules containing the birthdays.)

This reasoning process is by no means trivial, but I think it makes sense to most students.  Without using the terminology, we have conducted a hypothesis test.  The null hypothesis is that the lottery process was truly random.  The alternative hypothesis is that the process was not truly random.  The p-value turns out to be very close to zero, less than 1 in 10,000.  Therefore, we have very strong evidence against the null hypothesis in favor of the alternative.

In the following year’s (1971) draft lottery, additional steps were taken to try to produce a truly random process.  The correlation coefficient (between draft number and day number) turned out to be 0.014.  The graph of simulation results above* shows that such a correlation coefficient is not the least bit unusual or surprising if the lottery process was truly random.  The two-sided p-value turns out to be approximately 0.78.  What do you conclude about the 1971 lottery process?

* This 1971 draft lottery involved 365 birthdays, as compared to 366 birthdays in the 1970 draft lottery.  This difference is so negligible that using the same simulation results is reasonable.

After they provide their open-ended response, I also ask students: Which of the following responses are appropriate and which are not?

  • A: The data do not provide enough evidence to conclude that the 1971 lottery process was not truly random.
  • B: The data do not provide much evidence for doubting that the 1971 lottery process was truly random.
  • C: The data provide some evidence that the 1971 lottery process was truly random.
  • D: The data provide strong evidence that the 1971 lottery process was truly random.

Responses A and B are correct and appropriate.  But they are challenging for students to express, in large part because they include a double negative.  It’s very tempting for students to avoid the double negative construction and write a more affirmative conclusion. But the affirmative responses (C and D) get the logic of hypothesis testing wrong by essentially accepting the null hypothesis.  That’s a no-no, so those responses deserve only partial credit in my book.

Students naturally ask: Why is this wrong?  Very good question.  I have two answers, one fairly philosophical and the other more practical.  I will lead off with the philosophical answer, even though students find the practical answer to be more compelling and persuasive.


The philosophical answer is: Accepting a null hypothesis, or assessing evidence in favor of the null hypothesis, is simply not how the reasoning process of hypothesis testing works.  The reasoning process only assesses the strength of evidence that the data provide against the null hypothesis.  Remember how this goes: We start by assuming that the null hypothesis is true.  Then we see how surprising the observed data would be if the null hypothesis were true.  If the answer is that the observed data would be very surprising, then we conclude that the data provide strong evidence against the null hypothesis.  If the answer is that the observed data would be somewhat surprising, then we conclude that the data provide some evidence against the null hypothesis.  But what if the answer is that the observed data would not be surprising?  Well, then we conclude that the data provide little or no evidence against the null hypothesis.

This reasoning process is closely related to the logical argument called modus tollens:

  • If P then Q
  • Not Q
  • Therefore: not P

For example, the Constitution of the United States stipulates that if a person is eligible to be elected President in the year 2020 (call this P), then that person must have been born in the U.S. (call this Q).  We know that Queen Elizabeth was not born in the U.S. (not Q).  Therefore, Queen Elizabeth is not eligible to be elected U.S. President in 2020 (not P).

But what if Q is true?  The following, sometimes called the fallacy of the converse, is NOT VALID:

  • If P then Q
  • Q
  • Therefore: P

For example, Taylor Swift was born in the U.S. (Q).  Does this mean that she is eligible to be elected President in 2020 (P)?  No, because she is younger than 35 years old, which violates a constitutional requirement to serve as president.

For the draft lotteries, P is the null hypothesis that the lottery process was truly random, and Q is that the correlation coefficient (between day number and draft number) is between about -0.1 and 0.1.  Notice that (If P, then Q) is not literally true here, but P does make Q very likely.  This is the stochastic* version of the logic.  For the 1970 lottery, we observed a correlation coefficient (-0.226) that is not Q, so we have strong evidence for not P, that the lottery process was not truly random.  For the 1971 lottery, we obtained a correlation coefficient (0.014) that satisfies Q.  This leaves us with no evidence for not P (that the lottery process was non-random), but we also cannot conclude P (that the lottery process was random).

* I don’t use this word with introductory students.  But I do like the word stochastic, which simply means involving randomness or uncertainty.

I only discuss modus tollens in courses for mathematics and statistics majors.  But for all of my students I do mention the common expression: Absence of evidence does not constitute evidence of absence.  For the 1971 draft lottery, the correlation coefficient of 0.014 leaves us with an absence of evidence that anything suspicious (non-random) was happening, but that’s not the same as asserting that we have evidence that nothing suspicious (non-random) was happening.


My second answer, the more practical one, for why it’s inappropriate to talk about evidence in favor of a null hypothesis, or to accept a null hypothesis, is: Many different hypotheses are consistent with the observed data, so it’s not appropriate to accept any one of these hypotheses.  Let me use a new example to make this point.

Instead of flipping a coin, tennis players often determine who serves first by spinning a racquet and seeing whether it lands with the label facing up or down.  Is this really a fair, 50/50 process?  A student investigated this question by spinning her racquet 100 times, keeping track of whether it landed with the label facing up or down.

  • a) What are the observational units and variable?  The observational units are the 100 spins of the racquet.  The variable is whether the spun racquet landed with the label facing up or down.  This is a binary, categorical variable.
  • b) Identify the parameter of interest.  The parameter is the long-run proportion of all spins for which the racquet would land with the label up*.  This could also be expressed as the probability that the spun racquet would land with the label facing up.
  • c) State the null and alternative hypotheses in terms of this parameter.  The null hypothesis is that the long-run proportion of all spins that land up is 0.5.  In other words, the null hypothesis states that racquet spinning is a fair, 50/50 process, equally likely to land up or down.  The alternative hypothesis is that the long-run proportion of all spins that land up is not 0.5.  This is a two-sided alternative.

* We could instead define a down label as a success and specify the parameter to be the long-run proportion of all spins that would land down.

The 100 racquet spins in the sample resulted in 44 that landed with the label up, 56 that landed with the label down.  The two-sided p-value turns out to be 0.271, as shown in the following graph of a binomial distribution*:

* You could also (or instead) present students with an approximate p-value from a simulation analysis or a normal distribution.

  • d) Interpret this p-value.  If the racquet spinning process was truly fair (equally likely to produce an up or down result), there’s a 27.1% chance that a random sample of 100 spins would produce a result as extreme as the actual one: 44 or fewer, or 56 or more, spins landing with the label up.
  • e) Summarize your conclusion.  The sample data (44 landing up in 100 spins) do not provide much evidence against the hypothesis that racquet spinning is a fair, 50/50 process.
  • f) Explain how your conclusion follows from the p-value.  The p-value of 0.271 is not small, indicating that the observed result (44 landing up in 100 spins), or a result more extreme, would not be surprising if the racquet spinning process was truly fair.  In other words, the observed result is quite consistent with a fair, 50/50 process.

Once again this conclusion in part (e) is challenging for students to express, as it involves a double negative.  Students are very tempted to state the conclusion as: The sample data provide strong evidence that racquet spinning is a fair, 50/50 process.  Or even more simply: Racquet spinning is a fair, 50/50 process.

To help students understand what’s wrong with these conclusions, let’s focus on the parameter, which is the long-run proportion of racquet spins that would land with the label facing up.  Concluding that racquet spinning is a fair, 50/50 process means concluding that the value of this parameter equals 0.5. 

I ask students: Do we have strong evidence against the hypothesis that 45% of all racquet spins would land up?  Not at all!  This hypothesized value (0.45) is very close to the observed value of the sample proportion of spins that landed up (0.44).  The p-value for testing the null value of 0.45 turns out to be 0.920*.

* All of the p-values reported for this example are two-sided, calculated from the binomial distribution.

Let’s keep going: Do we have strong evidence against the hypothesis that 40% of all racquet spins would land up?  Again the answer is no, as the p-value equals 0.416.  What about 52%?  Now the p-value is down to 0.111, but that’s still not small enough to rule out 0.52 as a plausible value of the parameter.

Where does this leave us?  We cannot reject that the racquet spinning process is fair (parameter value 0.5), but there are lots and lots* of other parameter values that we also cannot reject.  Therefore, it’s inappropriate to accept one particular value, or to conclude that the data provide evidence in favor of one particular value, because there are many values that are similarly plausible for the parameter.  The racquet spinning process might be fair, but it also might be biased slightly in favor of up or considerably against up.

* Infinitely many, in fact


Now let’s consider a new example, which addresses the age-old question: Is yawning contagious?  The folks at the popular television series MythBusters investigated this question by randomly assigning 50 volunteers to one of two groups:

  • Yawn seed group: A confederate of the show’s hosts purposefully yawned as she individually led 34 subjects into a waiting room.
  • Control group: The person led 16 other subjects into a waiting room and was careful not to yawn.

All 50 subjects were observed by hidden camera as they sat in the room, to see whether or not they yawned as they waited for someone to come in.  Here is the resulting 2×2 table of counts:

The hosts of the show calculated that 10/34 ≈ 0.294 of the subjects in the yawn seed group yawned, compared to 4/16 = 0.250 of the subjects in the control group.  The hosts conceded that this difference is not dramatic, but they noted that the yawn seed group had a higher proportion who yawned than the control group, and they went on declare that the data confirm the yawning is contagious hypothesis.

We can use an applet (here) to simulate a randomization test* on these data.  The p-value turns out to be approximately 0.513, as seen in the following graph of simulation results:

* See post #27 (Simulation-based inference, part 2, here) for an introduction to such an analysis.

  • a) State the null and alternative hypothesis, in words.
  • b) Do you agree with the conclusion reached by the show’s hosts? Explain.
  • c) How would you respond to someone who concluded: “The hosts are completely wrong.  The data from this study actually provide strong evidence that yawning is not contagious.”

a) The null hypothesis is that yawning is not contagious.  In other words, the null hypothesis is that people exposed to a yawn seed group have the same probability of yawning as people not so exposed.  The alternative hypothesis is that yawning is contagious, so people exposed to a yawn seed group are more likely to yawn than people not so exposed.

b) The conclusion of the show’s hosts is not supported by the data.  Such a small difference in yawning proportions between the two groups could easily have occurred by the random assignment process alone, even if yawning is not contagious.  The data do not provide nearly enough evidence for concluding that yawning is contagious.

c) This conclusion goes much too far in the other direction.  It’s not appropriate to conclude that yawning is not contagious.  A hypothesis test only assesses evidence against a null hypothesis, not in favor of a null hypothesis.  It’s plausible that yawning is not contagious, but the observed data are also consistent with yawning being a bit contagious or even moderately contagious.


As I wrap up this lengthy post, let me offer five pieces of advice for helping students to avoid mis-stating conclusions from not-so-small p-values:

1. I strongly advise introducing hypothesis testing with examples that produce very small p-values and therefore provide strong evidence against the null hypothesis.  The blindsight study that I used in post #12 (Simulation-based inference, part 1, here) is one such example.  I think a very small p-value makes it much easier for students to hang their hat on the reasoning process behind hypothesis testing.

2. Later be sure to present several examples that produce not-so-small* p-values, giving students experience with drawing “not enough evidence to reject the null” conclusions.

* You have no doubt noticed that I keep saying not-so-small rather than large.  I think this also indicates how tricky it is to work with not-so-small p-values.  A p-value of .20 does not provide much evidence against a null hypothesis, and I consider a p-value of .20 to be not-so-small rather than large.

3. Emphasize that there are many plausible values of the parameter that would not be rejected by a hypothesis test, so it’s not appropriate to accept the one particular value that appears in the null hypothesis.

4. Take a hard line when grading students’ conclusions.  Do not give full credit for a conclusion that mentions evidence for a null hypothesis or accepts a null hypothesis.

5. In addition to asking students to state their own conclusions, provide them with a variety of mis-stated and well-stated conclusions, and ask them to identify which are which.

Do you remember the question that motivated this post? Are students more likely to earn full credit for stating a conclusion from a p-value of .02 or .20?  Are you persuaded to reject the hypothesis that students are equally likely to earn full credit with either option? Have I provided convincing arguments that drawing an appropriate conclusion is easier for students from a p-value of .02 than from a p-value of .20?

#28 A pervasive pet peeve

Let’s suppose that you and I are both preparing to teach our next class.  Being easily distracted, I let my mind (and internet browser) wander to check on my fantasy sports teams, so I only devote 60% of my attention to my class preparation.  On the other hand, you keep distractions to a minimum and devote 90% of your attention to the task.  Let’s call these values (60% for me, 90% for you) our focus percentages.  Here’s the question on which this entire post hinges: Is your focus percentage 30% higher than mine?

I have no doubt that most students would answer yes.  But that’s incorrect, because 90 is 50% (not 30%) larger than 60.  This mistaking of a difference in percentages as a percentage difference is the pet peeve that permeates this post.

I will describe some class examples that help students learn how to work with percentage differences.  Then I’ll present some assessment items for giving students practice with this tricky idea.  Along the way I’ll sneak in a statistic that rarely appears in Stat 101 courses: relative risk.  As always, questions for students appear in italics.


A rich source of data on high school students in the United States is the Youth Risk Behavior Surveillance Survey (YRBSS).  Here are counts from the 2017 YRBSS report, comparing youths in Arizona and California on how often they wear a seat belt when riding in a car driven by someone else:

For each state, calculate the proportion (to three decimal places) of respondents who rarely or never wear a seat belt.  These proportions are 173/2139 ≈ 0.081 for Arizona, 103/1778 ≈ 0.058 for California.  Convert these proportions to percentages, and use these percentages in sentences*.  Among those who were surveyed, 8.1% of the Arizona youths and 5.8% of the California youths said that they rarely or never wear a seat belt when riding in a car driven by someone else.

* I think it’s worthwhile to explicitly ask students to convert proportions to percentages.  It’s more common to speak about percentages than proportions, and this conversion is non-trivial for some students.

Is it correct to say that Arizona youths in the sample were 2.3% more likely to wear a seat belt rarely or never than California youths in the sample?  Some students need a moment to press 8.1 – 5.8 into their calculator or cell phone to confirm the value 2.3, and then almost all students respond yes.

Let me pause here, because I want to be very clear: This is my pet peeve.  I explain that the difference between the two states’ percentages (8.1% and 5.8%) is 2.3 percentage points, but that’s not the same thing as a 2.3 percent difference.


At this point I ask students to indulge me in a brief detour.  Percentage difference between any two values is often tricky for people to understand, but working with percentages as the two values to be compared makes the calculation and interpretation all the more confusing. The upcoming detour simplifies this by using more generic values than percentages.

Suppose that my IQ is 100* and Beth’s is 140.  These IQ scores differ by 40 points.  What is the percentage difference in these IQ scores?  I quickly admit to my students that this question is not as clear as it could be.  When we talk about percentage difference, we need to specify compared to what.  In other words, we need to make clear which value is the reference (or baseline).  Let me rephrase: By what percentage does Beth’s IQ exceed mine?  Now we know that we are to treat my IQ score as the reference value, so we divide the difference by my IQ score: (140 – 100) / 100 = 0.40.  Then to express this as a percentage, we multiply by 100% to obtain: 0.40×100% = 40%.  There’s our answer: Beth’s IQ score is 40% larger than mine.

* I joked about my IQ score in post #5, titled A below-average joke, here.

Why did this percentage difference turn out to be the same as the actual difference?  Because the reference value was 100, and percent means out of 100.  Let’s make the calculation slightly harder by bringing in Tom, whose IQ is 120.  By what percentage does Beth’s IQ exceed Tom’s?  Using Tom’s IQ score as the reference gives a percentage difference of: (140 – 120) / 120 × 100% ≈ 16.7%.  Beth’s IQ score, which is 20 points higher than Tom’s, is 16.7% greater than Tom’s.

Does this mean that Tom’s IQ score is 16.7% below Beth’s?  Many students realize that the answer is no, because this question changes the reference value to be Beth’s rather than Tom’s.  The calculation is now: (120 – 140) / 140 × 100% ≈ -14.3%.  Tom’s IQ score is 14.3% lower than Beth’s.

Calculate and interpret the percentage difference between Tom’s IQ score and mine, in both directions.  Comparing Tom’s IQ score to mine is the easier one, because we’ve seen that a reference value of 100 makes calculations easier: (120 – 100) / 100 × 100% ≈ 20%.  Tom’s IQ score is 20% higher than mine.  Comparing my score to Tom’s gives: (100 – 120) / 120 × 100% ≈ -16.7%.  My IQ score is 16.7% lower than Tom’s*.

* I think I can hear what many of you are thinking: Wait a minute, this is not statistics!  I agree, but I nevertheless think this topic, which should perhaps be classified as numeracy, is relevant and important to teach in introductory statistics courses.  Otherwise, many students will continue to make mistakes throughout their professional and personal lives when working with and interpreting percentages.  I will end this detour and return to examining real data now.

Let’s return to the YRBSS data.  Calculating a percentage difference can seem more complicated when dealing with proportions, but the process is the same.  Calculate the percentage difference by which the Arizona youths’ proportion who rarely or never use a seat belt exceeds that for California youths.  Earlier we calculated the difference in proportions to be: 0.081 – 0.058 = 0.023. Now we divide by California’s baseline value to obtain: 0.023/0.058 ≈ .396, and finally we convert this to a percentage difference by taking: 0.396 × 100% = 39.6%.  Write a sentence interpreting this value in context.  Arizona youths in this sample were 39.6% more likely to rarely or never wear a seat belt than California youths.  Finally, just to make sure that my pet peeve is not lost on students: Is this percentage difference of 39.6% close to the absolute difference of 2.3 percentage points?  Not at all!


Next I take students on what appears to be a tangent but will lead to a connection with a different statistic for comparing proportions between two groups.  Calculate the ratio of proportions who rarely or never use a seat belt between Arizona and California youths in the survey.  This calculation is straightforward: 0.081/0.058 ≈ 1.396.  Write a sentence interpreting this value in context.  Arizona youths in the survey are 1.396 times more likely to rarely or never wear a seat belt than California youths.  I emphasize that the word times is a crucial one in this sentence.  The word times is correct here because we calculated a ratio in the first place.

Then I reveal to students that this new statistic (ratio of proportions) is important enough to have its own name: relative risk.  The relative risk of rarely or never wearing a seat belt, comparing Arizona to California youths, is 1.396.  The negative word risk is used here because this statistic is often reported in medical studies, comparing proportions with a negative result such as having a disease.  The convention is to put the larger proportion in the numerator, using the smaller proportion to indicate the reference group.

Does the number 1.396 look familiar from our earlier analysis?  Most students respond that the percentage difference was 0.396, which seems too strikingly similar to 1.396 to be a coincidence.  Make a conjecture for the relationship between percentage difference and relative risk.  Many students propose: percentage difference = (relative risk – 1) × 100%.

I ask students to test this conjecture with YRBSS data on seat belt use from Pennsylvania and California youths:

Calculate and interpret the difference and ratio of proportions who rarely or never use seat belts.  The “rarely or never” proportion in Pennsylvania is 425/3761 ≈ 0.113.  We’ve already calculated that the proportion in California is 103/1778 ≈ 0.058.  The difference in proportions is 0.113 – 0.058 = 0.055.  The percentage of Pennsylvania youths in the sample who said that they rarely or never wear a seat belt is 5.5 percentage points higher than the percentage of California youths who answered “rarely or never.”  The ratio of proportions is 0.113/0.058 ≈ 1.951*.  A Pennsylvania youth in the sample was 1.951 times more likely than a California youth to rarely or never wear a seat belt.

* I performed this calculation on the actual counts, not the proportions rounded to three decimal places in the numerator and denominator.

Verify that the conjectured relationship between percentage difference and relative risk holds.  The percentage difference in the proportions can be calculated as: (0.113 – 0.058) / 0.058 × 100% ≈ 95.1%.  This can also be calculated from the ratio as: (1.951 – 1) × 100% ≈ 95.1%.

I am not necessarily proposing that relative risk needs to be taught in Stat 101 courses.  I am urging a very careful treatment of percentage difference, and it takes just an extra 15 minutes of class time to introduce relative risk.


Let’s follow up with a confidence interval for a difference in proportions.  If we go back to comparing the responses from Arizona and California youths, a 95% confidence interval for the difference in population proportions turns out to be: .023 ± .016, which is the interval (.007 → .039).

Interpret what this interval reveals.  First recall that the order of subtraction is Arizona minus California, and notice that the interval contains only positive values.  We are 95% confident that the proportion of all Arizona youths who would answer that they rarely or never wear a seat belt is between .007 and .039 larger than the proportion of all California who would give that answer.  We can translate this answer to percentage points by saying that the Arizona percentage (of all youths who would answer that they rarely or never wear a seat belt) is between 0.7 and 3.9 percentage points larger than the California percentage.  But many students trip themselves up by saying that Arizona youths are between 0.7% and 3.9% more likely than California youths to answer that they rarely or never wear a seat belt.  This response is incorrect, for it succumbs to my pet peeve of mistakenly interpreting a difference in percentages as a percentage difference.

What parameter do we need to determine a confidence interval for, in order to estimate the percentage difference in population proportions (who rarely or never wear a seat belt) between Arizona and California youths?  A confidence interval for the population relative risk will allow this.  Such a procedure exists, but it is typically not taught in an introductory statistics course*.  For the YRBBS data on seat belt use in Arizona and California, a 95% confidence interval for the population relative risk turns out to be (1.103 → 1.767).

* The sampling distribution of a sample relative risk is skewed to the right, but the sampling distribution of the log transformation of the sample relative risk is approximately normal.  So, a confidence interval can be determined for the log of the population relative risk, which can then be transformed back to a confidence interval for the population relative risk.

What aspect of this interval indicates strong evidence that Arizona and California have different population proportions?  This can be a challenging question for students, so I often offer a hint: What value would the relative risk have if the two population proportions were the same?  Most students realize that the relative risk (ratio of proportions) would equal 1 in this case.  That the interval above is entirely above 1 indicates strong evidence that Arizona’s population proportion (who rarely or never wear a seat belt) is larger than California’s.

Interpret this confidence interval.  We are 95% confident that Arizona youths are between 1.103 and 1.767 times more likely than California youths to answer that they rarely or never wear a seat belt.  Convert this to a statement about the percentage difference in the population proportions.  We can convert this to percentage difference by saying: We are 95% confident that Arizona youths are between 10.3% and 76.7% more likely than California youths to answer that they rarely or never wear a seat belt.

I am not suggesting that students learn how to calculate a confidence interval for a relative risk in Stat 101, but I do think students should be able to interpret such a confidence interval.


Now we return to the YRBSS data for a comparison that illustrates another difficulty that some students have with percentages.  The YRBSS classifies respondents by race, and the 2017 report says that 9.8% of black youths and 4.3% of white youths responded that they rarely or never wear a seat belt.  Calculate the ratio of these percentages.  This ratio is: .098/.043 ≈ 2.28.  Write a sentence interpreting the relative risk.  Black youths who were surveyed were 2.28 times more likely than white youths to rarely or never wear a seat belt.  Complete this sentence: Compared to white youths who were surveyed, black youths were ______ % more likely to rarely or never wear seat belts.  To calculate the percentage difference, we can use the relative risk as we discovered above: (2.28 – 1) × 100% = 118%.  Black youths who were surveyed were 118% more likely to rarely or never wear seat belts, as compared to white youths.

Hold on, can a percentage really be larger than 100%?  Yes, a percentage difference (or a percentage change or a percentage error) can exceed 100%.  If one value is exactly twice as big as another, then it is 100% larger.  So, if one value is more than twice as big as another, then it is more than 100% larger.  In this case, the percentage (who rarely or never use a seat belt) for black youths is more than twice the percentage for white youths, so the relative risk exceeds 2, and the percentage difference between the two percentages therefore exceeds 100%.


Here is a quiz containing five questions, all based on real data, for giving students practice working with percentage differences:

  • a) California’s state sales tax rate in early 2019 was 7.3%, compared to Hawaii’s state sales tax rate of 4.0%.  Was California’s state sales tax rate 3.3% higher than Hawaii’s?  If not, determine the correct percentage difference to use in that sentence.
  • b) Alaska had a 0% state sales tax rate in early 2019.  Could Hawaii match Alaska’s rate by reducing theirs by 4%?  If not, determine the correct percentage reduction to use in that sentence.
  • c) Steph Curry successfully made 354 of his 810 (43.7%) three-point shots in the 2018-19 NBA season, and Russell Westbrook successfully made 119 of his 411 (29.0%) three-point shots.  Could Westbrook have matched Curry’s success rate with a 14.7% improvement in his own success rate?  If not, determine the correct percentage improvement to use in that sentence.
  • d) Harvard University accepted 4.5% of its freshman applicants for Fall 2019, and Duke University accepted 7.4% of its applicants.  Was Harvard’s acceptance rate 2.9% lower than Duke’s?  If not, then determine the correct percentage difference to use in that sentence.
  • e) According to the World Bank Development Research Group, 10.0% of the world’s population lived in extreme poverty in 2015, compared to 35.9% in 1990.  Did the percentage who lived in extreme poverty decrease by 25.9% in this 25-year period?  If not, determine the correct percentage decrease to use in that sentence.

The correct answer to all of these yes/no questions is no, not even close.  Correct percentage differences are: a) 82.5% b) 100% c) 50.9% d) 39.2% e) 72.1%.


I briefly considered titling this post: A persnickety post that preaches about a pervasive, persistent, and pernicious pet peeve concerning percentages.  That title contains 15 words, 9 of which start with the letter P, so 60% of the words in that title begin with P.  Instead I opted for the much simpler title: A pervasive pet peeve, for which 75% of the words begin with P. 

Does this mean that I increased the percentage of P-words by 15% when I decided for the shorter title?  Not at all, that’s the whole point!  I increased the percentage of P-words by 15 percentage points, but that’s not the same as 15%.  In fact, the percentage increase is (75 – 60) / 60 × 100% = 25%, not 15%. 

Furthermore, notice that 25% is 66.67% larger than 15%, so the percentage increase (in percentage of P-words) that I achieved with the shorter title is 66.67% greater than what many would mistakenly believe the percentage increase to have been.

No doubt I have gotten carried away*, as that last paragraph is correct but positively** ridiculous.  I’ll conclude with two points: 1) Misunderstanding percentage difference (or change) is very common, and 2) Teachers of statistics can help students to calculate and interpret percentage difference correctly.

* You might have come to that conclusion far earlier in this post.

** I couldn’t resist using another P word here. I really need to press pause on this preposterous proclivity.

P.S. The 2017 YRBSS report can be found here.  You might ask students to select their own questions and variables to analyze and compare. Data on state sales tax rates appear here, basketball players’ shooting percentages here, college acceptance rates here, and poverty rates here.

#27 Simulation-based inference, part 2

I believe that simulation-based inference (SBI) helps students to understand the underlying concepts and logic of statistical inference.  I described how I introduce SBI back in post #12 (here), in the scenario of inference for a single proportion.  Now I return to the SBI theme* by presenting a class activity that concerns comparing proportions between two groups.  As always, questions that I pose to students appear in italics.

* Only 15 weeks after part 1 appeared!


I devote most of a 50-minute class meeting to the activity that I will describe here.  The research question is whether metal bands* used for tagging penguins are actually harmful to their survival.

* Some students, and also some fellow teachers, tell me that they initially think that I am referring to penguins listening to heavy metal bands.

I begin by telling students that the study involved 20 penguins, of which 10 were randomly assigned to have a metal band attached to their flippers, in addition to an RFID chip for identification.  The other 10 penguins did not receive a metal band but did have an RFID chip.  Researchers then kept track of which penguins survived for the 4.5-year study and which did not.

I ask students a series of questions before showing any results from the study: Identify and classify the explanatory and response variables.  The explanatory variable is whether or not the penguin had a metal band, and the response is whether or not the penguin survived for at least 4.5 years.  Both variables are categorical and binary.  Is this an experiment or an observational study?  This is an experiment, because penguins were randomly assigned to wear a metal band or not.  Did this study make use of random sampling, random assignment, both,or neither?  Researchers used random assignment to put penguins in groups but (presumably) did not take a random sample of penguins.  State the null and alternative hypotheses, in words.  The null hypothesis is that metal bands have no effect on penguin survival.  The alternative hypothesis is that metal bands have a harmful effect on penguin survival.

Then I tell students that 9 of the 20 penguins survived, 3 with a metal band and 6 without.  Organize these results into the following 2×2 table:

The completed table becomes:

Calculate the conditional success proportions for each group.  The proportion in the control group who survived is 6/10 = 0.6, and the proportion in the metal band group who survived is 3/10 = 0.3*.  Calculate the difference in these success proportions.  I mention that students could subtract in either order, but I want us all to be consistent so I instruct them to subtract the proportion for the metal band group from that of the control group: 0.6 – 0.3 = 0.3.

* I cringe when students use their calculator or cell phone for these calculations.

Is it possible that this difference could have happened even if the metal band had no effect, simply due to the random nature of assigning penguins to groups (i.e., the luck of the draw)?  I often give my students a silly hint that the correct answer has four letters.  Realizing that neither no nor yes has four letters, I get many befuddled looks before someone realizes: Sure, it’s possible!  Joking aside, this is a key question.  This question gets at why we need to conduct inference in the first place.  We cannot conclude that metal bands are harmful simply because a smaller proportion survived with metal bands than without them.  Why not?  Because this result could have happened even if metal bands are not harmful.

What question do we need to ask next?  Students are surprised that I ask them to propose the next question.  If they ask for a hint, I remind them of our earlier experience with SBI.  To analyze a research study of whether a woman with brain damage experienced a phenomenon known as blindsight, we investigated how surprising it would be to correctly identify the burning house in 14 of 17 pairs of drawings, if in fact she was choosing randomly between the two houses (one burning, one not) presented.  For this new context I want students to suggest that we ask: How likely, or how surprising, is it to obtain a difference in success proportions of 0.3 or greater, if in fact metal bands are not harmful?

How will we investigate this question?  With simulation!


Once again we start with by-hand simulation before turning to technology.  Like always, we perform our simulation assuming that the null hypothesis is true: that the metal band has no effect on penguin survival.  More specifically, we assume that the 9 penguins who survived would have done so with the metal band or not, and the 11 penguins who did not survive would have perished with the metal band or not.

We cannot use a coin to conduct this simulation, because unlike with the blindsight study, we are not modeling a person’s random selections between two options.  Now we want our simulation to model the random assignment of penguins to treatment groups.  We can use cards to do this.

How many cards do we need?  Each card will represent a penguin, so we need 20 cards.  Why do we need two colors of cards?  How many cards do we need of each color?  We need 9 cards of one color, to represent the 9 penguins who survived, and we need 11 cards of the other color, to represent the 11 penguins who perished.  After shuffling the cards, how many will we deal into how many groups?  One group of cards will represent the control group, and a second group of cards will represent penguins who received a metal band.  We’ll deal out 10 cards into each group, just as the researchers randomly assigned 10 penguins to each group.  What will we calculate and keep track of for each repetition?  We will calculate the success proportion for each group, and then calculate the difference between those two proportions.  I emphasize that we all need to subtract in the same order, so students must decide in advance which group is control and which is not, and then subtract in the same order: (success proportion in control group minus success proportion in metal band group).

I provide packets of 20 ordinary playing cards to my students, pre-arranged with 9 red cards and 11 black ones per packet.  Students shuffle the cards and deal them into two piles of 10 each.  Then they count the number of red and black cards in each pile and fill in a table in which we already know the marginal totals:

Next we need to decide: What (one) statistic should we calculate from this table?  A very reasonable choice is to use the difference in survival proportions as our statistic*.  I remind students that it’s important that we all subtract in the same order: (proportion who survived in control group) minus (proportion who survived in metal band group).  Students then come to the whiteboard to put the value of their statistic (difference in proportions) on a dotplot.  A typical result for a class of 35 students looks like**:

* I will discuss some other possible choices for this statistic near the end of this post.

** Notice that the distribution of this statistic (difference in proportions) is discrete.  Only a small number of values are possible, because of the fixed margins of the 2×2 table.  When I draw an axis on the board, I put tick marks on these possible values before students put their dots on the graph.  Occasionally a student will obtain a value that does not fall on one of these tick marks, because they have misunderstood the process or made a calculation error.

Where is this distribution centered?  Why does this make sense?  This distribution is centered near zero.  This makes sense because the simulation assumed that there’s no effect of the metal band, so we expect this difference to be positive about half the time and negative about half the time*.

* Some students are tempted to simply take the larger proportion minus the smaller proportion, so I repeat often that they should subtract in the agreed order: (control minus metal band).  Otherwise, the center of this distribution will not be near zero as it should be.

What is important to notice in this graph, to address the key question of whether the data provide strong evidence that the metal bands are harmful to penguin survival?  This brings students back to the goal of the simulation analysis: to investigate whether the observed result would have been surprising if metal bands have no effect.  Some students usually point out that the observed value of the statistic was 0.3, so we want to see how unusual it is to obtain a statistic of 0.3 or greater.  Does the observed value of the statistic appear to be very unusual in our simulation analysis?  No, because quite a few of the repetitions produced a value of 0.3 or more.  What proportion of the repetitions produced a statistic at least as extreme as the observed value?  Counting the occurrences at 0.3 and higher reveals that 9/35 ≈ 0.257 of the 35 repetitions produced a difference in success proportions of 0.3 or more.  What does this reveal about the strength of evidence that metal bands are harmful?  Because a result as extreme as in the actual study occurred about 26% of the time in our simulation, and 26% is not small enough to indicate a surprising result, the study does not provide strong evidence that metal bands are harmful.

By what term is this 0.257 value known?  This is the (approximate) p-value.  How can we produce a better approximation for the p-value?  Repeat the process thousands of times rather than just 35 times.  In order to produce 10,000 repetitions, should we use cards or technology?  Duh!


Now we turn to an applet (here) to conduct the simulation analysis.  First we click on 2×2 and then enter the table of counts and then click on Use Table:

Next we check Show Shuffle Options on the right side of the applet screen.  I like to keep the number of shuffles set at 1 and click “Shuffle” several times to see the results.  By leaving the Cards option selected, you see 20 colored cards (blue for survival, green for perishing) being shuffled and re-randomized, just as students did with their own packet of 20 cards in class.  You can also check Data or Plot to see different representations of the shuffling.  You might remind students that the underlying assumption behind the simulation analysis is that the metal bands have no effect on penguin survival (i.e., that the null hypothesis is true).

Eventually I ask for 10,000 shuffles, and the applet produces a graph such as:

Once again I ask students to notice that the distribution (of shuffled differences in proportions) is centered near zero.  But again the key question is: Does the simulation analysis indicate that the observed value of the statistic would be very surprising if metal bands have no effect?  Students are quick to say that the answer is no, because the observed value (0.3) is not very far out in the tail of this distribution.  How can we calculate the (approximate) p-value?  By counting the number of repetitions that produced a difference of 0.3 or more, and then dividing by 10,000.  The applet produces something like:

What conclusion do you draw?  Results as extreme as the one observed (a difference in survival proportions between the two groups of 0.3 or more) would not be surprising (p-value ≈ 0.1827) if the metal band had no effect on penguin survival.  Therefore, the experimental data do not provide strong evidence that metal bands are harmful to penguin survival.


I have a confession to make.  I confess this to students at this point in the class activity, and I also confess this to you now as you read this.  The sample size in this experiment was not 20 penguins.  No, the researchers actually studied 100 penguins, with 50 penguins randomly assigned to each group.  Why did I lie*?  Because 100 cards would be far too many for shuffling and counting by hand.  This also gives us an opportunity to see the effect of sample size on such an analysis.

* I chose my words very carefully above, saying I begin by telling students that the study involved 20 penguins …  While I admit to lying to my students, I like to think that I avoided telling an outright lie to you blog readers. If you don’t want to lie to your students, you could tell them at the outset that the data on 20 penguins are based on the actual study but do not comprise the complete study.

Now that I have come clean*, let me show the actual table of counts:

* Boy, does my conscience feel better for it!

We need to redo the analysis, but this goes fairly quickly in class because we have already figured out what to do.  Calculate the survival proportions for each group and their difference (control minus metal band).  The survival proportions are 31/50 = 0.62 in the control group and 16/50 = 0.32 in the metal band group, for a difference of 0.62 – 0.32 = 0.30*.  Before we re-run the simulation analysis, how do you expect the p-value to change, if at all?  Many students have good intuition that the p-value will be much smaller this time.  Here is a typical result with 10,000 repetitions:

* I try to restore my credibility with students by pointing out that I did not lie about the value of this statistic.

What conclusion would you draw?  Explain.  Now we have a very different conclusion.  This graph shows that the observed result (a difference in survival proportions of 0.3) would be very surprising if the metal band has no harmful effect.  A difference 0.3 or larger occurred in only 23 of 10,000 repetitions under the assumption of no effect.  The full study of 100 penguins provides very strong evidence that metal bands are indeed harmful to penguin survival.

Before concluding this activity, a final question is important to ask: The word harmful in that conclusion is a very strong one.  Is it legitimate to draw a cause-and-effect conclusion here?  Why or why not?  Yes, because researchers used random assignment, which should have produced similar groups of penguins, and because the results produced a very small p-value, indicating that such a big difference between the survival proportions in the two groups would have been unlikely to occur if metal bands had no effect.


That completes the class activity, but I want to make two additional points for teachers, which I also explain to mathematically inclined students:

1. We could have used a different statistic than the difference in success proportions.  For a long time I advocated using simply the number of success in group A (in this case, the number of survivors in the control group).  Why are these two statistics equivalent?  Because we are fixing the counts in both margins of the 2×2 table (9 who survived and 11 who perished, 10 in each treatment group), there’s only one degree of freedom.  What does this mean?  Once you specify the count in the upper left cell of the table (or any other cell, for that matter), the rest of the counts are then determined, and so the difference in success proportions is also determined.  In other (mathematical) words, there’s a one-to-one correspondence between the count in the upper left cell and the difference in success proportions.

Why did I previously use the count in the upper left cell as the statistic in this activity?  It’s easier to count than to calculate two proportions and the difference between them, so students are much more likely to make a mistake when they calculate a difference in success proportions.  Why did I change my mind, now favoring the difference in success proportions between the two groups?  My colleagues persuaded me that calculating proportions is always a good step when dealing with count data, and considering results from both groups is also a good habit to develop.

Those two statistics are not the only possible choices, of courses.  For example, you could calculate the ratio of success proportions rather than the difference; this ratio is called the relative risk.  You could even calculate the value of a chi-square statistic, but I certainly do not recommend that when you are introducing students to 2×2 tables for the first time.  Because of the one degree of freedom, all of these statistics would produce the same (approximate) p-value from a given simulation analysis.  The applet used above allows for choosing any of these statistics, in case you want students to explore this for themselves.

2. Just as we can use the binomial distribution to calculate an exact p-value in the one-proportion scenario, we can also calculate an exact p-value for the randomization test in this 2×2 table scenario.  The relevant probability distribution is the hypergeometric distribution, and the test is called Fisher’s exact test.  The calculation involves counting techniques, namely combinations.  The exact p-values can be calculated as (on the left for the sample size of 20 penguins, on the right for the full sample of 100 penguins):


There you have it: simulation-based inference for comparing success proportions between two groups.  I emphasize to students throughout this activity that the reasoning process as the same as it was with one proportion (see post #12 here).  We simulate the data-collection process assuming that the null (no effect) hypothesis is true.  Then if we find that the observed result would have been very surprising, we conclude that the data provide strong evidence against the null hypothesis.  In this case we saw that the observed result would not have been surprising, so we do not have much evidence against the null hypothesis.

This activity can reinforce what students learned earlier in the course about the reasoning process of assessing strength of evidence.  You can follow up with more traditional techniques, such a two-sample z-test for comparing proportions or a chi-square test.  I think the simulation-based approach helps students to understand what a p-value means and how it relates to strength of evidence.

P.S. You can read about the penguin study here.

P.P.S. I provided several resources and links about teaching simulation-based inference at the end of post #12 (here).

#26 Group quizzes, part 2

In last week’s post (here), I mentioned that I give lots of group quizzes and consider them to be an effective assessment tool that promotes students’ learning.  I provided six examples of quizzes, with five questions per quiz, that I have used with my students. 

Now I pick up where I left off, offering seven more quizzes with comments on each.  The topics of these quizzes include numerical variables and comparisons between groups.

As always, questions that I put to students appear in italics.  A file containing all thirteen quizzes from the two posts, along with solutions, can be downloaded from a link at the end of this post.


7. Answer these questions:

  • a) Suppose that a class of 10 students has the following exam scores: 60, 70, 50, 60, 90, 90, 80, 80, 40, 50.  Determine the median of these 10 exam scores.
  • b) Suppose that the average amount of sleep obtained by Cal Poly undergraduates last night was 6.8 hours, and the average amount of sleep obtained by Cal Poly graduate students last night was 7.6 hours.  Is it reasonable to conclude that the average amount of sleep obtained last night among all Cal Poly students was (6.8 + 7.6)/2 = 7.2 hours?  Explain.
  • c) What effect does doubling every value in a dataset have on the mean?   Explain your answer.
  • d) What effect does adding 5 to every value in a dataset have on the standard deviation?  Explain your answer.
  • e) Create an example of 10 hypothetical exam scores (on a 0 – 100 scale) with the property that the mean is at least 20 points larger than the median.  Also report the values of the mean and median for your example.

This quiz is a hodgepodge that addresses basic concepts of measures of center and variability, following up on topics raised in posts #5 (A below-average joke, here) and #6 (Two dreaded words, here).  Some students think of part (a) as a “trick” question, but I think it’s important for students to remember to put data in order before declaring that the middle value (in this case, the average of the two middle values) is the median.  For part (b), students should respond that this conclusion would only be valid if Cal Poly has the same number of undergraduate and graduate students.  You could ask parts (c) and (d) as multiple choice questions by deleting the “explain” aspect.  When I discuss part (e) with students afterward, I advise them to make such an example as extreme as possible.  To make the mean much larger than the median, they could force the median to be zero by having six scores of zero.  Then they can make the mean as large as possible by having four scores of 100.  This makes the mean equal 400/10 = 40, with a median of 0.


8. Suppose that the mean age of all pennies currently in circulation in the U.S. is 12.3 years, and the standard deviation of these ages is 9.6 years.  Suppose also that you take a random sample of 50 pennies and calculate the mean age of the pennies in your sample.

  • a) Are the numbers 12.3 and 9.6 parameters or statistics?  Explain briefly.
  • b) Describe the sampling distribution of the sample mean penny age.  Also produce a well-labeled sketch of this sampling distribution.
  • c) Determine the probability that the sample mean age of your random sample of 50 pennies would be less than 10 years. (Show your work.)
  • d) Are your answers to parts (b) and (c) approximately valid even if the distribution of penny ages is not normally distributed? Explain.
  • e) Based on the values of the mean and standard deviation of penny ages, there is reason to believe that the distribution of penny ages is not normally distributed.  Explain why.

This quiz is a challenging one, because the Central Limit Theorem is a challenging topic.  Part (a) allows students to earn a fairly easy point.  Those numbers are described as pertaining to all pennies in circulation, so they are parameters.  I’m looking for four things in response to part (b): shape (normal), center (mean 12.3 years), and variability (SD 9.6/sqrt(50) ≈ 1.36 years), along with a sketch that specifies sample mean age as the axis label.  Even if a student group has not answered parts (b) and (c) correctly, they can still realize that the large sample size of 50 means that the distribution of the sample mean will be approximately normal, so the answer to part (d) is that the answers to parts (b) and (c) would be valid.  Part (e) is a very challenging one that brings to mind the AP Statistics question discussed in post #8 (End of alphabet, here).  I have in mind there that the value 0 is only 1.28 standard deviations below the mean, so about 10% of pennies would have a negative age if the penny ages followed a normal distribution, which is therefore not plausible.


9. A study conducted in Dunedin, New Zealand investigated whether wearing socks over shoes could help people to walk confidently down an icy footpath*.  Volunteers were randomly assigned either to wear socks over their usual footwear or to simply wear their usual footwear, as they walked down an icy footpath.  An observer recorded whether or not the participant appeared to be walking confidently.

  • a) Is this an observational study or an experiment?  Explain briefly.
  • b) Identify the explanatory and response variables.
  • c) Does this study make use of random sampling, random assignment, both, or neither?
  • d) Did the researchers use randomness in order to give all walkers in New Zealand the same chance of being selected for the study?  Answer YES or NO.
  • e) Did the researchers use randomness in order to produce groups that were as similar as possible in all respects before the explanatory variable was imposed?  Answer YES or NO.

* This may not be scientific research of the greatest import, but this is a real study, not a figment of my imagination.  That this study was conducted in New Zealand makes it all the more appealing.  I hope my students enjoy this context as much as I do, but they are probably too focused on answering the quiz questions to notice.

Parts (a) – (c) should come as no surprise to students, as I ask these questions all the time in class.  (See post #11, Repeat after me, here.)  I especially like parts (d) and (e), which ask about the purpose of randomness in data collection.  Most students realize that random assignment does not give all walkers the same chance of being selected but does try to produce groups that are as similar as possible.  (See posts #19 and #20, Lincoln and Mandela, here and here, for more about random sampling and random assignment.)


10. Recall that a study conducted in Dunedin, New Zealand investigated whether wearing socks over shoes could help people to walk confidently down an icy footpath*.  Participants were randomly assigned to wear socks over their usual footwear, or to simply wear their usual footwear, as they walked down an icy footpath.  One of the response variables measured was whether an observer considered the participant to be walked confidently.  Results are summarized in the 2×2 table of counts below:

For parts (a) – (c), suppose that you conduct a by-hand simulation analysis to investigate whether wearing socks over shoes increases people’s confidence while walking down an icy footpath.  For parts (d) and (e), consider the results of such a simulation analysis performed with technology.

  • a) What would be the assumption involved with producing the simulation analysis?  Choose one of the following options: A. That wearing socks over shoes has no effect on walkers’ confidence; B. That wearing socks over shoes has some effect on walkers’ confidence; C. That walkers are equally likely to feel confident or not, regardless of whether they wear socks over shoes or not; D. That walkers are more likely to feel confident if they wear socks over shoes
  • b) How many cards would you use in the simulation analysis?  What would the color breakdown be? 
  • c) How many cards would you deal out into groups?  How many times would you repeat this process?
  • d) The graph below displays the results of a simulation analysis with 10,000 repetitions, displaying the distribution of the difference in success proportions between the two groups.  Describe how you would calculate an approximate p-value from this graph (i.e., where would you count?).
  • e) Based on the 2×2 table of data and on this graph of simulation results, how much evidence do the data provide in support of the conjecture that wearing socks over shoes increases people’s confidence while walking down an icy footpath?  Choose one of the following options: A. little or no evidence; B. moderate evidence; C. strong evidence; D. very strong evidence.

* This study is too fascinating to use only once!

This quiz assesses how well students understood a class activity about simulation-based inference for comparing proportions between two groups*.  Part (a) asks for the null hypothesis, without using that term.  Parts (b) – (c) concern the nuts and bolts of conducting a simulation analysis by hand.  Parts (d) and (e) address using the simulation analysis to draw a conclusion.  The hardest part for students is realizing that they need to see where the observed value of the statistic (difference in success proportions between the two groups) falls in the simulated null distribution.  I could have made this more apparent by first asking students to calculate the value of the statistic.  Instead I only give a small hint at the beginning of part (e) by reminding students to use the 2×2 table of observed counts as well as the graph of simulation results.  In this case the observed value of the statistic (10/14 – 8/15 ≈ 0.181) is not a surprising result in the simulated null distribution, so the study provides little or no evidence that wearing socks over shoes is helpful.

* My next blog post (#27) will describe and discuss such a class activity.


11. Researchers at Stanford University studied whether a curriculum could help to reduce children’s television viewing. Third and fourth grade students at two public elementary schools in San Jose were the subjects. One of the schools, chosen at random, incorporated an 18-lesson, 6-month classroom curriculum designed to reduce watching television and playing video games, whereas the other school made no changes to its curriculum. At the beginning and end of the study, all children were asked to report how many hours per week they spent on these activities, both before the curriculum intervention and afterward.  The tables below summarize reported amounts of television watching, first at the beginning of the study and then at its conclusion:

  • a) Is the response variable in this study categorical or numerical?
  • b) The difference between the groups can be shown not to be statistically significant at the beginning of the study.  Do you think the researchers would be pleased by this result?  Explain why or why not.
  • c) Even if the distributions of reported amounts of television watching per week are sharply skewed, would it still be valid to apply a two-sample t-test on these data?  Explain briefly.
  • d) Calculate the value of the test statistic for investigating whether the two groups differ with regard to average amount of television watching per week.
  • e) Based on the value of the test statistic, summarize your conclusion for the researchers.

Part (a) is quite straightforward, offering an easy point for students.  I really like part (b), which asks students to realize that a non-significant difference between the groups at the beginning of the study is a good thing. The lack of significance suggests that random assignment achieved its goal of producing similar groups prior to the intervention.  For part (c) students should recognize that the large sample sizes establish that the two-sample t-test is valid even with skewed distributions.  Notice that the only calculation in the quiz is part (d).  The value of the test statistic in part (d) turns out to be 3.27, which is large enough to conclude in part (e) that the intervention reduced the mean amount of television watching.


12. Answer the following:

  • a) Would you expect to find a positive or negative correlation coefficient between high temperature on January 1, 2020 and distance from the equator, for a sample consisting of one city from each of the 50 U.S. states?  Explain briefly.
  • b) Suppose that you record the daily high temperature and the daily amount of ice cream sold by an ice cream vendor at your favorite beach next summer, starting on the Friday of Memorial Day weekend and ending on the Monday of Labor Day weekend.  Would you expect to find a positive or negative correlation coefficient between these variables?  Explain briefly.
  • c) Suppose that every student in this class scored 5 points lower on the second exam than on the first exam.  Consider the correlation coefficient between first exam score and second exam score.  What would the value of this correlation coefficient be?  Explain briefly

Parts (d) and (e) pertain to the graph below, which displays data on the age (in months) at which a child first speaks and the child’s score on an aptitude test taken later in childhood:

  • d) Is the value of the correlation coefficient between these variables positive or negative?
  • e) Suppose that the child who took 42 months to speak were removed from the analysis.  Would the value of the correlation coefficient between the variables be closest to -1, 0, or 1?

This quiz addresses association and correlation between two numerical variables.  Parts (a) and (b) ask students to think about a context to determine whether an association would be positive or negative.  Part (c) is very challenging, as I discussed in post #21 (Twenty final exam questions, here).  Many students believe that the correlation must be negative, and some even respond that the correlation coefficient will equal -5!  The correct answer is that the correlation would be exactly 1.0, because the data would fall on a straight line with positive slope.

Parts (d) and (e) pertain to one of my all-time favorite datasets, which I encountered in Moore and McCabe’s Introduction to the Practice of Statistics near the beginning of my teaching career.  For this quiz I want students to realize that the correlation coefficient is negative but would be close to zero if the child who took the longest to speak were removed.


13. Some of the statistical inference procedures that we have studied include:

  • A. One-sample z-procedures for a proportion
  • B. Two-sample z-procedures for comparing proportions
  • C. One-sample t-procedures for a mean
  • D. Two-sample t-procedures for comparing means
  • E. Paired-sample t-procedures for comparing means

For each of the following questions, identify (by capital letter) which procedure you would use to address that question.  (Be aware that some letters may be used more than once, others not at all.)

  • a) Do cows tend to produce more milk if their handler speaks to them by name every day than if the handler does not speak to them by name?  A farmer randomly assigned half of her cows to each group and then compared how much milk they produced after one month.
  • b) A baseball coach wants to investigate whether players run more quickly from second base to home plate if they take a wide angle or a narrow angle around third base.  He recruits 20 players to serve as subjects for a study.  Each of the 20 players runs with each method (wide angle, narrow angle) once.
  • c) Does the average length of a sentence in a novel written by John Grisham exceed the average length of a sentence in a novel written by Louise Penny?  Students took a random sample of 100 sentences from each author’s most recent novel and recorded the number of words in each sentence.
  • d) Have more than 25% of Cal Poly students have been outside of California in the year 2019?
  • e) Are Stanford students more likely to have been outside of California in the year 2019 than Cal Poly students?

I give a quiz like this once or twice in every course.  Students need practice with identifying which procedure to use in a particular situation.  It’s easy and appropriate for students to focus on one topic at a time, so I think we teachers need to ask questions like this that require students to synthesize what they’ve learned across topics.

Notice that the words proportion and mean do not appear in any of the five parts of this quiz, so students cannot simply look for those key words.  I tell students that the key to answering questions like this is to start by identifying the variable(s) and their types (categorical or numerical) and roles (explanatory or response).


The last of the six GAISE recommendations (here) is: Use assessments to improve and evaluate student learning.  The improve part of that recommendation can be very challenging to implement successfully.  I have found group quizzes to be very effective for motivating students to help each other with developing and strengthening their understanding of statistical concepts.

P.S. The study about wearing socks over shoes can be found here.  The study about children’s television viewing can be found here.  The data on age of first speaking can be found here.

P.P.S. The following link contains a Word file with the thirteen quizzes from this post and the previous one, along with solutions.  Teachers should feel free to modify this file for use with their own students.

#25 Group quizzes, part 1

I tell my students in my syllabi that they will take lots of quizzes in my course.  That raises an obvious question: How many is lots?  Let’s look at some data. This graph shows the distribution of number of quizzes that I have given in 25 courses over the past several years:

The median is 18 and mean is 18.72 quizzes per course.  My courses meet for ten weeks (on the quarter system), so my students take an average of slightly fewer than 2 quizzes per week.  Many of my courses meet twice per week for 110 minutes per class meeting, so this amounts to about one quiz per class meeting.

Why do I give so many quizzes?  I tell students in my syllabi that the quizzes provide them with opportunities to:

  • improve their understanding of the material
  • assess how well they understand the material
  • prepare for the kinds of questions that will be on exams

With few exceptions these are group quizzes.  Even though I encourage students to work together and help each other as we work through in-class activities, I find that they engage much more concertedly during these group quizzes. Apparently the precious commodity of points toward their course grade provides a strong motivation.  Conversations that I overhear as students work on group quizzes convince me that they are truly helping each other to learn at these times.

Students usually form groups by self-selection.  Sometimes I mix things up by randomly assigning students to groups for a given quiz.  I typically limit group sizes to three students, but I allow them to form groups of any size for one quiz near the end of the course.  I emphasize that by putting their name on a quiz, they are attesting that they made a good-faith effort to help the group with the quiz.  Once in a while students abuse this rule by including the name of a classmate who was not even in class that day, but I don’t think this occurs too much.

These quizzes typically ask about topics that students are to have learned in class that day.  Students are almost always allowed to use their notes during the quiz. This policy encourages them to take good notes during class and also alleviates some of the stress of taking a quiz.  Once in a while, when I run out of class time, I’ll ask students to complete a quiz outside of class, but students are much less likely to work together on take-home quizzes.

Every quiz consists of five parts, each worth one point.  The lowest two (sometimes three) quiz scores in the course are dropped before the calculation of each student’s overall quiz percentage.  The quizzes usually contribute only 10% to the calculation of the overall course score.  I tell students on the first day of class that most students’ overall course score is helped by their quiz percentage.  The distribution of overall quiz percentages from one of my recent courses, shown in the graph below, is quite typical. The distribution is sharply skewed to the left, largely due to some students who miss more than a few quizzes, with a median of 87.35 percent and mean of 81.89 percent:

The remainder of this blog post, and all of the next one, consists of quizzes that I have given in introductory statistics courses, along with comments on each.  I will present thirteen quizzes, six in this post and seven to come next week.  The order of topics will reveal that I introduce simulation-based inference for a proportion early in the course.


1. For parts (a) and (b), consider the research question of whether students at Cal Poly – San Luis Obispo are more likely to wear clothing that says “Cal Poly” than students at Cal Poly – Pomona.  Suppose that you were to collect data for a statistical study of this question. 

  • a) Identify the explanatory variable, and classify it as numerical or categorical.
  • b) Identify the response variable, and classify it as numerical or categorical.

For parts (c) – (e), consider the patients who went to the emergency room at the local hospital last week as the observational units in a statistical study.

  • c) Identify one categorical variable that could be recorded on these observational units.
  • d) Identify one numerical variable that could be recorded on these observational units.
  • e) State a research question that you could investigate about these observational units, using at least one of the variables that you gave in part (c) or (d).

As I said in post #11 (here), I ask students about observational units and variables in nearly every example that we discuss in the entire course.  Not surprisingly, an early quiz focuses on this topic.  Most students do fine with parts (a) and (b), although some express the response variable poorly by saying something like “wearing Cal Poly clothing” instead of “whether or not the student wears Cal Poly clothing.”  Parts (c)-(e) are more challenging, as they ask students to think of their own variables. Part (e) is especially difficult; I have found that it’s much easier for students to describe variables than to state a research question to be investigated with those variables.  A good answer to part (e) does not need to be complicated.  For example, a fine answer is: “How long is the average waiting time to see a medical professional after arriving at the emergency room?”


2. Recall that you took samples of words from the population of words in the Gettysburg Address, for which the average length of a word is 4.295 letters.  Parts (a)-(d) refer to this situation.

  • a) Is the number 4.295 a parameter or a statistic? 
  • b) When you first selected your sample of 10 (circled) words, what was the variable?  Was it categorical or numerical?
  • c) What aspect of the first graph on the board indicated that the sampling method was biased?
  • d) Would selecting words by closing your eyes and pointing at the page 10 times produce an unbiased sampling method?  Explain briefly.
  • e) In general, does taking a very large sample (say, of millions of people) produce an unbiased sampling method?   Explain briefly.

Based on the Gettysburg Address activity that I described in post #19 (here), this quiz assesses whether students have understood the activity that they worked through in class that day.  They can answer parts (a)-(d) directly from their notes, so this quiz should provide easy points for students.  Part (e) requires students to stop and think a bit. I hope they’ll remember our class discussion of the infamous Literary Digest poll from 1936, which illustrates that a very large sample size does not guarantee an unbiased sampling method. I don’t often give quizzes that are this straight-forward and can be answered with little new thinking, but I believe this makes for a nice change-of-pace. I also think it’s okay now and then for a quiz to reward students for being in class and paying attention.


3. Researchers investigated whether they could correctly predict the outcome of an election, more often than not, by selecting the candidate whose face is judged (by a majority of people interviewed) to be more competent-looking.  They applied this prediction method to 32 U.S. Senate races in 2004.  The “competent face” method correctly predicted the winner in 23 of the 32 races.

  • a) What are the observational units in this study, and what is the sample size?
  • b) Describe (in words) the null hypothesis to be tested.

Consider the following results of a simulation analysis with 10,000 repetitions, for testing whether the competent face method would correctly predict the winner in more than half of all races:

  • c) Describe how you would use the simulation results to approximate the p-value of the test.
  • d) The p-value turns out to be approximately 0.01.  Write a sentence interpreting this p-value in context (probability of what, assuming what?).
  • e) Do the sample data provide strong evidence in support of the “competent face” prediction method?  Justify your answer, based on the simulation analysis.

This quiz assesses students’ understanding of simulation-based inference as presented early in the course.  Students would have seen an example such as the one presented in post #12 (here) before taking this quiz.

The second question in part (a) is meant to help students answer the first question.  If they realize that the sample size is 32, they can stop and ask themselves: 32 of what?  This should lead them to recognize that the 32 Senate races are the observational units, not the people who were interviewed to determine which candidate’s face is judged to be more competent.

Part (c) requires students to specify how they would use the applet to determine the approximate p-value, without needing to give them access to the applet.

Notice that part (d) gives a big hint about the two things (in addition to context) that students should include in their interpretation of a p-value: probability of obtaining 23 or more correct predictions in 32 races, assuming that the competent-face method would be correct for 50% of all races in the long run.  I do not give these hints later in the course, when I expect students to have a firmer grasp of interpreting a p-value, but I think such a hint is appropriate and helpful early on.


4. Suppose that a tire manufacturer believes that the lifetimes of its tires follow a normal distribution with mean 50,000 miles and standard deviation 5,000 miles.

  • a) Based on the empirical rule, 95% of tires last for between what two values?
  • b) How many standard deviations above the mean is a tire that lasts for 58,500 miles?
  • c) Determine the probability that a randomly selected tire lasts for more than 58,500 miles.
  • d) Determine the mileage for which only 25% of all tires last longer than that mileage.  Show how you arrive at your answer.
  • e) Suppose the manufacturer wants to issue a money-back guarantee for its tires that fail to achieve a certain number of miles.  If they want 99% of the tires to last for longer than the guaranteed number of miles, how many miles should they guarantee?  Show how you arrive at your answer.

Parts (a)-(d) are routine questions about the empirical rule, z-scores, and calculating probabilities and percentiles from normal distributions.  I think these provide good practice of the basics of what students are to have learned in class that day.  Students can use a table of standard normal probabilities, or a calculator with a normal probability function, or a software tool to answer parts (c) and (d).

Part (e) is not especially hard, but it does trip up a few students.  Can you guess a fairly common mistake that I find very disheartening?  The correct z-score is -2.326, so the correct answer for the lifetime to be guaranteed is 50,000 – 2.326×5000 ≈ 38,370 miles.  But some students look up the z-score for the 99th percentile and obtain positive 2.326, which produces an answer of 50,000 + 2.326×5000 ≈ 61,630 miles.  I always cringe at this response, because these students have not noticed that their calculation does not make sense: This answer would mean that the company would give 99% of customers their money back!  I also cringe because these students have neglected to follow my first piece of advice about calculations involving normal distributions: Start with a sketch!  If they included with a sketch, they would have seen that only 1% of tire lifetimes exceed 61,630 miles, and 38,370 miles is the value for which 99% exceed that lifetime:


5. Recall the formula for the z-test statistic when conducting a hypothesis test about a proportion:

  • a) What does the symbol p-hat* represent?  (Be as specific as possible.)
  • b) What does the symbol pi_0* represent?  (Be as specific as possible.)
  • c) What does the symbol n represent?
  • d) For a given value of n, what happens to the absolute value of the test statistic as the difference between p-hat and pi_0 increases?
  • e) For a given value of n, what happens to the p-value as the difference between p-hat and pi_0 increases?

* Students see the symbols rather than the words p-hat and pi_0 here.

Parts (a)-(c) simply ask students to explain what three symbols represent.  I grade parts (a) and (b) very strictly.  Full credit for part (a) requires saying that the p-hat symbol represents a sample proportion.  A response to part (b) must mention a hypothesized value of the population proportion, with all three italicized words required for full credit.  I think these are worthwhile questions, because it’s crucial for students to recognize what these symbols mean in order to understand hypothesis testing about a proportion.

Students can think about parts (d) and (e) either conceptually or mathematically.  The z-statistic measures the difference between the sample proportion and the hypothesized value of the population proportion in terms of number of standard deviations.  A larger difference produces a larger absolute value of the test statistic.  A larger difference also provides stronger evidence against the null hypothesis and therefore a smaller p-value.


6. A Harris Poll that surveyed 2225 adult Americans on October 14-19, 2015 found that 29% reported having at least one tattoo.

  • a) Is 29% (.29) a parameter or a statistic?  What symbol do we use for it?
  • b) Determine (by hand) a 95% confidence interval for the relevant parameter.
  • c) Interpret this interval: You’re 95% confident that __________ is between ____ and ____ .
  • d) How would a 99% confidence interval differ (if at all) from the 95% one?  Comment on both the midpoint and width of the interval.  (Do not bother to calculate a 99% confidence interval.)
  • e) The same Harris Poll also found that 47% of respondents between the ages of 18-35 reported having at least one tattoo.  How would a 95% confidence interval for this age group, based on this survey, compare to the 95% confidence interval that you found in part (b)?  Comment on both the midpoint and width of the interval.

Parts (a)-(c) ask students to apply what they learned in class to a new study.  Notice that I avoid identifying the parameter for them in part (b), because I want students to describe the parameter clearly in their interpretation in part (c), not just repeat back what I’ve already said.  I do not always provide the structure for interpreting a confidence interval as in part (c), but I think this is helpful in a quiz setting.

In part (d) most students realize that the confidence interval would become wider with a larger confidence level, and I also want them to note that the midpoint would stay at .29, the value of the sample proportion with a tattoo.  Part (e) makes students think, as we did not answer a question like this in class.  They need to recognize that the sample size would be smaller for the restricted group, so the confidence interval would become wider.  They also need to see that the sample proportion (with a tattoo) is larger for the subgroup, so the midpoint of this interval would be larger than for the original interval.


The word quiz can be very valuable when playing Scrabble, and I find group quizzes to be very valuable for my students’ learning.  I fulfill the promise in my syllabi to give lots of quizzes.  Most students respond admirably by engaging with each other to support their learning as they discuss and respond to the quiz questions.  In next week’s blog post I will provide seven more quizzes, again with five parts each, that I have used with my students.

P.S. The article about predicting elections can be found here.  A report on the Harris poll about tattoos can be found here.

P.P.S. I will provide a link to a Word file with these quiz questions, which teachers can modify as they’d like for use with their students, at the end of next week’s post.