Skip to content

#38 Questions from prospective teachers

My Cal Poly colleague Anelise Sabbag recently asked me to meet with students in her undergraduate course for prospective teachers of statistics.  Most of the students in the class are majoring in Statistics, Mathematics, or Liberal Studies, the last of which is for students preparing to teach at the elementary or middle school level.

Prior to my visit, Anelise asked her students to pose a question to me about teaching statistics.  I was very impressed with the thoughtfulness of their questions, so much that I decided to write this blog post with some of my responses and reflections.  Breaking from this blog’s custom, questions in italics in this post were posed to me by these students.


1. What is the hardest topic to teach in introductory statistics? 

This is a great question, one that I’ve thought about a lot.  My answer is: how the value of a sample statistic varies from sample to sample, if we were to repeatedly take random samples from a population

Of course, I could have expressed this answer with just two words: sampling distributions.  But while this two-word phrase provides a very handy shorthand for people who already understand the concept, I’m not convinced that using this term is helpful to students who are first learning the idea.

In fact, let me back up and split my answer into two parts: Before we can ask students to learn and understand sampling distributions, we need to begin with the more basic notion of sampling variability.  In other words, first we must help students recognize that the value of a sample statistic varies from sample to sample, before we tackle the more challenging* idea that this variability displays a predictable, long-run pattern.  That predictable, long-run pattern is what we mean by the term sampling distribution.

* This idea is not only challenging, it’s remarkable!  Isn’t it amazing that the long-run variability of a sample mean or a sample proportion turns out (in many circumstances, anyway) to follow a beautiful bell-shaped curve?!

Why is this topic so hard?  I suggest two reasons: First, it’s always difficult to comprehend a hypothetical: What would happen if …?  This hypothetical is central to many concepts in statistics, including probability, p-value, and confidence level, as well as sampling distribution.  Second, we’re asking students to think beyond a sample statistic (such as a mean or a proportion) as a straight-forward calculation that produces a number, to thinking of the statistic as a random variable that varies from sample to sample.  This is a very big cognitive step that requires a lot of careful thought*.

* An analogy from calculus is the large cognitive step from thinking of the slope of a tangent line to a curve at a point as a number, to then considering the slope of the tangent line to the curve at all points as a function.

What can be done to help students overcome their difficulties with this concept?  I will explore this question in a future post, but my one-word answer will come as no surprise: Simulate!


2. What do math majors struggle with when studying statistics?

First, I want to emphasize that math majors, and other students who are comfortable with math, struggle with the same challenging concepts that other students do, such as sampling distributions.  I rely on simulations to introduce math majors to sampling distributions, just as with students who are less mathematically inclined*. 

* I also explore this concept in more mathematical ways with math majors.  For example, I lead them to determine the exact sampling distribution of a sample mean in a sample of size 2 or 3 from a small population or discrete probability distribution.

Math majors can also struggle with the fundamental ideas of uncertainty and variability.  Probabilistic thinking can provide a bit of a shock from the deterministic thinking with which they are likely more comfortable.  A related issue is tolerance for ambiguity, as math majors (and all students) can be uncomfortable with the lack of certainty associated with statistical conclusions.  In their statistics courses, students must learn to write conclusions such as “there is strong evidence that …” and “we can be very confident that …” and “the data reveal a general tendency that …”  These conclusions stand in stark contrast to the kind that might be more in the comfort zone for math majors, such as “the exact answer is …” and “we have therefore proven that …”

Speaking of writing, that’s another aspect of statistics courses that can frustrate some math majors.  Assessments in statistics courses often require students to write sentences, perhaps even paragraphs, rather than provide a single number as an answer.  These questions often begin with verbs – such as describe, compare, explain, justify, interpret – that might intimidate students who are more comfortable responding to prompts that begin with verbs such as calculate, derive, show, or even prove

Another potential source of frustration is that much of mathematics involves abstraction, whereas statistics depends heavily on context.


3. How can teachers provide students with enough knowledge to prepare them to investigate good questions?

This question is a close cousin of one that Beth Chance and I are often asked by teachers who attend our workshops: How do you find time to include activities in class? 

I think many teachers under-estimate students’ ability to create their own understanding through well-designed learning activities.  I do not accept that teachers need to lecture on a topic, or have students watch a video or read a chapter on the topic, before they turn students loose on an activity.  The questions in the activities can lead students to new knowledge.  Necessary terminology and notation can be embedded in the activity.  Teachers can lead a discussion following the activity that reinforces key take-away lessons for students.

Here are three examples:

  • The Gettysburg Address activity described in post #19 (here) is a long one that can take most or all of a 50-minute class session.  But this activity introduces students to many concepts, including sampling bias, random sampling, sampling variability, sampling distribution, and effect of sample size on sampling variability
  • The Random Babies activity described in posts #17 and #18 (here and here) leads students to fundamental ideas of probability as a long-run proportion and expected value as a long-run average, along with topics such as sample space, equally likely outcomes, mutually exclusive events, and the complement rule.
  • The simulation-based inference activities of posts #12 and #27 (here and here) enable students to discover the reasoning process of statistical inference, specifically hypothesis tests and p-values.  Teachers do not need to provide a multi-step outline for how to conduct a hypothesis test prior to engaging students with these activities.  They do not even need to define a null hypothesis or a p-value in advance.  Instead, teachers can introduce those terms after students have encountered the ideas in the context of real data from a genuine study.  

4. What lessons have I learned from students?

I did not expect this question.  I think this is one of the best I’ve ever been asked.  This question truly caused me to pause and reflect. 

But I must admit that despite this considerable reflection, my answer is not the least bit clever or insightful.  Here’s my list of very basic things that I believe students value and teachers should prioritize:

  • Show respect.
  • Be organized.
  • Make expectations clear.
  • Provide timely feedback. 
  • Stay humble. 

The first four items in this list are so obvious that they need no explanation.  About the last one: I like to believe that I have continued to learn more and more as time has gone by.  One thing I have surely learned is that there is so much that I don’t know.  I’m referring to the subject matter, and to how students learn, and everything else involved with teaching statistics.  I have also come to realize that my course is certainly not the center of my students’ world.  I also need to remember that no students will master every detail or retain every skill that they develop in my course.  It’s fine for me to set high expectations for my students, but I also need to keep my expectations reasonable.


5. What advice do I offer to prospective teachers of statistics?

My #1 piece of advice is no secret, but first I’ll offer two other suggestions, which I hope are less predictable.

At the beginning of my teaching career, I learned a great deal about statistics, and formed a lot of ideas about how to teach students about statistical concepts, from reading textbooks for a basic course in statistical literacy: David Moore’s Statistics: Concepts and Controversies, Freedman, Pisani, and Purves’s Statistics, and Jessica Utts’s Seeing Through Statistics.  I have also enjoyed and learned a lot from books aimed at broader audiences that involve statistics and data.  Two examples on the history of statistics are David Salsburg’s The Lady Tasting Tea and Sharon Bertsch McGrayne’s The Theory That Would Not Die.  Examples from other fields include Freakonomics by Steven Levitt and Stephen Dubner and Thinking: Fast and Slow by Daniel Kahneman.  My first piece of advice is: Read non-technical books.

More than two decades ago, I invited Jim Bohan, a high school math teacher and math coordinator for his school district, to speak to students at my college who were considering whether to pursue math teaching as a career.  I’ll never forget Jim’s advice to these students: Don’t go into teaching because you love math; go into teaching because you love working with kids.  He reinforced his point by adding: When people ask me what I teach, I don’t say that I teach math; I say that I teach kids.  Jim’s message resonated with me and leads to my second piece of advice: Pay attention to the human side of teaching and learning.  

Now for the anti-climax …  The final piece of advice that I offered to the prospective teachers in Anelise’s class, the three words that I hoped to impress upon them more than any others, will surprise no one who is reading this blog*: Ask good questions! 

* If this is my first post that you’ve read, please consider reading post #1 (here) that provided an overview of this blog and my teaching philosophy. You can also find a convenient list of all posts (here).

#37 What’s in a name?

Delivered by Juliet on her iconic balcony, Shakespeare’s poetic what’s in a name speech is one of the most famous in the English language. 

What does this have to do with teaching introductory statistics?  Well, there’s a lot of data that one could collect on students’ names.  An obvious but boring example is to count the number of letters in a name.  A more fun, albeit silly, option is to determine the number of Scrabble points in a name.  I often collected these data from students early in my teaching career.

I have abandoned this practice in recent years, primarily because I want students to analyze more important and consequential datasets.  But I am starting to rethink this for two reasons.  First, there’s nothing wrong with occasionally using a dataset that’s silly and fun.  Second, this dataset is rich with potential for achieving pedagogical goals, such as exploring the seemingly simple but actually challenging concept of a statistical tendency

I mentioned in post #35 (Statistics of illumination, part 4, here) that psychologist Keith Stanovich has called probabilistic reasoning “the Achilles heel of human cognition.”  Data on Scrabble points in students’ names can help students to confront misunderstandings about this topic.  The research question to be investigated is: Do people with longer names have more Scrabble points?  The answer is yes, as long as we are careful to regard this as a statistical tendency and not a hard-and-fast rule.  What does this mean?  We’ll explore that question in this post.  As always, questions for posing to students appear in italics.


Here are the Scrabble point values of the 26 letters:

Just to make sure that we’re all clear: Shakespeare has 11 letters and 20 Scrabble points (1 + 4 + 1 + 5 + 1 + 1 + 3 + 1 + 1 + 1 + 1 = 20) in his name.  Juliet has 6 letters and 13 points, and Romeo has 5 letters and 7 points*.  Among these three names, a name with more letters than another always has more Scrabble points.  So far this is a rule and not just a tendency.  But one need not look far to find an exception to this rule: Romeo’s friend Mercutio has 8 letters but only 12 points, so his name has more letters but fewer points than Juliet’s.

* This website (here) is helpful for speeding up these calculations.

Because longer names do tend to produce more points than shorter names, I propose examining the ratio of a name’s points to letters as a measure of the name’s Scrabble-strength.  These ratios are approximately 1.818 for Shakespeare, 2.167 for Juliet, 1.400 for Romeo, and 1.500 for Mercutio.  By this measure, Juliet has the Scrabble-strongest name and Romeo the least.


For the purpose of this post, I will use the names of the fifty states in the U.S. to illustrate the kinds of questions that can be asked and analyses that can be performed.  Before I show you the data: Make a guess for the state with the most letters, the state with the most points, and the state with the largest ratio.  Are your guesses made?  Ready for the answers?  Here’s the list of states, along with region of the country, number of words in the name, number of letters, number of Scrabble points, and ratio of points to letters:

Which states win bragging rights?  As indicated in bold in the table, Massachusetts, North Carolina, and South Carolina tie for the most letters with 13.  New Hampshire takes first place in number of Scrabble points with 25.  In the most coveted ratio category, the winner is Kentucky with 2.625 points per letter.

Now let’s return to the original question: Do states with more letters tend to have more points?  But first we can ask students a more basic question: What kind of graph would you produce to investigate this question?  Because both variables are numerical, we can examine a scatterplot of points versus letters:

Does this graph reveal a tendency for states with more letters to have more points?  How can you tell?  Yes, this graph displays a positive association between points and letters.  For example, states with 10 or more letters all have 15 or more points, whereas states with 6 or fewer letters all have 12 or fewer points.

What statistic could we calculate to quantify this tendency?  Make an educated guess for its value with these data.  Again because these are both numerical variables, we can calculate the correlation coefficient between points and letters.  Its value turns out to be 0.735, indicating a fairly strong, positive association.

Can you identify a pair of states for which the state with more letters has fewer points?  Indicate one such pair of states on the graph.  Many pairs of states buck the overall trend in this way.  One example is that Kentucky has fewer letters (8 vs. 9) but more points (21 vs. 15) than California, as shown here:

Another way to think about a statistical tendency is: Suppose that we select two of the fifty states at random.  Make an educated guess for the probability that the state with more letters also has more points.  This is equivalent to asking: Among all pairs of states, in what percentage does the state with more letters also have more points?  There are 50×49/2 = 1225 pairs of states.  I wrote a small program in R to analyze these 1225 pairs.  It turns out that 161 of the pairs have the same number of letters or the same number of points.  Of the remaining 1064 pairs, 802 satisfy the overall tendency (the state with more letters also has more points), and 262 run counter to that tendency.  So, the probabilities for a randomly selected pair of states are 802/1225 ≈ 0.655 that the state with more letters also has more points, 262/1225 ≈ 0.214 that the state with more letters has fewer points, and 161/1225 ≈ 0.131 that the states have the same number of letters or points.  If we restrict our attention to the pairs of states without ties, the probability is 802/1064 ≈ 0.754 that the state with more letters also has more points.  To simplify: The overall probability is about two-thirds that the state with more letters has more points, and this increases to three-fourths if we eliminate ties.

The following graph displays the least squares regression line for predicting number of points from number of letters:

How many points would this line predict for Puerto Rico, if it were to become a state?  Puerto Rico has 10 letters, so the line would predict 1.611 + 1.460 × 10 = 16.211 points.  Assess the accuracy of this prediction.  Puerto Rico actually has 14 Scrabble points, so the prediction overestimates by 2.211 points, which is an overestimate of about 15.8%.  Repeat for Guam.  Guam has 4 letters, so the line would predict 1.611 + 1.460 × 4 = 7.451 points, compared to 7 actual points for Guam.  This is an overestimate of only 0.451 points, for a percentage error of just 6.4%.


We can also use these data to look for a statistical tendency in comparing two groups.  Do you expect that states with two words in their name will tend to have more Scrabble points than states with one-word names?

Before we investigate this question, here’s a more basic one: What kind of graph can we use to answer these questions?  Most students recognize that this question involves comparing two groups on a numerical response, so we can use comparative dotplots or boxplots.  Consider these graphs:

Do these graphs reveal that two-word states tend to have more points than one-word states?  Yes, absolutely.  From the boxplots, we see that every value in the five-number summary of points is higher for the two-word states than the one-word states.  In particular, the median number of points is 19 for the two-word states, 11.5 for the one-word states.  The dotplots also reveal that every two-word state has at least 16 points, but only 10 of 40 one-word states have 16 or more points. 

Can you identify a pair of states that do not satisfy the overall tendency?  Certainly, there are many such pairs.  The most extreme example is that (one-word) Kentucky has 21 points and (two-word) Rhode Island has 16 points.  I wrote some R code to analyze the 40×10 = 400 pairs of states (with a one-word and a two-word state) and found that 305 follow the overall tendency (meaning that the two-word state has more points than the one-word state), 68 run counter to the tendency, and 27 have a tie for number of points.

Could we incorporate information on number of words in the state’s name into the scatterplot of points versus letters?  If so, how?  What do you expect to see?  Yes, here is a coded scatterplot, with one-word states represented by blue circles and two-word states by red squares:

What does this graph reveal?  As expected, the two-word states appear in the top right part of the graph, indicating that they tend to have more letters and more points than one-word states.  Also, perhaps surprisingly, the association between points and letters is much weaker among the two-word states than the one-word states.


Now let’s bring the ratio (points to letters) variable into the analysis.  Which group – one-word states or two-word states – do you expect to have larger ratio values, on average?  Do you expect the difference between the groups to be substantial or small?  Ready to see the graphs?  Here you go:

Compare and contrast the distributions of ratio values between these groups.  The two-word states have slightly larger ratios, on average, than one-word states.  The medians are approximately 1.73 and 1.57 for the two-word and one-word states, respectively.  The two-word states also have more variability in ratio values than one-word states.  Both distributions appear to be slightly skewed to the right, more so for the two-word states.  The one-word states have two outliers on the high end of the ratio values – Kentucky (2.625 points per letter) and Texas (2.400 points per letter).

Now: What kind of relationship (if any) do you expect to see in a scatterplot of ratio versus letters?  How about in a scatterplot of ratio versus points?  Ready to find out?  Here are the graphs:

Describe what these graphs reveal.  Also make educated guesses for the values of the two correlation coefficients.  As expected, we see that the ratio is positively associated with number of points.  But the association is slightly negative with number of letters.  It turns out that correlation coefficients are 0.554 between ratio and points, -0.142 between ratio and letters.

Which variable – points or letters – would be more helpful for predicting ratio?  Determine the equation of this least squares line.  Also calculate and interpret the value of r-squared.  Number of points will clearly be more helpful for predicting ratio than number of letters.  The equation of the least squares line turns out to be: predicted ratio = 1.034 + 0.04674 × points.  The value of r-squared is 30.7%, indicating that 30.7% of the variability in states’ ratios is explained by knowing their number of points.  This value is perhaps surprisingly small, considering that number of points is used directly in the calculation of ratio.

Do you think the regression model would be substantially improved by adding number of letters, as well as number of points, to the predictors of ratio?  Some students think that the answer is clearly yes, because number of letters is included in the calculation of ratio, just as number of points is included.  Other students believe that the answer is clearly no, because the scatterplot reveals a very weak association (correlation -0.142) between ratio and letters.  Software tells us that the regression model with both predictors is: predicted ratio = 1.6723 + 0.12095 × points – 0.20029 × letters.  It makes sense that the coefficient is positive for number of points and negative for number of letters.  The value of r-squared for this regression model is 96.4%, dramatically greater than the value of 30.7% based only on number of points as a single predictor.  The two predictors together do much better at predicting a state’s ratio value than the sum of their usefulness as individual predictors.


Some students might ask themselves: Hold on, we can calculate a state’s ratio exactly from its number of points and number of letters, so why is r-squared not 100%?  The answer is that multiple regression incorporates variables in the model additively, whereas the calculation of ratio involves dividing points by letters.

Can we find a clever work-around that uses multiple regression to predict ratio from points and letters exactly, with an r-squared value of 100%?  Yes, we can.  The key is to transform all three variables by taking logarithms.  Here are scatterplots of the transformed data:

What do these graphs reveal?  We see a moderate positive association between log(ratio) and log(points), and there’s very little association between log(ratio) and log(letters).  These graphs provide no hint of what the multiple regression model will reveal.

The multiple regression model with these transformed variables turns out to be: log(ratio) = log(points) – log(letters).  The value of r-squared with this model is 100%!  What happens when you back-transform this equation to get rid of the logs*?  The right-hand side of the equation can be expressed as: log(points/letters).  Then exponentiating both sides of the equation produces what we knew it would: ratio = points/letters.

* How about that: a chance to work with properties of logarithms!


Asking your students to analyze data from their own names is more fun than analyzing states’ names.  Let me show you some results from data on my colleagues rather than students.  The following graph comes from data on the 33 faculty members listed on the faculty directory webpage (here) for the Statistics Department at Cal Poly – San Luis Obispo in the Winter quarter of 2020:

The graph reveals who wins bragging rights: Maddie Schroth-Glanz has the most letters (18) and Scrabble points (40) in her name, while Jimmy Doi has the largest ratio of points to letters (23/8 = 2.875 points per letter)*.  The smallest values were proudly achieved by Soma Roy for the fewest letters (7) and Dennis Sun for the fewest points (10) and smallest ratio (1.111)**.

* I used names as they appeared on the faculty directory webpage.  I realize that using James instead of Jimmy, or Madeleine instead of Maddie, would have changed the data.

** Where is my name?  I have slightly more letters than average (12, average 11.4) but many fewer points than average (14, average 21.4).  Needless to say, this makes my points-to-letters ratio one of the smallest among my colleagues (1.167, third smallest of 33 names).

Again we see a statistical tendency here, as names with more letters tend to have more points.  But a longer name does not guarantee more points, which is what makes this a tendency rather than a rule.   Among the 33×32/2 = 528 pairs of names, 327 follow the tendency and 127 run counter to it, with 74 ties.

The regression line for predicting points from letters is also given in the graph, along with the r-squared value of 41.0%.  The line is a bit steeper than with the states’ names (slope 1.773 points per letter for faculty, compared to 1.460 points per letter for the states).  The value of r-squared is smaller for the faculty than for the states: 41.0% versus 54.1%.

Analyzing people’s names provides an opportunity to analyze paired data.  The following graph displays Scrabble points of last names versus first names for these faculty, along with a 45 degree line:

What tendency is revealed by most of the names falling above the 45 degree line? A slight tendency for more points in a faculty member’s last name than first name.


Depending on your course goals and student audience, you could also ask students to do some coding associated with data on Scrabble points.  Such an activity could include scraping names from a webpage, and it might also use a mapping of letters-to-points for games other than Scrabble, such as Words with Friends. I recently attended an inspiring presentation by Paul Myers (9 letters, 16 points, 1.778 ratio), who asks high school students in a data science* class to write code (in both Excel and R) for calculating numbers of letters and Scrabble points in names. 

* Have you noticed that data science has more letters, more Scrabble points, and a larger points-to-letters ratio than statistics?

Recommendation #3 of the GAISE report (here) is: Integrate real data with a context and purpose.  These data on Scrabble points do not have much purpose, other than being fun, but they do provide opportunities to explore statistical concepts.  Chief among these is the concept of a statistical tendency, which is quite fundamental but can prove elusive to many students.

Perhaps if Shakespeare had been a (far-ahead-of-his-time) statistician or data scientist, he might have asked: What tends to be in a name?

P.S. An Excel file with the data on states’ names can be found below.  Graphs in this post were produced with Minitab statistical software.

P.P.S. As so many teachers prepare to teach remotely in the coming weeks, I regret that I have no experience with online teaching and so have no advice to offer*.  Because I am on leave this academic year, I am not embarking on this foray into a brave new world alongside so many of you.  I wish you and your students well in these challenging circumstances.

* Other than: Ask good questions! 🙂

P.P.P.S. I’m a big fan of simulations*.  I highly recommend this article (here) by Harry Stevens, which shows simulations of virus spread under four conditions, illustrating the potential impact of social distancing.

* See posts #12, #13, and #27 (here, here, and here) on simulation-based inference, post #14 (here) on interpreting confidence level, and posts #23 and #24 (here and here) for simulation analyses of a probability problem.

#36 Nearly normal

Some students mistakenly believe that everything follows a normal* distribution.  Much closer to the truth is that nothing follows a normal distribution.  George Box famously said: All models are wrong; some models are useful.  The normal distribution provides a useful model for the pattern of variation in many numerical variables.  It also provides a valuable model for how many sample statistics vary, under repeated random sampling from a population.

* This normal word is not quite as objectionable and misleading as expected value (see post #18 here), but it’s still an unfortunate term.  I try to convince students that so-called normal distributions are not all that normal in any sense, and they certainly do not provide the inevitable shape for the distribution of all, or even most, numerical variables.  I realize that I could use the term Gaussian distribution, but that’s too math-y.  Some people capitalize Normal to distinguish the distribution from the everyday word, but that’s quite subtle.  I’d prefer to simply called them bell-shaped distributions, although I know that’s too vague, for example because t-distributions are also bell-shaped.

In this post, I present questions about normal distributions that my students answer in class.  The first is a straightforward introduction to the basics of normal distribution calculations.  The second tries to make clear that a normal distribution is not an appropriate model for all numerical data.  The third asks students to think through how the mean and standard deviation affect a normal distribution in a manufacturing context.  As always, questions that I pose to students appear in italics.


I use the context of birthweights to lead students through basic questions involving calculations of probabilities and percentiles from normal distributions.  I like to draw students’ attention to two different wordings for these kinds of questions.  You’ll notice that question (b) asks about a proportion of a population, whereas question (c) asks for a probability involving a randomly selected member of the population.

1. Suppose that birthweights of newborn babies in the United States follow a normal distribution with mean 3300 grams and standard deviation 500 grams.  Babies who weigh less than 2500 grams at birth are classified as low birthweight.

  • a) How many standard deviations below the mean is a baby classified as low birthweight?

I realize that calculating a z-score can be considered an unnecessary intermediate step when students are using technology rather than an old-fashioned table of standard normal probabilities.  But I think a z-score provides valuable information*, so I like to start with this question.  Because (2500 – 3300) / 500 = -1.60, a low birthweight baby is at least 1.60 standard deviations below the mean birthweight.

* I discussed z-scores at some length in post #8 (End of the alphabet, here).

Based on the normal model:

  • b) What percentage of newborn babies weigh less than 2500 grams? 
  • c) What is the probability that a randomly selected newborn baby weighs more than 10 pounds?
  • d) What percentage of newborn babies weigh between 3000 and 3500 grams?
  • e) How little must a baby weight to be among the lightest 2.5% of all newborns?
  • f) How much must a baby weigh to be among the heaviest 10%?

Frankly, I don’t care much about whether students carry out these calculations with an old-fashioned table of standard normal probabilities or with technology.  I give my students access to an old-fashioned table and describe how to use it.  I also show students several choices for using technology (e.g., applet, Minitab, R, Excel).  I always encourage students to start with a well-labeled sketch of a normal curve, with the probability of interest shaded as an area under the normal curve.

The answer to part (b) is that the normal model predicts that 5.48% of newborns are of low birthweight, as shown in this applet (here) output:

I like that this applet draws a well-labeled sketch with the correct percentage shown as the shaded (dark blue) under the curve.  I also like that the applet reports the z-score as well as the probability. 

Part (c) requires that students first convert 10 pounds into grams.  They are welcome to use the internet to help with this conversion to approximately 4536 grams.  If they are using a standard table of cumulative probabilities, students must realize that they need to subtract the probability given in the table from one.  The applet reports that this probability that a baby weighs more than ten pounds is only 0.0067, as shown here:

Part (d) requires students to subtract two probabilities if they are using a table.  The applet shows this percentage to be 64.50%, as shown here:

I emphasize to students that parts (e) and (f) ask fundamentally different questions from parts (b)-(d).  The previous parts asked for probabilities from given values; the upcoming parts ask for the birthweight values that produce certain probabilities.  In other words, parts (e) and (f) ask for percentiles, a term with which students are aware but probably need some reinforcement to understand well.

Students can answer part (e) approximately, without a table or software, by remembering the empirical rule.  The cut-off value for the bottom 2.5% of a normal distribution is approximately 2 standard deviations below the mean, which gives 3300 – 2×500 = 2300 grams.  A more precise answer comes from using a z-score of -1.96 rather than -2, which gives 2320 grams, as shown here:

To answer part (f) with a table, students need to realize that the question asks for the 90th percentile.  The applet shows that this value is approximately 3941 grams:

These questions are quite routine.  The more interesting part comes from thinking about this normal distribution as a model for actual birthweight data.  At this point, I show students this table from the National Vital Statistics Reports (here):

  • (g) According to this table, what percentage of newborns in 2018 weighed between 3000 and 3999 grams?  How does this compare with what the normal model predicted in part (d)?

The table reports that 38.76% + 26.44% = 65.20% of newborns weighed between 3000 and 3999 grams, which is very close to the normal model’s prediction of 64.50% from part (d).

  • (h) Compare the predictions from the normal model in parts (b) and (c) to the actual counts.

The normal model’s predictions are less good in the tails of the distribution than near the middle.  The normal model predicted that 5.48% would be of low birthweight, but the actual counts show that 8.28% were of low birthweight.  If we use 4500 rather than 4536 for the approximate ten-pound value, we find that 0.94% + 0.11% = 1.05% of newborns weighed more than 4500 grams, compared to a prediction of about 0.67% from the normal model using 4536 grams.

What’s the bottom line here: Do birthweights follow a normal distribution?  Certainly not exactly, but closely enough that the normal model provides a useful approximation.

I want students in more mathematical courses to become comfortable with the concept of a cumulative distribution function (cdf).  So, I ask these students to use the pnorm (cdf) and qnorm (inverse cdf) commands in R, in addition to using the more visual applet, to perform these calculations.  The following output shows how to answer parts (b)-(f) with these R commands:


2. I recorded the game time (in minutes) until first goal for a sample of 41 National Hockey League games played on October 16-22, 2017.  The distribution of these times is displayed in the following histogram, for which the mean is 11.4 minutes and standard deviation is 10.6 minutes:

  • a) Would it be appropriate to use a normal model for the distribution of times until first goal?  Explain.
  • b) If you were to model these times with a normal distribution (using the sample mean and standard deviation), what is the probability that the time until first goal would be negative?
  • c) Comment on what the calculation in part (b) indicates about the suitability of using a normal model for time until first goal.

Students recognize immediately that this distribution is highly skewed, not bell-shaped in the least, so a normal model is completely inappropriate here.  The calculation in part (b) produces a z-score of (0 – 11.4) / 10.6 ≈ -1.08 and a probability of 0.141.  This means that a normal model would predict that about 1 in 7 hockey games would have a goal scored before the game began!  This calculation provides further evidence, as if any were needed, that a normal model would be highly inappropriate here.

This example takes only 10 minutes of class time, but I think it’s important to remind students that many numerical variables follow distributions that are not close to normal.  I also like that part (b) gives more practice with a routine calculation, even while the focus is on the inappropriateness of the normal model in this case.


The next series of questions asks students to think more carefully about properties of normal curves, particularly how the mean and standard deviation affect the distribution. 

3. Suppose that a candy bar advertises on its wrapper that its weight is 54.5 grams.  The actual weights vary a bit from candy bar to candy bar.  Let’s suppose that the actual weights follow a normal distribution with mean μ = 55.0 grams and standard deviation σ = 0.3 grams.

a) What percentage of candy bars weigh less than advertised?  This is a very routine calculation.  The z-score is -1.67, and the probability is .0478, so 4.78% of candy bars weigh less than advertised, as shown here:

b) Now suppose that the manufacturer wants to reduce this percentage so only 1% of candy bars weigh less than advertised.  If the standard deviation remains at 0.3 grams, would the mean need to increase or decrease?  Explain.  I encourage students to think about this visually: To get a smaller percentage below 54.5 grams, does the mean (and therefore the distribution) need to shift to the right or the left?  Most students realize that the curve needs to shift to the right, so the mean needs to be larger.

c) Determine the value of the mean that would achieve the goal that only 1% of candy bars weigh less than advertised.  Students cannot easily plug given numbers into an applet and press a button to answer this question.  They need to think through how to solve this.  The first step is to determine the z-score for the bottom 1% of a normal distribution, which turns out to be -2.326.  This tells us that the advertised weight (54.5 grams) must be 2.326 standard deviations below the mean.  We can then calculate the mean by adding 2.326 standard deviations to the advertised weight: 54.5 + 2.326 × 0.3 ≈ 55.20 grams.

Normal curves with the original mean (in blue) and the new mean (red dashes) are shown below.  The area to the left of the value 54.5, representing the percentage of candy bars that weigh less than advertised, is smaller with the new mean:

d) What is the downside to the manufacturer of making this change?  I want students to realize that increasing the mean weight means putting more candy in each bar, which will have a cost, perhaps substantial, to the manufacturer.

e) Now suppose that the manufacturer decides to keep the mean at 55.0 grams.  Instead they will change the standard deviation to achieve the goal that only 1% of candy bars weigh less than advertised.  Would the standard deviation need to increase or decrease to achieve this goal?  Explain.  When students need a hint, I ask: Does the original normal curve need to get taller and narrower, or shorter and wider, in order to reduce the area to the left of the value 54.5 grams?  This question is harder than the one about shifting the mean, but most students realize that the curve needs to become taller and narrower, which means that the standard deviation needs to decrease.

f) Determine the value of the mean that would achieve the goal that only 1% of candy bars weigh less than advertised.  Once again we need a z-score of -2.326 to determine the bottom 1% of the distribution, which again means that the advertised weight needs to be 2.326 standard deviations below the mean.  We can express this algebraically as: (54.5 – 55.0) / σ = -2.326.  Solving gives: σ = (55.0 – 54.5) / 2.326 ≈ 0.215 grams.

Normal curves with the original standard deviation (in blue) and the new one (red dashes) are shown below.  The area to the left of the value 54.5 is smaller with the new standard deviation:

g) Why might this be a difficult change for the manufacturer to make?  Decreasing the standard deviation of the weights requires making the manufacturing process less variable, which means achieving more consistency in the weights from candy bar to candy bar.  Reducing variability in a manufacturing setting can be a daunting task.

h) By what percentage does the manufacturer need to decrease the standard deviation of the weights in order to achieve this goal?  Percentage change is a challenging topic for students, so I look for opportunities to ask about it often*.  The manufacturer would need to decrease the standard deviation of the weights by (0.215 – 0.3) /  0.3 × 100% ≈ 28.3% to achieve this goal.

* See post #28 (A persistent pet peeve, here) for many more examples.


Teachers of introductory statistics must decide:

  • Whether to teach normal distributions as models for numerical data or only as approximations for sampling distributions;
  • Whether to include the process of standardization to z-scores when performing calculations involving normal distributions;
  • Whether to ask students to use a table of standard normal probabilities or use only technology for calculating probabilities and percentiles from normal distributions.

You can tell from the examples above that my answers are yes to the first two of these, and I don’t much care about whether students learn to read an old-fashioned normal probability table.  I do care that students learn that a normal curve only provides a model (approximation) for a distribution of real data, and that many numerical variables have a distribution that is not close to normal.  I also expect students to learn how to think carefully through normal distribution calculations that go beyond the basics.

In a follow-up post, I will describe an activity that gives students more practice with normal distribution calculations while also introducing the topic of classification and exploring the concept of trade-offs between different kinds of classification errors.

#35 Statistics of illumination, part 4

In previous posts (here, here, and here), I described examples that I present on the first day of a statistical literacy course and also when I give talks for high school students.  These activities show how data analysis can shed light on important questions and illustrate statistical thinking.

This post returns to this theme and completes the series.  Today’s example highlights multivariable thinking, much like post #3 (here) that introduced Simpson’s paradox.  One difference is that today’s example includes two numerical variables rather than all categorical ones.  A similarity is that we begin with a surprising finding about two variables that makes perfect sense after we consider a third variable.

As always, questions that I pose to students appear in italics.


We will examine data on lung capacity, as measured by a quantity called forced expiratory volume (to be abbreviated FEV), the amount of air an individual can exhale in the first second of forceful breath (in liters).  The following graph displays the distributions of FEV values for 654 people who participated in a research study, comparing smokers and non-smokers:

Which group – smokers or non-smokers – tends to have larger lung capacities?  Does this surprise you?  Students are quick to point out that although the two groups’ FEV values overlap considerably, smokers generally have higher FEV values, and therefore greater lung capacities, than non-smokers.  Next I tell students that the average FEV values for the two groups are 2.57 liters and 3.28 liters.  Which average is for smokers and which for non-smokers? Students realize from the graph that the larger average FEV belongs to the smokers.

Then I show boxplots of the distributions of FEV values.  Without going into any of the calculation details, I simply explain that the boxplots present the 25th, 50th, and 75th percentiles of the distributions, along with the minimum and maximum, with outliers shown as asterisks:

Describe how the distributions of FEV values compare between smokers and non-smokers.  The key point here is that smokers have higher FEV values than non-smokers throughout the distributions (at the minimum, 25th and 50th and 75th percentiles), except near the maximum values.  Non-smokers also have more variability in FEV values, including several outliers on the large side.

Does every smoker have a larger FEV value than every non-smoker?  No, many non-smokers have a larger FEV value than many smokers.  In others words, the FEV values overlap considerably between the two groups.  What is meant by a statistical tendency in this context?  This question is difficult but crucial to statistical thinking.  I don’t make a big deal of this on the first day of class, but I point out that a statistical tendency is not a hard-and-fast rule.  I emphasize phrases like on average and tend to and generally, in the hope that students will begin to catch on to probabilistic rather than deterministic thinking*. 

* I am reminded of a book called How to Think Straight About Psychology, by Keith Stanovich, which includes a chapter titled “The Achilles Heel of Human Cognition: Probabilistic Reasoning.” 

At this point I want students to focus on the big question: What in the world is going on here??  Why do smokers tend to have larger lung capacities than non-smokers?


I have withheld a very important piece of information until now: These 654 people were all children!  Their ages varied from 3 to 19 years old, as shown in the following histogram:

Before we analyze the data further, I ask students to think about this question in the back of their minds: How might this revelation about ages explain the surprising finding that smokers tend to have larger lung capacities than non-smokers?


Now, for the front of students’ minds, I ask: How do you expect the distribution of age to differ between smokers and non-smokers?  They naturally expect the smokers to be older children, while non-smokers include all of the younger and some of the older children.  This prediction is confirmed by this graph:

Then we consider the remaining pair of variables that we have not yet analyzed: What do you expect to see in a graph of lung capacity (FEV) vs. age?  Most students anticipate that lung capacity tends to increase as age increases.  This is confirmed by the following graph:

Do these last two graphs reveal a statistical tendency rather than a hard-and-fast rule?  Yes, absolutely.  Smokers tend to be older than non-smokers, but some smokers are younger than some non-smokers.  Furthermore, older children tend to have greater lung capacities than younger children, but the scatterplot also reveals that some older children have smaller lung capacities than younger ones.


Now let’s analyze a graph that displays all three of these variables simultaneously.  But first I ask students to take a step back and make sure that we’re all on the same page: What are the observational units, and what are the three variables here?  Also classify each variable as categorical or numerical.  The observational units are the 654 children.  The three variables are age (numerical), lung capacity as measured by FEV (numerical), and whether or not the person is a smoker (categorical). 

How can we include all three variables in one graph?  This is a harder question, but some students astutely suggest that we can code the dots in the scatterplot of FEV vs. age with different colors or symbols to indicate smoking status. 

Here’s the coded scatterplot, with blue circles for non-smokers and red squares for smokers:

This graph contains a lot of noise, so it’s hard to discern much.  We can see the overall patterns much more clearly by including lines of best fit* for the two groups:

* I’m not a fan of this phrase, but I don’t want to introduce least squares on the first day of class.

What does this graph reveal about lung capacities of smokers and non-smokers?  I am hoping for two particular responses to this question, so after some initial discussion I often need to ask more pointed questions, starting with: For those older than age 12, which line predicts greater lung capacity: smokers or non-smokers?  Does this surprise you?  Students realize, of course, that the non-smokers’ line falls above the smokers’ line for children older than 12.  This means that for a given age above 12, smokers are predicted to have smaller lung capacities than non-smokers.  This makes a lot more sense than our initial finding that smokers had larger lung capacities than non-smokers, on average, before we took age into account. 

A second pointed question: How do the slopes of the two lines compare?  What does that mean in this context?  Does this surprise you?  Clearly the solid blue line for non-smokers is steeper, and therefore has a greater slope, than the dashed red line for smokers.  This means that predicted lung capacity increases more quickly, for each additional year of age, for non-smokers than for smokers.  In fact, the line for smokers is almost flat, indicating that teenagers who smoke gain little increase in lung capacity as they age.  Again this finding is in line with what we would have expected beforehand, contrary to our surprising initial finding.

Succinctly put, the two take-away messages are:

  • At a given age, smokers tend to have smaller lung capacities than non-smokers.
  • The rate of increase in lung capacity, for each additional year of age, tends to be much slower for smokers than for non-smokers.

Oh, and just to make sure that no one missed this, I remind students of the question that I previously asked them to put at the back of their mind: How does the age variable explain the oddity that smokers in this dataset tend to have larger lung capacities than non-smokers?  At this point most students know the answer to this, but expressing it well can still be a challenge.  A full explanation requires making a connection between age and both of the other variables: smoking status and lung capacity.  Smokers tend to be older children, and older children tend to have greater lung capacities than younger ones.


How might we assess whether students can apply the same kind of multivariable thinking to new contexts?   I present two assessment questions here.  The first is based on a wonderful activity that Dick De Veaux has described about estimating how much a fireplace is worth to the value of a house in New England (see below for links).  He produced the following graph of house prices (in dollars) and living areas (in square feet), where the red dots and line represent houses with a fireplace:

How much is a fireplace worth?  De Veaux answers: It depends.  I ask students: Explain what this answer means.  At this early point in the course, I am looking for students to say two things: A fireplace does not add much or any value for modest-sized houses (smaller than 2000 square feet or so).  For houses larger than about 2000 square feet, the value of a fireplace (as seen by the distance between the red and blue lines) increases as the size of the house increases. For a 3000-square foot house, the worth of a fireplace is approximately $50,000.

A second follow-up assessment question, based on completely hypothetical data, presents the following graphs that display employees’ salary vs. experience at four companies (called A, B, C, and D) with 100 employees each.  The blue circles and lines represent cat lovers, and the red squares and lines represent dog lovers*.

* With such an obviously made-up example, I decided to use a ridiculous categorical variable rather than a more realistic one such as gender or race or education level.

A free-response question for students is: Describe the relationship between salary and experience at each company.  Also describe how the relationship varies (if at all) with regard to whether the employee works with cats or dogs. Reading and grading their responses can take a while, though.  A multiple-choice version could present students with four descriptions and ask them to match each description to a graph.  Here are some descriptions:

  • (a) Salary increases much more quickly, for each additional year of experience, for cat lovers than for dog lovers.  But dog lovers start out with much higher salaries than cat lovers, so much that it takes a bit more than 30 years of experience for cat lovers to catch up.
  • (b) Salary increases by about $2000 for each additional year of experience, essentially the same for both cat and dog lovers, but cat lovers earn about $10,000 more than dog lovers at every experience level.
  • (c) Salary increases by about $2000 for each additional year of experience, essentially the same for both cat and dog lovers.
  • (d) Salary increases much more quickly, for each additional year of experience, for cat lovers than for dog lovers.  Cat lovers generally earn less than dog lovers if they have less than about 20 years of experience, but cat lovers generally earn more than dog lovers beyond 20 years of experience.

Which graph goes with which description?  (a): Graph D; (b): Graph B; (c): Graph A; (d): Graph C


Multivariable thinking is a core component of statistical thinking.  The 2016 GAISE recommendations (here) explicitly called for introductory students to experience multivariable thinking in a variety of contexts.  I think this example about smoking and lung capacity provides a rich context for such a learning activity.  The surprising aspect of the initial finding captures students’ attention, and the resulting explanation involving age is both understandable and comforting.

Statistics and data truly can illuminate important questions about the world.  Introductory students can experience this on the first day of class.


P.S. Michael Kahn wrote about this dataset for Stats magazine in 2003, when Beth Chance and I edited that magazine, and also for the Journal of Statistics Education in 2005 (here).  The JSE article describes the source of the data and also contains a link to the datafile (near the end of the article).

A recent JSE article (here), written by Kevin Cummiskey and co-authors, uses this dataset for introducing students to causal inference.

De Veaux’s article and dataset about the worth of a fireplace can be found among ASA’s Stats 101 resources (here).  This example is also mentioned in the 2016 GAISE report (here).

Minitab statistical software (here) was used to produce the graphs in this post.

#34 Reveal human progress, part 2

In the previous post (here), I put my Ask good questions mantra on a temporary hold as I argued for another three-word exhortation that I hope will catch on with statistics teachers: Reveal human progress.  In this post I will merge these two themes by presenting questions for classroom use about data that reveal human progress.

The first three of these questions present data that reveal changes over time.  I think these questions are appropriate not only for introductory statistics but also for courses in quantitative reasoning and perhaps other mathematics courses.  The fourth question concerns probability, and the last two involve statistical inference.

As always, questions that I pose to my students appear in italics.


1. The following graph displays how life expectancy has changed in South Africa over the past few decades:

  • a) Describe how life expectancy has changed in South Africa over these years. 
  • b) In which of these three time periods did life expectancy change most quickly, and in which did it change most slowly: 1960 – 1990, 1990 – 2005, 2005 – 2016?
  • c) Explain what happened in South Africa in 1990 – 2005 that undid so much progress, and also explain what happened around 2005 to restart the positive trend.  (You need to use knowledge beyond what’s shown in the graph to answer this.  Feel free to use the internet.)

Question (a) is meant to be straightforward.  I expect students to comment on the gradual increase in life expectancy from 1960 – 1990, the sudden reversal into a dramatic decline from 1990 – 2005, and then another reversal with an even more rapid increase from 2005 – 2016.  A more thorough response would note that the life expectancy in 2005 had plunged to a level about equal to that of 1965, and the life expectancy in 2016 had rebounded to exceed the previous high in 2005.

Question (b) addresses rates of change.  I have in mind that students simply approximate these values from the graph.  Life expectancy increased from about 52 to 62 years between 1960 and 1990, which is an increase of about 10 life expectancy years over a 30-year time period, which is a rate of about 0.33 life expectancy years per year*.  From 1990 – 2005, life expectancy decreased by almost 10 years, for a rate of about 0.67 life expectancy years per year.  The years between 2005 – 2016 saw an increase in life expectancy of about 10 years, which is a rate of about 1 life expectancy year per year.  So, the quickest rate of change occurred in the most recent time period 2005 – 2016, and the slowest rate of change occurred in the most distant time period: 1960 – 1990.

* Unfortunately, the units here (life expectancy years per year of time) are tricky for students to express clearly.  This can be one of the downsides of using real data in an interesting context.

It usually takes students a little while to think of the explanation in part (c), but some students eventually suggest the HIV/AIDS epidemic that devastated South Africa in the 1990s.  Fortunately, effective medication became more available, helping to produce the dramatic improvement that began around the year 2005.


2. The following graph adds to the previous one by including the life expectancy for Ghana, as well as South Africa, over these years:

  • a) Compare and contrast how life expectancy changed in these two countries over these years.
  • b) Which country had a larger percentage increase in life expectancy over these years?  Explain your answer without performing any calculations.
  • c) Suppose that you were to calculate the changes in life expectancy for each year by subtracting the previous year’s value.  Which country would have a larger mean of its yearly changes?  Which country would have a larger standard deviation of its yearly changes?  Explain your answers.

For part (a), I expect students to respond that Ghana did not experience the dramatic reversals that South Africa did.  More specifically, Ghana experienced only a slight decline from about 1995 – 2000, much less dramatic and briefer than South Africa’s precipitous drop from 1990 – 2005.  For full credit I also look for students to mention at least one other aspect, such as:

  • Ghana had a much lower life expectancy than South Africa in 1960 and had a very similar life expectancy in 2016.
  • Ghana’s increase in life expectancy since 2005 has been much more gradual than South Africa’s steep increase over this period.

The key to answering part (b) correctly is to realize that the two countries ended with approximately the same life expectancy, but Ghana began with a much smaller life expectancy, so the percentage increase is larger for Ghana than for South Africa.

Part (c) is not at all routine, requiring a lot of thought.  Because Ghana had a larger increase in life expectancy over this time period, Ghana would have a larger mean for the distribution of its yearly changes.  But South Africa had steeper increases and decreases than Ghana, so South Africa would have more variability (and therefore a larger standard deviation) in its distribution of yearly changes*.

* The means of the yearly changes turn out to be 0.302 years for Ghana, 0.188 years for South Africa.  The standard deviations of the yearly changes are 0.625 years for South Africa, 0.174 years for Ghana.


3. Consider the following graph of energy-related carbon dioxide (CO2) emissions (in million metric tons) in the United States from 1990 – 2005:

  • a) Describe what the graph reveals.
  • b) Determine the least-squares line for predicting CO2 emissions from year.
  • c) Interpret the value of the slope coefficient.
  • d) Use the line to predict CO2 emissions for the year 2018.
  • e) The actual value for CO2 emissions in 2018 was 5269 million metric tons.  Calculate the percentage error of the prediction from the actual value.
  • f) Explain what went wrong, why the prediction did so poorly.

Students have little difficulty with part (a), as they note that CO2 emissions are increasing at a fairly steady rate from about 5000 million metric tons in 1990 to about 6000 million metric tons in 2005.  I intend for students to use technology to determine the least squares line in (b), the equation of which turns out to be: predicted CO2 emissions = -135,512 + 70.61 × year. 

To interpret the slope coefficient in part (c), students should respond that the predicted CO2 emissions increases by 70.61 million metric tons for each additional year.  Using this line to predict CO2 emissions for the year 2018 in part (d) gives: -135,512 + 70.61 × 2018 ≈ 6979 million metric tons.

This prediction is not very close to the actual value of CO2 emissions in 2018, as it over-predicts by more than 1700 million metric tons.  The percentage error for part (e) is: (6979 – 5269) / 5269 × 100% ≈ 32.5%. 

The explanation in part (f) is that we should have been cautious about extrapolation.  By using the least squares line to make a prediction thirteen years into the future, we assumed that the linear increasing trend would continue in the years following 2005.  We did not have a good justification for making this assumption.

In fact, a graph of the entire dataset from 1990 – 2018 reveals that the increasing trend from 1990 – 2005 actually reversed into a decreasing trend from 2005 – 2018:

Students find these data to be very surprising.  I hope the surprise aspect helps to make the caution about extrapolation memorable for them.


The next three questions concern Hans Rosling’s Gapminder/Ignorance Test.  I presented three of the twelve questions on this test in the previous post (here).  Each of the twelve questions asks respondents to select one of three options.  The correct answer for each question is the most optimistic of the three options presented.

4. Suppose that all people select randomly among the three options on all twelve questions.  Let the random variable X represent the number of questions that a person would answer correctly.

  • a) Describe the probability distribution of X.  Include the parameter values as well as the name of the distribution.
  • b) Determine and interpret the expected value of X.
  • c) Determine the probability that a person would obtain exactly the expected value for the number of correct answers.
  • d) Determine and compare the probabilities of correctly answering fewer than the expected value vs. more than the expected value. 
  • e) Discuss how the actual survey results, as shown in the following graph, compare to the binomial distribution calculations.

Under the assumption of random selection among the three options on all twelve questions, the probability distribution of X, the number of correct answers, would follow a binomial distribution with parameters n = 12 and p = 1/3.  A graph of this probability distribution is shown here:

The expected value of X can be calculated as: E(X) = np = 12×(1/3) = 4.0.  This means that if the questions were asked of a very large number of people, all of whom selected randomly among the three options on all twelve questions, then the average number of correct answers would be very close to 4.0.

The binomial probabilities in (c) and (d) can be calculated to be 0.2384 for obtaining exactly 4 correct answers, 0.3931 for 4 or fewer correct, and 0.3685 for more than 4 correct.

The survey data reveal that people do much worse on these questions that they would with truly random selections.  For example, about 80% of respondents got fewer than four correct answers, whereas random selections would produce about 39.31% with fewer than four correct answers.  On the other side, about 10% of people answered more than four questions correctly, compared with 36.85% that would be expected from random selections.


5. When the question about how the proportion of the world’s population living in extreme poverty has changed over the past twenty years, only 5% of a sample of 1005 respondents in the United States gave the correct answer (cut in half), while 59% responded with the option furthest from the truth (doubled).

  • a) Determine the z-score for testing whether the sample data provide strong evidence that less than one-third of all Americans would answer correctly.
  • b) Summarize your conclusion from this z-score, and explain the reasoning process behind your conclusion.
  • c) Determine a 95% confidence interval for the population proportion who would answer that the rate has doubled.
  • d) Interpret this confidence interval.

The z-score in (a) is calculated as: z = (0.05 – 1/3) / sqrt[(1/3)×(2/3)/1005] ≈ -19.1.  This is an enormous z-score, indicating that the sample proportion who gave the correct response is more than 19 standard deviations less than the value one-third.  Such an extreme result would essentially never happen by random chance, so the sample data provide overwhelming evidence that less than one-third of all adult Americans would have answered correctly.

The 95% confidence interval for the population proportion in part (c) is: .59 ± 1.96 × sqrt(.59×.41/1005), which is .59 ± .030, which is the interval (.560 → .620).  We can be 95% confident that if this question were asked of all adult Americans, the proportion who would give the most wrong answer (doubled) would be between .560 and .620.  In other words, we can be 95% confident that between 56% and 62% of all adult Americans would give the most wrong answer to this question.


I asked my students the question about how the extreme poverty rate has changed, before revealing the answer.  The table below shows the observed counts for the three response options in a recent class:

6. Conduct a hypothesis test of whether the sample data provide strong evidence against the hypothesis that the population of students at our school would be equally likely to choose among the three response options.

The null hypothesis is that students in the population would be equally likely to select among the three options (i.e., that one-third of the population would respond with each of the three options).  The expected counts (under this null hypothesis) are 83/3 ≈ 27.667 for each of the three categories.  All of these expected counts are larger than five, so a chi-square goodness-of-fit test is appropriate.  The chi-square test statistic turns out to equal 7.253, as shown in the following table:

The p-value, from a chi-square distribution with 2 degrees of freedom, is ≈ 0.027.  This p-value is fairly small (less than .05) but not very small (larger than .01), so we can conclude that the sample data provide fairly strong evidence against the hypothesis that students in the population would be equally likely to select among the three options.  The sample data suggest that students are more likely to give the most pessimistic answer (doubled) and less likely to give the most optimistic, correct answer (cut in half).  This conclusion should be regarded with caution, though, because the sample (students in my class) was not randomly selected from the population of all students at our school.


The six questions that I have presented here only hint at the possibilities of asking questions that help students to learn important statistical content while also exposing them to data that reveal human progress.  I also encourage teachers to point their students toward resources that empower them to ask their own questions, and analyze data of their own choosing, about the state of the world.  I listed several websites with such data at the very end of the previous post (here).

P.S. The life expectancies for South Africa and Ghana were obtained from the World Bank’s World Development Indicators dataset, accessed through google (here).  Life expectancy is defined here as “the average number of years a newborn is expected to live with current mortality patterns remaining the same.”  The data on CO2 emissions were obtained from the United States Energy Information Administration (here).  The data on the Gapminder/Ignorance Test were obtained from a link here.

Files containing the data on life expectancies and CO2 emissions can be downloaded from the links below:

#33 Reveal human progress, part 1

This post will feature many quotes that I find inspirational, starting with:

Quote #1: How can we soundly appraise the state of the world?  The answer is to count. …  A quantitative mindset, despite its nerdy aura, is actually the morally enlightened one, because it treats every human life as having equal value rather than privileging the people who are closest to us or most photogenic. – Steven Pinker, Enlightenment Now, pages 42-43

I am going to show some data that appraise the state of the world and how things have changed over the years.  First I will ask a few questions that Hans Rosling liked to ask his audiences, which I also ask of my students:

  • A: In the last twenty years, how has the proportion of the world’s population living in extreme poverty changed?  [Options: Almost doubled, Remained more or less the same, Almost halved]
  • B: What percentage of the world’s one-year-old children today have been vaccinated against some disease?  [Options: 20 percent, 50 percent, 80 percent]
  • C: Worldwide, 30-year-old men have spent an average of 10 years in school.  How many years have women of the same age spent in school, on average?  [Options: 9 years, 6 years, 3 years]

Are you ready for the answers?  Here’s a quote to reveal the correct answer for question A:

Quote #2: Over the past twenty years, the proportion of the global population living in extreme poverty has halved.  This is absolutely revolutionary.  I consider it to be the most important change that has happened in the world in my lifetime. – Hans Rosling, Factfulness, page 6

The correct answers for questions B and C are also the most optimistic of the options presented: 80 percent of one-year-old children have been vaccinated, and 30-year-old women have spent 9 years in school, on average.

Looking at data on a wide range of human experiences, Pinker uses even stronger language than Rosling to declare:

Quote #3: Here is a shocker: The world has made spectacular progress in every single measure of human well-being.  Here is a second shocker: Almost no one knows about it. – Steven Pinker, Enlightenment Now, page 52.


Can this really be true – that the world has made great progress, and that very few know about it?  Let’s return to questions A, B, and C, which were asked of people in many countries.  Rosling and his colleagues produced the following graphs of the percentage of correct responses for these questions:

Remember that these were multiple choice questions with three options.  Rosling pointed out that complete ignorance would lead to random guessing, which would produce roughly 33% correct responses in a large sample.  I’m sure you’ve noticed that for all three questions, in every country, respondents failed to achieve the level of complete ignorance.

Rosling and his colleagues asked twelve questions of this type.  For every question, the correct answer was the most optimistic of the three options provided.  Here is the distribution of number correct, where Rosling uses a chimpanzee to represent the expected value under the assumption of complete ignorance:

Do people really think the world is getting worse instead of better?  Further evidence is provided by the following survey results from asking this question directly in the year 2015:

Only in China did a higher percentage say that the world is getting better rather than worse.  In the United States, more than 10 times as many people responded worse than better


Why are people so pessimistic and ignorant (actually, worse than ignorant) about the state of the world?  Pinker argues that the negative nature of news, combined with cognitive biases such as the availability heuristic, explain much of this phenomenon:

Quote #4: Whether or not the world is actually getting worse, the nature of news will interact with the nature of cognition to make us think that it is. – Steven Pinker, Enlightenment Now, page 41

Rosling offers many explanations for this disconnect between perception and reality, starting with what he calls the gap instinct:

Quote #5: I’m talking about that irresistible temptation we have to divide all kinds of things into two distinct and often conflicting groups, with an imagined gap – a huge chasm of injustice – in between.  – Hans Rosling, Factfulness, page 21

Consider the following graph, from Rosling’s Gapminder site (here), of a country’s child mortality rate vs. the average number of babies per woman (color indicates region of the world, and the size of the circle represents the country’s population):

The countries in the bottom left of this graph have low child mortality rates and small families, while those in the upper right experience high child mortality rates and large families.  This graph displays Rosling’s gap instinct: Many people see the world as separated into two distinct groups of countries, which are often labeled developed and developing.   

But have you noticed the catch?  This graph shows the world in 1968, more than 50 years ago!  The following graph displays the same variables on the same scale in the year 2018:

The world has changed dramatically in these 50 years!  Child mortality rates have dropped substantially, which is undeniably good news.  Despite the fact that more and more babies live past age 5 (in fact, probably because of that fact), women have fewer and fewer babies than previously.  Sure, there’s still variability, and the African countries (shown in light blue) still have some catching up to do.  But the separation of countries into two clusters with a gap in between is a relic of the past.  The gap instinct that many people hold is not consistent with current data.


Next I will offer some data and graphs that reveal human progress.  Such data and visualizations abound*, but I will confine myself here to seven graphs. 

* I provide a partial list of resources in a P.S. at the end of this post.  The seven graphs shown below come from the Our World in Data site (here).

The first three graphs show decreases, for all regions of the world, in child mortality rates, average number of babies per woman, and extreme poverty rates:

The next three graphs show dramatic increases in life expectancy, literacy rates, and mean years of schooling:

The final graph displays raw counts rather than rates or averages.  Because the population of the world has been growing over time, you might wonder whether a decreasing rate of extreme poverty means that fewer people are living in extreme poverty.  The following graph shows that the number of people living in extreme poverty has indeed decreased dramatically over the past two decades, while the number of people not living in extreme poverty has increased sharply:


What does this have to do with teaching introductory statistics?  I think we (teachers of introductory statistics) have a tremendous opportunity to make our students aware of human progress.  Here’s my plea: I urge you to use data, examples, activities, and assignments that reveal* human progress to your students.

* I like the word reveal here, because we can expose students to human progress in dramatic fashion, as a plot twist providing the climax of a suspenseful narrative.

Why do I consider this so important?  I’ll call on Rosling to answer:

Quote #6: When we have a fact-based worldview, we can see that the world is not as bad as it seems – and we can see what we have to do to keep making it better.  – Hans Rosling, Factfulness, page 255

I hasten to add an important caveat: By no means am I arguing that statistics teachers should refrain from presenting examples and data that reveal problems and injustices.  Such examples can motivate students to take action for making the world a better place.  But I suspect that many statistics teachers, who are susceptible to the same inherent biases and heuristics that affect all people*, have a tendency to overdo the negative and understate the positive.  I also believe that good news about human progress can motivate students to do their part in continuing to make the world better. I am not asking teachers to recenter their entire course around data of human progress, just to show a few examples.

* I include myself here, of course.


How can we reveal the good news about human progress to students?  You know my answer: Ask good questions!

This post is something of an anomaly for this blog, as it contains few questions.  But some previous posts have already posed questions for students that use data on human progress:

  • In post #11 (Repeat after me, here), I suggested providing students with scatterplots (bubble graphs) from Rosling’s site and asking basic questions about observational units, variables, and association. 
  • I proposed asking students to calculate the percentage decrease in the extreme poverty rate between 1990 and 2015 in post #28 (A pervasive pet peeve, here). 
  • I recommended conducting a hypothesis test of whether Americans’ responses to the extreme poverty rate question are worse than would be expected by random chance in post #8 (End of the alphabet, here).

I will continue this theme in next week’s post by providing several more examples of how I have asked questions about data on human progress to teach statistical thinking in my courses.


An encouraging development is that as the year and decade came to a close in December of 2019, several columns appeared in the news to trumpet the good news of human progress.  Two examples are:

  • “This Has Been the Best Year Ever,” by Nicholas Kristof in The New York Times (here)
  • “We’ve just had the best decade in human history, seriously,” by Matt Ridley in The Spectator (here)

Finally, I offer with one more quote that I find insightful and inspiring:

Quote #7: If you could choose a moment in history to be born, and you didn’t know ahead of time who you were going to be, you’d choose now.  Because the world has never been less violent, healthier, better educated, more tolerant, with more opportunity for more people, and better connected, than it is today.  – To be revealed soon

What’s your guess – is this quote from Pinker or Rosling?  Rosling or Pinker?  I used three quotes from each above.  Does this quote break the tie?  Or do you suspect that I slipped in a quote from Kristof or Ridley here? Are you ready for the big reveal?

No, this quote does not break the tie, because these words are neither Pinker’s nor Rosling’s. They are also not Kristof’s or Ridley’s.  Who said this?  President Barack Obama, at the White House Summit on Global Development, on July 20, 2016 (here).


P.S. I highly recommend Pinker’s book Enlightenment Now (here) and Rosling’s book Factfulness (here).  These books inspired this post and provided the first six quotes above.

The project that produced the data and graphs for survey questions about the state of the world is summarized here, and the data can be found here.  The graph of survey results for the “getting better or worse” question came from the YouGov site (here).  The graph of perceived happiness levels came from the the Our World in Data site (here).  The graphs displaying Rosling’s gap instinct came from his Gapminder site (here).  The seven graphs of human progress came from the Our World in Data site: child mortality (here), babies per woman (here), poverty (here), life expectancy (here), literacy (here), and years of schooling (here).

I recommend the following resources for data and tools to explore human progress.  I relied most heavily on the first two sites in preparing this post:

#32 Create your own example, part 2

In last week’s post (here), I presented examples of questions that ask students to create their own example that satisfies a particular property, such as the mean exceeding the mean and inter-quartile range equaling zero.  I proposed that such questions can help students to think more carefully and deepen their understanding of statistical concepts.  All of last week’s examples concerned descriptive statistics.

Now I extend this theme to the realm of statistical inference concepts and techniques.  I present six create-your-own-example questions (each with multiple parts) concerning hypothesis tests and confidence intervals for proportions and means, with a chi-square test appearing at the end.  I believe these questions lead students to develop a stronger understanding of concepts such as the role of sample size and sample variability on statistical inference.

I encourage students to use technology, such as the applet here, to calculate confidence intervals, test statistics, and p-values.   This enables them to focus on underlying concepts rather than calculations.

The numbering of these questions picks up where the previous post left off.  As always, questions for students appear in italics.


6. Suppose that you want to test the null hypothesis that one-third of all adults in your county have a tattoo, against a two-sided alternative.  For each of the following parts, create your own example of a sample of 100 people that satisfies the indicated property.  Do this by providing the sample numbers with a tattoo and without a tattoo.  Also report the test statistic and p-value from a one-proportion z-test.

  • a) The two-sided p-value is less than 0.001.
  • b) The two-sided p-value is greater than 0.20.

Students need to realize that sample proportions closer to one-third produce larger p-values, while those farther from one-third generate smaller p-values.  Clever students might give the most extreme answers, saying that all 100 have a tattoo in part (a) and that 33 have a tattoo in part (b).

Instead of asking for one example in each part, you could make the question more challenging by asking students to determine all possible sample values that satisfy the property.  It turns out that for part (a), the condition is satisfied by having 17 or fewer, or 49 or more, with a tattoo.  For part (b), having 28 to 39 (inclusive) with a tattoo satisfies the condition.  Instead of trial-and-error, you could ask students to determine these values algebraically from the z-test statistic formula, but I would only ask this in courses for mathematically inclined students.


7. Suppose that you want to estimate the proportion of all adults in your county who have a tattoo. For each of the following parts, create your own example to satisfy the indicated property.  Do this by specifying the sample size and the number of people in the sample with a tattoo.  Also determine the confidence interval.

  • a) The sample proportion with a tattoo is 0.30, and a 95% confidence interval for the population proportion includes the value 0.35.
  • b) The sample proportion with a tattoo is 0.30, and a 99% confidence interval for the population proportion does not include the value 0.35.

The key here is to understand the impact of sample size on a confidence interval.  The confidence interval in both parts will be centered at the value of the sample proportion value of 0.30, so the interval in part (b) needs to be narrower than the interval in part (a).  A larger sample size produces a narrower confidence interval, so a smaller sample size is needed in part (a).

One example that works for part (a) is a sample of 100 people, 30 of whom have a tattoo, for part (a), which produces a 95% confidence interval of (0.210 → 0.390).  Similarly, creating a sample of 1000 people, 300 of whom have a tattoo, satisfies part (b), as the 99% confidence interval is (0.263 → 0.337).

Again you could consider asking students to determine all sample sizes that work.  Restricting attention to multiples of 10 (so the sample proportion with a tattoo equals 0.30 exactly), it turns out that a sample size of 340 or fewer suffices for part (a), and a sample size of 560 or more is needed for part (b).

 


8. Suppose that you want to estimate the population mean body temperature of a healthy adult with a 95% confidence interval.  For each of the following parts, create your own example of a sample of 10 body temperature values that satisfy the indicated property.  Do this by listing the ten values and also producing a dotplot that displays the ten values.  Report the sample standard deviation, and determine the confidence interval.

  • a) The sample mean is 98.0 degrees, and a 95% confidence interval for the population mean includes the value 98.6.
  • b) The sample mean is 98.0 degrees, and a 99% confidence interval for the population mean does not include the value 98.6.

This question is similar to the previous one, but dealing with a mean instead of a proportion brings the variability of the sample data into consideration.  This question removes sample size from consideration by stipulating that n = 10. 

The confidence interval for both parts will be centered at the value of the sample mean temperature: 98.0 degrees.  For the confidence interval in part (a) to include the value 98.6, the sample data need to display a good bit of variability.  A student might try a fairly simple example containing five values of 97.0 and five values of 99.0, which produces a sample standard deviation of 1.054 and a 95% confidence interval of (96.92 → 99.08) degrees.

In contrast, part (b) requires less sample variability, for the confidence interval to fall short of the value 98.6.  A student might use a fairly extreme example, such as one value of 97.9, eight values of 98, and one value of 98.1.  This results in a sample standard deviation of 0.047 and a 99% confidence interval of (97.95 → 98.05) degrees.

As with the previous questions, you could ask students to determine all values of the sample standard deviation that will work, either with trial-and-error or algebraically.  It turns out that the sample standard deviation needs to be at least 0.839 (to three decimal places) degrees in part (a), at most 0.583 degrees in part (b).


9. Suppose that you ask dog and cat owners whether their pet has been to a veterinarian in the past twelve months.  You organize the resulting counts in a 2×2 table as follows:

For each of the following parts, create your own example of a sample that satisfies the indicated property.  Do this by filling in the counts of the 2×2 table.  Also report the two sample proportions and the test statistic and p-value from a two-proportions z-test.

  • a) The two-sided p-value is less than 0.001.
  • b) The two-sided p-value is between 0.2 and 0.6.

Students need to produce a large difference in proportions for part (a) and a fairly small difference for part (b).  They could give a very extreme answer in part (a) by having 100% of dogs and 0% of cats visit a veterinarian.  A less extreme response that 80 of 100 dogs and 20 of 50 cats have been to a veterinarian produces a z-statistic of 4.90 and a p-value very close to zero.   

Stipulating that the p-value in part (b) must be less than 0.6 forces students not to use identical success proportions in the two groups.  One example that works is to have 80 of 100 dogs and 36 of 50 cats with a veterinarian visit. This produces a z-statistic of 1.10 and a p-value of 0.270.


10. The Gallup organization surveyed American adults about how many times they went to a movie at a movie theater in the year 2019.  They compared results for people with at least one child under age 18 in their household and those without such a child in their household.  Suppose that you reproduce this study by interviewing a random sample of adults in your county, and suppose that the sample means are the same as in the Gallup survey: 6.8 movies for those with children, 4.7 movies for those without, as shown in the table below:

For each of the following parts, create your own example that satisfies the indicated property.  Do this by filling in the sample size and sample standard deviation for each group.  Also report the value of the two-sample t-test statistic and the two-sided p-value.

  • a) The two-sample t-test statistic is less than 1.50.
  • b) The two-sample t-test statistic is greater than 2.50.

Students have considerable latitude in their answers here, as they can focus on sample size or sample variability.  They need to realize that large sample sizes and small standard deviations will generally produce larger test statistic values, as for part (a).  To produce a smaller test statistic value in part (b) requires relatively small sample sizes or large standard deviations.

For example, sample sizes of 10 and sample standard deviations of 4.0 for each group produce t = 1.17 to satisfy part (a).  The condition for part (b) can be met with the same standard deviations but larger sample sizes of 50 for each group, which gives t = 2.62.


11. Suppose that you interview a sample of 100 adults, asking for their political viewpoint (classified as liberal, moderate, or conservative) and how often they eat ice cream (classified as rarely, sometimes, or often).  Also suppose that you obtain the marginal totals shown in the following 3×3 table:

For each of the following parts, create your own example that satisfies the indicated property.  Do this by filling in the counts of the 3×3 table.  Also report the value of the chi-square statistic and p-value.  For part (b), also describe the nature of the association between the variables (i.e., which political groups tend to eat ice cream more or less frequently?).

  • b) The chi-square p-value is between 0.4 and 0.8.
  • c) The chi-square p-value is less than 0.001.

Like the previous questions, this one also affords students considerable leeway with their responses.  They need to supply nine cell counts in the table, but the fixed margins mean that they only have four degrees of freedom* to play around with.

* Once a student has filled in four cell counts (provided that they are not all in the same row or same column), the other five cell counts are then determined by the need to make counts add up to the marginal totals.

First students need to realize that to obtain a large p-value in part (a), the counts need to come close to producing independence between political viewpoint and ice cream frequency.  They also need to know that independence here would mean that all three political groups have 20% rarely, 50% sometimes, and 30% often eating ice cream.  Independence would produce this table of counts:

This table does not satisfy the condition for part (a), though, because the p-value is 1.0.  A correct response to part (a) requires a bit of variation from perfect independence.  The following table, which shifts two liberals from rarely to often and two conservatives from often to rarely, produces a chi-square statistic of 2.222 and a p-value of 0.695:

On the other hand, a table that successfully satisfies part (b) needs to reveal a clear association between the two variables.  Consider the following example:

The chi-square test statistic equals 13.316 for this example, and the p-value is 0.010.  This table reveals that makes liberals much more likely to eat ice cream often, and much less likely to eat ice cream rarely, compared to conservatives.


Students can use create-your-own-example questions to demonstrate and deepen their understanding of statistical concepts.  The previous post provided many examples that concerned descriptive statistics, and this post has followed suit with topics of statistical inference.

I also like to ask create-your-own-example questions that ask students, for instance, to identify a potential confounding variable in a study, or to suggest a research question for which comparative boxplots would be a relevant graph.  Perhaps a future post will discuss those kinds of questions.

As with the previous post, I leave you with a (completely optional, of course) take-home assignment: Create your own example of a create-your-own-example question to ask of your students.

P.S. A recent study (discussed here) suggests that average body temperature for humans, as discussed in question 8, has dropped in the past century and is now close to 97.5 degrees Fahrenheit.  The Gallup survey mentioned in question 10 can be found here.

#31 Create your own example, part 1

I like asking questions that prompt students to create their own example to satisfy some property.  I use these questions in many settings: class activities, homework assignments, quizzes, and exams.  Such questions prompt students to engage in higher-level thinking than rote calculations.  I also believe that these questions can lead students to deepen their understanding about properties of statistical measures and methods.

I presented one such question in post #3 (here), in which I asked students to create their own example to illustrate Simpson’s paradox.  That’s a very challenging question for most students.  In this post, I will provide five examples (each with multiple parts) of create-your-own-example questions, most of which are fairly straight-forward but nevertheless (I believe) worthwhile.  I will also discuss the statistical concepts, all related to the topic of descriptive statistics, that the questions address.  As always, questions for students appear in italics.


1. Suppose that you record the age of 10 customers who enter a movie theater.  For each of the following parts, create an example of 10 ages that satisfy the indicated property.  (In other words, produce a list of 10 ages for each part.)  Also, report the values of the mean and median for parts (c) – (e).  Do not bother to calculate the standard deviation in part (b).

  • a) The standard deviation equals zero.
  • b) The inter-quartile range equals zero, and the standard deviation does not equal zero. 
  • c) The mean is larger than the median.
  • d) The mean exceeds the median by at least 20 years.   
  • e) The mean exceeds the median by at least 10 years, and the inter-quartile range equals zero.

Part (a) simply requires that all 10 customers have the same age.  A correct answer to part (b) needs the 3rd through 8th values (in order) to be the same, in order for the IQR to equal zero, with at least one different value to make the standard deviation positive.  The easiest way to answer (b) correctly would make nine of the ages the same and one age different.

Part (c) requires knowing that the mean will be affected by a few unusually large ages.  An example that works for (d), which is more challenging than (c), is to have six ages of 10, so the median is 10, and four ages of 60, which pulls the mean up to 30. 

Part (e) is more challenging still.  An IQR of 0 again requires the 3rd through 8th values to be the same.  Two large outliers can inflate the mean enough to satisfy the property.  For example, eight ages of 10 and two ages of 60 makes the IQR 0, median 10, and mean 20.

Ideally, students think about properties of mean and median as they answer questions like this.  I think it’s fine for students to use some trial-and-error, but then I hope they can explain why an example works.  You could assess this by asking students to describe their reasoning process, perhaps for part d) or e), along with submitting their example.  

I want students to consider the context here (and always), so I only give partial credit if an example uses an unrealistic age such as 150 years.

For an in-class activity or homework assignment, I ask all five parts of this question, and I encourage students to use software (such as the applet here) to facilitate the calculations.  On a quiz or exam, I only ask one or two parts of this question.  I do think it’s important to give students practice with this kind of question prior to asking it on an exam.


2. Consider the following dotplot, which displays the distribution of margin of victory in a sample of 10 football games (mean 11.0, median 9.5, standard deviation 6.04 points):

For each of the following parts, create your own example by proposing an eleventh value along with these ten to satisfy the indicated property.  (Notice that the context here requires that the new value must be a positive integer.)  For each part, add your new data value to the dotplot.

  • a) The mean, median, and standard deviation all increase.
  • b) The mean, median, and standard deviation all decrease.
  • c) The median increases, and the mean decreases.

Students should realize immediately that part (a) requires that the new value be fairly large.  The new value must be larger than the mean and median, of course, but it needs to be considerably larger in order for the standard deviation to increase.  It turns out that any integer value of 18 or higher works.  (I do not expect students to determine the smallest value that works, although you could make the question harder by asking for that.)

Part (b) requires that the new value be less than the mean and median, but fairly close to the mean in order for the standard deviation to decrease.  A natural choice that works is 9.  (It turns out that any integer from 5 through 9, inclusive, works.)  Part (c) has a unique correct answer, which is the only integer between the median and mean: 10 points.

I provide a separate copy of the dotplot for each part of this question.  If students have access to technology as they answer these questions, you could ask them to report the new values of the statistics.


3. The Gallup organization surveyed American adults about how many times they went to a movie at a movie theater in the year 2019.  They compared results for people with at least one child under age 18 in their household and those without such a child in their household.  Suppose that you recreate this study by interviewing faculty at you school, and suppose that your sample contains 8 people in each group.For each of the following parts, create your own example that satisfies the given property.  Do this by producing dotplots on the axes provided, making sure to include 8 data values in each group.  Do not bother to calculate the values of the means and standard deviations.

  • a) The mean for those with children is larger than the mean for those without children.
  • b) The standard deviation for those with children is larger than the standard deviation for without children.
  • c) The mean for those with children is larger than the mean for those without, and the standard deviation for those with children is smaller than the standard deviation for those without.

Parts (a) and (b) are very straight-forward, simply assessing whether students understand that the mean measures center and standard deviation measures variability.  Part (c) is a bit more complicated, as students need to think about both aspects (center and variability) at the same time.  I provide a separate copy of the axes for each part.


4. Suppose that you ask dog and cat owners whether their pet has been to a veterinarian in the past twelve months.  You organize the resulting counts in a 2×2 table as follows:

For each of the following parts, create your own example of counts that satisfy the indicated property.  Do this by filling in the appropriate cells of the table with counts.  Also report the values for all relevant proportions, differences in proportions, and ratios of proportions.

  • a) The difference in proportions who answer yes is exactly 0.2.
  • b) The ratio of proportions who answer yes is exactly 2.0.
  • c) The difference in proportions who answer yes is greater than 0.2, and the ratio of proportions who answer yes is greater than 2.0.
  • d) The difference in proportions who answer yes is greater than 0.2, and the ratio of proportions who answer yes is less than 2.0.
  • e) The difference in proportions who answer yes is less than 0.2, and the ratio of proportions who answer yes is greater than 2.0.

You could make these questions easier by using the same sample size for both groups, but I prefer this version that requires students to think proportionally.  Part (c) requires one of the proportions to be fairly small, so the ratio can exceed 2.0.  Part (e) requires both proportions to be on the small side, so the ratio can exceed 2 without a large difference.  The following tables show examples (by no means unique) that work for parts (c), (d), and (e):


5. Consider the following scatterplot of sale price (in thousands of dollars) vs. size (in square feet) for seven houses that sold in Arroyo Grande, California:

The seven ordered pairs of (size, price) data points are: (1014, $474K), (1176, $520K), (1242, $459K), (1499, $470K), (1540, $575K), (1545, $500K), (1755, $580K).  The correlation coefficient between price and size is r = 0.627.  For each of the following parts, create your own example to satisfy the indicated property.  Do this by adding one point to the scatterplot and also reporting the values of the size (square feet) and price for the house that you add.  Also give a very brief description of the house (e.g., a very small and inexpensive house), and report the value of the correlation coefficient.

  • a) The correlation coefficient is larger than 0.8.
  • b) The correlation coefficient is between 0.2 and 0.4.
  • c) The correlation coefficient is negative.

Notice that I extended the scales on the axes of this graph considerably, as a hint to students that they need to consider using some small or large values for size or price.  I reproduce the graph for students in all three parts. Using technology (such as the applet here) is essential for this question.  You could ask part (a) or (c) on an exam with no technology, as long as you ask for educated guesses and do not require calculating the correlation coefficient.

The key in part (a) is to realize that the new house must reinforce the positive association considerably, which requires a house that is either considerably larger and more expensive, or else much smaller and less expensive.  Two points that work are a 500-square-foot house for $350K (r = 0.858), or a 2500-square-foot house for $650K (r = 0.846).  Students could think even bigger (or smaller) and produce a correlation coefficient even closer to 1.  For instance a 4000-square-foot house for two million dollars generates r = 0.978.

Part (b) calls for a new house that diminishes the positive association considerably, so students need to think of a house that goes against the prevailing tendency.  Students should try a small but expensive, or large but inexpensive, house.  One example that works is a 1000-square-foot-house for $550K (r = 0.374).   Part (c) is similar but requires an even more unusual house to undo the positive association completely.  For instance, a small-but-expensive house with 500 square feet for $650K achieves a negative correlation of r = -0.324.


I believe that create-your-own-example questions can help students to assess and deepen their understanding of statistical concepts related to measures of center, variability, and association.  Next week’s post will continue this theme by presenting five create-your-own-example questions that address properties of statistical inference procedures.

Are you ready for your take-home assignment*?  I bet you can guess what it is.  Ready?  Here goes: Create your own example of a create-your-own-example question that leads students to assess and deepen their understanding of a statistical concept.

* Needless to say, this assignment is optional!

P.S. The sample of 10 football games in question 2 consists of the NFL post-season games in January of 2020, prior to Super Bowl LIV, gathered from here, here, and here.  Results from the Gallup survey mentioned in question 3 can be found here.

#30 Minimize what?

What does least squares mean?  Students in an introductory statistics course typically encounter this term in the context of fitting a line to bivariate numerical data.  We tell students that among all possible lines, the least squares line is the one that makes the sum of squared vertical deviations (i.e., the sum of squared residuals) from the line as small as possible. 

In this post I explore how students can use least squares and other criteria to determine optimal measures of center for a single numerical variable.  I will describe an activity that I use with mathematically inclined students, primarily those majoring in statistics, mathematics, or economics.  I do not use this activity with typical Stat 101 students, but I do hope that this activity might be fun and worthwhile as a “beyond the exam” topic in an AP Statistics course. As always, questions that I pose to students appear in italics.


I carry a pedometer in my pocket to record the number of steps that I take each day.  Below are the data for a recent week, along with a dotplot (generated with the applet here):

Let’s start with a question meant to provoke students’ thought: Propose a number to represent the center of this distribution.  This is a very vague question, so I encourage students to just pick a value based on the graph, without giving it too much thought, and certainly without performing any calculations.  I also emphasize that there’s not a right-or-wrong answer here.

Then I ask a few students to share the values that they selected, which leads to the question: How can we decide whether one value (for the center of this distribution) is better than another?  This is a very hard question.  I try to lead students to understand that we need a criterion (a rule) for deciding.  Then I suggest that the criterion should take into account the differences (or deviations) between the data values and the proposed measure of center.  Do we prefer that these differences be small or large?  Finally, this is an easy question with a definitive answer: We prefer small differences to large ones.  I point out that with seven data values, we’ll have seven deviations to work with for each proposed measure of center.  How might we combine those seven deviations?  Would it work to simply add them?  Some students respond that this would not work, because we could have positive and negative differences cancelling out.  How can we get around that problem?  We could take absolute values of the deviations, or square them, before we add them.

Let’s get to work, starting with the least squares criterion.  Let m represent a generic measure of center.  Write out the function for sum of squares deviations (call this SSD) as a function of m.  When students need a hint, I say that there’s nothing clever about this, just a brute-force calculation.  In general terms, we could express this function as:

For these particular data values, this function becomes:

Predict what the graph of this function will look like.  If students ask for a hint, I suggest that they think about whether to expect to see a line, parabola, exponential curve, or something else.  Then I either ask students to use Excel, or ask them to talk me through its use, to evaluate this function.  First enter the seven data values into column A.  Then set up column B to contain a whole bunch of (integer) values of m, from 8000 to 16000, making use of Excel’s fill down feature.  Finally, enter this formula into column C*:

* The $ symbol in the formula specifies that those data cells are fixed, as opposed to the B2 cell that fills down to produce a different output for all of the m values.

The first several rows of output look like this:

A graph of this function follows:

What is the shape of this graph?  A parabola.  Explain why this makes sense.  Because the function is quadratic, of the form a×m^2 + b×m + cWhere does the function appear to be minimized?  Slightly above 12,000 steps.  How can we determine where the minimum occurs more precisely?  We can examine the SSD values in the Excel file to see where the minimum occurs.  Here are the values near the minimum:

We see that the minimum occurs at 12,069 steps.  Is it possible that SSD is minimized at a non-integer value of m?  Sure, that’s possible.  Can we zoom in further to identify the value of m that minimizes this function more exactly?   Yes, we can specify that Excel use multiples of .001, rather than integers, for the possible values of m, restricting our attention to the interval from 12,068 to 12,070 steps.  This produces the following graph:

Now we can examine the SSD values in the Excel file to identify where the minimum occurs:

The sum of squared deviations is minimized at the value 12,069.143.  Is this one of the seven data values?  No.  Is this the value of a common measure of center for these data?  Yes, it turns out that this is the mean of the data.  Do you think this is a coincidence?  No way, with so many decimal places of accuracy here, that would be an amazing coincidence!

If your students have studied a term of calculus, you can ask them to prove that SSD(m) is minimized by the mean of the data.  They can take the derivative, with respect to m, of the general form of SSD(m), set that derivative equal to zero, and solve for m.


Why should we confine our attention to least squares?  Let’s consider another criterion.  Instead of minimizing the sum of squared deviations between the data values and the measure of center, let’s minimize the sum of absolute deviations.

We’ll call this function SAD(m)*.  When written out, this function looks just like SSD(m) but with absolute values instead of squares.  Again we can use Excel to evaluate this function for a wide range of values of m, using the formula:

* Despite the name of this function, I implore students to be happy, not sad, as they expand their horizon beyond least squares.

What do you expect the graph of this SAD(m) function to look like?  This is a much harder question than with the SSD(m) function.  Students could have realized in advance that the SSD(m) function would follow a parabola.  But what will they expect the graph of a function that sums absolute values to look like?  What do you expect this to look like?  Ready?  Here’s the result:

Describe the behavior of this function.  This graph can be described as piece-wise linear.  It consists of connected line segments with different slopes.  Where do the junction points (where the line segments meet) of this function appear to occur?  Examining the SAD values in the Excel file, we find that the junction points in this graph occur at the m values 8457, 8589, 11593, and 13093*.

* The values 8457 and 8589 are so close together that it’s very hard to distinguish their junction points in the graph.  If we expanded the range of m values, we would see that all seven data values produce junction points.

Where does the minimum occur?  The minimum clearly occurs at one of these junction points: m = 11,593 steps.  Does this value look familiar?  Yes, this is one of the data values, specifically the median of the data.  Does this seem like a coincidence?  Again, no way, this would be quite a coincidence!  The sum of absolute deviations is indeed minimized at the median of the data values*. 

* The mathematical proof for this result is a bit more involved than using calculus to prove that the mean minimizes the sum of squared deviations.


Some students wonder: What if there had been an even number of data values?  I respond: What a terrific question!  What do you predict will happen?  Please explore this question and find out.

Let’s investigate this question now.  On Sunday, January 19, I walked for 14,121 steps.  Including this value in the dataset gives the following ordered values:

How will the mean and median change?  The mean will increase, because we’ve included a value larger than the previous mean.  The median will also increase, as it will now be the average of the 4th and 5th values, and the value we’ve inserted is larger than those values.  It turns out that the mean is now 12,325.625 steps, and the median is (11,593 + 13,093) / 2 = 12,343 steps.

Predict what will change in the graphs of these functions and the values of m that minimize these functions.  Ready to see the results?  Here is the graph for the SSD function:

This SSD function behaves as you expected, right?  It’s still a parabola, and it’s still minimized at the mean, which is now a bit larger than the previous mean.  Now let’s look at the SAD function:

Whoa, did you expect this?  We still have a piece-wise linear function, with junction points still at the data values.  The median does still minimize the function, but the median no longer uniquely minimizes the function.  The SAD function is now minimized by any value between the two middle values of the dataset.  For this dataset, all values from 11,593 → 13,093 steps minimize the SAD function*.

* While the common convention is to declare the median of an even number of values to be the midpoint of the middle two values, an alternative is to regard any value between the two middle values as a median.


Are these two criteria (sum of squared or absolute deviations) the only ones that we could consider?  Certainly not.  These are the two most popular criteria, with least squares the most common by far, but we can investigate others.  For example, if you’re a very cautious person, you might want to minimize the worst-case scenario.  So, let’s stick with absolute deviations, but let’s seek to minimize the maximum of the absolute deviations rather than their sum.  We’ll call this function MAXAD(m), and we can evaluate it in Excel with:

What do you predict this function to look like?  The resulting graph (based on the original seven data values) is:

This MAXAD function is piece-wise linear, just as the SAD function was.  But there are only two linear pieces to this function.  The unique minimum occurs at m = 12,663 steps.  How does this minimum value relate to the data values?  It turns out that the minimum occurs at the average of the minimum and maximum values, also known as the midrange. It makes sense that we use the midpoint of the most extreme values in order to minimize the worst-case scenario.

Now let’s continue with the idea of minimizing a worst-case scenario, but let’s work with squared differences rather than absolute values.  What do you expect the maximum of squared deviations function to look like, and where do you expect the minimum to occur?

Here’s the graph, again based on the original seven data values:

It’s hard to see, but the two pieces are not quite linear this time.  Because we are minimizing the worst-case scenario, the minimum again occurs at the midrange of the data values: m = 12,663 steps.

Would including the 8th data value that we used above affect the midrange?  No, because that 8th value did not change the minimum or maximum.  Is the midrange resistant to outliers?  Not at all!  The midrange is not only strongly affected by very extreme values, it takes no data values into account except for the minimum and maximum.

Could we ask students to investigate other criteria?  Sure.  Here’s a weird one: How about the median of the absolute deviations, rather than the sum or maximum of them?  I have no idea why you would want to minimize this function, but it produces a very interesting graph, and the median occurs at m = 10,775 steps:


The concept of least squares can apply to one-variable data as well as its more typical use with lines for bivariate data.  Students can use software to explore not only this concept but other minimization criteria, as well.  Along the way they can make some surprising (and pretty) graphs, and also discover some interesting results about summary statistics.

P.S. This activity was inspired by George Cobb and David Moore’s wonderful article “Mathematics, Statistics, and Teaching” (available here), which appeared in The American Mathematical Monthly in 1997.  The last section of the article discussed optimization properties of measures of center, mentioning several of the criteria presented in this post.

The very last sentence of George and David’s article (This is your take-home exam: design a better one-semester statistics course for mathematics majors) inspired Beth Chance and me to develop Investigating Statistical Concepts, Applications, and Methods (more information available here).

P.P.S. You can download the Excel file that I used in these analyses from the link below.  Notice that the file contains separate tabs for the original analysis of seven data values, a zoomed-in version of that analysis, and the analysis of eight data values.

#29 Not enough evidence

We statistics teachers often ask students to draw a conclusion, in the context of the data and research question provided, from the p-value of a hypothesis test.  Do you think a student is more likely to provide a response that earns full credit if the p-value is .02 or .20?

You may respond that it doesn’t matter.  You may believe that a student either knows how to state a conclusion from a p-value or not, regardless of whether the p-value is small or not-so-small.

I think it does matter, a lot.  I am convinced that students are more likely to give a response that earns full credit from a small p-value like .02 than from a not-so-small p-value like .20.  I think it’s a lot easier for students to express a small p-value conclusion of strong evidence against the null than a not-so-small p-value conclusion of not much evidence against the null.  Why?  In the not-so-small p-value case, it’s very easy for students to slip into wording about evidence for the null hypothesis (or accepting the null hypothesis), which does not deserve full credit in my book.

In this post I will explore this inclination to mis-state hypothesis test conclusions from a not-so-small p-value.  I will suggest two explanations for convincing students that speaking of evidence for the null, or deciding to accept the null, are not appropriate ways to frame conclusions.  I will return to an example that we’ve seen before and then present two new examples.  As always, questions that I pose to students appear in italics.


Let’s revisit the infamous 1970 draft lottery, which I discussed in post #9 (Statistics of illumination, part 3, here).  To recap: All 366 birthdays of the year were assigned a draft number.  The scatterplot on the left below displays the draft numbers vs. sequential day numbers.  At first glance, the graph appears to show nothing but random scatter, as we would expect from a truly random lottery.  But when we explored the data further, we found a bit of negative association between draft number and day number, with a correlation coefficient of -0.226.  We used simulation to investigate how surprising such a correlation would be with a truly random lottery.  The graph on the right shows the results for 10,000 random lotteries.  We see that none of the 10,000 simulated correlation coefficients is as large (in absolute value) as the -0.226 value that was achieved with the actual 1970 draft lottery.  Therefore, because a result as extreme as the one observed would be very unlikely to occur with a truly random lottery, we concluded that the observed data provide very strong evidence that the lottery process was not truly random.  (The explanation turned out to be insufficient mixing of the capsules containing the birthdays.)

This reasoning process is by no means trivial, but I think it makes sense to most students.  Without using the terminology, we have conducted a hypothesis test.  The null hypothesis is that the lottery process was truly random.  The alternative hypothesis is that the process was not truly random.  The p-value turns out to be very close to zero, less than 1 in 10,000.  Therefore, we have very strong evidence against the null hypothesis in favor of the alternative.

In the following year’s (1971) draft lottery, additional steps were taken to try to produce a truly random process.  The correlation coefficient (between draft number and day number) turned out to be 0.014.  The graph of simulation results above* shows that such a correlation coefficient is not the least bit unusual or surprising if the lottery process was truly random.  The two-sided p-value turns out to be approximately 0.78.  What do you conclude about the 1971 lottery process?

* This 1971 draft lottery involved 365 birthdays, as compared to 366 birthdays in the 1970 draft lottery.  This difference is so negligible that using the same simulation results is reasonable.

After they provide their open-ended response, I also ask students: Which of the following responses are appropriate and which are not?

  • A: The data do not provide enough evidence to conclude that the 1971 lottery process was not truly random.
  • B: The data do not provide much evidence for doubting that the 1971 lottery process was truly random.
  • C: The data provide some evidence that the 1971 lottery process was truly random.
  • D: The data provide strong evidence that the 1971 lottery process was truly random.

Responses A and B are correct and appropriate.  But they are challenging for students to express, in large part because they include a double negative.  It’s very tempting for students to avoid the double negative construction and write a more affirmative conclusion. But the affirmative responses (C and D) get the logic of hypothesis testing wrong by essentially accepting the null hypothesis.  That’s a no-no, so those responses deserve only partial credit in my book.

Students naturally ask: Why is this wrong?  Very good question.  I have two answers, one fairly philosophical and the other more practical.  I will lead off with the philosophical answer, even though students find the practical answer to be more compelling and persuasive.


The philosophical answer is: Accepting a null hypothesis, or assessing evidence in favor of the null hypothesis, is simply not how the reasoning process of hypothesis testing works.  The reasoning process only assesses the strength of evidence that the data provide against the null hypothesis.  Remember how this goes: We start by assuming that the null hypothesis is true.  Then we see how surprising the observed data would be if the null hypothesis were true.  If the answer is that the observed data would be very surprising, then we conclude that the data provide strong evidence against the null hypothesis.  If the answer is that the observed data would be somewhat surprising, then we conclude that the data provide some evidence against the null hypothesis.  But what if the answer is that the observed data would not be surprising?  Well, then we conclude that the data provide little or no evidence against the null hypothesis.

This reasoning process is closely related to the logical argument called modus tollens:

  • If P then Q
  • Not Q
  • Therefore: not P

For example, the Constitution of the United States stipulates that if a person is eligible to be elected President in the year 2020 (call this P), then that person must have been born in the U.S. (call this Q).  We know that Queen Elizabeth was not born in the U.S. (not Q).  Therefore, Queen Elizabeth is not eligible to be elected U.S. President in 2020 (not P).

But what if Q is true?  The following, sometimes called the fallacy of the converse, is NOT VALID:

  • If P then Q
  • Q
  • Therefore: P

For example, Taylor Swift was born in the U.S. (Q).  Does this mean that she is eligible to be elected President in 2020 (P)?  No, because she is younger than 35 years old, which violates a constitutional requirement to serve as president.

For the draft lotteries, P is the null hypothesis that the lottery process was truly random, and Q is that the correlation coefficient (between day number and draft number) is between about -0.1 and 0.1.  Notice that (If P, then Q) is not literally true here, but P does make Q very likely.  This is the stochastic* version of the logic.  For the 1970 lottery, we observed a correlation coefficient (-0.226) that is not Q, so we have strong evidence for not P, that the lottery process was not truly random.  For the 1971 lottery, we obtained a correlation coefficient (0.014) that satisfies Q.  This leaves us with no evidence for not P (that the lottery process was non-random), but we also cannot conclude P (that the lottery process was random).

* I don’t use this word with introductory students.  But I do like the word stochastic, which simply means involving randomness or uncertainty.

I only discuss modus tollens in courses for mathematics and statistics majors.  But for all of my students I do mention the common expression: Absence of evidence does not constitute evidence of absence.  For the 1971 draft lottery, the correlation coefficient of 0.014 leaves us with an absence of evidence that anything suspicious (non-random) was happening, but that’s not the same as asserting that we have evidence that nothing suspicious (non-random) was happening.


My second answer, the more practical one, for why it’s inappropriate to talk about evidence in favor of a null hypothesis, or to accept a null hypothesis, is: Many different hypotheses are consistent with the observed data, so it’s not appropriate to accept any one of these hypotheses.  Let me use a new example to make this point.

Instead of flipping a coin, tennis players often determine who serves first by spinning a racquet and seeing whether it lands with the label facing up or down.  Is this really a fair, 50/50 process?  A student investigated this question by spinning her racquet 100 times, keeping track of whether it landed with the label facing up or down.

  • a) What are the observational units and variable?  The observational units are the 100 spins of the racquet.  The variable is whether the spun racquet landed with the label facing up or down.  This is a binary, categorical variable.
  • b) Identify the parameter of interest.  The parameter is the long-run proportion of all spins for which the racquet would land with the label up*.  This could also be expressed as the probability that the spun racquet would land with the label facing up.
  • c) State the null and alternative hypotheses in terms of this parameter.  The null hypothesis is that the long-run proportion of all spins that land up is 0.5.  In other words, the null hypothesis states that racquet spinning is a fair, 50/50 process, equally likely to land up or down.  The alternative hypothesis is that the long-run proportion of all spins that land up is not 0.5.  This is a two-sided alternative.

* We could instead define a down label as a success and specify the parameter to be the long-run proportion of all spins that would land down.

The 100 racquet spins in the sample resulted in 44 that landed with the label up, 56 that landed with the label down.  The two-sided p-value turns out to be 0.271, as shown in the following graph of a binomial distribution*:

* You could also (or instead) present students with an approximate p-value from a simulation analysis or a normal distribution.

  • d) Interpret this p-value.  If the racquet spinning process was truly fair (equally likely to produce an up or down result), there’s a 27.1% chance that a random sample of 100 spins would produce a result as extreme as the actual one: 44 or fewer, or 56 or more, spins landing with the label up.
  • e) Summarize your conclusion.  The sample data (44 landing up in 100 spins) do not provide much evidence against the hypothesis that racquet spinning is a fair, 50/50 process.
  • f) Explain how your conclusion follows from the p-value.  The p-value of 0.271 is not small, indicating that the observed result (44 landing up in 100 spins), or a result more extreme, would not be surprising if the racquet spinning process was truly fair.  In other words, the observed result is quite consistent with a fair, 50/50 process.

Once again this conclusion in part (e) is challenging for students to express, as it involves a double negative.  Students are very tempted to state the conclusion as: The sample data provide strong evidence that racquet spinning is a fair, 50/50 process.  Or even more simply: Racquet spinning is a fair, 50/50 process.

To help students understand what’s wrong with these conclusions, let’s focus on the parameter, which is the long-run proportion of racquet spins that would land with the label facing up.  Concluding that racquet spinning is a fair, 50/50 process means concluding that the value of this parameter equals 0.5. 

I ask students: Do we have strong evidence against the hypothesis that 45% of all racquet spins would land up?  Not at all!  This hypothesized value (0.45) is very close to the observed value of the sample proportion of spins that landed up (0.44).  The p-value for testing the null value of 0.45 turns out to be 0.920*.

* All of the p-values reported for this example are two-sided, calculated from the binomial distribution.

Let’s keep going: Do we have strong evidence against the hypothesis that 40% of all racquet spins would land up?  Again the answer is no, as the p-value equals 0.416.  What about 52%?  Now the p-value is down to 0.111, but that’s still not small enough to rule out 0.52 as a plausible value of the parameter.

Where does this leave us?  We cannot reject that the racquet spinning process is fair (parameter value 0.5), but there are lots and lots* of other parameter values that we also cannot reject.  Therefore, it’s inappropriate to accept one particular value, or to conclude that the data provide evidence in favor of one particular value, because there are many values that are similarly plausible for the parameter.  The racquet spinning process might be fair, but it also might be biased slightly in favor of up or considerably against up.

* Infinitely many, in fact


Now let’s consider a new example, which addresses the age-old question: Is yawning contagious?  The folks at the popular television series MythBusters investigated this question by randomly assigning 50 volunteers to one of two groups:

  • Yawn seed group: A confederate of the show’s hosts purposefully yawned as she individually led 34 subjects into a waiting room.
  • Control group: The person led 16 other subjects into a waiting room and was careful not to yawn.

All 50 subjects were observed by hidden camera as they sat in the room, to see whether or not they yawned as they waited for someone to come in.  Here is the resulting 2×2 table of counts:

The hosts of the show calculated that 10/34 ≈ 0.294 of the subjects in the yawn seed group yawned, compared to 4/16 = 0.250 of the subjects in the control group.  The hosts conceded that this difference is not dramatic, but they noted that the yawn seed group had a higher proportion who yawned than the control group, and they went on declare that the data confirm the yawning is contagious hypothesis.

We can use an applet (here) to simulate a randomization test* on these data.  The p-value turns out to be approximately 0.513, as seen in the following graph of simulation results:

* See post #27 (Simulation-based inference, part 2, here) for an introduction to such an analysis.

  • a) State the null and alternative hypothesis, in words.
  • b) Do you agree with the conclusion reached by the show’s hosts? Explain.
  • c) How would you respond to someone who concluded: “The hosts are completely wrong.  The data from this study actually provide strong evidence that yawning is not contagious.”

a) The null hypothesis is that yawning is not contagious.  In other words, the null hypothesis is that people exposed to a yawn seed group have the same probability of yawning as people not so exposed.  The alternative hypothesis is that yawning is contagious, so people exposed to a yawn seed group are more likely to yawn than people not so exposed.

b) The conclusion of the show’s hosts is not supported by the data.  Such a small difference in yawning proportions between the two groups could easily have occurred by the random assignment process alone, even if yawning is not contagious.  The data do not provide nearly enough evidence for concluding that yawning is contagious.

c) This conclusion goes much too far in the other direction.  It’s not appropriate to conclude that yawning is not contagious.  A hypothesis test only assesses evidence against a null hypothesis, not in favor of a null hypothesis.  It’s plausible that yawning is not contagious, but the observed data are also consistent with yawning being a bit contagious or even moderately contagious.


As I wrap up this lengthy post, let me offer five pieces of advice for helping students to avoid mis-stating conclusions from not-so-small p-values:

1. I strongly advise introducing hypothesis testing with examples that produce very small p-values and therefore provide strong evidence against the null hypothesis.  The blindsight study that I used in post #12 (Simulation-based inference, part 1, here) is one such example.  I think a very small p-value makes it much easier for students to hang their hat on the reasoning process behind hypothesis testing.

2. Later be sure to present several examples that produce not-so-small* p-values, giving students experience with drawing “not enough evidence to reject the null” conclusions.

* You have no doubt noticed that I keep saying not-so-small rather than large.  I think this also indicates how tricky it is to work with not-so-small p-values.  A p-value of .20 does not provide much evidence against a null hypothesis, and I consider a p-value of .20 to be not-so-small rather than large.

3. Emphasize that there are many plausible values of the parameter that would not be rejected by a hypothesis test, so it’s not appropriate to accept the one particular value that appears in the null hypothesis.

4. Take a hard line when grading students’ conclusions.  Do not give full credit for a conclusion that mentions evidence for a null hypothesis or accepts a null hypothesis.

5. In addition to asking students to state their own conclusions, provide them with a variety of mis-stated and well-stated conclusions, and ask them to identify which are which.

Do you remember the question that motivated this post? Are students more likely to earn full credit for stating a conclusion from a p-value of .02 or .20?  Are you persuaded to reject the hypothesis that students are equally likely to earn full credit with either option? Have I provided convincing arguments that drawing an appropriate conclusion is easier for students from a p-value of .02 than from a p-value of .20?