Skip to content

#28 A pervasive pet peeve

Let’s suppose that you and I are both preparing to teach our next class.  Being easily distracted, I let my mind (and internet browser) wander to check on my fantasy sports teams, so I only devote 60% of my attention to my class preparation.  On the other hand, you keep distractions to a minimum and devote 90% of your attention to the task.  Let’s call these values (60% for me, 90% for you) our focus percentages.  Here’s the question on which this entire post hinges: Is your focus percentage 30% higher than mine?

I have no doubt that most students would answer yes.  But that’s incorrect, because 90 is 50% (not 30%) larger than 60.  This mistaking of a difference in percentages as a percentage difference is the pet peeve that permeates this post.

I will describe some class examples that help students learn how to work with percentage differences.  Then I’ll present some assessment items for giving students practice with this tricky idea.  Along the way I’ll sneak in a statistic that rarely appears in Stat 101 courses: relative risk.  As always, questions for students appear in italics.


A rich source of data on high school students in the United States is the Youth Risk Behavior Surveillance Survey (YRBSS).  Here are counts from the 2017 YRBSS report, comparing youths in Arizona and California on how often they wear a seat belt when riding in a car driven by someone else:

For each state, calculate the proportion (to three decimal places) of respondents who rarely or never wear a seat belt.  These proportions are 173/2139 ≈ 0.081 for Arizona, 103/1778 ≈ 0.058 for California.  Convert these proportions to percentages, and use these percentages in sentences*.  Among those who were surveyed, 8.1% of the Arizona youths and 5.8% of the California youths said that they rarely or never wear a seat belt when riding in a car driven by someone else.

* I think it’s worthwhile to explicitly ask students to convert proportions to percentages.  It’s more common to speak about percentages than proportions, and this conversion is non-trivial for some students.

Is it correct to say that Arizona youths in the sample were 2.3% more likely to wear a seat belt rarely or never than California youths in the sample?  Some students need a moment to press 8.1 – 5.8 into their calculator or cell phone to confirm the value 2.3, and then almost all students respond yes.

Let me pause here, because I want to be very clear: This is my pet peeve.  I explain that the difference between the two states’ percentages (8.1% and 5.8%) is 2.3 percentage points, but that’s not the same thing as a 2.3 percent difference.


At this point I ask students to indulge me in a brief detour.  Percentage difference between any two values is often tricky for people to understand, but working with percentages as the two values to be compared makes the calculation and interpretation all the more confusing. The upcoming detour simplifies this by using more generic values than percentages.

Suppose that my IQ is 100* and Beth’s is 140.  These IQ scores differ by 40 points.  What is the percentage difference in these IQ scores?  I quickly admit to my students that this question is not as clear as it could be.  When we talk about percentage difference, we need to specify compared to what.  In other words, we need to make clear which value is the reference (or baseline).  Let me rephrase: By what percentage does Beth’s IQ exceed mine?  Now we know that we are to treat my IQ score as the reference value, so we divide the difference by my IQ score: (140 – 100) / 100 = 0.40.  Then to express this as a percentage, we multiply by 100% to obtain: 0.40×100% = 40%.  There’s our answer: Beth’s IQ score is 40% larger than mine.

* I joked about my IQ score in post #5, titled A below-average joke, here.

Why did this percentage difference turn out to be the same as the actual difference?  Because the reference value was 100, and percent means out of 100.  Let’s make the calculation slightly harder by bringing in Tom, whose IQ is 120.  By what percentage does Beth’s IQ exceed Tom’s?  Using Tom’s IQ score as the reference gives a percentage difference of: (140 – 120) / 120 × 100% ≈ 16.7%.  Beth’s IQ score, which is 20 points higher than Tom’s, is 16.7% greater than Tom’s.

Does this mean that Tom’s IQ score is 16.7% below Beth’s?  Many students realize that the answer is no, because this question changes the reference value to be Beth’s rather than Tom’s.  The calculation is now: (120 – 140) / 140 × 100% ≈ -14.3%.  Tom’s IQ score is 14.3% lower than Beth’s.

Calculate and interpret the percentage difference between Tom’s IQ score and mine, in both directions.  Comparing Tom’s IQ score to mine is the easier one, because we’ve seen that a reference value of 100 makes calculations easier: (120 – 100) / 100 × 100% ≈ 20%.  Tom’s IQ score is 20% higher than mine.  Comparing my score to Tom’s gives: (100 – 120) / 120 × 100% ≈ -16.7%.  My IQ score is 16.7% lower than Tom’s*.

* I think I can hear what many of you are thinking: Wait a minute, this is not statistics!  I agree, but I nevertheless think this topic, which should perhaps be classified as numeracy, is relevant and important to teach in introductory statistics courses.  Otherwise, many students will continue to make mistakes throughout their professional and personal lives when working with and interpreting percentages.  I will end this detour and return to examining real data now.

Let’s return to the YRBSS data.  Calculating a percentage difference can seem more complicated when dealing with proportions, but the process is the same.  Calculate the percentage difference by which the Arizona youths’ proportion who rarely or never use a seat belt exceeds that for California youths.  Earlier we calculated the difference in proportions to be: 0.081 – 0.058 = 0.023. Now we divide by California’s baseline value to obtain: 0.023/0.058 ≈ .396, and finally we convert this to a percentage difference by taking: 0.396 × 100% = 39.6%.  Write a sentence interpreting this value in context.  Arizona youths in this sample were 39.6% more likely to rarely or never wear a seat belt than California youths.  Finally, just to make sure that my pet peeve is not lost on students: Is this percentage difference of 39.6% close to the absolute difference of 2.3 percentage points?  Not at all!


Next I take students on what appears to be a tangent but will lead to a connection with a different statistic for comparing proportions between two groups.  Calculate the ratio of proportions who rarely or never use a seat belt between Arizona and California youths in the survey.  This calculation is straightforward: 0.081/0.058 ≈ 1.396.  Write a sentence interpreting this value in context.  Arizona youths in the survey are 1.396 times more likely to rarely or never wear a seat belt than California youths.  I emphasize that the word times is a crucial one in this sentence.  The word times is correct here because we calculated a ratio in the first place.

Then I reveal to students that this new statistic (ratio of proportions) is important enough to have its own name: relative risk.  The relative risk of rarely or never wearing a seat belt, comparing Arizona to California youths, is 1.396.  The negative word risk is used here because this statistic is often reported in medical studies, comparing proportions with a negative result such as having a disease.  The convention is to put the larger proportion in the numerator, using the smaller proportion to indicate the reference group.

Does the number 1.396 look familiar from our earlier analysis?  Most students respond that the percentage difference was 0.396, which seems too strikingly similar to 1.396 to be a coincidence.  Make a conjecture for the relationship between percentage difference and relative risk.  Many students propose: percentage difference = (relative risk – 1) × 100%.

I ask students to test this conjecture with YRBSS data on seat belt use from Pennsylvania and California youths:

Calculate and interpret the difference and ratio of proportions who rarely or never use seat belts.  The “rarely or never” proportion in Pennsylvania is 425/3761 ≈ 0.113.  We’ve already calculated that the proportion in California is 103/1778 ≈ 0.058.  The difference in proportions is 0.113 – 0.058 = 0.055.  The percentage of Pennsylvania youths in the sample who said that they rarely or never wear a seat belt is 5.5 percentage points higher than the percentage of California youths who answered “rarely or never.”  The ratio of proportions is 0.113/0.058 ≈ 1.951*.  A Pennsylvania youth in the sample was 1.951 times more likely than a California youth to rarely or never wear a seat belt.

* I performed this calculation on the actual counts, not the proportions rounded to three decimal places in the numerator and denominator.

Verify that the conjectured relationship between percentage difference and relative risk holds.  The percentage difference in the proportions can be calculated as: (0.113 – 0.058) / 0.058 × 100% ≈ 95.1%.  This can also be calculated from the ratio as: (1.951 – 1) × 100% ≈ 95.1%.

I am not necessarily proposing that relative risk needs to be taught in Stat 101 courses.  I am urging a very careful treatment of percentage difference, and it takes just an extra 15 minutes of class time to introduce relative risk.


Let’s follow up with a confidence interval for a difference in proportions.  If we go back to comparing the responses from Arizona and California youths, a 95% confidence interval for the difference in population proportions turns out to be: .023 ± .016, which is the interval (.007 → .039).

Interpret what this interval reveals.  First recall that the order of subtraction is Arizona minus California, and notice that the interval contains only positive values.  We are 95% confident that the proportion of all Arizona youths who would answer that they rarely or never wear a seat belt is between .007 and .039 larger than the proportion of all California who would give that answer.  We can translate this answer to percentage points by saying that the Arizona percentage (of all youths who would answer that they rarely or never wear a seat belt) is between 0.7 and 3.9 percentage points larger than the California percentage.  But many students trip themselves up by saying that Arizona youths are between 0.7% and 3.9% more likely than California youths to answer that they rarely or never wear a seat belt.  This response is incorrect, for it succumbs to my pet peeve of mistakenly interpreting a difference in percentages as a percentage difference.

What parameter do we need to determine a confidence interval for, in order to estimate the percentage difference in population proportions (who rarely or never wear a seat belt) between Arizona and California youths?  A confidence interval for the population relative risk will allow this.  Such a procedure exists, but it is typically not taught in an introductory statistics course*.  For the YRBBS data on seat belt use in Arizona and California, a 95% confidence interval for the population relative risk turns out to be (1.103 → 1.767).

* The sampling distribution of a sample relative risk is skewed to the right, but the sampling distribution of the log transformation of the sample relative risk is approximately normal.  So, a confidence interval can be determined for the log of the population relative risk, which can then be transformed back to a confidence interval for the population relative risk.

What aspect of this interval indicates strong evidence that Arizona and California have different population proportions?  This can be a challenging question for students, so I often offer a hint: What value would the relative risk have if the two population proportions were the same?  Most students realize that the relative risk (ratio of proportions) would equal 1 in this case.  That the interval above is entirely above 1 indicates strong evidence that Arizona’s population proportion (who rarely or never wear a seat belt) is larger than California’s.

Interpret this confidence interval.  We are 95% confident that Arizona youths are between 1.103 and 1.767 times more likely than California youths to answer that they rarely or never wear a seat belt.  Convert this to a statement about the percentage difference in the population proportions.  We can convert this to percentage difference by saying: We are 95% confident that Arizona youths are between 10.3% and 76.7% more likely than California youths to answer that they rarely or never wear a seat belt.

I am not suggesting that students learn how to calculate a confidence interval for a relative risk in Stat 101, but I do think students should be able to interpret such a confidence interval.


Now we return to the YRBSS data for a comparison that illustrates another difficulty that some students have with percentages.  The YRBSS classifies respondents by race, and the 2017 report says that 9.8% of black youths and 4.3% of white youths responded that they rarely or never wear a seat belt.  Calculate the ratio of these percentages.  This ratio is: .098/.043 ≈ 2.28.  Write a sentence interpreting the relative risk.  Black youths who were surveyed were 2.28 times more likely than white youths to rarely or never wear a seat belt.  Complete this sentence: Compared to white youths who were surveyed, black youths were ______ % more likely to rarely or never wear seat belts.  To calculate the percentage difference, we can use the relative risk as we discovered above: (2.28 – 1) × 100% = 118%.  Black youths who were surveyed were 118% more likely to rarely or never wear seat belts, as compared to white youths.

Hold on, can a percentage really be larger than 100%?  Yes, a percentage difference (or a percentage change or a percentage error) can exceed 100%.  If one value is exactly twice as big as another, then it is 100% larger.  So, if one value is more than twice as big as another, then it is more than 100% larger.  In this case, the percentage (who rarely or never use a seat belt) for black youths is more than twice the percentage for white youths, so the relative risk exceeds 2, and the percentage difference between the two percentages therefore exceeds 100%.


Here is a quiz containing five questions, all based on real data, for giving students practice working with percentage differences:

  • a) California’s state sales tax rate in early 2019 was 7.3%, compared to Hawaii’s state sales tax rate of 4.0%.  Was California’s state sales tax rate 3.3% higher than Hawaii’s?  If not, determine the correct percentage difference to use in that sentence.
  • b) Alaska had a 0% state sales tax rate in early 2019.  Could Hawaii match Alaska’s rate by reducing theirs by 4%?  If not, determine the correct percentage reduction to use in that sentence.
  • c) Steph Curry successfully made 354 of his 810 (43.7%) three-point shots in the 2018-19 NBA season, and Russell Westbrook successfully made 119 of his 411 (29.0%) three-point shots.  Could Westbrook have matched Curry’s success rate with a 14.7% improvement in his own success rate?  If not, determine the correct percentage improvement to use in that sentence.
  • d) Harvard University accepted 4.5% of its freshman applicants for Fall 2019, and Duke University accepted 7.4% of its applicants.  Was Harvard’s acceptance rate 2.9% lower than Duke’s?  If not, then determine the correct percentage difference to use in that sentence.
  • e) According to the World Bank Development Research Group, 10.0% of the world’s population lived in extreme poverty in 2015, compared to 35.9% in 1990.  Did the percentage who lived in extreme poverty decrease by 25.9% in this 25-year period?  If not, determine the correct percentage decrease to use in that sentence.

The correct answer to all of these yes/no questions is no, not even close.  Correct percentage differences are: a) 82.5% b) 100% c) 50.9% d) 39.2% e) 72.1%.


I briefly considered titling this post: A persnickety post that preaches about a pervasive, persistent, and pernicious pet peeve concerning percentages.  That title contains 15 words, 9 of which start with the letter P, so 60% of the words in that title begin with P.  Instead I opted for the much simpler title: A pervasive pet peeve, for which 75% of the words begin with P. 

Does this mean that I increased the percentage of P-words by 15% when I decided for the shorter title?  Not at all, that’s the whole point!  I increased the percentage of P-words by 15 percentage points, but that’s not the same as 15%.  In fact, the percentage increase is (75 – 60) / 60 × 100% = 25%, not 15%. 

Furthermore, notice that 25% is 66.67% larger than 15%, so the percentage increase (in percentage of P-words) that I achieved with the shorter title is 66.67% greater than what many would mistakenly believe the percentage increase to have been.

No doubt I have gotten carried away*, as that last paragraph is correct but positively** ridiculous.  I’ll conclude with two points: 1) Misunderstanding percentage difference (or change) is very common, and 2) Teachers of statistics can help students to calculate and interpret percentage difference correctly.

* You might have come to that conclusion far earlier in this post.

** I couldn’t resist using another P word here. I really need to press pause on this preposterous proclivity.

P.S. The 2017 YRBSS report can be found here.  You might ask students to select their own questions and variables to analyze and compare. Data on state sales tax rates appear here, basketball players’ shooting percentages here, college acceptance rates here, and poverty rates here.

#27 Simulation-based inference, part 2

I believe that simulation-based inference (SBI) helps students to understand the underlying concepts and logic of statistical inference.  I described how I introduce SBI back in post #12 (here), in the scenario of inference for a single proportion.  Now I return to the SBI theme* by presenting a class activity that concerns comparing proportions between two groups.  As always, questions that I pose to students appear in italics.

* Only 15 weeks after part 1 appeared!


I devote most of a 50-minute class meeting to the activity that I will describe here.  The research question is whether metal bands* used for tagging penguins are actually harmful to their survival.

* Some students, and also some fellow teachers, tell me that they initially think that I am referring to penguins listening to heavy metal bands.

I begin by telling students that the study involved 20 penguins, of which 10 were randomly assigned to have a metal band attached to their flippers, in addition to an RFID chip for identification.  The other 10 penguins did not receive a metal band but did have an RFID chip.  Researchers then kept track of which penguins survived for the 4.5-year study and which did not.

I ask students a series of questions before showing any results from the study: Identify and classify the explanatory and response variables.  The explanatory variable is whether or not the penguin had a metal band, and the response is whether or not the penguin survived for at least 4.5 years.  Both variables are categorical and binary.  Is this an experiment or an observational study?  This is an experiment, because penguins were randomly assigned to wear a metal band or not.  Did this study make use of random sampling, random assignment, both,or neither?  Researchers used random assignment to put penguins in groups but (presumably) did not take a random sample of penguins.  State the null and alternative hypotheses, in words.  The null hypothesis is that metal bands have no effect on penguin survival.  The alternative hypothesis is that metal bands have a harmful effect on penguin survival.

Then I tell students that 9 of the 20 penguins survived, 3 with a metal band and 6 without.  Organize these results into the following 2×2 table:

The completed table becomes:

Calculate the conditional success proportions for each group.  The proportion in the control group who survived is 6/10 = 0.6, and the proportion in the metal band group who survived is 3/10 = 0.3*.  Calculate the difference in these success proportions.  I mention that students could subtract in either order, but I want us all to be consistent so I instruct them to subtract the proportion for the metal band group from that of the control group: 0.6 – 0.3 = 0.3.

* I cringe when students use their calculator or cell phone for these calculations.

Is it possible that this difference could have happened even if the metal band had no effect, simply due to the random nature of assigning penguins to groups (i.e., the luck of the draw)?  I often give my students a silly hint that the correct answer has four letters.  Realizing that neither no nor yes has four letters, I get many befuddled looks before someone realizes: Sure, it’s possible!  Joking aside, this is a key question.  This question gets at why we need to conduct inference in the first place.  We cannot conclude that metal bands are harmful simply because a smaller proportion survived with metal bands than without them.  Why not?  Because this result could have happened even if metal bands are not harmful.

What question do we need to ask next?  Students are surprised that I ask them to propose the next question.  If they ask for a hint, I remind them of our earlier experience with SBI.  To analyze a research study of whether a woman with brain damage experienced a phenomenon known as blindsight, we investigated how surprising it would be to correctly identify the burning house in 14 of 17 pairs of drawings, if in fact she was choosing randomly between the two houses (one burning, one not) presented.  For this new context I want students to suggest that we ask: How likely, or how surprising, is it to obtain a difference in success proportions of 0.3 or greater, if in fact metal bands are not harmful?

How will we investigate this question?  With simulation!


Once again we start with by-hand simulation before turning to technology.  Like always, we perform our simulation assuming that the null hypothesis is true: that the metal band has no effect on penguin survival.  More specifically, we assume that the 9 penguins who survived would have done so with the metal band or not, and the 11 penguins who did not survive would have perished with the metal band or not.

We cannot use a coin to conduct this simulation, because unlike with the blindsight study, we are not modeling a person’s random selections between two options.  Now we want our simulation to model the random assignment of penguins to treatment groups.  We can use cards to do this.

How many cards do we need?  Each card will represent a penguin, so we need 20 cards.  Why do we need two colors of cards?  How many cards do we need of each color?  We need 9 cards of one color, to represent the 9 penguins who survived, and we need 11 cards of the other color, to represent the 11 penguins who perished.  After shuffling the cards, how many will we deal into how many groups?  One group of cards will represent the control group, and a second group of cards will represent penguins who received a metal band.  We’ll deal out 10 cards into each group, just as the researchers randomly assigned 10 penguins to each group.  What will we calculate and keep track of for each repetition?  We will calculate the success proportion for each group, and then calculate the difference between those two proportions.  I emphasize that we all need to subtract in the same order, so students must decide in advance which group is control and which is not, and then subtract in the same order: (success proportion in control group minus success proportion in metal band group).

I provide packets of 20 ordinary playing cards to my students, pre-arranged with 9 red cards and 11 black ones per packet.  Students shuffle the cards and deal them into two piles of 10 each.  Then they count the number of red and black cards in each pile and fill in a table in which we already know the marginal totals:

Next we need to decide: What (one) statistic should we calculate from this table?  A very reasonable choice is to use the difference in survival proportions as our statistic*.  I remind students that it’s important that we all subtract in the same order: (proportion who survived in control group) minus (proportion who survived in metal band group).  Students then come to the whiteboard to put the value of their statistic (difference in proportions) on a dotplot.  A typical result for a class of 35 students looks like**:

* I will discuss some other possible choices for this statistic near the end of this post.

** Notice that the distribution of this statistic (difference in proportions) is discrete.  Only a small number of values are possible, because of the fixed margins of the 2×2 table.  When I draw an axis on the board, I put tick marks on these possible values before students put their dots on the graph.  Occasionally a student will obtain a value that does not fall on one of these tick marks, because they have misunderstood the process or made a calculation error.

Where is this distribution centered?  Why does this make sense?  This distribution is centered near zero.  This makes sense because the simulation assumed that there’s no effect of the metal band, so we expect this difference to be positive about half the time and negative about half the time*.

* Some students are tempted to simply take the larger proportion minus the smaller proportion, so I repeat often that they should subtract in the agreed order: (control minus metal band).  Otherwise, the center of this distribution will not be near zero as it should be.

What is important to notice in this graph, to address the key question of whether the data provide strong evidence that the metal bands are harmful to penguin survival?  This brings students back to the goal of the simulation analysis: to investigate whether the observed result would have been surprising if metal bands have no effect.  Some students usually point out that the observed value of the statistic was 0.3, so we want to see how unusual it is to obtain a statistic of 0.3 or greater.  Does the observed value of the statistic appear to be very unusual in our simulation analysis?  No, because quite a few of the repetitions produced a value of 0.3 or more.  What proportion of the repetitions produced a statistic at least as extreme as the observed value?  Counting the occurrences at 0.3 and higher reveals that 9/35 ≈ 0.257 of the 35 repetitions produced a difference in success proportions of 0.3 or more.  What does this reveal about the strength of evidence that metal bands are harmful?  Because a result as extreme as in the actual study occurred about 26% of the time in our simulation, and 26% is not small enough to indicate a surprising result, the study does not provide strong evidence that metal bands are harmful.

By what term is this 0.257 value known?  This is the (approximate) p-value.  How can we produce a better approximation for the p-value?  Repeat the process thousands of times rather than just 35 times.  In order to produce 10,000 repetitions, should we use cards or technology?  Duh!


Now we turn to an applet (here) to conduct the simulation analysis.  First we click on 2×2 and then enter the table of counts and then click on Use Table:

Next we check Show Shuffle Options on the right side of the applet screen.  I like to keep the number of shuffles set at 1 and click “Shuffle” several times to see the results.  By leaving the Cards option selected, you see 20 colored cards (blue for survival, green for perishing) being shuffled and re-randomized, just as students did with their own packet of 20 cards in class.  You can also check Data or Plot to see different representations of the shuffling.  You might remind students that the underlying assumption behind the simulation analysis is that the metal bands have no effect on penguin survival (i.e., that the null hypothesis is true).

Eventually I ask for 10,000 shuffles, and the applet produces a graph such as:

Once again I ask students to notice that the distribution (of shuffled differences in proportions) is centered near zero.  But again the key question is: Does the simulation analysis indicate that the observed value of the statistic would be very surprising if metal bands have no effect?  Students are quick to say that the answer is no, because the observed value (0.3) is not very far out in the tail of this distribution.  How can we calculate the (approximate) p-value?  By counting the number of repetitions that produced a difference of 0.3 or more, and then dividing by 10,000.  The applet produces something like:

What conclusion do you draw?  Results as extreme as the one observed (a difference in survival proportions between the two groups of 0.3 or more) would not be surprising (p-value ≈ 0.1827) if the metal band had no effect on penguin survival.  Therefore, the experimental data do not provide strong evidence that metal bands are harmful to penguin survival.


I have a confession to make.  I confess this to students at this point in the class activity, and I also confess this to you now as you read this.  The sample size in this experiment was not 20 penguins.  No, the researchers actually studied 100 penguins, with 50 penguins randomly assigned to each group.  Why did I lie*?  Because 100 cards would be far too many for shuffling and counting by hand.  This also gives us an opportunity to see the effect of sample size on such an analysis.

* I chose my words very carefully above, saying I begin by telling students that the study involved 20 penguins …  While I admit to lying to my students, I like to think that I avoided telling an outright lie to you blog readers. If you don’t want to lie to your students, you could tell them at the outset that the data on 20 penguins are based on the actual study but do not comprise the complete study.

Now that I have come clean*, let me show the actual table of counts:

* Boy, does my conscience feel better for it!

We need to redo the analysis, but this goes fairly quickly in class because we have already figured out what to do.  Calculate the survival proportions for each group and their difference (control minus metal band).  The survival proportions are 31/50 = 0.62 in the control group and 16/50 = 0.32 in the metal band group, for a difference of 0.62 – 0.32 = 0.30*.  Before we re-run the simulation analysis, how do you expect the p-value to change, if at all?  Many students have good intuition that the p-value will be much smaller this time.  Here is a typical result with 10,000 repetitions:

* I try to restore my credibility with students by pointing out that I did not lie about the value of this statistic.

What conclusion would you draw?  Explain.  Now we have a very different conclusion.  This graph shows that the observed result (a difference in survival proportions of 0.3) would be very surprising if the metal band has no harmful effect.  A difference 0.3 or larger occurred in only 23 of 10,000 repetitions under the assumption of no effect.  The full study of 100 penguins provides very strong evidence that metal bands are indeed harmful to penguin survival.

Before concluding this activity, a final question is important to ask: The word harmful in that conclusion is a very strong one.  Is it legitimate to draw a cause-and-effect conclusion here?  Why or why not?  Yes, because researchers used random assignment, which should have produced similar groups of penguins, and because the results produced a very small p-value, indicating that such a big difference between the survival proportions in the two groups would have been unlikely to occur if metal bands had no effect.


That completes the class activity, but I want to make two additional points for teachers, which I also explain to mathematically inclined students:

1. We could have used a different statistic than the difference in success proportions.  For a long time I advocated using simply the number of success in group A (in this case, the number of survivors in the control group).  Why are these two statistics equivalent?  Because we are fixing the counts in both margins of the 2×2 table (9 who survived and 11 who perished, 10 in each treatment group), there’s only one degree of freedom.  What does this mean?  Once you specify the count in the upper left cell of the table (or any other cell, for that matter), the rest of the counts are then determined, and so the difference in success proportions is also determined.  In other (mathematical) words, there’s a one-to-one correspondence between the count in the upper left cell and the difference in success proportions.

Why did I previously use the count in the upper left cell as the statistic in this activity?  It’s easier to count than to calculate two proportions and the difference between them, so students are much more likely to make a mistake when they calculate a difference in success proportions.  Why did I change my mind, now favoring the difference in success proportions between the two groups?  My colleagues persuaded me that calculating proportions is always a good step when dealing with count data, and considering results from both groups is also a good habit to develop.

Those two statistics are not the only possible choices, of courses.  For example, you could calculate the ratio of success proportions rather than the difference; this ratio is called the relative risk.  You could even calculate the value of a chi-square statistic, but I certainly do not recommend that when you are introducing students to 2×2 tables for the first time.  Because of the one degree of freedom, all of these statistics would produce the same (approximate) p-value from a given simulation analysis.  The applet used above allows for choosing any of these statistics, in case you want students to explore this for themselves.

2. Just as we can use the binomial distribution to calculate an exact p-value in the one-proportion scenario, we can also calculate an exact p-value for the randomization test in this 2×2 table scenario.  The relevant probability distribution is the hypergeometric distribution, and the test is called Fisher’s exact test.  The calculation involves counting techniques, namely combinations.  The exact p-values can be calculated as (on the left for the sample size of 20 penguins, on the right for the full sample of 100 penguins):


There you have it: simulation-based inference for comparing success proportions between two groups.  I emphasize to students throughout this activity that the reasoning process as the same as it was with one proportion (see post #12 here).  We simulate the data-collection process assuming that the null (no effect) hypothesis is true.  Then if we find that the observed result would have been very surprising, we conclude that the data provide strong evidence against the null hypothesis.  In this case we saw that the observed result would not have been surprising, so we do not have much evidence against the null hypothesis.

This activity can reinforce what students learned earlier in the course about the reasoning process of assessing strength of evidence.  You can follow up with more traditional techniques, such a two-sample z-test for comparing proportions or a chi-square test.  I think the simulation-based approach helps students to understand what a p-value means and how it relates to strength of evidence.

P.S. You can read about the penguin study here.

P.P.S. I provided several resources and links about teaching simulation-based inference at the end of post #12 (here).

#26 Group quizzes, part 2

In last week’s post (here), I mentioned that I give lots of group quizzes and consider them to be an effective assessment tool that promotes students’ learning.  I provided six examples of quizzes, with five questions per quiz, that I have used with my students. 

Now I pick up where I left off, offering seven more quizzes with comments on each.  The topics of these quizzes include numerical variables and comparisons between groups.

As always, questions that I put to students appear in italics.  A file containing all thirteen quizzes from the two posts, along with solutions, can be downloaded from a link at the end of this post.


7. Answer these questions:

  • a) Suppose that a class of 10 students has the following exam scores: 60, 70, 50, 60, 90, 90, 80, 80, 40, 50.  Determine the median of these 10 exam scores.
  • b) Suppose that the average amount of sleep obtained by Cal Poly undergraduates last night was 6.8 hours, and the average amount of sleep obtained by Cal Poly graduate students last night was 7.6 hours.  Is it reasonable to conclude that the average amount of sleep obtained last night among all Cal Poly students was (6.8 + 7.6)/2 = 7.2 hours?  Explain.
  • c) What effect does doubling every value in a dataset have on the mean?   Explain your answer.
  • d) What effect does adding 5 to every value in a dataset have on the standard deviation?  Explain your answer.
  • e) Create an example of 10 hypothetical exam scores (on a 0 – 100 scale) with the property that the mean is at least 20 points larger than the median.  Also report the values of the mean and median for your example.

This quiz is a hodgepodge that addresses basic concepts of measures of center and variability, following up on topics raised in posts #5 (A below-average joke, here) and #6 (Two dreaded words, here).  Some students think of part (a) as a “trick” question, but I think it’s important for students to remember to put data in order before declaring that the middle value (in this case, the average of the two middle values) is the median.  For part (b), students should respond that this conclusion would only be valid if Cal Poly has the same number of undergraduate and graduate students.  You could ask parts (c) and (d) as multiple choice questions by deleting the “explain” aspect.  When I discuss part (e) with students afterward, I advise them to make such an example as extreme as possible.  To make the mean much larger than the median, they could force the median to be zero by having six scores of zero.  Then they can make the mean as large as possible by having four scores of 100.  This makes the mean equal 400/10 = 40, with a median of 0.


8. Suppose that the mean age of all pennies currently in circulation in the U.S. is 12.3 years, and the standard deviation of these ages is 9.6 years.  Suppose also that you take a random sample of 50 pennies and calculate the mean age of the pennies in your sample.

  • a) Are the numbers 12.3 and 9.6 parameters or statistics?  Explain briefly.
  • b) Describe the sampling distribution of the sample mean penny age.  Also produce a well-labeled sketch of this sampling distribution.
  • c) Determine the probability that the sample mean age of your random sample of 50 pennies would be less than 10 years. (Show your work.)
  • d) Are your answers to parts (b) and (c) approximately valid even if the distribution of penny ages is not normally distributed? Explain.
  • e) Based on the values of the mean and standard deviation of penny ages, there is reason to believe that the distribution of penny ages is not normally distributed.  Explain why.

This quiz is a challenging one, because the Central Limit Theorem is a challenging topic.  Part (a) allows students to earn a fairly easy point.  Those numbers are described as pertaining to all pennies in circulation, so they are parameters.  I’m looking for four things in response to part (b): shape (normal), center (mean 12.3 years), and variability (SD 9.6/sqrt(50) ≈ 1.36 years), along with a sketch that specifies sample mean age as the axis label.  Even if a student group has not answered parts (b) and (c) correctly, they can still realize that the large sample size of 50 means that the distribution of the sample mean will be approximately normal, so the answer to part (d) is that the answers to parts (b) and (c) would be valid.  Part (e) is a very challenging one that brings to mind the AP Statistics question discussed in post #8 (End of alphabet, here).  I have in mind there that the value 0 is only 1.28 standard deviations below the mean, so about 10% of pennies would have a negative age if the penny ages followed a normal distribution, which is therefore not plausible.


9. A study conducted in Dunedin, New Zealand investigated whether wearing socks over shoes could help people to walk confidently down an icy footpath*.  Volunteers were randomly assigned either to wear socks over their usual footwear or to simply wear their usual footwear, as they walked down an icy footpath.  An observer recorded whether or not the participant appeared to be walking confidently.

  • a) Is this an observational study or an experiment?  Explain briefly.
  • b) Identify the explanatory and response variables.
  • c) Does this study make use of random sampling, random assignment, both, or neither?
  • d) Did the researchers use randomness in order to give all walkers in New Zealand the same chance of being selected for the study?  Answer YES or NO.
  • e) Did the researchers use randomness in order to produce groups that were as similar as possible in all respects before the explanatory variable was imposed?  Answer YES or NO.

* This may not be scientific research of the greatest import, but this is a real study, not a figment of my imagination.  That this study was conducted in New Zealand makes it all the more appealing.  I hope my students enjoy this context as much as I do, but they are probably too focused on answering the quiz questions to notice.

Parts (a) – (c) should come as no surprise to students, as I ask these questions all the time in class.  (See post #11, Repeat after me, here.)  I especially like parts (d) and (e), which ask about the purpose of randomness in data collection.  Most students realize that random assignment does not give all walkers the same chance of being selected but does try to produce groups that are as similar as possible.  (See posts #19 and #20, Lincoln and Mandela, here and here, for more about random sampling and random assignment.)


10. Recall that a study conducted in Dunedin, New Zealand investigated whether wearing socks over shoes could help people to walk confidently down an icy footpath*.  Participants were randomly assigned to wear socks over their usual footwear, or to simply wear their usual footwear, as they walked down an icy footpath.  One of the response variables measured was whether an observer considered the participant to be walked confidently.  Results are summarized in the 2×2 table of counts below:

For parts (a) – (c), suppose that you conduct a by-hand simulation analysis to investigate whether wearing socks over shoes increases people’s confidence while walking down an icy footpath.  For parts (d) and (e), consider the results of such a simulation analysis performed with technology.

  • a) What would be the assumption involved with producing the simulation analysis?  Choose one of the following options: A. That wearing socks over shoes has no effect on walkers’ confidence; B. That wearing socks over shoes has some effect on walkers’ confidence; C. That walkers are equally likely to feel confident or not, regardless of whether they wear socks over shoes or not; D. That walkers are more likely to feel confident if they wear socks over shoes
  • b) How many cards would you use in the simulation analysis?  What would the color breakdown be? 
  • c) How many cards would you deal out into groups?  How many times would you repeat this process?
  • d) The graph below displays the results of a simulation analysis with 10,000 repetitions, displaying the distribution of the difference in success proportions between the two groups.  Describe how you would calculate an approximate p-value from this graph (i.e., where would you count?).
  • e) Based on the 2×2 table of data and on this graph of simulation results, how much evidence do the data provide in support of the conjecture that wearing socks over shoes increases people’s confidence while walking down an icy footpath?  Choose one of the following options: A. little or no evidence; B. moderate evidence; C. strong evidence; D. very strong evidence.

* This study is too fascinating to use only once!

This quiz assesses how well students understood a class activity about simulation-based inference for comparing proportions between two groups*.  Part (a) asks for the null hypothesis, without using that term.  Parts (b) – (c) concern the nuts and bolts of conducting a simulation analysis by hand.  Parts (d) and (e) address using the simulation analysis to draw a conclusion.  The hardest part for students is realizing that they need to see where the observed value of the statistic (difference in success proportions between the two groups) falls in the simulated null distribution.  I could have made this more apparent by first asking students to calculate the value of the statistic.  Instead I only give a small hint at the beginning of part (e) by reminding students to use the 2×2 table of observed counts as well as the graph of simulation results.  In this case the observed value of the statistic (10/14 – 8/15 ≈ 0.181) is not a surprising result in the simulated null distribution, so the study provides little or no evidence that wearing socks over shoes is helpful.

* My next blog post (#27) will describe and discuss such a class activity.


11. Researchers at Stanford University studied whether a curriculum could help to reduce children’s television viewing. Third and fourth grade students at two public elementary schools in San Jose were the subjects. One of the schools, chosen at random, incorporated an 18-lesson, 6-month classroom curriculum designed to reduce watching television and playing video games, whereas the other school made no changes to its curriculum. At the beginning and end of the study, all children were asked to report how many hours per week they spent on these activities, both before the curriculum intervention and afterward.  The tables below summarize reported amounts of television watching, first at the beginning of the study and then at its conclusion:

  • a) Is the response variable in this study categorical or numerical?
  • b) The difference between the groups can be shown not to be statistically significant at the beginning of the study.  Do you think the researchers would be pleased by this result?  Explain why or why not.
  • c) Even if the distributions of reported amounts of television watching per week are sharply skewed, would it still be valid to apply a two-sample t-test on these data?  Explain briefly.
  • d) Calculate the value of the test statistic for investigating whether the two groups differ with regard to average amount of television watching per week.
  • e) Based on the value of the test statistic, summarize your conclusion for the researchers.

Part (a) is quite straightforward, offering an easy point for students.  I really like part (b), which asks students to realize that a non-significant difference between the groups at the beginning of the study is a good thing. The lack of significance suggests that random assignment achieved its goal of producing similar groups prior to the intervention.  For part (c) students should recognize that the large sample sizes establish that the two-sample t-test is valid even with skewed distributions.  Notice that the only calculation in the quiz is part (d).  The value of the test statistic in part (d) turns out to be 3.27, which is large enough to conclude in part (e) that the intervention reduced the mean amount of television watching.


12. Answer the following:

  • a) Would you expect to find a positive or negative correlation coefficient between high temperature on January 1, 2020 and distance from the equator, for a sample consisting of one city from each of the 50 U.S. states?  Explain briefly.
  • b) Suppose that you record the daily high temperature and the daily amount of ice cream sold by an ice cream vendor at your favorite beach next summer, starting on the Friday of Memorial Day weekend and ending on the Monday of Labor Day weekend.  Would you expect to find a positive or negative correlation coefficient between these variables?  Explain briefly.
  • c) Suppose that every student in this class scored 5 points lower on the second exam than on the first exam.  Consider the correlation coefficient between first exam score and second exam score.  What would the value of this correlation coefficient be?  Explain briefly

Parts (d) and (e) pertain to the graph below, which displays data on the age (in months) at which a child first speaks and the child’s score on an aptitude test taken later in childhood:

  • d) Is the value of the correlation coefficient between these variables positive or negative?
  • e) Suppose that the child who took 42 months to speak were removed from the analysis.  Would the value of the correlation coefficient between the variables be closest to -1, 0, or 1?

This quiz addresses association and correlation between two numerical variables.  Parts (a) and (b) ask students to think about a context to determine whether an association would be positive or negative.  Part (c) is very challenging, as I discussed in post #21 (Twenty final exam questions, here).  Many students believe that the correlation must be negative, and some even respond that the correlation coefficient will equal -5!  The correct answer is that the correlation would be exactly 1.0, because the data would fall on a straight line with positive slope.

Parts (d) and (e) pertain to one of my all-time favorite datasets, which I encountered in Moore and McCabe’s Introduction to the Practice of Statistics near the beginning of my teaching career.  For this quiz I want students to realize that the correlation coefficient is negative but would be close to zero if the child who took the longest to speak were removed.


13. Some of the statistical inference procedures that we have studied include:

  • A. One-sample z-procedures for a proportion
  • B. Two-sample z-procedures for comparing proportions
  • C. One-sample t-procedures for a mean
  • D. Two-sample t-procedures for comparing means
  • E. Paired-sample t-procedures for comparing means

For each of the following questions, identify (by capital letter) which procedure you would use to address that question.  (Be aware that some letters may be used more than once, others not at all.)

  • a) Do cows tend to produce more milk if their handler speaks to them by name every day than if the handler does not speak to them by name?  A farmer randomly assigned half of her cows to each group and then compared how much milk they produced after one month.
  • b) A baseball coach wants to investigate whether players run more quickly from second base to home plate if they take a wide angle or a narrow angle around third base.  He recruits 20 players to serve as subjects for a study.  Each of the 20 players runs with each method (wide angle, narrow angle) once.
  • c) Does the average length of a sentence in a novel written by John Grisham exceed the average length of a sentence in a novel written by Louise Penny?  Students took a random sample of 100 sentences from each author’s most recent novel and recorded the number of words in each sentence.
  • d) Have more than 25% of Cal Poly students have been outside of California in the year 2019?
  • e) Are Stanford students more likely to have been outside of California in the year 2019 than Cal Poly students?

I give a quiz like this once or twice in every course.  Students need practice with identifying which procedure to use in a particular situation.  It’s easy and appropriate for students to focus on one topic at a time, so I think we teachers need to ask questions like this that require students to synthesize what they’ve learned across topics.

Notice that the words proportion and mean do not appear in any of the five parts of this quiz, so students cannot simply look for those key words.  I tell students that the key to answering questions like this is to start by identifying the variable(s) and their types (categorical or numerical) and roles (explanatory or response).


The last of the six GAISE recommendations (here) is: Use assessments to improve and evaluate student learning.  The improve part of that recommendation can be very challenging to implement successfully.  I have found group quizzes to be very effective for motivating students to help each other with developing and strengthening their understanding of statistical concepts.

P.S. The study about wearing socks over shoes can be found here.  The study about children’s television viewing can be found here.  The data on age of first speaking can be found here.

P.P.S. The following link contains a Word file with the thirteen quizzes from this post and the previous one, along with solutions.  Teachers should feel free to modify this file for use with their own students.

#25 Group quizzes, part 1

I tell my students in my syllabi that they will take lots of quizzes in my course.  That raises an obvious question: How many is lots?  Let’s look at some data. This graph shows the distribution of number of quizzes that I have given in 25 courses over the past several years:

The median is 18 and mean is 18.72 quizzes per course.  My courses meet for ten weeks (on the quarter system), so my students take an average of slightly fewer than 2 quizzes per week.  Many of my courses meet twice per week for 110 minutes per class meeting, so this amounts to about one quiz per class meeting.

Why do I give so many quizzes?  I tell students in my syllabi that the quizzes provide them with opportunities to:

  • improve their understanding of the material
  • assess how well they understand the material
  • prepare for the kinds of questions that will be on exams

With few exceptions these are group quizzes.  Even though I encourage students to work together and help each other as we work through in-class activities, I find that they engage much more concertedly during these group quizzes. Apparently the precious commodity of points toward their course grade provides a strong motivation.  Conversations that I overhear as students work on group quizzes convince me that they are truly helping each other to learn at these times.

Students usually form groups by self-selection.  Sometimes I mix things up by randomly assigning students to groups for a given quiz.  I typically limit group sizes to three students, but I allow them to form groups of any size for one quiz near the end of the course.  I emphasize that by putting their name on a quiz, they are attesting that they made a good-faith effort to help the group with the quiz.  Once in a while students abuse this rule by including the name of a classmate who was not even in class that day, but I don’t think this occurs too much.

These quizzes typically ask about topics that students are to have learned in class that day.  Students are almost always allowed to use their notes during the quiz. This policy encourages them to take good notes during class and also alleviates some of the stress of taking a quiz.  Once in a while, when I run out of class time, I’ll ask students to complete a quiz outside of class, but students are much less likely to work together on take-home quizzes.

Every quiz consists of five parts, each worth one point.  The lowest two (sometimes three) quiz scores in the course are dropped before the calculation of each student’s overall quiz percentage.  The quizzes usually contribute only 10% to the calculation of the overall course score.  I tell students on the first day of class that most students’ overall course score is helped by their quiz percentage.  The distribution of overall quiz percentages from one of my recent courses, shown in the graph below, is quite typical. The distribution is sharply skewed to the left, largely due to some students who miss more than a few quizzes, with a median of 87.35 percent and mean of 81.89 percent:

The remainder of this blog post, and all of the next one, consists of quizzes that I have given in introductory statistics courses, along with comments on each.  I will present thirteen quizzes, six in this post and seven to come next week.  The order of topics will reveal that I introduce simulation-based inference for a proportion early in the course.


1. For parts (a) and (b), consider the research question of whether students at Cal Poly – San Luis Obispo are more likely to wear clothing that says “Cal Poly” than students at Cal Poly – Pomona.  Suppose that you were to collect data for a statistical study of this question. 

  • a) Identify the explanatory variable, and classify it as numerical or categorical.
  • b) Identify the response variable, and classify it as numerical or categorical.

For parts (c) – (e), consider the patients who went to the emergency room at the local hospital last week as the observational units in a statistical study.

  • c) Identify one categorical variable that could be recorded on these observational units.
  • d) Identify one numerical variable that could be recorded on these observational units.
  • e) State a research question that you could investigate about these observational units, using at least one of the variables that you gave in part (c) or (d).

As I said in post #11 (here), I ask students about observational units and variables in nearly every example that we discuss in the entire course.  Not surprisingly, an early quiz focuses on this topic.  Most students do fine with parts (a) and (b), although some express the response variable poorly by saying something like “wearing Cal Poly clothing” instead of “whether or not the student wears Cal Poly clothing.”  Parts (c)-(e) are more challenging, as they ask students to think of their own variables. Part (e) is especially difficult; I have found that it’s much easier for students to describe variables than to state a research question to be investigated with those variables.  A good answer to part (e) does not need to be complicated.  For example, a fine answer is: “How long is the average waiting time to see a medical professional after arriving at the emergency room?”


2. Recall that you took samples of words from the population of words in the Gettysburg Address, for which the average length of a word is 4.295 letters.  Parts (a)-(d) refer to this situation.

  • a) Is the number 4.295 a parameter or a statistic? 
  • b) When you first selected your sample of 10 (circled) words, what was the variable?  Was it categorical or numerical?
  • c) What aspect of the first graph on the board indicated that the sampling method was biased?
  • d) Would selecting words by closing your eyes and pointing at the page 10 times produce an unbiased sampling method?  Explain briefly.
  • e) In general, does taking a very large sample (say, of millions of people) produce an unbiased sampling method?   Explain briefly.

Based on the Gettysburg Address activity that I described in post #19 (here), this quiz assesses whether students have understood the activity that they worked through in class that day.  They can answer parts (a)-(d) directly from their notes, so this quiz should provide easy points for students.  Part (e) requires students to stop and think a bit. I hope they’ll remember our class discussion of the infamous Literary Digest poll from 1936, which illustrates that a very large sample size does not guarantee an unbiased sampling method. I don’t often give quizzes that are this straight-forward and can be answered with little new thinking, but I believe this makes for a nice change-of-pace. I also think it’s okay now and then for a quiz to reward students for being in class and paying attention.


3. Researchers investigated whether they could correctly predict the outcome of an election, more often than not, by selecting the candidate whose face is judged (by a majority of people interviewed) to be more competent-looking.  They applied this prediction method to 32 U.S. Senate races in 2004.  The “competent face” method correctly predicted the winner in 23 of the 32 races.

  • a) What are the observational units in this study, and what is the sample size?
  • b) Describe (in words) the null hypothesis to be tested.

Consider the following results of a simulation analysis with 10,000 repetitions, for testing whether the competent face method would correctly predict the winner in more than half of all races:

  • c) Describe how you would use the simulation results to approximate the p-value of the test.
  • d) The p-value turns out to be approximately 0.01.  Write a sentence interpreting this p-value in context (probability of what, assuming what?).
  • e) Do the sample data provide strong evidence in support of the “competent face” prediction method?  Justify your answer, based on the simulation analysis.

This quiz assesses students’ understanding of simulation-based inference as presented early in the course.  Students would have seen an example such as the one presented in post #12 (here) before taking this quiz.

The second question in part (a) is meant to help students answer the first question.  If they realize that the sample size is 32, they can stop and ask themselves: 32 of what?  This should lead them to recognize that the 32 Senate races are the observational units, not the people who were interviewed to determine which candidate’s face is judged to be more competent.

Part (c) requires students to specify how they would use the applet to determine the approximate p-value, without needing to give them access to the applet.

Notice that part (d) gives a big hint about the two things (in addition to context) that students should include in their interpretation of a p-value: probability of obtaining 23 or more correct predictions in 32 races, assuming that the competent-face method would be correct for 50% of all races in the long run.  I do not give these hints later in the course, when I expect students to have a firmer grasp of interpreting a p-value, but I think such a hint is appropriate and helpful early on.


4. Suppose that a tire manufacturer believes that the lifetimes of its tires follow a normal distribution with mean 50,000 miles and standard deviation 5,000 miles.

  • a) Based on the empirical rule, 95% of tires last for between what two values?
  • b) How many standard deviations above the mean is a tire that lasts for 58,500 miles?
  • c) Determine the probability that a randomly selected tire lasts for more than 58,500 miles.
  • d) Determine the mileage for which only 25% of all tires last longer than that mileage.  Show how you arrive at your answer.
  • e) Suppose the manufacturer wants to issue a money-back guarantee for its tires that fail to achieve a certain number of miles.  If they want 99% of the tires to last for longer than the guaranteed number of miles, how many miles should they guarantee?  Show how you arrive at your answer.

Parts (a)-(d) are routine questions about the empirical rule, z-scores, and calculating probabilities and percentiles from normal distributions.  I think these provide good practice of the basics of what students are to have learned in class that day.  Students can use a table of standard normal probabilities, or a calculator with a normal probability function, or a software tool to answer parts (c) and (d).

Part (e) is not especially hard, but it does trip up a few students.  Can you guess a fairly common mistake that I find very disheartening?  The correct z-score is -2.326, so the correct answer for the lifetime to be guaranteed is 50,000 – 2.326×5000 ≈ 38,370 miles.  But some students look up the z-score for the 99th percentile and obtain positive 2.326, which produces an answer of 50,000 + 2.326×5000 ≈ 61,630 miles.  I always cringe at this response, because these students have not noticed that their calculation does not make sense: This answer would mean that the company would give 99% of customers their money back!  I also cringe because these students have neglected to follow my first piece of advice about calculations involving normal distributions: Start with a sketch!  If they included with a sketch, they would have seen that only 1% of tire lifetimes exceed 61,630 miles, and 38,370 miles is the value for which 99% exceed that lifetime:


5. Recall the formula for the z-test statistic when conducting a hypothesis test about a proportion:

  • a) What does the symbol p-hat* represent?  (Be as specific as possible.)
  • b) What does the symbol pi_0* represent?  (Be as specific as possible.)
  • c) What does the symbol n represent?
  • d) For a given value of n, what happens to the absolute value of the test statistic as the difference between p-hat and pi_0 increases?
  • e) For a given value of n, what happens to the p-value as the difference between p-hat and pi_0 increases?

* Students see the symbols rather than the words p-hat and pi_0 here.

Parts (a)-(c) simply ask students to explain what three symbols represent.  I grade parts (a) and (b) very strictly.  Full credit for part (a) requires saying that the p-hat symbol represents a sample proportion.  A response to part (b) must mention a hypothesized value of the population proportion, with all three italicized words required for full credit.  I think these are worthwhile questions, because it’s crucial for students to recognize what these symbols mean in order to understand hypothesis testing about a proportion.

Students can think about parts (d) and (e) either conceptually or mathematically.  The z-statistic measures the difference between the sample proportion and the hypothesized value of the population proportion in terms of number of standard deviations.  A larger difference produces a larger absolute value of the test statistic.  A larger difference also provides stronger evidence against the null hypothesis and therefore a smaller p-value.


6. A Harris Poll that surveyed 2225 adult Americans on October 14-19, 2015 found that 29% reported having at least one tattoo.

  • a) Is 29% (.29) a parameter or a statistic?  What symbol do we use for it?
  • b) Determine (by hand) a 95% confidence interval for the relevant parameter.
  • c) Interpret this interval: You’re 95% confident that __________ is between ____ and ____ .
  • d) How would a 99% confidence interval differ (if at all) from the 95% one?  Comment on both the midpoint and width of the interval.  (Do not bother to calculate a 99% confidence interval.)
  • e) The same Harris Poll also found that 47% of respondents between the ages of 18-35 reported having at least one tattoo.  How would a 95% confidence interval for this age group, based on this survey, compare to the 95% confidence interval that you found in part (b)?  Comment on both the midpoint and width of the interval.

Parts (a)-(c) ask students to apply what they learned in class to a new study.  Notice that I avoid identifying the parameter for them in part (b), because I want students to describe the parameter clearly in their interpretation in part (c), not just repeat back what I’ve already said.  I do not always provide the structure for interpreting a confidence interval as in part (c), but I think this is helpful in a quiz setting.

In part (d) most students realize that the confidence interval would become wider with a larger confidence level, and I also want them to note that the midpoint would stay at .29, the value of the sample proportion with a tattoo.  Part (e) makes students think, as we did not answer a question like this in class.  They need to recognize that the sample size would be smaller for the restricted group, so the confidence interval would become wider.  They also need to see that the sample proportion (with a tattoo) is larger for the subgroup, so the midpoint of this interval would be larger than for the original interval.


The word quiz can be very valuable when playing Scrabble, and I find group quizzes to be very valuable for my students’ learning.  I fulfill the promise in my syllabi to give lots of quizzes.  Most students respond admirably by engaging with each other to support their learning as they discuss and respond to the quiz questions.  In next week’s blog post I will provide seven more quizzes, again with five parts each, that I have used with my students.

P.S. The article about predicting elections can be found here.  A report on the Harris poll about tattoos can be found here.

P.P.S. I will provide a link to a Word file with these quiz questions, which teachers can modify as they’d like for use with their students, at the end of next week’s post.

#24 Random rendezvous, part 2

In last week’s post (here), I described one of my favorite probability problems: Two people plan to meet for lunch, their arrival times are independent and uniformly distributed over an hour, and they agree to wait fifteen minutes for the other to arrive.  I ask my students to determine the probability that they successfully meet, first using intuition, then simulation, and finally mathematics.

This post will consider a modification to this problem: Suppose that the arrival times follow normal (Gaussian) rather than uniform distributions.  Again I ask a series of questions that lead my students to tackle this extension with intuition, simulation, and mathematics.  Now the necessary mathematics will involve properties of random variables rather than geometry.  (As always, questions for students appear in italics.)


Let’s name this week’s lunch companions Michael and LeBron.  Just like Eponine and Cosette last week, they agree to wait 15 minutes for the other to arrive.  Suppose that Michael and LeBron’s arrival times at the restaurant follow independent normal distributions, each with mean 30 minutes (after noon) and standard deviation 10 minutes.  First I ask students to think about this scenario and make some predictions:

  • Do you think Michael and LeBron are more likely, less likely, or equally likely to successfully meet, as compared to Eponine and Cosette whose arrival times were uniform between noon and 1pm?
  • Make a guess for the probability that Michael and LeBron successfully meet. 
  • Draw a rough sketch of what you expect the joint distribution of their arrival times to look like.

Then I ask: How can we investigate the joint distribution of their arrival times and approximate the probability that they successfully meet?  I hope that the class erupts in response with a chorus of: Simulate!


Let’s begin the simulation analysis by asking: What lines of code do we need to change from last week’s simulation analysis?  I think it’s fair to ask this of students even If they have no coding experience themselves.  If they need a hint: What was the name of the probability distribution of the arrival times last week?  The only lines of code that need changing are the ones for simulating uniformly distributed arrival times (on the left below, where N represents the number of repetitions), and you can show them the new code to simulate normally distributed arrival times (on the right):

Other than this change, the rest of the program used for last week’s analysis (see the end of that post here) can be used without modification.  Before running the code, I ask: What do you expect the distributions of arrival times, difference in arrival times, and absolute difference in arrival times to look like?  Running the code for 10,000 repetitions produces results such as:

Describe what these graphs reveal.  As expected, the distributions of arrival times closely resemble normal distributions, centered around 30 minutes, with almost all the times between 0 and 60 minutes (after noon).  The distribution of differences in arrival times also appears to follow a normal distribution, with a mean close to 0 minutes and most values between -40 and 40 minutes.  The distribution of absolute differences is sharply skewed to the right, centered around 10 minutes or so.

The most informative graph is a scatterplot of arrival times, coded by whether a successful meeting occurs (green for yes, red for no).  What do you expect the scatterplot to look like?  How (if at all) will it differ from the scatterplot when we assumed uniform distributions for arrival times?  Here is a result from 10,000 repetitions:

How is this joint distribution different from the case with uniform arrival times?  These (joint) arrival times are not evenly spread out throughout a 60×60 square.  Instead they are concentrated around the point (30,30), and they become gradually less dense as they move away from that center point in any direction.

How does the region of arrival times for which they successfully meet compare to the uniform case?  The successful meetings, coded in green, occur within 15 minutes on either side of the y = x diagonal line.  It looks like more than half of the dots are green, signaling successful meetings, but it’s hard to tell from the graph.  We need to count how many of the 10,000 repetitions resulted in success.  For these 10,000 repetitions we obtain:

Use a margin-of-error to calculate a 95% confidence interval for the exact probability.  The margin-of-error is approximately 1/sqrt(10,000) = .01, so we can be about 95% confident that the exact probability of a successful meeting is between ( .7071 → .7271).


Now suppose that Michael and LeBron want to increase their chance of successfully meeting to 90%.  Would they need to wait for more or less than 15 minutes?  Explain.  Nearly all students realize that they’ll need to increase their waiting time to more than 15 minutes in order to increase their probability of a successful meeting.

How can you use your simulation results to approximate how many minutes they need to wait to have a 90% chance of meeting?  This question is harder than a similar one in the previous post about a 50% chance of meeting.  In that case we simply calculated the median of the 10,000 simulated values for the absolute difference in arrival times.  But the same idea will work here: instead of the 50th percentile we need to determine the 90th percentile of the absolute differences.  We can use R to do this by sorting the vector of absolute differences and then picking out the 9000th value among the 10,000 simulated values in that sorted vector:

We can approximate that Michael and LeBron should agree to wait for a bit more than 23 minutes in order to have a 90% chance of successfully meeting.


Now let’s turn to a mathematical solution.  We cannot use geometry as we did when the arrival times followed uniform distributions (here), because the (joint) arrival times are not equally likely to fall throughout the 60×60 square.  The key now is to use properties of random variables.

Let’s introduce some notation: Let the random variable X represent Michael’s arrival time and Y represent LeBron’s arrival time.  Express the probability of a successful meeting in terms of these random variables.  We can see from the labeled scatterplot of simulation results above that they successfully meet when their arrival times are within 15 minutes of each other, so we want Pr(|X – Y| < 15).

The bad news is that it’s not easy to determine the probability distribution of the absolute difference |X – Y|.  But the good news is that we can use probability rules to determine the distribution of the difference (X – Y).  Rewrite the probability of a successful meeting without the absolute value.  This is a challenging part for many students, who need to recall how to work with inequalities involving absolute values.  We can re-express Pr(|X – Y| < 15) as Pr(-15 < X – Y < 15).

Determine the name and parameter values of the probability distribution of the difference in arrival times (X – Y).  The first rule we need is that the sum or difference (or any linear combination) of normally distributed random variables also follows a normal distribution.  Then we can determine the mean, variance, and standard deviation of (X – Y) as follows:

As a way to check their work, and as a good practice to model for students, I ask: Calculate the mean and SD of the 10,000 simulated differences.  Are they close to these theoretical values?  Students find that the simulation did produce reasonable approximations, helping to confirm that these theoretical calculations are correct.

Now we have a straightforward normal probability calculation.  We could calculate z-scores (they turn out to be z ≈ ± 1.06) and use a normal probability table, or we could use an applet (here, output shown below), or we could use R (output shown further below):

We see that Michael and LeBron have a 71.12% probability of meeting successfully.  Is the theoretical probability from the normal distribution within the margin-of-error of the approximate probability from the simulation analysis?  Yes, because 0.7112 is within the interval ( .7071 → .7271) that we calculated from the simulation.

Use the normal distribution to determine how long Michael and LeBron must agree to wait in order to have a 90% chance of meeting.  We need to calculate the 5th and 95th percentiles of the (normal) distribution of the difference (X – Y).  We could use an applet or R (see output below), or we could realize that the relevant z-scores are z = ±1.645.  The necessary waiting time is therefore 1.645 standard deviations above the mean: 0 + 1.645×14.142 ≈ 23.264 minutes.  Once again I ask: Compare this to the approximate value that we determined from simulation.  This is quite close to the approximate value of 23.132 minutes that we found with simulation.


A natural extension is to investigate the effect of changing the means and/or standard deviations of the arrival times.  For example: Suppose that Michael and LeBron could make their arrival times more consistent by reducing their standard deviations from 10 to 5 minutes.  How would this change the probability that they successfully meet?  First predict, then simulate, then use mathematics.

Intuition: Most students realize that with less variability (more consistency) in their arrival times, Michael and LeBron are more likely to meet than they were before.

Simulation: A simulation analysis with the smaller standard deviations produces a graph such as the one below.  The arrival times are much less spread out than before, with almost all arrival times between 15 and 45 minutes (after noon), so Michael and LeBron are much more likely to meet.

Mathematics: The standard deviation of the difference (X – Y) becomes sqrt(5^2+5^2) = sqrt(50) ≈ 7.071 minutes.  Then Pr(-15 < (X – Y) < 15) becomes ≈ 0.9661 (z-scores are ±2.12).  As expected, this probability is a considerably larger, because of the smaller standard deviations of arrival times, than with the original scenario.


Ready for a more complicated extension? The calculations for this one go beyond the scope of most introductory statistics courses.  Nevertheless, the intuition here should make sense for introductory students.  How would you expect the probability of successfully meeting to change if Michael and LeBron’s arrival times are not independent but rather are positively correlated?  What if their arrival times are negatively correlated?  Here’s a hint: First think about what the scatterplot of arrival times would look like with a positive (or negative) correlation.

To make this more concrete, let’s go back to assuming that each person’s arrival time follows a normal distribution with mean 30 minutes (after noon) and standard deviation 10 minutes.  Now let’s assume that the arrival times of Michael and LeBron have a correlation coefficient of 0.7.  Later we’ll change the correlation coefficient to -0.7. Once again, let’s approach this problem with intuition, then simulation, and then mathematics.

Intuition: A positive correlation between Michael’s and LeBron’s arrival times means that one is more likely to arrive early when the other arrives early, and one is more likely to arrive late when the other arrives late.  Therefore, we expect less variability in the distribution of differences, and we expect the probability of successfully meeting to increase with a positive correlation.  On the other hand, a negative correlation means that one is more likely to arrive late when the other arrives early, and vice versa.  With a negative correlation, we expect a smaller probability of successfully meeting.

Simulation: Below is some R code for simulating arrival times from a bivariate normal distribution.  Below that are some results for 10,000 repetitions, with the positive correlation on the left and negative correlation on the right:

We see from the graph on the left that a positive correlation makes Michael and LeBron more likely to arrive at similar times, which increases their probability of successfully meeting (approximately 0.9486 from these 10,000 repetitions).  Similarly, we see from the graph on the right that a negative correlation makes Michael and LeBron much less likely to successfully meet (approximately 0.5859 from these 10,000 repetitions).

Mathematics: We know that the difference in arrival times (X – Y) follows a normal distribution with mean 30 – 30 = 0, as when the arrival times were independent.  The non-zero correlation changes the standard deviation of (X – Y), as shown in the following histograms of simulated differences in arrival times:

You might ask students: Which graph do you think goes with which value of the correlation coefficient (0, 0.7, -0.7)?  The answer is that the positive correlation produced the graph on the left, negative correlation for the one on the right, and zero correlation for the middle graph.  Why does this make sense?  As mentioned above, a positive correlation in arrival times produces less variability in differences, and a negative correlation produces more variability in differences.

Now let’s proceed to calculations, which go well beyond the scope of most introductory statistics courses.  The difference (X – Y) still follows a normal distribution with mean 0, but the standard deviation is different from the case of independent arrival times.  The key result is:

With a correlation coefficient of 0.7, this produces Cov(X, Y) = 0.7×10×10 = 70 and Var(X – Y) = 10^2 + 10^2 -2(70) = 60, so SD(X – Y) = sqrt(60) ≈ 7.746 minutes.  As expected, this is a smaller SD than in the case of independent (correlation zero) arrival times.  We can then calculate Pr(-15 < (X – Y) < 15) ≈ 0.9472 (with z-scores ≈ ±1.94).  As expected, this probability is larger than before.  Note that this is quite close to the approximate probability from the simulation results, which is reassuring to see after such an involved calculation.

For the case of an equally strong but negative correlation between the two arrival times, we obtain SD(X – Y) = sqrt(340) ≈ 18.439, which is much larger than before.  This leads to Pr(-15 < (X – Y) < 15) ≈ 0.5841 (with z-scores ≈ ±0.81), which is a considerably smaller probability than before, quite consistent with the approximation from simulation.

These calculations are summarized in the table:


This extension of the “random rendezvous” probability problem replaces uniform with normal distributions for the arrival times.  This change gives students a chance to practice calculations related to normal distributions and also to apply properties of random variables.  This extension also allows students to develop their intuition and to perform more simulation analyses.  Changing the means and/or standard deviations of arrival times offers more questions that require intuition, simulation, and mathematics.  Allowing the arrival times to be positively or negatively correlated adds another dimension to this problem, which can be tackled with intuition and simulation even if the mathematics might be beyond typical students of introductory statistics.

We could analyze even more extensions of this problem, such as using probability distributions other than uniform or normal.  The simulation analysis would involve minimal changes to the code we’ve already written.  The mathematical analysis would require calculating double integrals to determine the volume under a joint probability density function over the region in which the two people meet.

Oh, but look at the time!  I’m running late for a lunch engagement and would hate to miss meeting my friend, so I really must be going …

#23 Random rendezvous, part 1

This post describes one of my favorite examples for teaching probability*.  This activity makes use of three approaches to probability: intuition, simulation, and mathematics.  I especially like that the mathematics involved is not combinatorics or calculus, as is so often the case with probability problems, but rather geometry.  Because it involves coding and geometry as well as probability, I hope that this activity might be especially applicable in high school classrooms.  This post will also feature pretty pictures that help to develop and confirm intuition. As always, questions that I ask students appear in italics.

* I have a complicated relationship with probability. I greatly enjoy studying and teaching it, not only for applications to statistics but also other applications and for its own sake.  Nevertheless, I advocate teaching minimal probability content in Stat 101 courses, only what’s essential for understanding statistical concepts.  On the other hand, I also believe that understanding randomness, uncertainty, and probability is an important quantitative reasoning skill for all students to develop, including at the high school level.  I teach basic probability ideas in a statistical literacy course, more probability topics in an introductory course for engineering students, and an entire course on probability for students majoring in statistics, mathematics, and other quantitative fields.


Suppose that two people plan to meet for lunch at a certain restaurant.  (Let’s call them Eponine and Cosette, in memory of my first two cats*.)  They are both very busy professionals, so they cannot know for sure what time they will arrive.  Let’s assume that their arrival times are independent random variables with each uniformly distributed between 0 and 60, measured in minutes after noon.  Eponine and Cosette agree in advance that the first to arrive will only wait 15 minutes for the other to arrive.  The driving question is: How likely is it that they will successfully meet?

* You can read about my cats Eponine and Cosette, and see their photos, at the end of post #16, titled Questions about cats (here).

I start by asking students to use their intuition to make some predictions.  I want them to put some thought into, but not attempt to devise any solutions for, answering these questions: Do you think they are more likely to meet or not, or do you think it’s 50/50 for whether they meet or not?  Make a guess for the probability that they successfully meet.  In other words, if they were to repeat this random process for a very large number of days, on what percentage of days do you think they would successfully meet?


Next I engage students in a discussion about what steps we need to implement a simulation analysis:

  • Generate arrival times for both people.
  • Determine whether they successfully meet by calculating the absolute difference in arrival times, seeing whether that absolute difference is 15 minutes or less.
  • Repeat this a large number of times.
  • Calculate the proportion of repetitions in which they successfully meet, by counting how often they meet and dividing by the number of repetitions.

We also discuss what graphs would be informative:

  • Histograms of distributions of individual arrival times
  • Scatterplot of joint distribution of arrival times
    • Labeled scatterplot, color-coded according to whether or not they successfully meet
  • Histogram of distribution of difference in arrival times
  • Histogram of distribution of absolute difference in arrival times

Depending on your student audience and course goals, you might want students to write their own code to conduct this simulation analysis, or you might provide them with partial code and ask them to fill in the rest, or you might provide them with full code to run.

Here I will present code written in R*.  I like to use N for the number of repetitions, which the user specifies before running the code.  We can generate the random arrival times, with a vector of length N for each person, so we do not need to use a loop.  We can also produce graphs of the individual and joint distributions using:

* I confess at the outset that I am a novice programmer.

Before running this code, I ask students: Predict what the graphs, both the histograms and the scatterplots, will look like.  Here are some results with 10,000 repetitions:

Describe what the histograms and scatterplots reveal.  There’s not a lot to see here, but the histograms do confirm our expectations about uniform distributions.  Also as expected, the scatterplot reveals random scatter throughout the 60×60 square of possible (joint) arrival times.

How can we calculate vectors of differences and absolute differences in arrival times?  What do you expect histograms of these distributions to look like?  The following code calculates these vectors and produces these graphs, and results are shown for 10,000 repetitions:

Describe what these graphs reveal.  The distribution of differences in the graph on the left is quite symmetric about the value zero.  This makes sense because the two people have the same distribution of arrival times.  The graph on the right of the absolute differences results from folding the left side of the graph on the left about the zero axis.  The distribution of absolute differences is certainly not symmetric.  Smaller values for the absolute difference are more likely than larger values.

Now we’re ready to calculate an approximate probability to answer the question we started with.  We just need to count how many of the repetitions produced an absolute difference in arrival times of 15 minutes or less.  We can do this by creating a vector of true/false values, with TRUE meaning that they successfully met and FALSE meaning that they did not.  We can do this with one line of code as follows, where the user needs to enter wait, the number of minutes that they agree to wait, before running the code:

Next I like to re-produce the scatterplot above, making it more informative (and prettier!) by using different colors for whether the people successfully meet or not.  Let’s use green for repetitions in which Eponine and Cosette successfully met and red for those in which they arrive so far apart that they do not meet.  First I ask students: What do you expect the colored scatterplot to look like?  In other words, where in the 60×60 square do you expect to see the green dots (where they successfully meet), and where do you expect to see the red dots (where they do not meet)?  This question can be challenging, so I offer a hint: Where in the graph do they arrive at precisely the same time, in which case they certainly meet?  Then I follow up with another hint: How far from that line can they arrive and still successfully meet?  Here is some code and resulting graph for 10,000 repetitions:

I think this is a beautiful graph*.  It shows that Eponine and Cosette successfully meet if their joint arrival times are within 15 minutes above or below the y = x diagonal line.  Next I ask: Based on this graph, make a guess for the probability that they successfully meet.

* I always tell my students that this graph would look great on a t-shirt, but so far none have taken the hint to produce such a t-shirt for me.

Now we are ready to calculate this probability (approximately) from the simulation results.  We need to count how many repetitions result in a successful meeting.  We can accomplish this by summing the vector of TRUE/FALSE values, because R treats TRUE as 1 and FALSE as 0.  Then we divide by the number of repetitions to calculate the approximate probability.  Here is the code and sample output for 10,000 repetitions:

Calculate the margin-of-error associated with this approximate probability.  Use the margin-of-error to determine a confidence interval for the actual long-run probability.  In the beginning of a probability course, I give students the short-hand formula 1/sqrt(N) for the margin-of-error of an approximate probability based on a simulation analysis with N repetitions.  With 10,000 repetitions, this produces a margin-of-error of .01, so we can be confident that the actual long-run probability is within the interval 0.4394 ± 0.0100, which is the interval (0.4294 à 0.4494).  Eponine and Cosette have a slightly less than 50% chance of successfully meeting for lunch.


Eponine and Cosette might be disheartened to learn that they are more likely not to meet than to meet.  Suppose that they want to change their waiting time to produce a 50% chance of meeting.  Again I ask students to start with intuition: Do they need to increase or decrease their waiting time to achieve this 50% goal?  Make a guess for how long they need to wait to have a 50% chance of successfully meeting.

Then I ask students: How can we use the simulation results to approximate how long they need to wait to produce a 50% chance of successfully meeting?  This can be challenging, so I have hints ready: Which of the vectors that we generated is most relevant to this question?  What aspect of that vector will approximate a 50% probability?  Students eventually recognize that they can approximate the necessary waiting time with the median of the absolute differences in arrival times.  Here’s the code and some sample output:

Because waiting 15 minutes produced a slightly smaller than 50% chance of a successful meeting, it makes sense that the approximate wait time for a 50% chance is a bit larger than 15 minutes.


Before we move on from simulation to a mathematical analysis, I return to a 15-minute wait time and ask: How can we improve the approximate probability?  Students are quick to respond that using more repetitions should improve the approximation.  Simulating one million repetitions produced an approximate probability of 0.437847.  The margin-of-error is 1/sqrt(1,000,000) = 0.001, so a 95% confidence interval for the exact probability of a successful meeting is 0.437847 ± 0.001, which is the interval (0.436847 à 0.438847).  The approximate wait time needed for a 50% chance of meeting turned out to be 17.60137 minutes.

With one million repetitions, the colored scatterplot of arrival times looks like:

There are so many green and red dots in this graph that you cannot see the individual dots, but you can see a very clear image of the (green) region for which a successfully meeting occurs.  This region provides the key to using geometry to calculate the exact probability.


Because the arrival distributions are independent and uniform, we can determine the exact probability of a successful meeting by calculating the area of the region in which they meet as a fraction of the total area of the region of possible (joint) arrival times.  In other words, we need to calculate the probability that a point selected at random from the 60×60 square falls within the green region rather than one of the red regions.

Determine the area of the overall square.  I often advise students that the denominator is typically easier to calculate with such probability questions.  The area of the 60×60 square of possible (joint) arrival times is 60×60 = 3600 (in units of minutes squared). 

Determine the area of the green region where they successfully meet.  When students do not think of a shortcut from themselves, I offer a hint that makes this much easier: First determine the area of the red regions where they do not meet. The two red triangles have the same area, because both have a length of 45 minutes and height of 45 minutes.  The area of each triangle is therefore 45×45/2 = 1012.5 (again in units of minutes squared), so the combined area of the two red triangles is 45×45 = 2025.  The area of the green region is therefore 3600 – 2025 = 1575.

Use the areas to calculate the (exact, theoretical) probability that Eponine and Cosette successfully meet.  This probability is 1575/3600 = 7/16 = 0.4375.

Are the two approximate probabilities from the simulation analyses above within the margin-of-error of the exact probability?  Yes, the simulation with 10,000 repetitions produced an interval estimate of .4394 ± .0100, which is the interval (.4294 à .4494).  The simulation with 1,000,000 repetitions produced an interval estimate of .437847 ± .001, which is the interval (.436847 à .438847).  Both of these intervals include the exact probability of .4375.


Now let’s use geometry and algebra to determine how many minutes they must agree to wait, in order to have a 50% chance of successfully meeting.  First I ask students: Express the probability of a successful meeting as a function of the number of minutes that each person agrees to wait.  I suggest that students use Pr(S) to denote the probability of a successful meeting, and let m represent the number of minutes that each person agrees to wait.  Again I have a hint ready: Start with a sketch of the 60×60 square.  Then sketch the region where they meet, much like before, but using m rather than 15 as the number of minutes that they wait.  Here’s a sketch:

Each triangle now has length (60 – m) and height (60 – m), so its area is (60-m)×(60-m)/2.  The combined area of the two triangles is therefore (60-m)×(60-m).  The area of the non-triangular region where they successfully meet is then 3600 – (60-m)×(60-m).  The probability of a successful meeting can be expressed as:

Graph this probability as a function of m, for values of m from 0 to 60 minutes.  Some R code and output for this graph are:

Describe the behavior of this function.  Students should comment that this function is increasing, which certainly makes sense because waiting longer increases the probability of meeting.  They should also observe that the rate of increase diminishes as the number of minutes increases.  I also ask students to check that the probability for a 15-minute wait looks to be consistent with what we calculated earlier: .4375.

Now we are ready to do the algebra to solve for how long to wait in order to have a 50% chance of meeting.  Setting Pr(S) = 0.5 produces (60-m)×(60-m) = 1800.  Solving gives: m ≈ 17.5736 minutes.  This value is very consistent with our approximate values from the simulation analyses above.


This activity can lead students to use intuition, simulation, and mathematics to solve probability problems.  Unlike many probability problems that rely on combinatorics or calculus, geometry provides the solution to this one.  Depending on your student audience and course goals, you could also ask students to do some coding themselves with this activity. You might also use this as an assignment rather than an in-class activity.

More extensions to this random rendezvous naturally present themselves.  For example, we need not have assumed that Eponone’s and Cosette’s arrival times were uniformly distributed.  How would the probability of a successful meeting change if their arrival times followed independent normal distributions?  That depends on the means and standard deviations, which could be the same for both people or not.  What if those normal distributions were not independent?  We’ll consider those extensions in next week’s post, in which we will again use intuition, simulation, and mathematics to tackle probability questions.


P.S. This activity was inspired by problem #26, called The Hurried Duelers, in Frederick Mosteller’s wonderful book Fifty Challenging Problems in Probability.

P.P.S. The link below contains a file with the R code used above.

#22 Four more exam questions

In last week’s post (here), I presented twenty multiple choice questions, all conceptual in nature, none based on real data.  This week I present four free-response questions that I have used on exams, all based on real data from genuine studies.  These questions assess students’ abilities to draw and justify appropriate conclusions.  Topics covered include confounding, biased sampling, simulation-based inference, statistical inference for comparing two groups, and cause-and-effect conclusions.  All of these questions have multiple parts*.  I also provide comments on the goal of each question and common student errors. I do not intend these four questions to comprise a complete exam.  As always, questions for students appear in italics.

* I could have titled this post “Twenty more exam questions” if I had counted each part separately.


1. (6 pts) Researchers found that people who used candy cigarettes as children were more likely to become smokers as adults, compared to people who did not use candy cigarettes as children.

  • (a) (1 pt) Identify the explanatory variable.
  • (b) (1 pt) Identify the response variable.
  • (c) (4 pts) When hearing about this study, a colleague of mine said: “But isn’t the smoking status of the person’s parents a confounding variable here?”  Describe what it means for smoking status to be a confounding variable that provides an alternative to drawing a cause/effect explanation in this context.

Describing what confounding means can be very challenging for students.  The key is to suggest a connection between the confounding variable and both the explanatory and response variables. I’ve tried to make this task as straight-forward as possible here.  Students do not need to suggest a confounding variable themselves, and the context does not require specialized knowledge to explain the confounding.

Parts (a) and (b) are meant to be helpful by directing students to think about the explanatory and response variables in this study (and also offering an opportunity to earn two relatively easy points).  The explanatory variable is whether or not the person used candy cigarettes as a child, and the response variable is whether or not the person became a smoker as an adult.

To earn full credit for part (c), students need to say something like:

  • Parents who smoke are more likely to allow their children to use candy cigarettes than parents who do not smoke.
  • Children of parents who smoke are more likely to become smokers as adults than children of parents who do not smoke.

It would be nice for students to add that these two connections would result in a higher proportion of smokers among those who used candy cigarettes as children than among those who did not use candy cigarettes, but I do not require such a statement.

Many students earn partial credit by giving only one of the two connections.  Such a response fails to explain confounding fully and falls short of providing an alternative explanation for the observed association.  Another common error is that some students focus on conjectured explanations, such as proposing only that children of smokers want to emulate their parents by using candy cigarettes, or that a genetic predisposition leads children of smokers to become smokers themselves.  Both of these explanations come up short because they only address one of the two connections.

I sometimes make this question a bit easier by provide one of the connections for students: My colleague also pointed out that children of smokers are more likely to become smokers as adults than children of non-smokers.  What else does the colleague need to say to complete the explanation of how parents’ smoking status is a confounding variable in this study?  At other times I make this question harder by asking students to propose a potential confounding variable and also explain how the confounding could provide an alternative to a cause-and-effect explanation.

I sometimes make this question a bit easier by provide one of the connections for students: My colleague also pointed out that children of smokers are more likely to become smokers as adults than children of non-smokers.  What else does the colleague need to say to complete the explanation of how parents’ smoking status is a confounding variable in this study?  At other times I make this question harder by asking students to propose a potential confounding variable and also explain how the confounding could provide an alternative to a cause-and-effect explanation.


2. (8 pts) The news website CNN.com has posted poll questions that people who view the website can respond to.  The following results were posted on January 10, 2012:

The margin-of-error, for 95% confidence, associated with this poll can be calculated to be ± .003, or ± 0.3%.

  • a) (1 pt) Are the percentages reported here (62%, 25%, 13%) parameters or statistics?  Explain briefly.
  • b) (1 pt) Explain (using no more than ten words) why the margin-of-error is so small.
  • c) (3 pts) Would you be very confident that between 61.7% and 62.3% of all employed Americans surf the Web often while on the job?  Circle YES or NO.  Also explain your answer.

Part (a) provides an easy point for students to earn by responding that these are statistics, because they are based on the sample of people who responded to the poll.  Part (b) is also fairly easy; an ideal answer has only four words: very large sample size.  I do not require that students report the sample size of 111,938.  They can omit the word “very” and still earn full credit.

Part (c) is the key question.  I want students to recognize that this poll relies on a very biased sampling method.  Any online poll like this is prone to sampling bias, but the topic of this poll question especially invites bias.  Only by surfing the web can a person see this poll question, so the sampling method favors those who surf the web often while at work.  Because of this biased sampling method, students should not be the least bit confident that the population proportion is within the margin-of-error of the sample result.

I’ve learned to require students to circle YES or NO along with their explanation.  Otherwise, several students try to have it both ways with a vague answer that tries to cover all possibilities, such as: I would be very confident of this, but I would also be cautious not to conclude anything too conclusively.

I used to present this poll result graphic to students and then ask specifically about sampling bias.  But I changed to the above version, as I decided that it’s important for students to be able to spot sampling bias without being prompted to look for it.


3. (12 pts)  Researchers presented young children with a choice between two toy characters who were offering stickers.  One character was described as mean, and the other was described as nice.  The mean character offered two stickers, and the nice character offered one sticker.  Researchers wanted to investigate whether infants would tend to select the nice character over the mean character, despite receiving fewer stickers.  They found that 16 of the 20 children in the study selected the nice character.

  • a) (2 pts) Describe (in words) the null hypothesis in this study.
  • b) (3 pts) Suppose that you were to conduct a simulation analysis of this study to investigate whether the observed result provides strong evidence that children genuinely prefer the nice character with one sticker over the mean one with two stickers.  Indicate what you would enter for the following three inputs: i) Probability of success, ii) Sample size, iii) Number of samples.
  • c) (1 pt) One of the following graphs was produced from a correct simulation analysis.  The other two were produced from incorrect simulation analyses.  Circle the correct one.
  • d) (1 pt) Based on the correct graph, which of the following is closest to the p-value of this test: 5.000, 0.500, 0.050, 0.005?  (Circle your answer.)
  • e) (2 pts) Write an interpretation of the p-value in the context of this study.
  • f) (3 pts) Summarize your conclusion from this research study and simulation analysis.

I am often asked about how to assess students’ knowledge of simulation-based inference* without using technology during the exam.  This question shows one strategy for achieving this.  Students need to specify the input values that they would use for the simulation, pick out what the simulation results would look like, estimate the p-value from the simulation results, and summarize an appropriate conclusion.

* See post #12 (here) for an introduction to simulation-based inference.

For part (a), I am looking for students to say that the null hypothesis is that children have no preference for either character.  At this point I am not asking for students to express this hypothesis in terms of a parameter.  It’s fine for them to state that children are equally likely to select either character, or that children select a character at random.

Correct responses for part (b) are to use 0.5 for the probability of success, 20 for the sample size, and a large number such as 1000 or 10,000 for the number of samples.  Some students enter 0.8 for the probability of heads, based on the sample proportion of successes.  A few students enter 20 for the number of repetitions.

Part (c) requires some thought, because my students have not seen such a question before.  Some mistakenly think that the simulation results should be centered at the observed value, so they incorrectly select the graph on the left.  The simulation results should be centered on what’s expected under the null hypothesis, as in graphs in the middle and on the right.  Most students realize that they’ve never seen a simulation result look like the nearly-uniform distribution in the middle graph.  Most recognize that they have frequently seen simulation results that look like the bell-shaped graph on the right, so they correctly select it.

To answer part (d) correctly, students need to be looking at the correct graph.  For the graph on the right, very few of the repetitions produced 16 or more successes in 20 trials, so the p-value is very small.  The smallest p-value among the options, .005, is the correct answer.

Many students struggle somewhat with part (e).  One of the things that I like about the simulation-based approach to statistical inference is that I think it makes the interpretation of p-value as clear as possible.  Students do not need to memorize an interpretation; they just need to describe what they see in the graph and remember the assumption behind the simulation analysis: If children had no preference between the characters, then only about 5 in 1000 (.005) repetitions would produce 16 or more successes.  Many students get the second part of this interpretation correct but forget to mention the “if there were no preference” assumption; such a response earns partial credit.  Sometimes I make this part of the question easier by giving a parenthetical hint: probability of what, assuming what?

Part (f), which is much more open-ended than previous parts, asks students to draw an appropriate conclusion.  This study provides very strong evidence that children genuinely prefer the nice character over the mean character despite receiving fewer stickers from the nice character.  This conclusion follows from the very small p-value, which establishes that it would be very surprising for 16 or more of 20 children to select the nice character, if in fact children had no preference for either character.


4. (16 pts) The Gallup organization released a report on October 20, 2014 that studied the daily lives and well-being of a random sample of American adults.  The report compared survey responses between adults with children under age 18 living in the home and those without such children living in the home.  The following table was provided in the report:

  • a) (2 pts) Does this study involve random sampling, random assignment, both, or neither?  Explain briefly.
  • b) (2 pts) State the appropriate null and alternative hypotheses (using appropriate symbols) for testing whether the two populations of adults differ with regard to the proportion who smiled or laughed a lot on the previous day.
  • c) (2 pts) The value of the test statistic turns out to be z = 18.5.  Write a sentence interpreting the value of this z-test statistic.  (This is not asking for a test decision or conclusion based on the z-test statistic.)
  • d) (2 pts) Would you reject the null hypothesis at the .01 significance level?  Explain how your answer follows from the value of the z-test statistic.
  • e) (2 pts) A 99% confidence interval based on the sample data turns out to be (.039 à .051).  Interpret what this interval says in this context.
  • f) (2 pts) Is this confidence interval consistent with your test decision (from part d)?  Explain how you know.
  • g) (2 pts) Give a very brief explanation for why this confidence interval is very narrow.
  • h) (2 pts) Suppose that someone reads about this study and says that having children in the household causes a very large increase in the likelihood of smiling or laughing a lot.  Would you agree with this conclusion?  Explain why or why not.

Presenting the sample statistics in the form of this table is a bit non-conventional.  This is certainly not a 2×2 table of counts that students are accustomed to seeing.   This can confuse some at first, but I think it’s worthwhile for students to see and grapple with information presented in multiple ways.

Part (a) revisits the theme of posts #19 and #20, titled Lincoln and Mandela (here and here), about the distinction between random sampling and random assignment.  Students should note that the question states that the sample was selected randomly.  But the Gallup organization certainly did not perform random assignment, because it would not be sensible or practical to randomly assign which people have children in their household and which do not.

To answer part (b) correctly, students need to realize that the test requires comparing proportions between two groups.  The null hypothesis is that American adults with children in their household have the same proportion who smiled or laughed a lot on the previous day as those without children in their household.  This null hypothesis can be expressed in symbols* as:

* Recall from post #13, titled A question of trust (here), that I like to use Greek letters for all parameter symbols, so I use π for a population proportion.

I could have asked students to calculate the z-test statistic, but part (c) provides this value and asks for an interpretation.  I try to ward off a common error by cautioning students not to provide a test decision or conclusion.  But many students do not know what interpreting the z-score means, even though we’ve done that often in class*.  I want students to respond that the sample proportions (who smiled or laughed a lot on the previous day) in the two groups (those with/without children under age 18 in the household) are 18.5 standard deviations (or standard errors) apart.  This is a huge difference.  Students do not need to comment on the huge-ness until the next part, though.  Despite my caution, many students draw a conclusion from the z-score here rather than interpret it.  This could be because they do not read carefully enough, or it could well be that they do not understand what interpreting a z-score entails.

* See post #8, titled End of alphabet (here), for more thoughts and examples about z-scores.

For part (d), students should note that because the z-score of 18.5 is enormous, the p-value will be incredibly small, very close to zero.  The tiny p-value leads to an emphatic rejection of the null hypothesis.  Notice that I do not ask for an interpretation of this test decision in context here, only because parts (c) and (e) ask for interpretations.

Students need to realize that the confidence interval presented in part (e) estimates the difference in population proportions.  I think this is fair to expect in part because that’s the conventional confidence interval to produce when comparing proportions between two groups, and also because the reported difference in sample proportions between the groups (.045) is the midpoint of the interval.  We can be 95% confident that the proportion of American adults with a child under age 18 in the household is greater than the proportion among those without a child by between .039 and .051 (in other words by between 3.9 and 5.1 percentage points).  Some students interpret this interval only as a difference without specifying direction (that those with a child are more likely to have smiled or laughed a lot).  Such a response is only worth partial credit, because they’re leaving out an important element by not specifying which group has a higher proportion who smile or laugh a lot.

Part (f) is intended to be straightforward.  Students should have rejected the null hypothesis that the population proportions are the same in the two groups.  They should also notice that the confidence interval, containing only positive values, does not include zero as a plausible value for the difference in population proportions.  These two procedures therefore give consistent results*.

* I hope that some students will remember the cat households example from post #16, titled Questions about cats (here), when they read this part.  If they do, this recollection might also help with part (h) coming up.

Part (g) is asking about the very large sample size producing a narrow confidence interval. This is the same issue that I asked about in part (b) of question #2 about the CNN.com poll*.

* It’s certainly possible that I over-emphasize this point with my students.

I must admit that I really like part (h).  The previous seven parts have been leading up to this part, which asks about the scope and type of conclusion students can draw from this survey.  Notice that I use bold font for both causes and very large increase.  This as a big hint that I want students to comment on both aspects.  Most students correctly note that this is an observational study and not a randomized experiment, so a cause-and-effect conclusion (between having children in the household and being more likely to smile or laugh) is not justified.  Relatively few students go on to address whether the difference between the groups is very large.  I hope that they’ll look at the two sample proportions, and also at the confidence interval for the difference in population proportions, and then conclude that 3.9 to 5.1 percentage points does not indicate a very large difference between the two groups.


I hope these four exam questions, which aim to assess students’ abilities to draw and justify conclusions, provide a nice complement to last week’s multiple choice questions (here).  See below for a link to a Word file containing these questions.

P.S. I thank my Cal Poly colleague Kevin Ross for introducing me to the Gallup poll and some good questions to ask about it.  Kevin and his wife Amy have five children under age 18 in their household.  I suspect that Kevin and his wife smile and laugh quite often.

P.P.S. The journal article on candy cigarette use can be found here.  The article on children’s choices of toy characters can be found here; this is a follow-up study to a more well-known one that I often use in class, described here.  A report on the Gallup survey about smiling and laughing can be found here.

P.P.P.S. Follow the link below for a Word file containing these four questions, and feel free to use or revise them for use with your own students.

#21 Twenty final exam questions

My mantra of “ask good questions” applies to exams as well as in-class learning activities.  This week I present and discuss twenty multiple-choice questions that I have used on final exams.  All of these questions are conceptual in nature.  They require no calculations, they do not refer to actual studies, and they do not make use of real data.  I certainly do not intend these questions to comprise a complete exam; I strongly recommend asking many free-response questions based on real data and genuine studies as well.

At the end of this post I provide a link to a file containing these twenty questions, in case that facilitates using them with your students.  Correct answers are discussed throughout and also reported at the end.


I like to think that this question assesses some basic level of understanding, but frankly I’m not sure.  Do students ever say that a standard deviation and a p-value can sometimes be negative?  Not often, but yes.  Do I question my career choice when I read those responses?  Not often, but yes.


I think it’s valuable to ask students to apply what they’ve learned to a new situation or a bew statistic.  This question is not nearly as good for this goal as my favorite question (see post #2 here), but I think this assesses something worthwhile.  The questions about resistance are fairly straightforward.  The mid-hinge is resistant because it relies only on quartiles, but the mid-range is very non-resistant because it depends completely on the most extreme values.  Both of these statistics are measures of center.  This is challenging for many students, perhaps because they have seen that the difference between the maximum and minimum, and the difference between the quartiles, are measures of variability.  One way to convince students of this is to point out that adding a constant to every value in the dataset (in other words, shifting all of the data values by the same amount) would cause the mid-hinge and mid-range to increase (or shift) by exactly that constant.


This question should be very easy for all students, but some struggle.  The question boils down to: If the sum of values equals zero, does the mean have to equal zero, and does the median have to equal zero?  The answer is yes to the first, because the mean is calculated as the sum divided by the number of values.  But the answer is no to the second, as seen in this counterexample where the mean is 0 but the median is not: -20, 5, 15.  The fact that this question is stated about residuals is completely irrelevant to answering the question, but the mention of residuals leads some students to think in unhelpful directions.

I sometimes ask an open-ended version of this question where I ask students to provide a counter-example if their answer is no.


This question has been extremely challenging for my students.  I used to ask it without providing options, and the most common response was “the same.”  That’s right: Many students did not realize that they should provide a number when asked for the value of a correlation coefficient.  Among these options, it’s very discouraging when a student selects -5, apparently not knowing that a correlation coefficient needs to be between -1 and +1 (inclusive), but this answer is tempting to some students because of the “5 points lower” wording in the question.  Another commonly selected wrong answer is -1.  I think students who answer -1 realize that the data would fall on a perfectly straight line, so the correlation coefficient must be -1 or +1, but the “lower” language fools them into thinking that the association is negative.

I sometimes offer a hint, advising students to start by drawing a sketch of some hypothetical data that satisfy the description.  I have also started to ask and discuss this question in class when we first study correlation, and then include the exact same question on the final exam.  This has improved students’ performance, but many still struggle.


Most students correctly identify (a) and (d) as categorical variables and (c) as a numerical variable.  The most challenging parts are (b) and (e), which are not variables for these observational units.  I try to emphasize that variables are things that can be recorded for each observational unit, not an overall question or measure that pertains to the entire dataset.


I started asking this question after I noticed that some of my students believe that conducting a randomized experiment always justifies drawing a cause-and-effect conclusion, regardless of how the data turn out!  The good news is that very few students give answer A.  The bad news is that more than a few give answer C.


Some students take the “correlation does not imply causation” maxim to an inappropriate higher level by believing that “correlation implies no causation.”  Of course, I want them to know that a strong correlation does not establish a cause-and-effect relationship but also does not preclude that possibility.


I often ask this question as a calculation to be performed in my courses for mathematically inclined students.  To calculate the correct percentage, note that Brad will get 70% right because he knows the answer, and he’ll guess correctly on 1/3 of the other 30%.  So, his long-run percentage correct will be 70% + 1/3(30%) = 80%.

When I ask for this calculation, I’ve been surprised by students giving an answer less than 70%.  I understand that mistakes happen, of course, or that a student would not know how to solve this, but I can’t understand why they wouldn’t realize immediately that the answer has to be larger than 70%.  I decided to ask this multiple-choice version of the question, which does not require a numerical answer or any calculation.  I’m still surprised that a few students get this wrong.


This is essentially the same question as I asked in post #16 (here) about whether the percentage of American households with a pet dog plus the percentage with a pet cat equals the percentage with either a pet dog or a pet cat.  Adding these percentages is not legitimate because the events are not mutually exclusive: It’s possible that it could rain on both Saturday and Sunday.  I hope that choosing 70% and 30% as the percentages is helpful to students, who might be tipped off by the 100% value that something must be wrong because rain cannot be certain.

It might be interesting to ask this question with percentages of 70% and 40%, and also with percentages of 60% and 30%.  I hope that the version with 70% and 40% would be easier, because all students should recognize that there could not be a 110% chance of rain.  I suspect that the version with 60% and 30% would be harder, because it might be more tempting to see 90% as a reasonable chance.


The main point here is that you cannot just take the average of 80% and 40%, because the group sizes are not the same.  Because there are many more students than faculty, the overall percentage will be much closer to the student percentage of 80%, so the correct answer is that the overall percentage would be more than 60%.


The goal here is to assess whether students realize that a probability such as 0.5 refers to a long-run proportion and does not necessarily hold in the short-run.  A sample size of two children definitely falls into the short-run and not long-run category, so it’s not guaranteed or even very likely to have one child of each sex.

A student does not need to enumerate the sample space and calculate the exact probability to answer this question correctly.  The sample space of four equally likely outcomes is {B1B2, B1G2, G1B2, G1G2}, so the probability of having one child of each sex is indeed 2/4 = 0.5.  But a student only needs to realize that this event is neither very likely nor very unlikely in order to answer correctly.  In fact, even if a student has the misconception that the three outcomes {2 boys, 2 girls, 1 of each} are equally likely, so they think the probability is 1/3, they should still give the correct answer of C.


Students expect to perform normal distribution calculations after they read the first sentence.  But they cannot do this, because the mean and standard deviation are not provided.  For that matter, we also don’t know the value of the advertised weight.  Students are left with no option but to think things through.  I hope that they’ll remember and follow the advice that I give for any question involving normal distributions: Start with a sketch!

Part (a) can be answered without ever having taken a statistics course.  To reduce the percentage of packages that weigh less than advertised, without changing the mean or standard deviation, the manufacturer would need to decrease the advertised weight.

To answer part (b), students should realize that decreasing the percentage of underweight packages would require putting more candy in each package, so the mean of the distribution of weights would need to increase.

Part (c) is the most challenging part.  Decreasing the percentage of underweight packages, without changing the advertised weight or the mean, would require a taller and skinnier normal curve.  So, the standard deviation of the weights would need to decrease.


Most students get this wrong by answering yes.  These students have missed the whole point of the Central Limit Theorem (CLT), which describes the distribution of the sample mean.  Many students believe that whenever a sample size reaches 30 or more, that guarantees an approximately normal distribution.  Of what?  They don’t give that question any thought.  They mistakenly believe that the CLT simply guarantees a normal distribution when n ≥ 30.

I usually ask for an explanation along with a yes/no answer here.  But the explanation is almost always the same, boiling down to: Yes, because n ≥ 30.  Some students do give a very good answer, which demonstrates that they’ve learned something important (and also gives me much pleasure).  I think this question helps to identify students with a very strong understanding of the CLT from those with a less strong understanding.

You could ask a version of this question that does not refer to the Central Limit Theorem by asking: Does the sample size of 200 houses establish that


This is one of my very favorite questions, which I ask on almost every final exam.  I think this is a very important idea for students to understand.  But my students perform very poorly on this question that I like so much.  Not many give the correct answer (B, 1000), and many think that the answer is 100,000 or more.

It’s fine for students to perform a sample size calculation to answer this question, but that’s not my intent.  I hope that they will have noticed that many examples in the course involved surveys with about 1000 people and that the margin-of-error turned out to be in the ballpark of 3 percentage points.

Unfortunately, many students are misled by the 325 million number that appears in the first sentence of the question.  The population size is not relevant here.  Margin-of-error depends critically on sample size but hardly at all on population size, as long as the population is much larger than the sample.  A sample size of 1000 people has the same margin-of-error whether the population of interest is all Americans or all New Zealanders or all residents of San Luis Obispo.

I suppose you could argue that I am deliberately misleading students by leading off with an irrelevant piece of information, but that’s precisely what’s being assessed: Do they realize that the population size is irrelevant here?  It’s quite remarkable that a sample size of 1000 is sufficient to obtain a margin-of-error of only 3.5 percentage points in a population as numerous as the United States.  One of my principal goals in the course is for students to appreciate the wonder of random sampling!

I sometimes give half-credit to answers of 100 and 10,000, because they are somewhat in the ballpark.  On the opposite extreme, I am tempted to deduct 2 points (even on a 1-point question!) when a student answers 1,000,000 or 10,000,000.


This question is about as straight-forward as they come, and my students generally do well.  Some of the questions above are quite challenging, so it’s good to include some easier ones as well.


This is another straightforward one on which my students do well.  I hope that the answer to this question is second-nature to students by the end of the course, and I like to think that they silently thank me for the easy point when they read this question.


You might be expecting me to say that this one is also straight-forward, but it is always more problematic for students than I anticipate.  Maybe some students out-smart themselves by applying an exam-testing strategy that cautions against giving the same answer for both parts of a two-part question.


Part (a) is very clear-cut.  In fact, this is another question for which there’s no need to have ever set foot in a statistics classroom to answer correctly.  All that’s needed is to look for the result with the biggest difference between the success proportions in the two groups.

It does help to have been in a statistics classroom for part (b), although many students have correct intuition that larger sample sizes produce stronger evidence of a difference between the groups, when the difference in success proportions is the same.


I like questions about hypothesis tests and confidence intervals providing complementary and consistent results.  In this case students need to realize that the p-value is greater than 0.05, so the difference in the groups means is not statistically significant at the .05 level, so a 95% confidence interval for the difference in population means should include both positive and negative values (and zero).


This is another example of asking students to think through a statistic that they may not have encountered in class.  They should recognize that a relative risk greater than one indicates that one group has a higher success proportion than the other.  In this case, a confidence interval consisting entirely of values greater than one provides strong evidence that the success proportions differ between the two groups.


Because this is the post #21 in this blog series, I will include a twenty-first question for extra credit*.  Be forewarned that this is not really a statistics question, and it does not align with any conventional learning objective for a statistics course.

* I rarely offer extra credit to my students, but I happily extend this opportunity to blog readers.

I mentioned in post #8 (here) that this percentage has halved and that only 5% of a sample of Americans gave the correct answer.  Hans Rosling liked to point out that this represents a far worse understanding than pure ignorance, which would suggest that one-third would answer correctly.  Of course, knowing this fact is not a learning objective of an introductory statistics course, but I truly hope that statistics teachers can lead their students to learn about the world by presenting real data on many topics.  Later I will write a blog post arguing that statistics teachers can present data that help to make students aware of many measurable ways in which the world is becoming a better and better place.


P.S. More information for Rosling’s claim and survey data about the global extreme poverty rate (question #21) can be found here and here and here.

P.P.S. I thank Beth Chance for introducing me question #14 above (about the sample size needed to obtain a reasonable margin-of-error for the population of all U.S. residents).  Beth tells me that she borrowed this question from Tom Moore, so I thank him also.

I also thank Beth and Tom for kindly serving as two reviewers who very read drafts of my blog posts and offer many helpful suggestions for improvement before I post them.

Speaking of giving thanks, to those in the U.S. who read this during the week that it is posted, let me wish you a Happy Thanksgiving!

To all who are reading this in whatever country and at whatever time: Please accept my sincere thanks for taking the time to follow this blog.

P.P.P.S. Answers to these questions are: 1a) A, 1b) A, 1c) B, 1d) B, 1e) A; 2a) A, 2b) A, 2c) B, 2d) A; 3a) A, 3b) B; 4) F; 5a) A, 5b) C, 5c) B, 5d) A, 5e) C; 6) B; 7) C; 8) C; 9) B; 10) A; 11) C; 12a) A, 12b) B, 12c) A; 13) B; 14) B; 15) A; 16) C; 17a) A, 17b) A; 18a) A, 18b) C; 19) B; 20) B; 21) C.

A Word file with these twenty questions, which you may use to copy/paste or modify questions for use with your students, can be found here:

#20 Lincoln and Mandela, part 2

In last week’s post (here) I discussed sampling bias and random sampling with an activity that made use of Lincoln’s Gettysburg Address.  Now I present an activity using random assignment that mentions another great leader: Nelson Mandela.


This activity starts by asking students to answer two questions about Mandela.  Unbeknownst to my students, I distribute two versions of the questions:

  • Version A:
    • Was Nelson Mandela, first president of South Africa following apartheid, younger or older than 16 years old when he died?
    • Make a guess for Mandela’s age when he died.
  • Version B:
    • Was Nelson Mandela, first president of South Africa following apartheid, younger or older than 160 years old when he died?
    • Make a guess for Mandela’s age when he died.

Did you notice the subtle difference between the two versions?  They are identical except for one occurrence of the 0 character.  Version A asks about an age of 16, and version B asks about an age of 160.  Both of these questions are ridiculous; everyone knows that Mandela was older than 16 and younger than 160 when he died.  Why ask these silly questions at all?  The research question here is whether people’s age guesses are subconsciously affected by first seeing the number 16 or 160.  The psychological phenomenon known as anchoring suggests that those numbers, despite being completely irrelevant, nevertheless affect subsequent responses.

How do I present these questions to students?  Sometimes I prepare and distribute strips of paper in class, with the two versions mixed up randomly beforehand.  At other times I distribute the questions to students electronically, using either a google form or our course management system.  In this latter case I post both versions and ask students to determine which version to answer by adding the year, date, and month (as a number) of their birthdate. I tell them to use version A if this sum is odd and version B if this sum is even.


After we’ve collected students’ responses, I reveal the two versions to students and describe the research question.  Then I ask:

  • (a) What are the observational units?  This one is easy: The students themselves are the observational units, because the data were collected on them.
  • (b) What are the variables?  Which is explanatory, and which is response?  Classify each variable as categorical or numerical.  This one takes some thought.  The answer to the first question on the survey (older or younger than …) is actually not relevant here.  The explanatory variable is which “anchor” value the student was given: 16 or 160.  This is a categorical (and binary) variable.  The response variable is the guess for Mandela’s age when he died, which is numerical.
  • (c) Is this an observational study or an experiment?  This is an experiment, because the explanatory variable groups (i.e., whether a student read the question with 16 or 160 as the anchor value) were actively imposed on the students, not passively observed.  Thinking about the explanatory variable is key here, because some students are tempted to respond that this is an observational study because it feels like only answering survey questions.
  • (d) Did this study make use of random sampling, random assignment, both, or neither?  This is an important question, because one of my primary goals with this activity is to reinforce that random sampling and random assignment have different goals and benefits.  I advise students to take this one question at a time:
    • Did we make use of random sampling?  After some time for thought, I often ask a follow-up question as a hint: Who constitutes the sample, and did we choose these people randomly?  Students realize that they comprise the sample.  They were not selected randomly from the population of all students at our university (or from any other population), because they’re all students in my class.
    • Did we make use of random assignment?  Again I provide a hint after allowing some time to think about this: How did students come to be in group A or group B?  Students recognize that they were randomly assigned to a group.
  • (e) Some students know more about world history than others.  Is this a confounding variable here?  If not, how was this variable (knowledge of world history) controlled for?  This is a tough question for many students.  But this is a key question that gets at the purpose of random assignment.  The answer is no, this is not a confounding variable, because random assignment should have balanced out those who know lots of world history (or a moderate amount, or not much at all) between the two groups.  That’s what random assignment is all about: creating groups that are as similar as possible in all respects, except for the explanatory variable being studied.
  • (f) If it turns out that the age guesses in group B are statistically significantly greater than the age guesses in group A, would it be appropriate to draw a cause-and-effect conclusion?  If so, between what and what?  This is the other key question, as it addresses the scope of conclusion that can potentially be drawn from a randomized experiment.  Because students were randomly assigned to one value or the other for the anchor, it would be legitimate to draw a cause-and-effect conclusion if the age guesses differ significantly, in the conjectured direction, between the two groups.  In other words, it would be appropriate to conclude that the value of the anchor had an effect on these students’ age guesses.

I often end this activity there, keeping students’ atention solely on data collection issues.  But you could extend the activity further, or come back to it later in the course, by asking students to analyze and draw conclusions from their data:

  • (g) Produce appropriate graphs and calculate relevant statistics.  Summarize what these reveal about the research question.
  • (h) Conduct a test to assess the strength of evidence that the data provide for the research hypothesis.  Also calculate and interpret an appropriate confidence interval for comparing the two groups.

Here are results from a recent class of mine, analyzed with Minitab statistical software:

This analysis reveals that the sample data provide strong evidence to support the anchoring phenomenon.  The mean age guesses differ by almost 18 years (68.80 for version A, 86.86 for version B) in the conjectured direction.  The medians, which are not affected by outliers, differ by 11.5 years (75.5 for version A, 87.0 for version B).  The p-value for the t-test comparing the group means is essentially zero, indicating that the class data provide strong evidence to support the hypothesis that responses are affected by the “anchor” number that they see first.  We can be 95% confident that those who see an anchor of 160 produce an average age guess that is between 9.4 and 26.7 years greater than those who see an anchor of 16.


These data also provide a good opportunity to ask about whether any values should be removed from the analysis.  Many students believe that outliers should always be discarded, but it’s important to consider whether there is ample justification for removing them. In this case the age guesses of 14 years in group A and 140 years in group B are so implausible as to suggest that the students who gave those responses did not understand the question, or perhaps did not take the question seriously.  Let’s re-analyze the data without those values.  But first let’s ask students to think through what will happen:

  • (i) Predict the effect of removing the two extreme data values on:
    • Mean age guess in each group,
    • Standard deviations of the age guesses in each group,
    • Value of the t-test statistic
    • p-value
    • Confidence interval for the difference in population means
  • (j) Remove these two data values, and re-analyze the data.  Comment on how (if at all) these quantities change.  Also re-summarize your conclusions, and comment on how (if at all) they change.

After removing the two extreme data values, we produce the following output:

We see that even without the extreme data values, the data still provide strong evidence for the anchoring phenomenon.  As most students will have predicted, the mean age guess increased in version A and decreased in version B.  The standard deviations of the age guesses decreased in both groups.  The smaller difference in group means would move the t-value toward zero, but the smaller within-group standard deviations would produce a larger (in absolute value) t-statistic.  The net effect here is that the value of the t-statistic is slightly less negative. The p-value is the same as before to three decimal places (0.000) but is actually a tad larger due to the smaller (in absolute value) t-statistic.  Similarly, the confidence interval is centered on a smaller difference and is a bit narrower.  Without the extreme data values, we are 95% confident that the average age guess with the 160 anchor is between 7.4 and 23.4 years larger than with the 16 anchor.


Before concluding this analysis, I think it’s important to return to two key questions that get at the heart of the different purposes of random sampling and random assignment:

  • (k) Is it appropriate to draw a cause-and-effect conclusion from these data?  Justify your answer, and state the conclusion in context.
  • (l) To what population is it reasonable to generalize the results of this study?  Justify your answer.

Yes, it is appropriate to draw a cause-and-effect conclusion that the larger anchor number tends to produce greater age guesses than the smaller anchor number.  This conclusion is warranted, because the study design made use of random assignment and the resulting data revealed a highly statistically significant difference in the average age guesses of the two groups.

But this study only included students from my class, which is not a random sample from any population.  We should be careful not to generalize this conclusion too broadly.  Perhaps other students at my university would react similarly, and perhaps students in general would respond similarly, but we do not have data to address that.


I mentioned in post #11, titled “Repeat after me” (here), that I ask questions about observational units and variables over and over in almost every example throughout the entire course.  After we’ve studied random sampling and random assignment, I also ask questions about this, like questions (c) and (d) above, for virtually every example.  I also ask questions about scope of conclusions, like questions (k) and (l) above, for almost every example also.

To assess students’ understanding of the distinction between random sampling and random assignment, I also ask questions such as:

  • You want to collect data to investigate whether teenagers in the United States have read fewer Harry Potter books (from the original series of seven books) than teenagers in the United Kingdom.  Would you make use of random sampling, random assignment, both, or neither?  Explain.
  • An instructor wants to investigate whether using a red pen to grade assignments leads to lower scores on exams than using a blue pen to grade assignments.  Would you advise the professor to make use of random sampling, random assignment, both, or neither?  Explain.
  • A student decides to investigate whether NFL football games played in indoor stadiums tend to have more points scored than games played outdoors.  The student examines points scored in every NFL game of the 2019 season. Has the student used random sampling, random assignment, both, or neither?

The Harry Potter question cannot involve random assignment, because it makes no sense to randomly assign teenagers to live in either the U.S. or U.K.  But it would be good to use random sampling to select the teenagers in each country to be asked about their Harry Potter reading habits.  On the other hand, it’s important to use random assignment for the question about red vs. blue pen, because the research question asks for a cause-and-effect conclusion.  It’s less important to select a random sample of the instructor’s students, and the instructor would probably want to include all of his or her students who agreed to participate in the study.  For the football question, the student investigator would use neither random assignment nor random sampling.  NFL games are not assigned at random to be played in an indoor stadium or outdoors, and the games from the 2019 season do not constitute a random sample from any population.


The Lincoln and Mandela activities aim to help students understand that despite the common word random, there’s actually a world of difference between random sampling and random assignment:

The textbook titled The Statistical Sleuth, by Fred Ramsey and Dan Schafer, presents the following graphic, illustrating the different scopes of conclusions that can be drawn from a statistical study, depending on whether random sampling and/or random assignment were employed:

I recommend emphasizing this distinction between random sampling and random assignment at every opportunity.  I also think we do our students a favor by inviting Lincoln and Mandela into our statistics courses for a brief visit.


P.S. Nelson Mandela (1918 – 2013) was 95 years old when he died. You can read about the anchoring phenomenon here, and an article about using the effect of implausible anchors appears here.  The data on age guesses used above can be found in the Excel file below.

#19 Lincoln and Mandela, part 1

Two great leaders will be featured in this post and the next: Abraham Lincoln and Nelson Mandela.  Well, to be honest, featured is too strong, but these men provide the background for in-class activities that help students to understand two very important concepts in statistics: random sampling and random assignment.

When I first mention these two terms in class, I suspect that many students only hear random and don’t pay much attention to sampling versus assignment.  I admit that I did not make a big deal of this distinction myself when I started teaching.  But now I try to emphasize that random sampling and random assignment are very different ideas with very different goals.  In a nutshell:

  • Random sampling concerns how to select observational units for a sample.  Random sampling allows for generalizing the results of a sample to the larger population.
  • Random assignment pertains to how observational units come to be in groups to be compared.  Random assignment allows for the possibility of drawing a cause-and-effect conclusion.

This post will discuss random sampling with reference to Lincoln, and the next will concern random assignment while mentioning Mandela.  Along the way we’ll sneak in a touch of history and also some psychology.  As always, questions for students appear in italics.


I begin this activity by asking students to consider the 268 words in this speech as the population of interest:

The natural first question is: What speech is this, and who wrote it?  I’m glad that most students recognize this as Lincoln’s Gettysburg Address.  Then I give these instructions:

  • Circle ten words as a representative sample from this population.
  • For each word in your sample, record how many letters are in the word.
  • Calculate the average (mean) number of letters per word in your sample.
  • Plot your sample average on a dotplot on the board, along with the sample averages of your classmates.

Those who remember post #11 (here) will not be surprised that I next ask students: Identify the observational units and variable, first in your sample and then for the graph on the board.  For the students’ samples of ten words, the observational units are words, and the variable is the length of the word, as measured by number of letters.  But for the dotplot that students produce on the board, the observational unit are samples of 10 words, and the variable is the average length of a word.

All of this is prelude to the important question: How can we use the dotplot on the board to tell whether this sampling method (my telling students to circle ten words) is any good?  Before a student will respond, I often have to add: What additional information would you like to know to help you decide whether this sampling method was good?  At this point a student usually responds that they would like to know the average word length in the entire population of 268 words.  I reply: Great idea, and before class I calculated this population average to be 4.295 letters per word.  Then I draw a vertical line through the dotplot at this value.  Here are results from a recent class:

At this point I define sampling bias as a systematic tendency for a sampling method to over-represent some observational units and under-represent others.  Then I ask: Would you say that this sampling method (my asking students to circle ten words) is biased?  If so, in which direction?  How can you tell from the dotplot?

Students recognize that a large majority of the sample averages are greater than the population average.  This means that there’s a systematic tendency for this sampling method to over-represent large words and under-represent small words.  In other words, this sampling method is biased toward over-estimating the average length of a word in the Gettysburg Address.

I emphasize to students that sampling bias is a property of the sampling method, not of any one sample generated by the method.  One illustration of this is to ask: Whose idea was it to select a sample by circling ten words based solely on human judgment?  Students reply, somewhat sheepishly, that it was my idea.  I respond that this is absolutely right: The sampling bias here is my fault, not theirs, because the sampling method was my idea.

Then I ask: Suggest some reasons for why this sampling method turned out to be biased in this way.  Students are quick to suggest good explanations for this sampling bias.  They mention that longer words (such as government, battlefield, and consecrate) convey the meaning of the speech better than smaller words (such as a, by, and for).  Students also suggest that longer words are more likely to be selected because they are just more interesting than smaller words.

Next I ask whether sample size is the problem: Would asking people to circle twenty words (rather than ten) eliminate, or at least reduce, the sampling bias?  Most students realize that taking a larger sample of words would not help with this problem, because people would still be prone to select larger words rather than smaller ones.

Before we conclude this discussion of biased sampling, I ask students to give me a chance to redeem myself by proposing a new sampling method: Suppose that I ask you to close your eyes and point at the page ten times in order to select words for your sample.  Would this sampling method be unbiased?  (After all, doesn’t closing your eyes guarantee a lack of bias?)  Explain.  Most students correctly realize that this sampling method is still biased toward longer words.  You would be more likely to select longer words than shorter ones, because longer words take up more space on the page.

Finally, I ask: Suggest a different sampling method that would be unbiased.  Some students immediately respond with a magic word: random!  So I follow up with: What does it mean to select a random sample of words in this situation?  This question is harder, but eventually a student says that random sampling gives every word, whether it is an interesting word such as dedicate or a boring word like of, the same chance of being selected.


We then proceed to examine properties of random sampling.  Sometimes I ask students to generate their own random samples of words from this population.  One option for doing this is to give them a numbered list of the 268 words and then use a random number generator (such as the one at random.org) to select their sample.  They can then calculate their sample mean word length and put a dot on a new dotplot on the board, using the same scale as the original dotplot.

Another option is to move directly to using an applet (available here) to select random samples of words.  This applet starts by showing the distribution of word lengths in the population, which is skewed to the right:

You can select random samples by first clicking on Show Sampling Options.  I ask students to start by selecting one random sample of 5 words, which produces a result such as:

The applet calculates the sample mean word length for this sample and plots that on a graph.  Then asking the applet to select 999 more samples results in a graph of sample means that looks like:

Now we’re ready for the key questions: Does this distribution of sample means indicate sampling bias or unbiasedness of this random sampling method?  What aspect of the distribution leads you to this conclusion?  The shape and variability in this distribution are completely irrelevant to the issue of sampling bias.  To address this issue, we focus on the center of the distribution.  We see that the center of the distribution of sample means is very close to the population mean.  We can quantify this by noting that the mean of the 1000 sample means is 4.336 letters/word, which is quite close to the population mean of 4.295 letters/word. Therefore, this random sampling method appears to be unbiased.

Before moving on, I want to point out how challenging the following statement can be for students:

The mean of the sample means is the population mean.

This sentence contains only ten words, but three of them are the word mean(s)!  We can rewrite this statement mathematically, using common notation, as:

Notice that this equation contains only three symbols (in addition to the equals sign), but all three of them describe a mean!  It takes considerable time and careful thought for students to recognize and understand what these three means are and how they relate to each other:

  • The population mean.  For the population of 268 words in the Gettysburg Address, the value of the population mean is 4.295 letters/word.
  • The sample mean, which varies from sample to sample.  Each student calculated his/her own sample mean and represented it with a dot on the board. The first random sample generated by the applet above had a sample mean of 3.6 letters/word.  The applet then generated 999 more random samples and calculated the sample mean number of letters/word for each one.
  • The mean of the sample means.  We could have calculated this for the students’ sample means in class; we did not bother, but we know from the graph that the mean of the sample means would have been much greater than 4.295.  The applet did calculate the mean of the 1000 sample means that it generated; the mean of these sample means turned out to be 4.336 letters/word.  If we went on to generate all possible random samples, in the long run the mean of the sample means would be 4.295, the same value as the population mean.

My next question for students: Consider taking random samples of size 20 words per sample, rather than 5 words per sample.  How (if at all) would you expect the distribution of sample means to change, in terms of center, variability, and shape?  After students think about this, discuss it among themselves, and record their predictions, we use the applet to make this change, which produces a result such as:

We see that the center of this distribution is still close to the population mean of 4.295 letters/word.  Most students expect this, because this simply shows that random sampling is still unbiased with a larger sample size.  The key finding is that the variability of sample means is smaller with a larger sample size.  How can we tell?  One way is that the sample means now range from about 3 to 6 letters/word, whereas before (with a smaller sample size of 5) they ranged from about 2 to 8 letters/word.  Even better, we can note that the standard deviation of the sample means is now about 0.463, which is much less than its value of 0.945 with the smaller sample size.  The shape of the distribution of sample means is a bit more symmetric and normal-looking with the larger sample size than with the smaller sample size, much less skewed than the distribution of the population.

This last point foreshadows the concept of a sampling distribution of a sample mean and the Central Limit Theorem.  I think this context and applet provide a great opportunity to study those ideas*, but at this point I prefer to keep the focus on the topics of sampling bias and random sampling.

* One feature that I particularly like about this applet is that it displays three distributions at once, which are crucial (and challenging) for students to keep in mind when studying sampling distributions:

  • Population distribution (of word lengths)
  • Sample distribution (of word lengths)
  • Sampling** distribution (of average word lengths in a sample)

** It’s very unfortunate that the modifier words sample and sampling are so similar, yet the distributions they describe are precisely a key distinction to understand.  Perhaps we should avoid using the term sampling distribution and instead say distribution of sample averages.  It’s nice to be able to use shorthand when speaking with colleagues who understand the ideas, but in this case the extra words provide clarity for students who are just beginning to consider the ideas.


Before leaving the topic of sampling bias and random sampling, I ask a few more questions of my students, all in the context of selecting a sample of students at our university to complete a survey:

  • Would it be easy or hard to select a random sample of 50 Cal Poly students?

It takes a while for some students to realize that selecting such a random sample would be very hard to achieve.  It’s unlikely that university administrators would provide a list of all students at the university.  Having access to such a list would enable us to select a random sample of students’ names, but we would still face the challenges of contacting them successfully and then, even more problematic, convincing them to respond to our survey.

  • Suppose that you select a sample of Cal Poly students by standing in front of the library or recreation center and approaching 50 students who pass by.  Would this constitute a random sample of Cal Poly students?  What if you stand in front of the recreation center and approach 50 students who pass by?

Most students realize that this sampling method (standing in one location and recruiting passersby) does not constitute random sampling.  Some students would be more likely to be selected than others, in part because they are out-and-about on campus more often.  It’s also likely that you would be more likely to approach students who appear to be …, well, …, approachable, as opposed to students who look more intimidating or less friendly.  Even though the word random is used in an everyday sense to mean anything that is unplanned or unstructured, random sampling has a technical meaning.

  • Even though the convenience sampling described above is not random, could it nevertheless result in a sample that is representative of the population of Cal Poly students?  Identify a variable for which you would not be willing to consider such a convenience sample (as described above) to be representative of the population of Cal Poly students.  Also identify a variable for which you would be willing to consider such a sample (as described above) to be representative of the population of Cal Poly students.

We should certainly not consider a convenience sample, selected from students who pass by the library or recreation center, to be representative of the population for most variables, such as how often a student uses the recreation center per week, and whether or not a student knows where the library is on campus.  We should also be wary for variables about the student’s major, or how many hours they study per week, or how much sleep they get per night.  But there’s probably no reason to doubt that such a sample is representative of the population for a variable such as blood type.


I have used far more than 268 words to write this post.  Clearly I am much less economical with words than Abraham Lincoln in his Gettysburg Address.  I look forward to name-dropping Nelson Mandela into the next post, which will feature random assignment and discuss how that is quite different from random sampling.

P.S. Beth Chance and I developed the Gettysburg Address activity based the famous “random rectangles” activity developed by Dick Scheaffer and others.  As I told Dick when I interviewed him for the Journal of Statistics Education (here), I suspect that random rectangles is the most widely used activity for teaching statistics of all time, at least among activities that do not involve M&M candies.  You can read more about the genesis of the random rectangles activity in this JSE article (here).

P.P.S. This website (here) provides six different versions of the Gettysburg Address, with minor variations (and slightly different numbers of words) among them.  The one used above is the Hay copy.