Skip to content

#22 Four more exam questions

In last week’s post (here), I presented twenty multiple choice questions, all conceptual in nature, none based on real data.  This week I present four free-response questions that I have used on exams, all based on real data from genuine studies.  These questions assess students’ abilities to draw and justify appropriate conclusions.  Topics covered include confounding, biased sampling, simulation-based inference, statistical inference for comparing two groups, and cause-and-effect conclusions.  All of these questions have multiple parts*.  I also provide comments on the goal of each question and common student errors. I do not intend these four questions to comprise a complete exam.  As always, questions for students appear in italics.

* I could have titled this post “Twenty more exam questions” if I had counted each part separately.


1. (6 pts) Researchers found that people who used candy cigarettes as children were more likely to become smokers as adults, compared to people who did not use candy cigarettes as children.

  • (a) (1 pt) Identify the explanatory variable.
  • (b) (1 pt) Identify the response variable.
  • (c) (4 pts) When hearing about this study, a colleague of mine said: “But isn’t the smoking status of the person’s parents a confounding variable here?”  Describe what it means for smoking status to be a confounding variable that provides an alternative to drawing a cause/effect explanation in this context.

Describing what confounding means can be very challenging for students.  The key is to suggest a connection between the confounding variable and both the explanatory and response variables. I’ve tried to make this task as straight-forward as possible here.  Students do not need to suggest a confounding variable themselves, and the context does not require specialized knowledge to explain the confounding.

Parts (a) and (b) are meant to be helpful by directing students to think about the explanatory and response variables in this study (and also offering an opportunity to earn two relatively easy points).  The explanatory variable is whether or not the person used candy cigarettes as a child, and the response variable is whether or not the person became a smoker as an adult.

To earn full credit for part (c), students need to say something like:

  • Parents who smoke are more likely to allow their children to use candy cigarettes than parents who do not smoke.
  • Children of parents who smoke are more likely to become smokers as adults than children of parents who do not smoke.

It would be nice for students to add that these two connections would result in a higher proportion of smokers among those who used candy cigarettes as children than among those who did not use candy cigarettes, but I do not require such a statement.

Many students earn partial credit by giving only one of the two connections.  Such a response fails to explain confounding fully and falls short of providing an alternative explanation for the observed association.  Another common error is that some students focus on conjectured explanations, such as proposing only that children of smokers want to emulate their parents by using candy cigarettes, or that a genetic predisposition leads children of smokers to become smokers themselves.  Both of these explanations come up short because they only address one of the two connections.

I sometimes make this question a bit easier by provide one of the connections for students: My colleague also pointed out that children of smokers are more likely to become smokers as adults than children of non-smokers.  What else does the colleague need to say to complete the explanation of how parents’ smoking status is a confounding variable in this study?  At other times I make this question harder by asking students to propose a potential confounding variable and also explain how the confounding could provide an alternative to a cause-and-effect explanation.

I sometimes make this question a bit easier by provide one of the connections for students: My colleague also pointed out that children of smokers are more likely to become smokers as adults than children of non-smokers.  What else does the colleague need to say to complete the explanation of how parents’ smoking status is a confounding variable in this study?  At other times I make this question harder by asking students to propose a potential confounding variable and also explain how the confounding could provide an alternative to a cause-and-effect explanation.


2. (8 pts) The news website CNN.com has posted poll questions that people who view the website can respond to.  The following results were posted on January 10, 2012:

The margin-of-error, for 95% confidence, associated with this poll can be calculated to be ± .003, or ± 0.3%.

  • a) (1 pt) Are the percentages reported here (62%, 25%, 13%) parameters or statistics?  Explain briefly.
  • b) (1 pt) Explain (using no more than ten words) why the margin-of-error is so small.
  • c) (3 pts) Would you be very confident that between 61.7% and 62.3% of all employed Americans surf the Web often while on the job?  Circle YES or NO.  Also explain your answer.

Part (a) provides an easy point for students to earn by responding that these are statistics, because they are based on the sample of people who responded to the poll.  Part (b) is also fairly easy; an ideal answer has only four words: very large sample size.  I do not require that students report the sample size of 111,938.  They can omit the word “very” and still earn full credit.

Part (c) is the key question.  I want students to recognize that this poll relies on a very biased sampling method.  Any online poll like this is prone to sampling bias, but the topic of this poll question especially invites bias.  Only by surfing the web can a person see this poll question, so the sampling method favors those who surf the web often while at work.  Because of this biased sampling method, students should not be the least bit confident that the population proportion is within the margin-of-error of the sample result.

I’ve learned to require students to circle YES or NO along with their explanation.  Otherwise, several students try to have it both ways with a vague answer that tries to cover all possibilities, such as: I would be very confident of this, but I would also be cautious not to conclude anything too conclusively.

I used to present this poll result graphic to students and then ask specifically about sampling bias.  But I changed to the above version, as I decided that it’s important for students to be able to spot sampling bias without being prompted to look for it.


3. (12 pts)  Researchers presented young children with a choice between two toy characters who were offering stickers.  One character was described as mean, and the other was described as nice.  The mean character offered two stickers, and the nice character offered one sticker.  Researchers wanted to investigate whether infants would tend to select the nice character over the mean character, despite receiving fewer stickers.  They found that 16 of the 20 children in the study selected the nice character.

  • a) (2 pts) Describe (in words) the null hypothesis in this study.
  • b) (3 pts) Suppose that you were to conduct a simulation analysis of this study to investigate whether the observed result provides strong evidence that children genuinely prefer the nice character with one sticker over the mean one with two stickers.  Indicate what you would enter for the following three inputs: i) Probability of success, ii) Sample size, iii) Number of samples.
  • c) (1 pt) One of the following graphs was produced from a correct simulation analysis.  The other two were produced from incorrect simulation analyses.  Circle the correct one.
  • d) (1 pt) Based on the correct graph, which of the following is closest to the p-value of this test: 5.000, 0.500, 0.050, 0.005?  (Circle your answer.)
  • e) (2 pts) Write an interpretation of the p-value in the context of this study.
  • f) (3 pts) Summarize your conclusion from this research study and simulation analysis.

I am often asked about how to assess students’ knowledge of simulation-based inference* without using technology during the exam.  This question shows one strategy for achieving this.  Students need to specify the input values that they would use for the simulation, pick out what the simulation results would look like, estimate the p-value from the simulation results, and summarize an appropriate conclusion.

* See post #12 (here) for an introduction to simulation-based inference.

For part (a), I am looking for students to say that the null hypothesis is that children have no preference for either character.  At this point I am not asking for students to express this hypothesis in terms of a parameter.  It’s fine for them to state that children are equally likely to select either character, or that children select a character at random.

Correct responses for part (b) are to use 0.5 for the probability of success, 20 for the sample size, and a large number such as 1000 or 10,000 for the number of samples.  Some students enter 0.8 for the probability of heads, based on the sample proportion of successes.  A few students enter 20 for the number of repetitions.

Part (c) requires some thought, because my students have not seen such a question before.  Some mistakenly think that the simulation results should be centered at the observed value, so they incorrectly select the graph on the left.  The simulation results should be centered on what’s expected under the null hypothesis, as in graphs in the middle and on the right.  Most students realize that they’ve never seen a simulation result look like the nearly-uniform distribution in the middle graph.  Most recognize that they have frequently seen simulation results that look like the bell-shaped graph on the right, so they correctly select it.

To answer part (d) correctly, students need to be looking at the correct graph.  For the graph on the right, very few of the repetitions produced 16 or more successes in 20 trials, so the p-value is very small.  The smallest p-value among the options, .005, is the correct answer.

Many students struggle somewhat with part (e).  One of the things that I like about the simulation-based approach to statistical inference is that I think it makes the interpretation of p-value as clear as possible.  Students do not need to memorize an interpretation; they just need to describe what they see in the graph and remember the assumption behind the simulation analysis: If children had no preference between the characters, then only about 5 in 1000 (.005) repetitions would produce 16 or more successes.  Many students get the second part of this interpretation correct but forget to mention the “if there were no preference” assumption; such a response earns partial credit.  Sometimes I make this part of the question easier by giving a parenthetical hint: probability of what, assuming what?

Part (f), which is much more open-ended than previous parts, asks students to draw an appropriate conclusion.  This study provides very strong evidence that children genuinely prefer the nice character over the mean character despite receiving fewer stickers from the nice character.  This conclusion follows from the very small p-value, which establishes that it would be very surprising for 16 or more of 20 children to select the nice character, if in fact children had no preference for either character.


4. (16 pts) The Gallup organization released a report on October 20, 2014 that studied the daily lives and well-being of a random sample of American adults.  The report compared survey responses between adults with children under age 18 living in the home and those without such children living in the home.  The following table was provided in the report:

  • a) (2 pts) Does this study involve random sampling, random assignment, both, or neither?  Explain briefly.
  • b) (2 pts) State the appropriate null and alternative hypotheses (using appropriate symbols) for testing whether the two populations of adults differ with regard to the proportion who smiled or laughed a lot on the previous day.
  • c) (2 pts) The value of the test statistic turns out to be z = 18.5.  Write a sentence interpreting the value of this z-test statistic.  (This is not asking for a test decision or conclusion based on the z-test statistic.)
  • d) (2 pts) Would you reject the null hypothesis at the .01 significance level?  Explain how your answer follows from the value of the z-test statistic.
  • e) (2 pts) A 99% confidence interval based on the sample data turns out to be (.039 à .051).  Interpret what this interval says in this context.
  • f) (2 pts) Is this confidence interval consistent with your test decision (from part d)?  Explain how you know.
  • g) (2 pts) Give a very brief explanation for why this confidence interval is very narrow.
  • h) (2 pts) Suppose that someone reads about this study and says that having children in the household causes a very large increase in the likelihood of smiling or laughing a lot.  Would you agree with this conclusion?  Explain why or why not.

Presenting the sample statistics in the form of this table is a bit non-conventional.  This is certainly not a 2×2 table of counts that students are accustomed to seeing.   This can confuse some at first, but I think it’s worthwhile for students to see and grapple with information presented in multiple ways.

Part (a) revisits the theme of posts #19 and #20, titled Lincoln and Mandela (here and here), about the distinction between random sampling and random assignment.  Students should note that the question states that the sample was selected randomly.  But the Gallup organization certainly did not perform random assignment, because it would not be sensible or practical to randomly assign which people have children in their household and which do not.

To answer part (b) correctly, students need to realize that the test requires comparing proportions between two groups.  The null hypothesis is that American adults with children in their household have the same proportion who smiled or laughed a lot on the previous day as those without children in their household.  This null hypothesis can be expressed in symbols* as:

* Recall from post #13, titled A question of trust (here), that I like to use Greek letters for all parameter symbols, so I use π for a population proportion.

I could have asked students to calculate the z-test statistic, but part (c) provides this value and asks for an interpretation.  I try to ward off a common error by cautioning students not to provide a test decision or conclusion.  But many students do not know what interpreting the z-score means, even though we’ve done that often in class*.  I want students to respond that the sample proportions (who smiled or laughed a lot on the previous day) in the two groups (those with/without children under age 18 in the household) are 18.5 standard deviations (or standard errors) apart.  This is a huge difference.  Students do not need to comment on the huge-ness until the next part, though.  Despite my caution, many students draw a conclusion from the z-score here rather than interpret it.  This could be because they do not read carefully enough, or it could well be that they do not understand what interpreting a z-score entails.

* See post #8, titled End of alphabet (here), for more thoughts and examples about z-scores.

For part (d), students should note that because the z-score of 18.5 is enormous, the p-value will be incredibly small, very close to zero.  The tiny p-value leads to an emphatic rejection of the null hypothesis.  Notice that I do not ask for an interpretation of this test decision in context here, only because parts (c) and (e) ask for interpretations.

Students need to realize that the confidence interval presented in part (e) estimates the difference in population proportions.  I think this is fair to expect in part because that’s the conventional confidence interval to produce when comparing proportions between two groups, and also because the reported difference in sample proportions between the groups (.045) is the midpoint of the interval.  We can be 95% confident that the proportion of American adults with a child under age 18 in the household is greater than the proportion among those without a child by between .039 and .051 (in other words by between 3.9 and 5.1 percentage points).  Some students interpret this interval only as a difference without specifying direction (that those with a child are more likely to have smiled or laughed a lot).  Such a response is only worth partial credit, because they’re leaving out an important element by not specifying which group has a higher proportion who smile or laugh a lot.

Part (f) is intended to be straightforward.  Students should have rejected the null hypothesis that the population proportions are the same in the two groups.  They should also notice that the confidence interval, containing only positive values, does not include zero as a plausible value for the difference in population proportions.  These two procedures therefore give consistent results*.

* I hope that some students will remember the cat households example from post #16, titled Questions about cats (here), when they read this part.  If they do, this recollection might also help with part (h) coming up.

Part (g) is asking about the very large sample size producing a narrow confidence interval. This is the same issue that I asked about in part (b) of question #2 about the CNN.com poll*.

* It’s certainly possible that I over-emphasize this point with my students.

I must admit that I really like part (h).  The previous seven parts have been leading up to this part, which asks about the scope and type of conclusion students can draw from this survey.  Notice that I use bold font for both causes and very large increase.  This as a big hint that I want students to comment on both aspects.  Most students correctly note that this is an observational study and not a randomized experiment, so a cause-and-effect conclusion (between having children in the household and being more likely to smile or laugh) is not justified.  Relatively few students go on to address whether the difference between the groups is very large.  I hope that they’ll look at the two sample proportions, and also at the confidence interval for the difference in population proportions, and then conclude that 3.9 to 5.1 percentage points does not indicate a very large difference between the two groups.


I hope these four exam questions, which aim to assess students’ abilities to draw and justify conclusions, provide a nice complement to last week’s multiple choice questions (here).  See below for a link to a Word file containing these questions.

P.S. I thank my Cal Poly colleague Kevin Ross for introducing me to the Gallup poll and some good questions to ask about it.  Kevin and his wife Amy have five children under age 18 in their household.  I suspect that Kevin and his wife smile and laugh quite often.

P.P.S. The journal article on candy cigarette use can be found here.  The article on children’s choices of toy characters can be found here; this is a follow-up study to a more well-known one that I often use in class, described here.  A report on the Gallup survey about smiling and laughing can be found here.

P.P.P.S. Follow the link below for a Word file containing these four questions, and feel free to use or revise them for use with your own students.

#21 Twenty final exam questions

My mantra of “ask good questions” applies to exams as well as in-class learning activities.  This week I present and discuss twenty multiple-choice questions that I have used on final exams.  All of these questions are conceptual in nature.  They require no calculations, they do not refer to actual studies, and they do not make use of real data.  I certainly do not intend these questions to comprise a complete exam; I strongly recommend asking many free-response questions based on real data and genuine studies as well.

At the end of this post I provide a link to a file containing these twenty questions, in case that facilitates using them with your students.  Correct answers are discussed throughout and also reported at the end.


I like to think that this question assesses some basic level of understanding, but frankly I’m not sure.  Do students ever say that a standard deviation and a p-value can sometimes be negative?  Not often, but yes.  Do I question my career choice when I read those responses?  Not often, but yes.


I think it’s valuable to ask students to apply what they’ve learned to a new situation or a bew statistic.  This question is not nearly as good for this goal as my favorite question (see post #2 here), but I think this assesses something worthwhile.  The questions about resistance are fairly straightforward.  The mid-hinge is resistant because it relies only on quartiles, but the mid-range is very non-resistant because it depends completely on the most extreme values.  Both of these statistics are measures of center.  This is challenging for many students, perhaps because they have seen that the difference between the maximum and minimum, and the difference between the quartiles, are measures of variability.  One way to convince students of this is to point out that adding a constant to every value in the dataset (in other words, shifting all of the data values by the same amount) would cause the mid-hinge and mid-range to increase (or shift) by exactly that constant.


This question should be very easy for all students, but some struggle.  The question boils down to: If the sum of values equals zero, does the mean have to equal zero, and does the median have to equal zero?  The answer is yes to the first, because the mean is calculated as the sum divided by the number of values.  But the answer is no to the second, as seen in this counterexample where the mean is 0 but the median is not: -20, 5, 15.  The fact that this question is stated about residuals is completely irrelevant to answering the question, but the mention of residuals leads some students to think in unhelpful directions.

I sometimes ask an open-ended version of this question where I ask students to provide a counter-example if their answer is no.


This question has been extremely challenging for my students.  I used to ask it without providing options, and the most common response was “the same.”  That’s right: Many students did not realize that they should provide a number when asked for the value of a correlation coefficient.  Among these options, it’s very discouraging when a student selects -5, apparently not knowing that a correlation coefficient needs to be between -1 and +1 (inclusive), but this answer is tempting to some students because of the “5 points lower” wording in the question.  Another commonly selected wrong answer is -1.  I think students who answer -1 realize that the data would fall on a perfectly straight line, so the correlation coefficient must be -1 or +1, but the “lower” language fools them into thinking that the association is negative.

I sometimes offer a hint, advising students to start by drawing a sketch of some hypothetical data that satisfy the description.  I have also started to ask and discuss this question in class when we first study correlation, and then include the exact same question on the final exam.  This has improved students’ performance, but many still struggle.


Most students correctly identify (a) and (d) as categorical variables and (c) as a numerical variable.  The most challenging parts are (b) and (e), which are not variables for these observational units.  I try to emphasize that variables are things that can be recorded for each observational unit, not an overall question or measure that pertains to the entire dataset.


I started asking this question after I noticed that some of my students believe that conducting a randomized experiment always justifies drawing a cause-and-effect conclusion, regardless of how the data turn out!  The good news is that very few students give answer A.  The bad news is that more than a few give answer C.


Some students take the “correlation does not imply causation” maxim to an inappropriate higher level by believing that “correlation implies no causation.”  Of course, I want them to know that a strong correlation does not establish a cause-and-effect relationship but also does not preclude that possibility.


I often ask this question as a calculation to be performed in my courses for mathematically inclined students.  To calculate the correct percentage, note that Brad will get 70% right because he knows the answer, and he’ll guess correctly on 1/3 of the other 30%.  So, his long-run percentage correct will be 70% + 1/3(30%) = 80%.

When I ask for this calculation, I’ve been surprised by students giving an answer less than 70%.  I understand that mistakes happen, of course, or that a student would not know how to solve this, but I can’t understand why they wouldn’t realize immediately that the answer has to be larger than 70%.  I decided to ask this multiple-choice version of the question, which does not require a numerical answer or any calculation.  I’m still surprised that a few students get this wrong.


This is essentially the same question as I asked in post #16 (here) about whether the percentage of American households with a pet dog plus the percentage with a pet cat equals the percentage with either a pet dog or a pet cat.  Adding these percentages is not legitimate because the events are not mutually exclusive: It’s possible that it could rain on both Saturday and Sunday.  I hope that choosing 70% and 30% as the percentages is helpful to students, who might be tipped off by the 100% value that something must be wrong because rain cannot be certain.

It might be interesting to ask this question with percentages of 70% and 40%, and also with percentages of 60% and 30%.  I hope that the version with 70% and 40% would be easier, because all students should recognize that there could not be a 110% chance of rain.  I suspect that the version with 60% and 30% would be harder, because it might be more tempting to see 90% as a reasonable chance.


The main point here is that you cannot just take the average of 80% and 40%, because the group sizes are not the same.  Because there are many more students than faculty, the overall percentage will be much closer to the student percentage of 80%, so the correct answer is that the overall percentage would be more than 60%.


The goal here is to assess whether students realize that a probability such as 0.5 refers to a long-run proportion and does not necessarily hold in the short-run.  A sample size of two children definitely falls into the short-run and not long-run category, so it’s not guaranteed or even very likely to have one child of each sex.

A student does not need to enumerate the sample space and calculate the exact probability to answer this question correctly.  The sample space of four equally likely outcomes is {B1B2, B1G2, G1B2, G1G2}, so the probability of having one child of each sex is indeed 2/4 = 0.5.  But a student only needs to realize that this event is neither very likely nor very unlikely in order to answer correctly.  In fact, even if a student has the misconception that the three outcomes {2 boys, 2 girls, 1 of each} are equally likely, so they think the probability is 1/3, they should still give the correct answer of C.


Students expect to perform normal distribution calculations after they read the first sentence.  But they cannot do this, because the mean and standard deviation are not provided.  For that matter, we also don’t know the value of the advertised weight.  Students are left with no option but to think things through.  I hope that they’ll remember and follow the advice that I give for any question involving normal distributions: Start with a sketch!

Part (a) can be answered without ever having taken a statistics course.  To reduce the percentage of packages that weigh less than advertised, without changing the mean or standard deviation, the manufacturer would need to decrease the advertised weight.

To answer part (b), students should realize that decreasing the percentage of underweight packages would require putting more candy in each package, so the mean of the distribution of weights would need to increase.

Part (c) is the most challenging part.  Decreasing the percentage of underweight packages, without changing the advertised weight or the mean, would require a taller and skinnier normal curve.  So, the standard deviation of the weights would need to decrease.


Most students get this wrong by answering yes.  These students have missed the whole point of the Central Limit Theorem (CLT), which describes the distribution of the sample mean.  Many students believe that whenever a sample size reaches 30 or more, that guarantees an approximately normal distribution.  Of what?  They don’t give that question any thought.  They mistakenly believe that the CLT simply guarantees a normal distribution when n ≥ 30.

I usually ask for an explanation along with a yes/no answer here.  But the explanation is almost always the same, boiling down to: Yes, because n ≥ 30.  Some students do give a very good answer, which demonstrates that they’ve learned something important (and also gives me much pleasure).  I think this question helps to identify students with a very strong understanding of the CLT from those with a less strong understanding.

You could ask a version of this question that does not refer to the Central Limit Theorem by asking: Does the sample size of 200 houses establish that


This is one of my very favorite questions, which I ask on almost every final exam.  I think this is a very important idea for students to understand.  But my students perform very poorly on this question that I like so much.  Not many give the correct answer (B, 1000), and many think that the answer is 100,000 or more.

It’s fine for students to perform a sample size calculation to answer this question, but that’s not my intent.  I hope that they will have noticed that many examples in the course involved surveys with about 1000 people and that the margin-of-error turned out to be in the ballpark of 3 percentage points.

Unfortunately, many students are misled by the 325 million number that appears in the first sentence of the question.  The population size is not relevant here.  Margin-of-error depends critically on sample size but hardly at all on population size, as long as the population is much larger than the sample.  A sample size of 1000 people has the same margin-of-error whether the population of interest is all Americans or all New Zealanders or all residents of San Luis Obispo.

I suppose you could argue that I am deliberately misleading students by leading off with an irrelevant piece of information, but that’s precisely what’s being assessed: Do they realize that the population size is irrelevant here?  It’s quite remarkable that a sample size of 1000 is sufficient to obtain a margin-of-error of only 3.5 percentage points in a population as numerous as the United States.  One of my principal goals in the course is for students to appreciate the wonder of random sampling!

I sometimes give half-credit to answers of 100 and 10,000, because they are somewhat in the ballpark.  On the opposite extreme, I am tempted to deduct 2 points (even on a 1-point question!) when a student answers 1,000,000 or 10,000,000.


This question is about as straight-forward as they come, and my students generally do well.  Some of the questions above are quite challenging, so it’s good to include some easier ones as well.


This is another straightforward one on which my students do well.  I hope that the answer to this question is second-nature to students by the end of the course, and I like to think that they silently thank me for the easy point when they read this question.


You might be expecting me to say that this one is also straight-forward, but it is always more problematic for students than I anticipate.  Maybe some students out-smart themselves by applying an exam-testing strategy that cautions against giving the same answer for both parts of a two-part question.


Part (a) is very clear-cut.  In fact, this is another question for which there’s no need to have ever set foot in a statistics classroom to answer correctly.  All that’s needed is to look for the result with the biggest difference between the success proportions in the two groups.

It does help to have been in a statistics classroom for part (b), although many students have correct intuition that larger sample sizes produce stronger evidence of a difference between the groups, when the difference in success proportions is the same.


I like questions about hypothesis tests and confidence intervals providing complementary and consistent results.  In this case students need to realize that the p-value is greater than 0.05, so the difference in the groups means is not statistically significant at the .05 level, so a 95% confidence interval for the difference in population means should include both positive and negative values (and zero).


This is another example of asking students to think through a statistic that they may not have encountered in class.  They should recognize that a relative risk greater than one indicates that one group has a higher success proportion than the other.  In this case, a confidence interval consisting entirely of values greater than one provides strong evidence that the success proportions differ between the two groups.


Because this is the post #21 in this blog series, I will include a twenty-first question for extra credit*.  Be forewarned that this is not really a statistics question, and it does not align with any conventional learning objective for a statistics course.

* I rarely offer extra credit to my students, but I happily extend this opportunity to blog readers.

I mentioned in post #8 (here) that this percentage has halved and that only 5% of a sample of Americans gave the correct answer.  Hans Rosling liked to point out that this represents a far worse understanding than pure ignorance, which would suggest that one-third would answer correctly.  Of course, knowing this fact is not a learning objective of an introductory statistics course, but I truly hope that statistics teachers can lead their students to learn about the world by presenting real data on many topics.  Later I will write a blog post arguing that statistics teachers can present data that help to make students aware of many measurable ways in which the world is becoming a better and better place.


P.S. More information for Rosling’s claim and survey data about the global extreme poverty rate (question #21) can be found here and here and here.

P.P.S. I thank Beth Chance for introducing me question #14 above (about the sample size needed to obtain a reasonable margin-of-error for the population of all U.S. residents).  Beth tells me that she borrowed this question from Tom Moore, so I thank him also.

I also thank Beth and Tom for kindly serving as two reviewers who very read drafts of my blog posts and offer many helpful suggestions for improvement before I post them.

Speaking of giving thanks, to those in the U.S. who read this during the week that it is posted, let me wish you a Happy Thanksgiving!

To all who are reading this in whatever country and at whatever time: Please accept my sincere thanks for taking the time to follow this blog.

P.P.P.S. Answers to these questions are: 1a) A, 1b) A, 1c) B, 1d) B, 1e) A; 2a) A, 2b) A, 2c) B, 2d) A; 3a) A, 3b) B; 4) F; 5a) A, 5b) C, 5c) B, 5d) A, 5e) C; 6) B; 7) C; 8) C; 9) B; 10) A; 11) C; 12a) A, 12b) B, 12c) A; 13) B; 14) B; 15) A; 16) C; 17a) A, 17b) A; 18a) A, 18b) C; 19) B; 20) B; 21) C.

A Word file with these twenty questions, which you may use to copy/paste or modify questions for use with your students, can be found here:

#20 Lincoln and Mandela, part 2

In last week’s post (here) I discussed sampling bias and random sampling with an activity that made use of Lincoln’s Gettysburg Address.  Now I present an activity using random assignment that mentions another great leader: Nelson Mandela.


This activity starts by asking students to answer two questions about Mandela.  Unbeknownst to my students, I distribute two versions of the questions:

  • Version A:
    • Was Nelson Mandela, first president of South Africa following apartheid, younger or older than 16 years old when he died?
    • Make a guess for Mandela’s age when he died.
  • Version B:
    • Was Nelson Mandela, first president of South Africa following apartheid, younger or older than 160 years old when he died?
    • Make a guess for Mandela’s age when he died.

Did you notice the subtle difference between the two versions?  They are identical except for one occurrence of the 0 character.  Version A asks about an age of 16, and version B asks about an age of 160.  Both of these questions are ridiculous; everyone knows that Mandela was older than 16 and younger than 160 when he died.  Why ask these silly questions at all?  The research question here is whether people’s age guesses are subconsciously affected by first seeing the number 16 or 160.  The psychological phenomenon known as anchoring suggests that those numbers, despite being completely irrelevant, nevertheless affect subsequent responses.

How do I present these questions to students?  Sometimes I prepare and distribute strips of paper in class, with the two versions mixed up randomly beforehand.  At other times I distribute the questions to students electronically, using either a google form or our course management system.  In this latter case I post both versions and ask students to determine which version to answer by adding the year, date, and month (as a number) of their birthdate. I tell them to use version A if this sum is odd and version B if this sum is even.


After we’ve collected students’ responses, I reveal the two versions to students and describe the research question.  Then I ask:

  • (a) What are the observational units?  This one is easy: The students themselves are the observational units, because the data were collected on them.
  • (b) What are the variables?  Which is explanatory, and which is response?  Classify each variable as categorical or numerical.  This one takes some thought.  The answer to the first question on the survey (older or younger than …) is actually not relevant here.  The explanatory variable is which “anchor” value the student was given: 16 or 160.  This is a categorical (and binary) variable.  The response variable is the guess for Mandela’s age when he died, which is numerical.
  • (c) Is this an observational study or an experiment?  This is an experiment, because the explanatory variable groups (i.e., whether a student read the question with 16 or 160 as the anchor value) were actively imposed on the students, not passively observed.  Thinking about the explanatory variable is key here, because some students are tempted to respond that this is an observational study because it feels like only answering survey questions.
  • (d) Did this study make use of random sampling, random assignment, both, or neither?  This is an important question, because one of my primary goals with this activity is to reinforce that random sampling and random assignment have different goals and benefits.  I advise students to take this one question at a time:
    • Did we make use of random sampling?  After some time for thought, I often ask a follow-up question as a hint: Who constitutes the sample, and did we choose these people randomly?  Students realize that they comprise the sample.  They were not selected randomly from the population of all students at our university (or from any other population), because they’re all students in my class.
    • Did we make use of random assignment?  Again I provide a hint after allowing some time to think about this: How did students come to be in group A or group B?  Students recognize that they were randomly assigned to a group.
  • (e) Some students know more about world history than others.  Is this a confounding variable here?  If not, how was this variable (knowledge of world history) controlled for?  This is a tough question for many students.  But this is a key question that gets at the purpose of random assignment.  The answer is no, this is not a confounding variable, because random assignment should have balanced out those who know lots of world history (or a moderate amount, or not much at all) between the two groups.  That’s what random assignment is all about: creating groups that are as similar as possible in all respects, except for the explanatory variable being studied.
  • (f) If it turns out that the age guesses in group B are statistically significantly greater than the age guesses in group A, would it be appropriate to draw a cause-and-effect conclusion?  If so, between what and what?  This is the other key question, as it addresses the scope of conclusion that can potentially be drawn from a randomized experiment.  Because students were randomly assigned to one value or the other for the anchor, it would be legitimate to draw a cause-and-effect conclusion if the age guesses differ significantly, in the conjectured direction, between the two groups.  In other words, it would be appropriate to conclude that the value of the anchor had an effect on these students’ age guesses.

I often end this activity there, keeping students’ atention solely on data collection issues.  But you could extend the activity further, or come back to it later in the course, by asking students to analyze and draw conclusions from their data:

  • (g) Produce appropriate graphs and calculate relevant statistics.  Summarize what these reveal about the research question.
  • (h) Conduct a test to assess the strength of evidence that the data provide for the research hypothesis.  Also calculate and interpret an appropriate confidence interval for comparing the two groups.

Here are results from a recent class of mine, analyzed with Minitab statistical software:

This analysis reveals that the sample data provide strong evidence to support the anchoring phenomenon.  The mean age guesses differ by almost 18 years (68.80 for version A, 86.86 for version B) in the conjectured direction.  The medians, which are not affected by outliers, differ by 11.5 years (75.5 for version A, 87.0 for version B).  The p-value for the t-test comparing the group means is essentially zero, indicating that the class data provide strong evidence to support the hypothesis that responses are affected by the “anchor” number that they see first.  We can be 95% confident that those who see an anchor of 160 produce an average age guess that is between 9.4 and 26.7 years greater than those who see an anchor of 16.


These data also provide a good opportunity to ask about whether any values should be removed from the analysis.  Many students believe that outliers should always be discarded, but it’s important to consider whether there is ample justification for removing them. In this case the age guesses of 14 years in group A and 140 years in group B are so implausible as to suggest that the students who gave those responses did not understand the question, or perhaps did not take the question seriously.  Let’s re-analyze the data without those values.  But first let’s ask students to think through what will happen:

  • (i) Predict the effect of removing the two extreme data values on:
    • Mean age guess in each group,
    • Standard deviations of the age guesses in each group,
    • Value of the t-test statistic
    • p-value
    • Confidence interval for the difference in population means
  • (j) Remove these two data values, and re-analyze the data.  Comment on how (if at all) these quantities change.  Also re-summarize your conclusions, and comment on how (if at all) they change.

After removing the two extreme data values, we produce the following output:

We see that even without the extreme data values, the data still provide strong evidence for the anchoring phenomenon.  As most students will have predicted, the mean age guess increased in version A and decreased in version B.  The standard deviations of the age guesses decreased in both groups.  The smaller difference in group means would move the t-value toward zero, but the smaller within-group standard deviations would produce a larger (in absolute value) t-statistic.  The net effect here is that the value of the t-statistic is slightly less negative. The p-value is the same as before to three decimal places (0.000) but is actually a tad larger due to the smaller (in absolute value) t-statistic.  Similarly, the confidence interval is centered on a smaller difference and is a bit narrower.  Without the extreme data values, we are 95% confident that the average age guess with the 160 anchor is between 7.4 and 23.4 years larger than with the 16 anchor.


Before concluding this analysis, I think it’s important to return to two key questions that get at the heart of the different purposes of random sampling and random assignment:

  • (k) Is it appropriate to draw a cause-and-effect conclusion from these data?  Justify your answer, and state the conclusion in context.
  • (l) To what population is it reasonable to generalize the results of this study?  Justify your answer.

Yes, it is appropriate to draw a cause-and-effect conclusion that the larger anchor number tends to produce greater age guesses than the smaller anchor number.  This conclusion is warranted, because the study design made use of random assignment and the resulting data revealed a highly statistically significant difference in the average age guesses of the two groups.

But this study only included students from my class, which is not a random sample from any population.  We should be careful not to generalize this conclusion too broadly.  Perhaps other students at my university would react similarly, and perhaps students in general would respond similarly, but we do not have data to address that.


I mentioned in post #11, titled “Repeat after me” (here), that I ask questions about observational units and variables over and over in almost every example throughout the entire course.  After we’ve studied random sampling and random assignment, I also ask questions about this, like questions (c) and (d) above, for virtually every example.  I also ask questions about scope of conclusions, like questions (k) and (l) above, for almost every example also.

To assess students’ understanding of the distinction between random sampling and random assignment, I also ask questions such as:

  • You want to collect data to investigate whether teenagers in the United States have read fewer Harry Potter books (from the original series of seven books) than teenagers in the United Kingdom.  Would you make use of random sampling, random assignment, both, or neither?  Explain.
  • An instructor wants to investigate whether using a red pen to grade assignments leads to lower scores on exams than using a blue pen to grade assignments.  Would you advise the professor to make use of random sampling, random assignment, both, or neither?  Explain.
  • A student decides to investigate whether NFL football games played in indoor stadiums tend to have more points scored than games played outdoors.  The student examines points scored in every NFL game of the 2019 season. Has the student used random sampling, random assignment, both, or neither?

The Harry Potter question cannot involve random assignment, because it makes no sense to randomly assign teenagers to live in either the U.S. or U.K.  But it would be good to use random sampling to select the teenagers in each country to be asked about their Harry Potter reading habits.  On the other hand, it’s important to use random assignment for the question about red vs. blue pen, because the research question asks for a cause-and-effect conclusion.  It’s less important to select a random sample of the instructor’s students, and the instructor would probably want to include all of his or her students who agreed to participate in the study.  For the football question, the student investigator would use neither random assignment nor random sampling.  NFL games are not assigned at random to be played in an indoor stadium or outdoors, and the games from the 2019 season do not constitute a random sample from any population.


The Lincoln and Mandela activities aim to help students understand that despite the common word random, there’s actually a world of difference between random sampling and random assignment:

The textbook titled The Statistical Sleuth, by Fred Ramsey and Dan Schafer, presents the following graphic, illustrating the different scopes of conclusions that can be drawn from a statistical study, depending on whether random sampling and/or random assignment were employed:

I recommend emphasizing this distinction between random sampling and random assignment at every opportunity.  I also think we do our students a favor by inviting Lincoln and Mandela into our statistics courses for a brief visit.


P.S. Nelson Mandela (1918 – 2013) was 95 years old when he died. You can read about the anchoring phenomenon here, and an article about using the effect of implausible anchors appears here.  The data on age guesses used above can be found in the Excel file below.

#19 Lincoln and Mandela, part 1

Two great leaders will be featured in this post and the next: Abraham Lincoln and Nelson Mandela.  Well, to be honest, featured is too strong, but these men provide the background for in-class activities that help students to understand two very important concepts in statistics: random sampling and random assignment.

When I first mention these two terms in class, I suspect that many students only hear random and don’t pay much attention to sampling versus assignment.  I admit that I did not make a big deal of this distinction myself when I started teaching.  But now I try to emphasize that random sampling and random assignment are very different ideas with very different goals.  In a nutshell:

  • Random sampling concerns how to select observational units for a sample.  Random sampling allows for generalizing the results of a sample to the larger population.
  • Random assignment pertains to how observational units come to be in groups to be compared.  Random assignment allows for the possibility of drawing a cause-and-effect conclusion.

This post will discuss random sampling with reference to Lincoln, and the next will concern random assignment while mentioning Mandela.  Along the way we’ll sneak in a touch of history and also some psychology.  As always, questions for students appear in italics.


I begin this activity by asking students to consider the 268 words in this speech as the population of interest:

The natural first question is: What speech is this, and who wrote it?  I’m glad that most students recognize this as Lincoln’s Gettysburg Address.  Then I give these instructions:

  • Circle ten words as a representative sample from this population.
  • For each word in your sample, record how many letters are in the word.
  • Calculate the average (mean) number of letters per word in your sample.
  • Plot your sample average on a dotplot on the board, along with the sample averages of your classmates.

Those who remember post #11 (here) will not be surprised that I next ask students: Identify the observational units and variable, first in your sample and then for the graph on the board.  For the students’ samples of ten words, the observational units are words, and the variable is the length of the word, as measured by number of letters.  But for the dotplot that students produce on the board, the observational unit are samples of 10 words, and the variable is the average length of a word.

All of this is prelude to the important question: How can we use the dotplot on the board to tell whether this sampling method (my telling students to circle ten words) is any good?  Before a student will respond, I often have to add: What additional information would you like to know to help you decide whether this sampling method was good?  At this point a student usually responds that they would like to know the average word length in the entire population of 268 words.  I reply: Great idea, and before class I calculated this population average to be 4.295 letters per word.  Then I draw a vertical line through the dotplot at this value.  Here are results from a recent class:

At this point I define sampling bias as a systematic tendency for a sampling method to over-represent some observational units and under-represent others.  Then I ask: Would you say that this sampling method (my asking students to circle ten words) is biased?  If so, in which direction?  How can you tell from the dotplot?

Students recognize that a large majority of the sample averages are greater than the population average.  This means that there’s a systematic tendency for this sampling method to over-represent large words and under-represent small words.  In other words, this sampling method is biased toward over-estimating the average length of a word in the Gettysburg Address.

I emphasize to students that sampling bias is a property of the sampling method, not of any one sample generated by the method.  One illustration of this is to ask: Whose idea was it to select a sample by circling ten words based solely on human judgment?  Students reply, somewhat sheepishly, that it was my idea.  I respond that this is absolutely right: The sampling bias here is my fault, not theirs, because the sampling method was my idea.

Then I ask: Suggest some reasons for why this sampling method turned out to be biased in this way.  Students are quick to suggest good explanations for this sampling bias.  They mention that longer words (such as government, battlefield, and consecrate) convey the meaning of the speech better than smaller words (such as a, by, and for).  Students also suggest that longer words are more likely to be selected because they are just more interesting than smaller words.

Next I ask whether sample size is the problem: Would asking people to circle twenty words (rather than ten) eliminate, or at least reduce, the sampling bias?  Most students realize that taking a larger sample of words would not help with this problem, because people would still be prone to select larger words rather than smaller ones.

Before we conclude this discussion of biased sampling, I ask students to give me a chance to redeem myself by proposing a new sampling method: Suppose that I ask you to close your eyes and point at the page ten times in order to select words for your sample.  Would this sampling method be unbiased?  (After all, doesn’t closing your eyes guarantee a lack of bias?)  Explain.  Most students correctly realize that this sampling method is still biased toward longer words.  You would be more likely to select longer words than shorter ones, because longer words take up more space on the page.

Finally, I ask: Suggest a different sampling method that would be unbiased.  Some students immediately respond with a magic word: random!  So I follow up with: What does it mean to select a random sample of words in this situation?  This question is harder, but eventually a student says that random sampling gives every word, whether it is an interesting word such as dedicate or a boring word like of, the same chance of being selected.


We then proceed to examine properties of random sampling.  Sometimes I ask students to generate their own random samples of words from this population.  One option for doing this is to give them a numbered list of the 268 words and then use a random number generator (such as the one at random.org) to select their sample.  They can then calculate their sample mean word length and put a dot on a new dotplot on the board, using the same scale as the original dotplot.

Another option is to move directly to using an applet (available here) to select random samples of words.  This applet starts by showing the distribution of word lengths in the population, which is skewed to the right:

You can select random samples by first clicking on Show Sampling Options.  I ask students to start by selecting one random sample of 5 words, which produces a result such as:

The applet calculates the sample mean word length for this sample and plots that on a graph.  Then asking the applet to select 999 more samples results in a graph of sample means that looks like:

Now we’re ready for the key questions: Does this distribution of sample means indicate sampling bias or unbiasedness of this random sampling method?  What aspect of the distribution leads you to this conclusion?  The shape and variability in this distribution are completely irrelevant to the issue of sampling bias.  To address this issue, we focus on the center of the distribution.  We see that the center of the distribution of sample means is very close to the population mean.  We can quantify this by noting that the mean of the 1000 sample means is 4.336 letters/word, which is quite close to the population mean of 4.295 letters/word. Therefore, this random sampling method appears to be unbiased.

Before moving on, I want to point out how challenging the following statement can be for students:

The mean of the sample means is the population mean.

This sentence contains only ten words, but three of them are the word mean(s)!  We can rewrite this statement mathematically, using common notation, as:

Notice that this equation contains only three symbols (in addition to the equals sign), but all three of them describe a mean!  It takes considerable time and careful thought for students to recognize and understand what these three means are and how they relate to each other:

  • The population mean.  For the population of 268 words in the Gettysburg Address, the value of the population mean is 4.295 letters/word.
  • The sample mean, which varies from sample to sample.  Each student calculated his/her own sample mean and represented it with a dot on the board. The first random sample generated by the applet above had a sample mean of 3.6 letters/word.  The applet then generated 999 more random samples and calculated the sample mean number of letters/word for each one.
  • The mean of the sample means.  We could have calculated this for the students’ sample means in class; we did not bother, but we know from the graph that the mean of the sample means would have been much greater than 4.295.  The applet did calculate the mean of the 1000 sample means that it generated; the mean of these sample means turned out to be 4.336 letters/word.  If we went on to generate all possible random samples, in the long run the mean of the sample means would be 4.295, the same value as the population mean.

My next question for students: Consider taking random samples of size 20 words per sample, rather than 5 words per sample.  How (if at all) would you expect the distribution of sample means to change, in terms of center, variability, and shape?  After students think about this, discuss it among themselves, and record their predictions, we use the applet to make this change, which produces a result such as:

We see that the center of this distribution is still close to the population mean of 4.295 letters/word.  Most students expect this, because this simply shows that random sampling is still unbiased with a larger sample size.  The key finding is that the variability of sample means is smaller with a larger sample size.  How can we tell?  One way is that the sample means now range from about 3 to 6 letters/word, whereas before (with a smaller sample size of 5) they ranged from about 2 to 8 letters/word.  Even better, we can note that the standard deviation of the sample means is now about 0.463, which is much less than its value of 0.945 with the smaller sample size.  The shape of the distribution of sample means is a bit more symmetric and normal-looking with the larger sample size than with the smaller sample size, much less skewed than the distribution of the population.

This last point foreshadows the concept of a sampling distribution of a sample mean and the Central Limit Theorem.  I think this context and applet provide a great opportunity to study those ideas*, but at this point I prefer to keep the focus on the topics of sampling bias and random sampling.

* One feature that I particularly like about this applet is that it displays three distributions at once, which are crucial (and challenging) for students to keep in mind when studying sampling distributions:

  • Population distribution (of word lengths)
  • Sample distribution (of word lengths)
  • Sampling** distribution (of average word lengths in a sample)

** It’s very unfortunate that the modifier words sample and sampling are so similar, yet the distributions they describe are precisely a key distinction to understand.  Perhaps we should avoid using the term sampling distribution and instead say distribution of sample averages.  It’s nice to be able to use shorthand when speaking with colleagues who understand the ideas, but in this case the extra words provide clarity for students who are just beginning to consider the ideas.


Before leaving the topic of sampling bias and random sampling, I ask a few more questions of my students, all in the context of selecting a sample of students at our university to complete a survey:

  • Would it be easy or hard to select a random sample of 50 Cal Poly students?

It takes a while for some students to realize that selecting such a random sample would be very hard to achieve.  It’s unlikely that university administrators would provide a list of all students at the university.  Having access to such a list would enable us to select a random sample of students’ names, but we would still face the challenges of contacting them successfully and then, even more problematic, convincing them to respond to our survey.

  • Suppose that you select a sample of Cal Poly students by standing in front of the library or recreation center and approaching 50 students who pass by.  Would this constitute a random sample of Cal Poly students?  What if you stand in front of the recreation center and approach 50 students who pass by?

Most students realize that this sampling method (standing in one location and recruiting passersby) does not constitute random sampling.  Some students would be more likely to be selected than others, in part because they are out-and-about on campus more often.  It’s also likely that you would be more likely to approach students who appear to be …, well, …, approachable, as opposed to students who look more intimidating or less friendly.  Even though the word random is used in an everyday sense to mean anything that is unplanned or unstructured, random sampling has a technical meaning.

  • Even though the convenience sampling described above is not random, could it nevertheless result in a sample that is representative of the population of Cal Poly students?  Identify a variable for which you would not be willing to consider such a convenience sample (as described above) to be representative of the population of Cal Poly students.  Also identify a variable for which you would be willing to consider such a sample (as described above) to be representative of the population of Cal Poly students.

We should certainly not consider a convenience sample, selected from students who pass by the library or recreation center, to be representative of the population for most variables, such as how often a student uses the recreation center per week, and whether or not a student knows where the library is on campus.  We should also be wary for variables about the student’s major, or how many hours they study per week, or how much sleep they get per night.  But there’s probably no reason to doubt that such a sample is representative of the population for a variable such as blood type.


I have used far more than 268 words to write this post.  Clearly I am much less economical with words than Abraham Lincoln in his Gettysburg Address.  I look forward to name-dropping Nelson Mandela into the next post, which will feature random assignment and discuss how that is quite different from random sampling.

P.S. Beth Chance and I developed the Gettysburg Address activity based the famous “random rectangles” activity developed by Dick Scheaffer and others.  As I told Dick when I interviewed him for the Journal of Statistics Education (here), I suspect that random rectangles is the most widely used activity for teaching statistics of all time, at least among activities that do not involve M&M candies.  You can read more about the genesis of the random rectangles activity in this JSE article (here).

P.P.S. This website (here) provides six different versions of the Gettysburg Address, with minor variations (and slightly different numbers of words) among them.  The one used above is the Hay copy.

#18 What do you expect?

I argued in post #6 (here) that the most dreaded two-word term in statistics is standard deviation.  In this post I discuss the most misleading two-word term in statistics.  There’s no doubt in my mind about which term holds this distinction.  What do you expect me to say?

If you expect me to say expected value, then your expectation is correct.

Below are four examples for helping students to understand the concept of expected value and avoid being misled by its regrettable name.  You’ll notice that I do not even use that misleading name until the end of the second example.  As always, questions that I pose to students appear in italics.


1. Let’s return to the random babies activity from post #17 (here).  I used the applet (here) to generate one million repetitions of distributing four babies to their mothers at random, with the following results:

I ask students: Calculate the average number of matches per repetition.  I usually get some blank stares, so I ask: Remind me how to calculate an average.  A student says to add up the values and then divide by the number of values.  I respond: Yes, that’s all there is to it, so please do that with these one million values.  At this point the blank stares resume, along with mutterings that they can’t possibly be expected* to add a million values on their own.

* There’s that word again.

But of course adding these one million values is not so hard at all: Adding the 375,124 zeroes takes no time, and then adding the 332,938 ones takes barely a moment.  Then you can make use of a wonderful process known as multiplication to calculate the entire sum: 0×(375,124) + 1×(332,938) + 2×(250,014) + 4×(41,924) = 1,000,662.  Dividing by 1,000,000 just involves moving the decimal point six places to the left.  This gives 1.000662 as the average number of matches in the one million simulated repetitions of this random process of distributing four babies to their mothers at random.

Then I ask: What do you think the long-run average (number of matches per repetition) will be if we continue to repeat this random process forever and ever?   Most students predict that the long-run average will be 1.0, and I tell them that this is exactly right.  I also show the applet’s graph of the average number of matches as a function of number of repetitions (for the first 1000 repetitions), which shows considerable variation at first but then gradual convergence toward a long-run value:


At this point we discuss how to calculate the theoretical long-run average based on exact probabilities rather than simulation results.  To derive the formula, let’s rewrite the calculation of the average number of matches in one million repetitions from above:

Notice that this calculation is a weighted average, where each possible value (0, 1, 2, 4) is weighted by the proportion of repetitions that produced the value.  Now recall the exact probabilities that we calculated in post #17 (here) for this random process:

and then replace the proportions in the weighted average calculation with the exact, theoretical probabilities:

This expression works out to be 24/24, which is better known as the value 1.0.  This is the theoretical long-run average number of matches that would result from repeating this random process forever and ever.  In general, a theoretical long-run average is the weighted average of the possible values of the random process, using probabilities as weights.  We can express this in a formula as follows, where LRA represents long-run average, x represents the possible values, and p(x) represents their probabilities:

Back to the random babies context, next I ask:

  • Is this long-run average the most likely value to occur?  Students recognize that the answer is no, because we are slightly more likely to obtain 0 matches than 1 match (because probability 9/24 is greater than 8/24).
  • How likely is the long-run average value to occur?  We would obtain exactly 1 match one-third (about 33.33%) of the time, if we were to repeat the random process over and over.
  • Do you expect the long-run average value to occur if you conduct the random babies process once?  Not really, because it’s twice as likely that we will not obtain 1 match than it is that we will obtain 1 match.

2. Now a very generic example: Consider rolling a fair, ordinary, six-sided die (or number cube), and then observing the number of dots on the side that lands up.  Calculate and interpret the long-run average value from this random process.

Saying that the die is fair means that the six possible outcomes should be equally likely, so the possible values and their probabilities are:

We can calculate the long-run average to be: LRA = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6) = 21/6 = 3.5.  This means that if we were to roll the die for a very large number of rolls, the average number of dots appearing on the side that lands up would be very close to 3.5.

Now I ask the same three questions from the end of the previous example:

  • Is this long-run average the most likely value to occur in the die-rolling process?  Of course not, because it’s downright impossible to obtain 3.5 dots when rolling a die. 
  • How likely is the long-run average value to occur?  Duh, like I just said, it’s impossible!  The probability is zero.
  • Do you expect the long-run average value to occur if you roll a die once?  Once more, with feeling: Of course not!

Students naturally wonder why I asked these seemingly pointless questions for the die-rolling example.  Here’s where things get a bit dicey (pun intended).  I sheepishly reveal to students that the common term for this quantity that we have been calculating and interpreting is expected value, abbreviated as EV or E(X).

Let’s ask those questions again about the die-rolling process, but now using standard terminology:

  • Is the expected value the most likely value to occur in the die-rolling process? 
  • How likely is the expected value to occur? 
  • Do you expect the expected value to occur if you conduct the die rolling process once? 

The answers to these questions are the same as before: No, of course not, the expected value (3.5 dots) is certainly not expected, because it’s impossible!

Isn’t this ridiculous?  Can we blame students for getting confused between the expected value and what we expect to happen?  As long as we’re stuck with this horribly misleading term, it’s incumbent on us to help students understand that the expected value of a random process does not in any way, shape, or form mean the value that we expect to occur when we conduct the random process.  How can we do this?  You already know my answer: Ask good questions!


3. Now let’s consider the gambling game of roulette.  When an American roulette wheel (as shown below) is spun, a ball eventually comes to rest in one of its 38 numbered slots.  The slots have colors: 18 red, 18 black, and 2 green.

The simplest version of the game is that you can bet on either a number or a color:

  • If you bet $1 on a color (red or black) and the ball lands in a slot of that color, then you get $2 back for a net profit of $1.  Otherwise, your net profit is -$1.
  • If you bet $1 on a number and the ball lands in that number’s slot, then you get $36 back for a net profit of $35.  Otherwise, your net profit is -$1.

I ask students to work through the following questions in groups, and then we discuss the answers:

  • a) List the possible values of your net profit from a $1 bet on a color, and also report their associated probabilities.  The possible values for net profit are +1 (if the ball lands on your color) and -1 (if it lands on a different color).  The wheel contains 18 slots of your color, so the probability that your net profit is +1 is 18/38, which is about 0.474.  The probability that your net profit is -1 is therefore 20/38, which is about 0.526.  Not surprisingly, it’s a little more likely that you’ll lose than win.
  • b) Determine the expected value of the net profit from betting $1 on a color.  The expected value is $1×(18/38) + (-$1)×(20/38) = -$2/38, which is about -$0.053.
  • c) Interpret what this expected value means.  If you were to bet $1 on a color for a large number of spins of the wheel, then your average net profit would be very close to a loss of $0.053 (about a nickel) per spin.
  • d) Repeat (a)-(c) for betting $1 on a number.  The possible values of net profit are now +35 (if the balls lands on your number) and -1 (otherwise).  The respective probabilities are 1/38 (about 0.026) and 37/38 (about 0.974).  The expected value of net profit is $35×(1/38) + (-$1)×(37/38) = -$2/38, which is about -$0.053.  If you were to bet $1 on a number for a large number of spins of the wheel, then your average net profit would be very close to a loss of $0.053 (about a nickel) per spin.
  • e) How do the expected values of the two types of bets compare?  Explain what this means.  The two expected values are identical.  This means that if you bet for a large number of spins, your average net profit will be to lose about a nickel per spin, regardless of whether you bet on a color or number.
  • f) Are the two types of bets identical?  (Would you get the same experience by betting on a color all evening vs. betting on a number all evening?)  If not, explain their primary difference.  No, the bets are certainly not identical, even though their expected values are the same.  If you bet on a number, you will win much less often than if you bet on a color, but your winning amount will be much larger when you do win.
  • g) The expected value from a $1 bet might seem too small to form the basis for the huge gambling industry.  Explain how casinos can make substantial profits based on this expected value.  Remember that the expected value is the average net profit per dollar bet per spin.  Casinos rely on attracting many customers and keeping them gambling for a large number of spins.  For example, if 1000 gamblers make $1 bets on 1000 spins each, then the expected value* of the casino’s income would 1000×1000×($2/38) ≈ $52,638.58.

* I have resisted the temptation to use a shorthand term such as expected income or expected profit throughout this example.  I believe that saying expected value every time might help students to avoid thinking of “expected” in the everyday sense of the word when we intend its technical meaning.


4. I like to use this question on exams to assess students’ understanding of expected value: At her birthday party, Sofia swings at a piñata repeatedly until she breaks it.  Her mother tells Sofia that she has determined the probabilities associated with the possible number of swings that could be needed for Sofia to break the piñata, and she has calculated the expected value to be 2.4.  Interpret what this expected value means.

A good answer is: If Sofia were to repeat this random process (of swinging until she breaks a piñata) for a very large number of piñatas, then the long-run average number of swings that she would need will be very close to 2.4 swings per piñata.

I look for three components when grading students’ interpretations: 1) long-run, 2) average, and 3) context.  Let’s consider each of these:

  1. The phrase long-run does not need to appear, but the idea of repeating the random process over and over for a large number of repetitions is essential.  I strongly prefer that the interpretation describe what “long run” means by indicating what would be repeated over and over (in this case, the process of swinging at a piñata until it breaks).  
  2. The idea of “average” is absolutely crucial to interpreting expected value, but it’s not uncommon for students to omit this word from their interpretations.   The interpretation makes no sense if it says that Sofia will take 2.4 swings in the long run.
  3. As is so often the case in statistics, context is key.  If a student interprets the expected value as “long-run average” with no other words provided, then the student has not demonstrated an ability to apply the concept to this situation.  In fact, a student could respond “long-run average” without bothering to read a single word about the context.

I also think it’s helpful to ask students, especially those who are studying to become teachers themselves, to critique hypothetical responses to interpreting the expected value, such as:

  • A. The long-run average is 2.4 swings.
  • B. The average number of swings that Sofia needs to break the piñata is 2.4 swings.
  • C. If Sofia were to repeat this random process (of swinging until she breaks a piñata) for a very large number of piñatas, then she would need very close to 2.4 swings in the long run.

I would assign partial credit to all three of these responses. Response A is certainly succinct, and it includes the all-important long-run average.  But the only mention of context in response A is the word “swings,” which I do not consider sufficient for describing the process of Sofia swinging at a piñata until it breaks.  Response B sounds pretty good, as it mentions average and describes the context well, but it is missing the idea of long-run.  Adding “if she were to repeat this process with a large number of piñatas” to response B would make it worthy of full credit.  Response C is so long and generally on-point that it might be hard to see what’s missing.  But response C makes no mention of the word or idea of average.  All that’s needed for response C to deserve full credit is to add “on average” at the end or insert “an average of” before “2.4 swings.”


Can we expect students to understand what expected value means?  Sure, but the unfortunate name makes this more of a challenge than it should be, as it practically begs students to confuse expected value with the value that we expect to occur.  As much as I would like to replace this nettlesome term with long-run average and its abbreviation LRA, I don’t expect* this alternative to catch on in the short term.  But I do hope that this change catches on before the long run arrives.

* Sorry, I can’t stop using this word!

P.S. I borrowed the scenario of Sofia swinging at a piñata from my colleague John Walker, who proposed this context in an exam question with more involved probability calculations.

#17 Random babies

Be forewarned that what you are about to read is highly objectionable. The topic is an introduction to basic ideas of randomness and probability, but that’s not the offensive part.  No, the despicable aspect is the context of the example, which I ask you to accept in the spirit of silliness intended.

One of the classic problems in probability is the matching problem.  When I first studied probability, this was presented in the context of a group of men at a party who throw their hats into the middle of a room and later retrieve their hats at random.  As I prepared to present this problem at the start of my teaching career, I wanted to use a context that would better capture students’ attention.  I described a hospital that returns newborn babies to their mothers at random.  Of course I realized that this context is horrific, but I thought it might be memorable, and I was hoping that it’s so far beyond the pale as to be laughable.  On the end-of-course student evaluations, one question asked what should be changed about the course, and another asked what should be retained.  For the latter question, several of my students wrote: Keep the random babies!  I have followed this advice for thirty years.

If you’d prefer to present this activity with a context that is value-neutral and perhaps even realistic, you could say that a group of people in a crowded elevator drop their cell phones, which then get jostled around so much that the people pick them up at random. That’s a value-neutral and perhaps even realistic setting. It’s also been suggested to me that the context could be a veterinarian who gives cats back to their owners at random*!

* In case you missed post #16 (here), I like cats.


After I describe this scenario to students, for the case with four babies and mothers, I ask: Use your intuition to arrange the following events in order, from least likely to most likely:

  • None of the four mothers gets the correct baby.
  • At least one of the four mothers gets the correct baby.
  • All of the four mothers gets the correct baby.

At this point I don’t care how good the students’ intuitions are, but I do want them to think about these events before we begin to investigate how likely they are.  How will we conduct this investigation?  Simulate!

Before we proceed to use technology, we start with a by-hand simulation using index cards.  I give four index cards to each student and ask them to write a baby’s first name on each card.  Then I ask students to take a sheet of scratch paper and divide it into four sections, writing a mother’s last name in each section*.  You know what comes next: Students shuffle the cards (babies) and randomly distribute them to the sections of the sheet (mothers).  I ask students to keep track of the number of mothers who get the correct baby, which we call the number of matches.  Then I point out that just doing this once does not tell us much of anything. We need to repeat simulating this random process for a large number of repetitions.  I usually ask each student to repeat this three times.

* I used to provide students with names, but I think it’s more fun to let them choose names for themselves.  I emphasize that they must know which baby goes with which mother.  I recommend that they use alliteration, for example with names such as Brian Bahmanyar and Hector Herrera and Jacob Jaffe and Sean Silva**, to help with this.

** These are the names of four graduates from the Statistics program at Cal Poly. Check out their (and others’) alumni updates to our department newsletter (here) to learn about careers that are available to those with a degree in statistics.

Once the students have completed their three repetitions, each goes to the board, where I have written the numbers 0, 1, 2, 3, 4 across the top*, and students put tally marks to indicate their number of matches for each of their repetitions.  Then we count the tallies for each possible value, and finally convert these counts to proportions.  Here are some sample results:

* I make the column for exactly 3 matches very skinny, because students should realize that it’s impossible to obtain this result (because if 3 mothers get the right baby, then the remaining baby must go to the correct mother also).

At this point I tell students that these proportions are approximate probabilities.  I add that the term probability refers to the long-run proportion of times that the event would occur, if the random process were repeated for a very large number of repetitions.  Based on the by-hand simulation with 96 repetitions shown above, our best guesses are that nobody would receive the correct baby in 40.6% of all repetitions and that all four mothers would get the correct baby in 3.1% of all repetitions.


How could we produce better approximations for these probabilities?  Many students realize that more repetitions should produce better approximations.  At this point we turn to an applet (here) to conduct many more repetitions quickly and efficiently.  The screen shots below show how the applet generates the babies (!) and then distributes them at random to waddle to homes, with the colors of diapers and houses indicating which babies belong where.  The sun comes out to shine gloriously at houses with correct matches, while clouds and rain fall drearily on houses that get the wrong baby.

We repeat this for 1 repetition (trial) at a time until we finally tire of seeing the stork and the cute babies, and then we ask the applet to conduct 1000 repetitions.  Here are some sample results:

These are still approximate probabilities, but these are probably closer to the truth (meaning, closer to the theoretical long-run proportions) than our by-hand approximations, because they are based on many more repetitions (1000 instead of 96).  By clicking on the bar in the graph corresponding to 0 matches, we obtain the following graph, which shows the proportion (relative frequency) of occurrences of 0 matches as a function of the number of repetitions (trials):

I point out that this proportion bounces around quite a bit when there are a small number of trials, but the proportion seems to be settling down as the number of repetitions increases.  In fact, it’s not too much of a stretch to believe that the proportion might be approaching some limiting value in the long run.  This limiting value is what the term probability means.

Determine the approximate probability that at least one mother gets the correct baby.  Indicate two different ways to determine this.  Also interpret this (approximate) probability.  One way is to add up the number of repetitions with at least one match: (344 + 241 + 46) / 1000 = 0.631.  Another way is to subtract the estimate for 0 matches from one: 1 – 0.369 = 0.631.  Based on our simulation analysis, we estimate that at least one mother would get the correct baby in 63.1% of all repetitions, if this random process of distributing four babies to mothers at random were repeated a very large number of times.


Can we calculate the exact, theoretical probabilities here?  In other words, can we figure out the long-run limiting values for these proportions?  Yes, we can, and it’s not terribly hard.  But I don’t do this in “Stat 101” courses because I consider this to be a mathematical topic that can distract students’ attention from statistical thinking.  The essential point for statistical thinking is to think of probability as the long-run proportion of times that an event would happen if the random process were repeated a very large number of times, and I think the simulation analysis achieves this goal.

I do present the calculation of exact probabilities in introductory courses for mathematically inclined students and also in a statistical literacy course that includes a unit on randomness and probability.  The first step is to list all possible outcomes of the random process, called a sample space.  In other words, we need to list all ways to distribute four babies to their mothers at random.  This can be quite challenging and time-consuming for students who are not strong mathematically, so I present the sample space to them:

How is this list to be understood?  I demonstrate this for students by analyzing entries in the first column.  The outcome 1234 in the upper left means that all four mothers get the correct baby.  The outcome 2134 below that means that mothers 3 and 4 got the correct baby, but mothers 1 and 2 had their babies swapped.  The outcome 3124 (below the previous one) means that mother 4 got the correct baby, but mother 1 got baby 3 and mother 2 got baby 1 and mother 3 got baby 2.  The outcome 4123 in the bottom left means that all four mothers got the wrong baby: mother 1 got baby 4, and mother 2 got baby 1, and mother 3 got baby 2, and mother 4 got baby 3.

How does this list lead us to probabilities?  We take the phrase “at random” to mean that all 24 of these possible outcomes are equally likely.  Therefore, we can calculate the probability of an event by counting how many outcomes comprise the event and dividing by 24, the total number of outcomes.

Determine the number of matches for each outcome.  Then count how many outcomes produce 0 matches, 1 match, and so on.  Finally, divide by the total number of outcomes to determine the exact probabilities.  Express these probabilities as fractions and also as decimals, with three decimal places of accuracy. I ask students to work together on this and compare their answers with nearby students.  The correct answers are:

Compare these (exact) probabilities to the approximate ones from the by-hand and applet simulations.  Students notice that the simulation analyses, particularly the applet one based on a larger number of repetitions, produced reasonable approximations.

Determine and interpret the probability that at least one mother gets the correct baby.  This probability is (8+6+1)/24 = 15/24 = .625.  We could also calculate this as 1 – 9/24 = 15/24 = .625.  If this random process were repeated a very large number of times, then at least one mother would get the correct baby in about 62.25% of the repetitions.

Determine and interpret the probability that at least half of the four mothers get the correct baby.  This probability is (6+1)/24 = 7/24 ≈ .292.  This means that if this random process were repeated a very large number of times, then at least half of the mothers would get the correct baby in about 29.2% of the repetitions.

Finally, we return to the question of ordering the three events listed above, from least likely to most likely.  The correct ordering is:

  • All four of the mothers get the correct baby (probability .042).
  • None of the four mothers gets the correct baby (probability .375).
  • At least one of the four mothers gets the correct baby (probability .625).

Here are some follow-up questions that I have asked on a quiz or exam:

For parts (a) – (c), suppose that three people (Alisha, Beth, Camille) drop their cell phones in a crowded elevator.  The phones get jostled so much that each person picks up a phone at random.  The six possible outcomes can be listed (using initials) as: ABC, ACB, BAC, BCA, CAB, CBA.

  • a) The probability that all three of them pick up the correct phone can be shown to be 1/6 ≈ .167.  Does this mean that if they repeat this random process (of dropping their three phones and picking them up at random) for a total of 6 repetitions, you can be sure that all three will get the correct phone exactly once?  Answer yes or no; also explain your answer.
  • b) Determine the probability that at least one of them picks up the correct phone.  Express this probability as a fraction and a decimal.  Show your work.
  • c) Interpret what this probability means by finishing this sentence: If the random process (of three people picking up cell phones at random) were repeated a very large number of times, then …

For parts (d) – (f), suppose instead that six people in a crowded elevator drop their cell phones and pick them up at random.

  • d) Would the probability that all of the people pick up the correct phone be smaller, the same, or larger than with three people?
  • e) Which word or phrase – impossible, very unlikely, or somewhat unlikely – best describes the event that exactly five of the six people pick up the correct phone?
  • f) Which word or phrase – impossible, very unlikely, or somewhat unlikely – best describes the event that all six people pick up the correct phone?

Answers: a) No. The 1/6 probability refers to the proportion of times that all three would get the correct phone in the long run, not in a small number (such as six) of repetitions. b) There are four outcomes in which at least one person gets the correct phone (ABC, ACB, BAC, CBA), so this probability is 4/6 = 2/3 ≈ .667. c) … all three people would pick up the correct phone in about 2/3 (or about 66.7%) of the repetitions. d) Smaller e) Impossible f) Very unlikely


I like to think that this memorable context forms the basis for an effective activity that helps students to develop a basic understanding of probability as the long-run proportion of times that an event occurs.

P.S. As I’ve said before, Beth Chance deserves the lion’s share (and then some) of the credit for the applet collection that I refer to often. Carlos Lima, a former student of Beth’s for an introductory statistics course, designed and implemented the animation features in the “random babies” applet.

#16 Questions about cats

I like cats*.  I also notice that it’s simply impossible to spell STATISTICS without the letters C, A, T, and S. These two facts provide more than enough justification for me to ask many questions in class that pertain to cats in one way or another.  I believe that the upcoming questions about felines (and their human friends) can help students to learn important concepts in descriptive statistics, probability, and statistical inference**.

* This is one of the shortest sentences that I’ve ever written, even shorter than: Ask good questions.

** If you are more interested in cats than in these statistical concepts, I invite you to skip down to the P.P.S. at the end of this post to see photos of my cats.


I heard Jay Lehmann present the following question at a conference.  I liked it so much (not only because it mentions cats) that I began using it on my own final exams:

1a) Which would be larger – the average weight of 10 randomly selected people, or the average weight of 1000 randomly selected cats (ordinary domestic housecats)?

Jay mentioned that some of his students struggle with this question, because they don’t think proportionally.  They believe that the weights of 1000 cats must be larger than the weight of 10 people.  This would be true, of course, if we were talking about combined weight, but the question asks about average weight, which requires thinking on a per individual (person or cat) basis. There’s no doubt that people weigh more on average than cats.

I’m pleased to say that my students had no difficulty with this question.  But I decided to ask a second question:

1b) Which would be larger – the standard deviation of the weights of 1000 randomly selected people, or the standard deviation of the weights of 10 randomly selected cats (ordinary domestic housecats)?

The correct answer, of course, is that the standard deviation would be much larger for people than for cats, because weights of people range from just a few pounds for newborns to hundreds and hundreds of pounds for overweight adults.  Cats’ weights range only from a pound or less in kittens to a few dozen pounds for overweight cats.

My students did very poorly on this question.  Why?  I think they believe that a larger sample size produces a smaller standard deviation, period.  I never said that, of course.  What I did say, and what we investigated with simulation, is that the standard deviation of a sample mean decreases as the sample size increases.  We also explored how the standard deviation of a sample proportion decreases as the sample size increases.  We also looked at some formulas that make this more explicit, such as:

I’m afraid that many students came away from these discussions believing that “larger sample sizes produce smaller standard deviations” without paying attention to the crucial of a sample statistic part.  In an effort to curb this misunderstanding, I now try to never say or write standard deviation without adding of what for more clarity.

My students’ performance on this question is especially disheartening because I fear that a higher percentage get this wrong on the final exam than would have at the beginning of the course.  In other words, I worry that my teaching on this topic is violating the fundamental principle of “first do no harm.”

Oh dear, after a light-hearted introduction, this post has taken a discouraging turn!  Let’s move on to happier thoughts about cats (and even dogs) …


The following questions address some basic ideas of working with percentages.  You could use these to introduce, or assess students’ understanding of, probabilities of unions of events.

2. The 2018 General Social Survey (GSS) interviewed a national sample of American adults and found that 47% have a pet dog and 25% have a pet cat.

a) Does it necessarily follow that 72% (which is 47% + 25%) of those surveyed had a pet dog or a pet cat?  If not, is it even possible (in principle anyway) for this to be true?  Under what circumstance (however unrealistic) would this be true?

This conclusion does not follow, because some people have both a pet dog and a pet cat.  In other words, having a dog and having a cat are not mutually exclusive.  It’s theoretically possible that 72% of those surveyed have a pet dog or a pet cat, but this would only be true if absolutely nobody in the survey had both a dog and a cat.

b) The 2018 GSS also found that 14% of survey respondents had both a dog and a cat.  What can you conclude about the percentage who had a dog or a cat?

By adding 47% and 25%, we double-count the people who had both a dog and a cat.  We can compensate for this double-counting by subtracting off the percentage who had both.  The percentage of those surveyed who had a dog or a cat is therefore 47% + 25% – 14% = 58%.

This can be seen by putting the given percentages into the 2×2 table on the left below and then filling in the remaining percentages to produce the table on the right.  The filled-in table shows that you can calculate the percentage who had a dog or a cat by adding the three percentages in red, or else (as I did above) by adding the (marginal) percentages for each pet and then subtracting off the (joint) percentage with both pets in order to compensate for double-counting.

c) If we only knew the percentages in part (a) and not the percentage in part (b), what would be the smallest possible percentage of respondents who owned a pet dog or a pet cat?  Describe the (unrealistic) situation in which this extreme case would occur.

This question is very challenging for many students.  One way to tackle this is to start with the 2×2 table on the left below.  Then realize that to make the percentage with a dog or cat as small as possible, we need to make the percentage in the upper-left cell (with both a dog and a cat) as large as possible.  How large can that percentage be?  No larger than 25%, the percentage with a cat.  The completed table on the right shows that this extreme situation occurs only if none of the respondents had a cat and not a dog.  In other words, the most extreme case is that every person with a cat also had a dog, which gives 47% with a dog or a cat, the same as the percentage with a dog.


The following set of questions is one of my favorites (again, not only because it concerns cats).  I have long used this example to introduce students to two important ideas in statistical inference: the fundamental distinction between statistical significance and practical importance, and the consistency between confidence intervals and hypothesis tests.

3. The 2012 Statistical Abstract of the United States gives information from a national survey of 47,000 U.S. households in 2006, which found that 32.4% of the households sampled had a pet cat.  Consider this as a random sample of American households in 2006.

a) What are the observational units and variable?  What type of variable is this?

The observational units are households, not people and not cats.  The variable is whether or not the household has a cat, which is … (get ready for it) … a CATegorical variable.

b) Conduct a hypothesis test of whether the sample data provide strong evidence that the population proportion of all American households that had a pet cat in 2006 differed from one-third. Summarize your conclusion.

The z-test statistic is calculated as:

With such a large (in absolute value) z-test statistic, the p-value is very small (about 0.00002).  The sample data provide extremely strong evidence that the proportion of all American households that had a pet cat in 2006 was not one-third.

c) Produce and interpret a 99.9% confidence interval for the population proportion of all American households that own a pet cat.

This confidence interval is calculated as:

This becomes .324 ± .007, which is the interval (.317 → .331).  We can be 99.9% confident that the population proportion of American households that had a pet cat in 2006 was between .317 and .331.

Parts (a) – (c) provide fairly routine practice. The following parts introduce students to important ideas.  I encourage students to think through these questions in groups before I lead a discussion about the answers and what they’re supposed to learn from them.  I also caution students to read parts (e) and (f) very carefully to notice the small but important difference in these questions.

d) Are the test decision and confidence interval consistent with each other?  Explain.

Yes, these results are consistent.  The hypothesis test provided extremely strong evidence that the population proportion is not one-third, and the confidence interval does not include the value one-third (roughly .3333).

e) Do the sample data provide very strong evidence that the population proportion who own a pet cat is not one-third?  Explain whether the p-value or confidence interval helps you to decide.

Yes.  The p-value is extremely small (approximately .00002), so the sample data provide very strong evidence that the population proportion is not one-third.  Whatever this population proportion might equal, we have very strong evidence that it’s not one-third.

f) Do the sample data provide strong evidence that the population proportion who own a pet cat is very different from one-third?  Explain whether the p-value or confidence interval helps you to decide.

No.  The confidence interval shows us that we can be very confident that the population proportion who had a cat in 2006 is between about .317 and .331.  In other words, we can be very confident that between 31.7% and 33.1% of all American households had a pet cat in 2006.  In practical terms, this is quite close to one-third, or 33.33%.

g) What aspect of this study is responsible for the somewhat surprising pair of findings that we have very strong evidence that: (1) the population proportion is not one-third, and (2) the population proportion is quite close to one-third?

The driving factor is the very large sample size of 47,000 households. With such a large sample size, even the small difference between the sample percentage (32.4%) and the hypothesized percentage (33.33%) is enough to be statistically significant, meaning that a difference that large would be very unlikely to occur by chance alone.  The large sample size also produces a very narrow confidence interval (even with a very high confidence level), so we can be very confident that the population percentage is very close to 32.4%, which in turn is quite close to one-third in practical terms

The bottom line here is very important for students to understand about statistical inference: With a large sample size, a small difference can be statistically significant but not practically important.


Next comes a series of questions for showing how confidence intervals and hypotheses tests relate when comparing two groups and highlighting the important role of sample size in statistical inference.

4. A national survey of pet owners in the U.S. found that 53% of cat owners and 63% of dog owners said that they would perform CPR on their pets in the event of a medical emergency.

a) Are these numbers parameters or statistics?  Explain.

These numbers are statistics, because they describe the sample of dog and cat owners who were surveyed, not all dog and cat owners in the U.S.

b) State the appropriate null and alternative hypotheses for testing whether the difference between 53% and 63% is statistically significant in this context.

The null hypothesis is that the population proportions who would perform CPR on their pet are the same for dog owners and cat owners.  The alternative hypothesis is that these population proportions are different.  We could represent these hypotheses in symbols as H0: π_dog = π_cat, Ha: π_dog ≠ π_cat.

c) What additional information would you need in order to conduct a test of these hypotheses?

We need to know the sample sizes: how many dog owners and how many cat owners were surveyed?  I have to admit that I am incredibly picky when I grade student responses on this question.  If a student responds with “sample size,” that only gets partial credit. The response needs to use the plural, because learning the combined sample size is not sufficient information for conducting the test.

d) Suppose for now that the sample sizes had been 100 in each group.  Determine the z-score and p-value of the test.  Would you reject the null hypothesis at the .05 significance level?

I ask students to use technology to perform the calculations here, so they can focus on the more important concept to be addressed after part (e).  A free online tool is available here.  The test statistic turns to be z ≈ 1.43, with a two-sided p-value of 0.1520.  This p-value is greater than .05, so the observed difference in sample proportions is not statistically significant at the .05 level.

e) Determine and interpret a 95% confidence interval for the difference in the two population proportions.

Again I ask students to use technology for the calculation, which produces a 95% CI of (-0.036 → 0.236).  We can be 95% confident that the proportion of all dog owners who would perform CPR is anywhere from .036 smaller to .236 larger than the proportion of all cat owners who would perform CPR.

f) Are the test decision and confidence interval consistent with each other?  Explain how you can tell.

Yes, these results are consistent.  We did not conclude that the two groups differ, and the confidence interval (for the difference in population proportions) includes the value zero.

g) Now suppose that the sample sizes had been 500 in each group.  Determine the z-score and p-value and confidence interval.  Summarize your conclusions.

The test statistic becomes z ≈ 3.20, with a two-sided p-value of 0.0014.  The 95% CI becomes (0.039 → 0.161).  Now we do have strong evidence that dog owners and cat owners differ with regard to the population proportion who would perform CPR on their pets.  We can be 95% confident the proportion of all dog owners who would perform CPR is somewhere from .039 to .161 larger than the proportion of all cat owners who would perform CPR.

h) Describe how the p-value and confidence interval changed with the larger sample sizes.

The p-value became much smaller, enough to indicate that the difference in the observed sample proportions was unlikely to have occurred by chance alone.  The confidence interval became much narrower, enough that it contains only positive values, indicating that a higher proportion of dog owners than cat owners would perform CPR on their pet in an emergency.

The point here is to help students recognize once again the substantial role that sample size plays in statistical inference.


I promised back in post #6 (here) that I would devote a future post to nothing but questions about cats.  I am happy to check this off as a promise kept.  I hope that cat-lovers and dog-lovers alike have found something worthwhile in this post. Among their many other benefits to society, cats can help students to learn statistics!


P.S. The percentages from the GSS in question #2 came from a Washington Post article (here).  An earlier Washington Post article (here) summarized discrepancies in pet ownership estimates from different sources. The data in question #3 can be found in Table 1241 of the 2012 Statistical Abstract of the United States (here).  The survey about performing CPR on pets was summarized in a Los Angeles Times article (here).

P.P.S. I dedicate this post to the three cats who have been provided so much happiness to my wife and me.  Our first cat Eponine was a classic scaredy-cat, afraid of her own shadow.  She decided early in life that she would never do anything daring but would try to live as long as possible.  She succeeded quite well, making it to 23 years and 3 months.  On the other hand, Cosette sought adventure and lived every day to the fullest.  As a self-respecting calico cat, she became the undisputed, benevolent head of our household from the moment she joined it.  Our current cat Puti is a very good-natured boy who loves to purr, sit on laps, and complain that his 6am breakfast is served much too late in the day.

My three cats: Eponine (top left), Cosette (top right), Puti (bottom)

#15 How confident are you? part 2

How confident are you that your students can interpret a 95% confidence interval (CI) correctly?  This post continues the previous one (here) by considering numerical data and highlighting a common misconception about interpreting a CI for a population mean.

Here is the formula for a one-sample t-interval for a population mean μ, using conventional notation:

It’s worth making sure that students understand this notation.  Two quiz questions that I often ask are: 1.Remind me: what’s the difference between μ and x-bar?  2. Remind me of what the symbol s stands for, and be sure to use three words in your response.  Of course,I want students to say that μ is the symbol for a population mean and x-bar for a sample mean.  I also hope they’ll say that s stands for a sample standard deviation.  If they respond only with standard deviation, I tell them that this response is too vague and does not earn full credit.


Let’s dive in to an example that we’ll use throughout this post: I’d like to estimate the average runtime of a feature film in the thriller genre.  I selected a simple random sample of 50 thriller films from the population of 28,369* thrillers listed at IMDb (here).

* There are actually 41,774 feature films in the thriller genre listed at IMDb on October 13, 2019, but runtimes are provided for only 28,369 of them.

Consider the following (Minitab) output of the sample data:

My questions for students are:

  • (a) What are the observational units and variable?  What type of variable is this?
  • (b) Describe the relevant population and parameter.  Also indicate an appropriate symbol for this parameter.
  • (c) Identify the appropriate confidence interval procedure.
  • (d) Are the technical conditions for this procedure satisfied?  Explain.
  • (e) Calculate a 95% confidence interval for the population mean.
  • (f) Interpret this interval.
  • (g) What percentage of the films in the sample have times that fall within this interval?
  • (h) Is this percentage close to 95%?  Should it be?  Explain what went wrong, or explain that nothing went wrong.

Here are my answers:

  • (a) The observational units are the films.  The variable is the runtime of the film, measured in minutes, which is a numerical variable.
  • (b) The population is all feature films in the thriller genre listed at IMDb for which runtimes are provided.  The parameter is the mean (average) runtime among these flims, denoted by μ.
  • (c) We will use a one-sample t-interval procedure to estimate the population mean μ.
  • (d) The dotplot of the sample data reveals that the distribution of runtimes is skewed to the right.  But the skewness is not extreme, so the sample size of 50 films should be large enough for the t-interval procedure to be valid.
  • (e) The 95% CI for μ is calculated as: 101.70 ± 2.010×25.30/sqrt(50), which is 101.70 ± 7.19, which is the interval (94.51 → 108.89) minutes.
  • (f) We are 95% confident that the population mean runtime of a feature film in the thriller genre in IMDb is between 94.51 and 108.89 minutes.
  • (g) Only 7 of the 50 films (14%) run for more than 94.51 minutes and less than 108.89 minutes, as shown in red in this dotplot:
  • (h) This percentage (14%) is nowhere close to 95%.  Moreover, there’s no reason to expect this percentage to be close to 95%.  Nothing went wrong here.  Remember that the CI is estimating the population mean (average), not individual values.  We do not expect 95% of the individual films’ runtimes to be within this CI.  Rather, we are 95% confident that the population mean of the runtimes is within this CI.

Question (h) indicates a very common and troublesome student misconception.  Many students mistakenly believe that a 95% CI for a population mean is supposed to contain 95% of the data values.  These students are confusing confidence about a parameter with prediction about an individual.  How can we help them to see the mistake here?  I hope that questions (g) and (h) help with this, as students should see for themselves that only 7 of the 50 films (14%) in this sample fall within the CI.  You might also point out that as the sample size increases, the CI for μ will continue to get narrower, so the interval will include fewer and fewer data values.  We can also be sure to ask students to identify parameters in words as often as possible, because I think this misconception goes back to not paying enough attention to what a parameter is in the first place.

Something else we could consider doing* to help students to distinguish between confidence and prediction is to teach them about prediction intervals, which estimate individual values rather than the population mean.  In many situations the relevant question is one of prediction.  For example, you might be much more interested in predicting how long the next thriller film that you watch will take, as opposed to wanting to estimate how long a thriller film lasts on average.

* I confess that I do not typically do this, except in courses for mathematically inclined students such as those majoring in statistics, mathematics, or economics.

Here is the formula for a prediction interval:

Comparing this to the confidence interval formula above, we see that the prediction interval formula has an extra s (sample standard deviation) term. This accounts for variability from individual to individual, which makes the prediction interval much wider than the confidence interval.  For the sample data on runtimes of thriller films, the 95% prediction interval is: 101.70 ± 2.010×25.30×sqrt(1+1/50), which is 101.70 ± 51.36, which is the interval (50.34 → 153.06) minutes.  Notice how wide this interval is: Its half-width is 51.36 minutes (nearly an hour), compared to a half-width of just 7.19 minutes for the confidence interval above.  This prediction interval captures 45 of the 50 runtimes in this sample (90%).

An important caveat is that unlike the t-confidence interval procedure for a population mean, this prediction interval procedure relies heavily on the assumption of a normally distributed population, regardless of sample size.  The runtime distribution is skewed to the right, so this t-prediction interval procedure is probably not valid.  A simpler alternative is to produce a prediction interval by using the (approximate) 2.5th and 97.5th percentiles of the sample data.  For this sample, we could use the second-smallest and second-largest runtime values, which gives a prediction interval of (60 → 163) minutes.  This interval contains 48/50 (96%) of the runtimes in the sample.


Now let’s re-consider question (f), which asked for an interpretation of the confidence interval.  Below are four possible student answers.  As you read these, please think about whether or not you would award full credit for that interpretation:

  • 1. We are 95% confident that μ is between 94.5 and 108.9.
  • 2. We are 95% confident that the population mean is between 94.5 and 108.9 minutes.
  • 3. We are 95% confident that the population mean runtime of a thriller film in the IMDb list is between 94.5 and 108.9 minutes.
  • 4. We are 95% confident that the population mean runtime of a thriller film in the IMDb list is between 94.5 and 108.9 minutes.  This confidence stems from knowing that 95% of all confidence intervals generated by this procedure would succeed in capturing the actual value of the population mean.

I hope we agree that none of these interpretations is flat-out wrong, and they get progressively better as we progress from #1 through #4.  Where would you draw the line about deserving full credit?  I would regard #3 as good enough.  I think #1 and #2 fall short by not providing context.  I view #4 as going beyond what’s needed because the question asked only for an interpretation of the interval, not for the meaning of the 95% confidence level.  I suggest asking a separate question specifically about interpreting confidence level*, in order to assess students’ understanding of that concept.

* I have asked: Explain what the phrase “95% confidence” means in this interpretation. This is a challenging question for most students.


Continuing this deep dive into into interpreting a confidence interval for a population mean, please consider the following incorrect answers.  Think about which you consider to be more or less serious than others, and also reflect on which interpretations deserve full credit, partial credit, or no credit.

  • A. We are 95% confident that a thriller film in the IMDb list runs for between 94.5 and 108.9 minutes.
  • B. There’s a 95% chance that a thriller film in the IMDb list runs for between 94.5 and 108.9 minutes.
  • C. About 95% of all thriller films in the IMDb list run for between 94.5 and 108.9 minutes.
  • D. We are 95% confident that the mean runlength of a thriller film in this sample from the IMDb list was between 94.5 and 108.9 minutes.
  • E. We are 95% confident that the mean runlength of a thriller film in a new random sample from the IMDb list would be between 94.5 and 108.9 minutes.
  • F. There’s a 95% chance (or a 0.95 probability) that the population mean runlength of a thriller film in the IMDb list is between 94.5 and 108.9 minutes.

I contend that A, B, and C are all egregiously wrong.  They all make the same mistake of thinking that the interval predicts the runtime of individual films rather than estimating a mean.  I suppose you could say that A is better than B and C because it uses the word “confident.” In fact, simply inserting “on average” at the end of the sentence would be sufficient to fix A.  But the idea of “on average” is a crucial one to have omitted!

I believe that D and E are slightly less wrong than A, B, and C, because they do include the idea of mean.  But they refer to a sample mean instead of the population mean.  This is also a serious error and so would receive no credit in my class.  I might say that D is worse than E, because we know for sure that the mean runtime in this sample is the midpoint of the confidence interval.

What about F?  It’s not quite correct, because it uses the language of chance and probability rather than confidence.  The population mean μ is a fixed value, so it’s not technically correct* to refer to the probability or chance that μ falls in a particular interval.  What’s random is the confidence interval itself, because the interval obtained from this procedure would vary from sample to sample if we were to take repeated random samples from the population**.  But I consider this distinction between confidence and probability to be fairly minor, especially compared to the much more substantive distinction between confidence and prediction.  I would nudge a student who produced F toward more appropriate language but would award full credit for this interpretation.

* Unless we take a Bayesian approach, which I will discuss in a future post.

** As we saw in the previous post (here) by using the Simulating Confidence Intervals applet (here).


I ask a version of the “do you expect 95% of the data to fall within the CI” question almost every time I ask about interpreting a confidence interval.  I remember one student from many years ago who seemed to be either tickled or annoyed by my repeating this question so often.  In response to such a question on the final exam, he wrote something like: “Boy, some students must get this wrong a lot because you keep asking about it.  Okay, once again, my answer is …”  You might be expecting me to conclude this post on an ironic note by saying that the student then proceeded to give a wrong answer.  But no, he nailed it.  He knew that we do not expect anywhere near 95% of the data values to fall within a 95% confidence interval for the population mean.  I hope that this student would be tickled, and not annoyed, to see that I have now devoted most of a blog post to this misconception.

P.S. The sample data on runtimes can be found in the file below.

#14 How confident are you? part 1

How confident are you that your students understand what “95% confidence” means?  Or that they realize why we don’t always use 99.99% confidence?  That they can explain the sense in which larger samples produce “better” confidence intervals than smaller samples?  For that matter, how confident are you that your students know what a confidence interval is trying to estimate in the first place?  This blog post, and the next one as well, will focus on helping students to understand basic concepts of confidence intervals. (As always, my questions to students appear in italics below.)


I introduce confidence intervals (CIs) to my students with a CI for a population proportion, using the conventional method given by:

Let’s apply this to a surveyed that we encountered in post #8 (here) about whether the global rate of extreme poverty has doubled, halved, or remained about the same over the past twenty years.  The correct answer is that the rate has halved, but 59% of a random sample of 1005 adult Americans gave the (very) wrong answer that they thought the rate had doubled (here). 

Use this sample result to calculate a 95% confidence interval.  This interval turns out to be:

This calculation becomes .59 ± .03, which is the interval (.56, .62)*.  Interpret what this confidence interval means.  Most students are comfortable with concluding that we are 95% confident that something is between .56 and .62.  The tricky part is articulating what that something is.  Some students mistakenly say that we’re 95% confident that this interval includes the sample proportion who believe that the global poverty rate has doubled.  This is wrong, in part because we know that the sample proportion is the exact midpoint of this interval.  Other students mistakenly say that if researchers were to select a new sample of 1005 adult Americans, then we’re 95% confident that between 56% and 62% of those people would answer “doubled” to this question.  This is incorrect because it is again trying to interpret the confidence interval in terms of a sample proportion.  The correct interpretation needs to make clear what the population and parameter are: We can be 95% confident that between 56% and 62% of all adult Americans would answer “doubled” to the question about how the global rate of extreme poverty has changed over the past twenty years.

* How are students supposed to know that this (.56, .62) notation represents an interval?  I wonder if we should use notation such as (.56 → .62) instead?

Now comes a much harder question: What do we mean by the phrase “95% confident” in this interpretation?  Understanding this concept requires thinking about how well the confidence interval procedure would perform if it were applied for a very large number of samples.  I think the best way to explore this is with … (recall from the previous post here that I hope for students to complete this sentence with a joyful chorus of a single word) … simulation!


To conduct this simulation, we use one of my favorite applets*.  The Simulating Confidence Intervals applet (here) does what its name suggests:

  • simulates selecting random samples from a probability distribution,
  • generates a confidence interval (CI) for the parameter from each simulated sample,
  • keeps track of whether or not the CI successfully captures the value of the population parameter, and
  • calculates a running count of how many (and what percentage of) intervals succeed.

* Even though this applet is one of my favorites, it only helps students to learn if you … (wait for it) … ask good questions!

The first step in using the applet is to specify that we are dealing with a proportion, sampling from a binomial model, and using the conventional z-interval, also known as the Wald method:

The next step is to specify the value of the population proportion.  The applet needs this information in order to produce simulated samples, but it’s crucial to emphasize to students that you would not know the value of the population proportion in a real study.  Indeed, the whole point of selecting a random sample and calculating a sample proportion is to learn something about the unknown value of the population proportion.  But in order to study properties of the CI procedure, we need to specify the value of the population proportion.  Let’s use the value 0.40; in other words we’ll assume that 40% of the population has the characteristic of interest.  Let’s make this somewhat more concrete and less boring: Suppose that we are sampling college students and that 40% of college students have a tattoo.  We also need to enter the sample size; let’s start with samples of n = 75 students.  Let’s generate just 1 interval at first, and let’s use 95% confidence:

Here’s what we might observe* when we click the “Sample” button in the applet:

* Your results will vary, of course, because that’s the nature of randomness and simulation.

The vertical line above the value 0.4 indicates that the parameter value is fixed.  The black dot is the value of the simulated sample proportion, which is also the midpoint of the interval (0.413* in this case).  The confidence interval is shown in green, and the endpoint values (0.302 → 0.525) appear when you click on the interval.  You might ask students to use the sample proportion and sample size to confirm the calculation of the interval’s endpoints.  You might also ask students to suggest why the interval was colored green, or you might ask more directly: Does this interval succeed in capturing the value of the population proportion (which, you will recall, we stipulated to be 0.4)?  Yes, the interval from 0.302 to 0.525 does include the value 0.4, which is why the interval was colored green.

* This simulated sample of 75 students must have included 31 successes (with a tattoo) and 44 failures, producing a sample proportion of 31/75 ≈ 0.413).

At this point I click on “Sample” several times and ask students: Does the value of the population proportion change as the applet generates new samples?  The answer is no, the population proportion is still fixed at 0.4, where we told the applet to put it.  What does vary from sample to sample?  This a key question.  The answer is that the intervals vary from sample to sample.  Why do the intervals vary from sample to sample?  Because the sample proportion, which is the midpoint of the interval, varies from sample to sample.  That’s what the concept of sampling variability is all about.

I continue to click on “Sample” until the applet produces an interval that appears in red, such as:

Why is this interval red?  Because it fails to capture the value of the population proportion.  Why does this interval fail when most succeed?  Because random chance produced an unusually small value of the sample proportion (0.253), which led to a confidence interval (0.155 → 0.352) that falls entirely below the value of the population proportion 0.40.

Now comes the fun part and a pretty picture.  Instead of generating one random sample at a time, let’s use the applet to generate 100 samples/intervals all at once.  We obtain something like:

This picture captures what the phrase “95% confidence” means.  But it still takes some time and thought for students to understand what this shows.  Let’s review:

  • The applet has generated 100 random samples from a population with a proportion value of 0.4.
  • For each of the 100 samples, the applet has used the usual method to calculate a 95% confidence interval.
  • These 100 intervals are displayed with horizontal line segments.
  • The 100 sample proportions are represented by the black dots at the midpoints of the intervals.
  • The population proportion remains fixed at 0.4, as shown by the vertical line. 
  • The confidence intervals that are colored green succeed in capturing the value 0.4.
  • The red confidence intervals fail to include the value 0.4.

Now, here’s the key question: What percentage of the 100 confidence intervals succeed in capturing the value of the population proportion?  It’s a lot easier to count the red ones that fail: 5 out of 100.  Lo and behold, 95% of the confidence intervals succeed in capturing the value of the population proportion.  That is what “95% confidence” means.

The applet also has an option to sort the intervals, which produces:

This picture illustrates why some confidence intervals fail: The red intervals were the unlucky ones with an unusually small or large value of the sample proportion, which leads to a confidence interval that falls entirely below or above the population proportion value of 0.4.

A picture like this appears in many statistics textbooks, but the applet makes this process interactive and dynamic.  Next I keep pressing the “Sample” button in order to generate many thousands of samples and intervals.  The running total across thousands of samples should reveal that close to 95% of confidence intervals succeed in capturing the value of the population parameter.

An important question to ask next brings this idea back to statistical practice: Survey researchers typically select only one random sample from a population, and then they produce a confidence interval based on that sample.How do we know whether the resulting confidence interval is successful in capturing the unknown value of the population parameter?  The answer is that we do not know.  This answer is deeply unsatisfying to many students, who are uncomfortable with this lack of certainty.  But that’s the unavoidable nature of the discipline of statistics.  Some are comforted by this follow-up question: If we can’t know for sure whether the confidence interval contains the value of the population parameter, on what grounds can we be confident about this?  Our 95% confidence stems from knowing that the procedure produces confidence intervals that succeed 95% of the time in the long run.  That’s what the large abundance of green intervals over red ones tells us.  In practice we don’t know where the vertical line for the population value is, so we don’t know whether our one confidence interval deserves to be colored green or red, but we do know that 95% of all intervals would be green, so we can be 95% confident that our interval deserves to be green.


Whew, that’s a lot to take in!  But I must confess that I’m not sure that this long-run interpretation of confidence level is quite as important as we instructors often make it out to be.  I think it’s far more important that students be able to describe what they are 95% confident of: that the interval captures the unknown value of the population parameter.  Both of those words are important – population parameter – and students should be able to describe both clearly in the context of the study.

I can think of at least three other aspects of confidence intervals that I think are more important (than the long-run interpretation of confidence level) for students to understand well.


1. Effect of confidence level – why don’t we always use 99.99% confidence?

Let’s go back to the applet, again with a sample size of 75.  Let’s consider changing the confidence level from 95% to 99% and then to 80%.  I strongly encourage asking students to think about this and make a prediction in advance: How do you expect the intervals to change with a larger confidence level?  Be sure to cite two things that will change about the intervals.  Once students have made their predictions, we use the applet to explore what happens:

99% confidence on the left, 80% confidence on the right

The results for 99% confidence are on the left, with 80% confidence on the right.  A larger confidence level produces wider intervals and a larger percentage of intervals that succeed in capturing the parameter value.  Why do we not always use 99.99% confidence?  Because those intervals would typically be so wide as to provide very little useful information*.

* Granted, there might be some contexts for which this level of confidence is necessary.  A very large sample size could prevent the confidence interval from becoming too wide, as the next point shows.


2. Effect of sample size – in what sense do larger samples produce better confidence intervals than smaller samples? Let’s return to the applet with a confidence level of 95%.  Now I ask: Predict what will change about the intervals if we change the sample size from 75 to 300.  Comment on both the intervals’ widths and the percentage of intervals that are successful.  Most students correctly predict that the larger sample size will produce intervals that are more narrow.  But many students mistakenly predict that the larger ample size will result in a higher percentage of successful intervals.  Results such as the following (n = 75 on the left, n = 300 on the right) convince them that they are correct about narrower intervals, but the percentage of successful ones remains close to 95%, because that is controlled by the confidence level:

n = 75 on the left, n = 300 on the right

This graph (and remember that students using the applet would see many such graphs dynamically, rather than simply seeing this static image) confirms students’ intuition that a larger sample size produces narrower intervals.  That’s the sense in which larger sample sizes produce better confidence intervals, because narrower intervals indicate a more precise (i.e., better) estimate of the population parameter for a given confidence level.

Many students are surprised, though, to see that the larger sample size does not affect the green/red breakdown.  We should still expect about 95% of confidence intervals to succeed in capturing the population proportion, for any sample size, because we kept the confidence level at 95%.


3. Limitations of confidence intervals – when should we refuse to calculate a confidence interval?

Suppose that an alien lands on earth and wants to estimate the proportion of human beings who are female*.  Fortunately, the alien took a good statistics course on its home planet, so it knows to take a sample of human beings and produce a confidence interval for this proportion.  Unfortunately, the alien happens upon the 2019 U.S. Senate as its sample of human beings.  The U.S. Senate has 25 women senators (its most ever!) among its 100 members in 2019.

* I realize that this context is ridiculous, but it’s one of my favorites.  In my defense, the example does make use of real data.

a) Calculate the alien’s 95% confidence interval.  This interval is:

This calculation becomes .25 ± .085, which is the interval (.165 → .335).

b) Interpret the interval.  The alien would be 95% confident that the proportion of all humans on earth who are female is between .165 and .335.

c) Is this consistent with your experience living on this planet?  No, the actual proportion of humans who are female is much larger than this interval, close to 0.5.

d) What went wrong?  The alien did not select a random sample of humans.  In fact, the alien’s sampling method was very biased toward under-representing females.

e) As we saw with the applet, about 5% of all 95% confidence intervals fail to capture the actual value of the population parameter.  Is that the explanation for what went wrong here?  No!  Many students are tempted to answer yes, but this explanation about 5% of all intervals failing is only relevant when you have selected random samples over and over again.  The lack of random sampling is the problem here.

f) Would it be reasonable for the alien to conclude, with 95% confidence, that between 16.5% and 33.5% of U.S. senators in the year 2019 are female?  No.  We know (for sure, with 100% confidence) that exactly 25% of U.S. senators in 2019 are female.  If that’s the entire population of interest, there’s no reason to calculate a confidence interval.  This question is a very challenging one, for which most students need a nudge in the right direction.

The lessons of this example are:

  • Confidence intervals are not appropriate when the data were collected with a biased sampling method.  A confidence interval calculated from such a sample can provide very dubious and misleading information.
  • Confidence intervals are not appropriate when you have access to the entire population of interest.  In this unusual and happy circumstance, you should simply describe the population.

I feel a bit conflicted as I conclude this post.  I have tried to convince you that the Simulating Confidence Intervals applet provides a great tool for leading students to explore and understand what the challenging concept of “95% confidence” really means.  But I also have also aimed to persuade you that many instructors over-emphasize this concept at the expense of more important things for students to learn about confidence intervals.

I will continue this discussion of confidence intervals in the next post, moving on to numerical variables and estimating a population mean.

#13 A question of trust

Which do you trust more: a simulation-based* or normal-based analysis of an inference question?  In other words, if a simulation analysis and normal approximation give noticeably different p-values, which would you believe to be closer to the correct p-value?  Please think about this question in the abstract for a moment.  Soon we’ll come back to it in specific example.

* If you’re not familiar with simulation-based inference, I recommend reading post #12 (here) first.


Here’s the example that we’ll consider throughout this post: Stemming from concern over childhood obesity, researchers investigated whether children might be as tempted by toys as by candy for Halloween treats (see abstract of article here).  Test households in five Connecticut neighborhoods offered two bowls to trick-or-treating children: one with candy and one with small toys.  For each child, researchers kept track of whether the child selected the candy or the toy.  The research question was whether trick-or-treaters are equally likely to select the candy or toy.  More specifically, we will investigate whether the sample data provide strong evidence that trick-or-treaters have a tendency to select either the candy or toy more than the other.

In my previous post (here) I argued against using terminology and formalism when first introducing the reasoning process of statistical inference.  In this post I’ll assume that students have now been introduced to the structure of hypothesis tests, so we’ll start with a series of background questions before we analyze the data (my questions to students appear in italics):

  • What are the observational units?  The trick-or-treaters are the observational units.
  • What is the variable, and what type of variable is it?  The variable is the kind of treat selected by the child: candy or toy.  This is a binary, categorical variable.
  • What is the population of interest?  The population is all* trick-or-treaters in the U.S.  Or perhaps we should restrict the population to all trick-or-treaters in Connecticut, or in this particular community.
  • What is the sample?  The sample is the trick-or-treaters in these Connecticut neighborhoods whose selections were recorded by the researchers.
  • Was the sample selected randomly from the population?  No, it would be very difficult to obtain a list of trick-or-treaters from which one could select a random sample.  Instead this is a convenience sample of trick-or-treaters who came to the homes that agreed to participate in the study.  We can hope that these trick-or-treaters are nevertheless representative of a larger population, but they were not randomly selected from a population.
  • What is the parameter of interest?  The parameter is the population proportion of all* trick-or-treaters who would select the candy if presented with this choice between candy and toy.  Alternatively, we could define the parameter to be the population proportion who would select the toy.  It really doesn’t matter which of the two options we designate as the “success,” but we do need to be consistent throughout our analysis.  Let’s stick with candy as success.
  • What is the null hypothesis, in words?  The null hypothesis is that trick-or-treaters are equally likely to select the candy or toy.  In other words, the null hypothesis is that 50% of all trick-or-treaters would select the candy.
  • What is the alternative hypothesis, in words?  The alternative hypothesis is that trick-or-treaters are not equally likely to select the candy or toy.  In other words, the alternative hypothesis is that the proportion of all trick-or-treaters who would select the candy is not 0.5.  Notice that this is a two-sided hypothesis.
  • What is the null hypothesis, in symbols?  First we have to decide what symbol to use for a population proportion.  Most teachers and textbooks use p, but I prefer to use π.  I like the convention of using Greek letters for parameters (such as μ for a population mean and σ for a population standard deviation), and I see no reason to abandon that convention for a population proportion.  Some teachers worry that students will immediately think of the mathematical constant 3.14159265… when they see the symbol π, but I have not found this to be a problem.  The null hypothesis is H0: π = 0.5.
  • What is the alternative hypothesis, in symbols?  The two-sided alternative hypothesis is Ha: π ≠ 0.5.

* I advise students that it’s always a nice touch to insert the word “all” when describing a population and parameter.

Whew, that was a lot of background questions!  Notice that I have not yet told you how the sample data turned out.   I think it’s worth showing students that the issues above can and should be considered before looking at the data.  So, how did the data turn out?  The researchers found that 148 children selected the candy and 135 selected the toy.  The value of the sample proportion who selected the candy is therefore 148/283 ≈ 0.523.

Let’s not lose sight of the research question here: Do the sample data provide strong evidence that trick-or-treaters have a tendency to select either the candy or toy more than the other?  To pursue this I ask: How can we investigate whether the observed value of the sample statistic (.523 who selected the candy) would be very surprising under the null hypothesis that trick-or-treaters are equally likely to select the candy or toy?  I hope that my students will erupt in a chorus of, “Simulate!”*

* I tell my students that if they ever drift off to sleep in class and are startled awake to find that I have called on them with a question, they should immediately respond with: Simulate!  So many of my questions are about simulation that there’s a reasonable chance that this will be the correct answer.  Even if it’s not correct, I’ll be impressed.


Here is a graph of the distribution of sample proportions resulting from 10,000 repetitions of 283 coin flips (using the One Proportion applet here):

I ask students: Describe the shape, center, and variability of the distribution of these simulated sample proportions.  The shape is very symmetric and normal-looking.  The center appears to be near 0.5, which makes sense because our simulation assumed that 50% of all children would choose the candy.  Almost all of the sample proportions fall between 0.4 and 0.6, and it looks like about 90% of them fall between 0.45 and 0.55.

But asking about shape, center, and variability ignores the key issue.  Next I ask this series of questions:

  • What do we look for in the graph, in order to assess the strength of evidence about the research question?  We need to see whether the observed value of the sample statistic (0.523) is very unusual.
  • Well, does it appear that 0.523 is unusual?  Not unusual at all.  The simulation produced sample proportions as far from 0.5 as 0.523 fairly frequently.
  • So, what do we conclude about the research question, and why?  The sample data (0.523 selecting the candy) would not be surprising if children were equally likely to choose the candy or toy, so the data do not provide enough evidence to reject the (null) hypothesis that children are equally likely to choose the candy or toy.

We could stop there, absolutely.  We don’t need to calculate a p-value or anything else in order to draw this conclusion.  We can see all we need from the graph of simulation results.  But let’s go ahead and calculate the (approximate) p-value from the simulation.  Because we have a two-sided alternative, a sample proportion will be considered as “extreme” as the observed one if it’s at least as far from 0.5 as 0.523 is.  In other words, the p-value is the probability of obtaining a sample proportion of 0.477 or less, or 0.523 or more, if the null hypothesis were true.  The applet reveals that 4775 of the 10,000 simulated sample proportions are that extreme, as shown in red below:

The approximate p-value from the simulation analysis is therefore 0.4775.  This p-value is nowhere near being less than 0.05 or 0.10 or any reasonable significance level, so we conclude that the sample data do not provide sufficient evidence to reject the null hypothesis that children are equally likely to choose the candy or toy.


When I first asked about how to investigate the research question, you might have been thinking that we could use a normal approximation, also known as a one-proportion z-test.  Let’s do that now: Apply a one-proportion z-test to these data, after checking the sample size condition.  The condition is certainly satisfied: 283(.5) = 141.5 is far larger than 10.  The z-test statistic can be calculated as:

This z-score tells us that the observed sample proportion who selected candy (0.523) is less than one standard deviation away from the hypothesized value of 0.5.  The two-sided p-value from the normal distribution to be ≈ 2×0.2198 = 0.4396.  Again, of course, the p-value is not small and so we conclude that the sample data do not provide sufficient evidence to reject the null hypothesis of equal likeliness.


But look at the two p-values we have generated: 0.4775 and 0.4396.  Sure, they’re in the same ballpark, but they’re noticeably different.  On a percentage basis, they differ by 8-9%, which is non-trivial.  Which p-value is correct?  This one is easy: Neither is correct!  These are both approximations.

Finally, we are back to the key question of the day, alluded to the title of this post and posed in the first paragraph: Which do you trust more: the (approximate) p-value based on simulation, or the (approximate) p-value based on the normal distribution?  Now that we have a specific example with two competing p-values to compare, please think some more about your answer before you read on.


Many students (and instructors) place more trust in the normal approximation.  One reason for this is that the normal distribution is based on a complicated formula and sophisticated mathematics.  Take a look at the probability density function* of a normal distribution:

* Oh dear, I must admit that in this expression the symbol π does represent the mathematical constant 3.14159265….

How could such a fancy-looking formula possibly go wrong?  More to the point, how could this sophisticated mathematical expression possibly do worse than simulation, which amounts to just flipping a coin a whole bunch of times?

An even more persuasive argument for trusting the normal approximation, in many students’ minds, is that everyone gets the same answer if they perform the normal-based method correctly.  But different people get different answers from a simulation analysis.  Even a single person gets different answers if they conduct a simulation analysis a second time.  This lack of exact replicability feels untrustworthy, doesn’t it?


So, how can we figure out which approximation is better?  Well, what does “better” mean here?  It means closer to the actual, exact, correct p-value.  Can we calculate that exact, correct p-value for this Halloween example? If so, how? Yes, by using the binomial distribution.

If we let X represent a binomial distribution with parameters n = 283 and π = 0.5, the exact p-value is calculated as Pr(X ≤ 135) + Pr(X ≥ 148)*.  This probability turns out (to four decimal places) to be 0.4757.  This is the exact p-value, to which we can compare the approximate p-values.

* Notice that the values 135 and 148 are simply the observed number who selected toy and candy, respectively, in the sample.

So, which approximation method does better?  Simulation-based wins in a landslide over normal-based:

This is not a fluke.  With 10,000 repetitions, it’s not surprising that the simulation-based p-value* came so close to the exact binomial p-value.  The real question is why the normal approximation did so poorly, especially in this example where the validity conditions were easily satisfied, thanks to a large sample size of 283 and a population proportion of 0.5.

* I promise that I only ran the simulation analysis once; I did not go searching for a p-value close to the exact one. We could also calculate a rough margin-of-error for the simulation-based p-value to be about 1/sqrt(10,000) ≈ .01.


The problem with the normal approximation, and a method for improving it, go beyond the scope of a typical Stat 101 course, but I do present this in courses for mathematically inclined students.  First think about it: Why did the normal approximation do somewhat poorly here, and how might you improve the normal approximation?

The problem lies in approximating a discrete probability distribution (binomial) with a continuous one (normal).  The exact binomial probability is the sum of the heights of the red segments in the graph below, whereas the normal approximation calculates the area under the normal curve to the left of 135 and the right of 148:

The normal approximation can be improved with a continuity correction, which means using 135.5 and 147.5, rather than 135 and 148, as the endpoints for the area under the curve.  This small adjustment leads to including a bit more of the area under the normal curve.  The continuity-corrected z-score becomes 0.713 (compared to 0.773 without the correction) and the two-sided normal-based p-value (to four decimal places) becomes 0.4756, which differs from the exact binomial p-value by only 0.0001.  This seemingly minor continuity correction greatly improves the normal approximation to the binomial distribution.


My take-away message is not that normal-based methods are bad, and also not that we should teach the continuity correction to introductory students.  My point is that simulation-based inference is good!  I think many teachers regard simulation as an effective tool for studying concepts such as sampling distributions and for justifying the use of normal approximations.  I agree with this use of simulation wholeheartedly, as far as it goes.  But we can help our students to go further, recognizing that simulation-based inference is very valuable (and trustworthy!) in its own right.