Skip to content

#83 Better, not necessarily good

In this post I will describe a five-question, auto-graded quiz that I recently gave my students.  I modified a longer assignment that I have used in the past to create this quiz.  I will describe my thinking behind each of these questions, and I will also mention some questions that I decided not to ask.  My favorite question on this quiz is the last one, because it tries to assess whether students can take a step back and think about the big picture rather than apply a particular procedure.  As always, questions that I pose to students appear in italics.


The quiz begins with this background information:

Helping smokers to quit is a very important and challenging public health goal.  In a study of the effectiveness of a nicotine lozenge, smokers who wanted to quit were randomly assigned to one of two groups: one group received nicotine lozenges and the other group received placebo lozenges.  At the end of the 52-week study, 17.9% of the 459 subjects in the nicotine group had successfully abstained from smoking, compared to 9.6% of the 458 subjects in the placebo group. 

I realize that I have not presented a question yet, but I want to interrupt myself already.  I usually present students with a 2×2 table of counts for questions like this about comparing proportions between two groups. But this time I provided the sample sizes and sample proportions (presented as percentages) instead.  I think it’s worthwhile to present information in different ways at different times, in the hope that students learn to recognize the underlying structure without becoming too attached to a particular presentation. 

To help with recognizing the underlying structure, I repeatedly advise students to start with identifying the observational units and variables*.  In this case the observational units are smokers who want to quit, the explanatory variable is whether the person receives a nicotine lozenge or placebo lozenge, and the response variable is whether or not the person successfully abstains from smoking for a year.  Both variables are categorical and binary, so that lets students know that they are analyzing a scenario that involves comparing two proportions.

* See post #11, titled Repeat after me, here.


1. How many times more likely was a smoker who received the nicotine lozenge to quit, as compared to a smoker who received a placebo lozenge?  Report your answer with two decimal places of accuracy.

This question asks for the ratio of the success proportions, also known as the relative risk.  The answer is: 0.179 / 0.096 ≈ 1.86.  I specified an error tolerance of 0.02, so students could receive full credit even with some rounding error.

I often ask this question in an alternate form, asking students to fill in the blank in the following sentence: Those who received the nicotine lozenge were ___ % more likely to quit, compared to those who received a placebo lozenge.  The answer is 86%, which can be determined by subtracting 1 from the ratio and then multiplying by 100%.

I realize that working with the difference in success proportions is more conventional than calculating the ratio, and the next two questions will ask about this difference.  But I think working with percentage differences is an important and challenging skill for students, so I ask this type of question often*.

* See post #28, titled A pervasive pet peeve, here.


2. Calculate the value of the z-test statistic for comparing the success proportions.  Report your answer with two decimal places of accuracy.

Students had the option of calculating this by hand or using software, such as the applet (here) that I have used in class.  If they are calculating this by hand, they need to use:

The only component of this calculation that has not been provided is the value of the combined proportion of successes for the two groups.  To determine this, students could first produce the 2×2 table of counts:

When some students asked about how to determine the combined sample proportion of successes, I mentioned that they could first determine the 2×2 table of counts, but I also pointed them to a shortcut by asking: In what situation does the combined proportion equal the average of the two group proportions?  I wanted them to realize that using the average works when the two groups have the same sample size.  In this case the two sample sizes are so similar that the average gives a very close approximation for the combined proportion of successes. 

The test statistic turns out to be z = 3.63.  I allowed an error tolerance of 0.05 to account for rounding errors.

I often ask students to interpret the value of a z-test statistic.  Here they could say that the success proportion in the nicotine group is 3.63 standard errors larger than the success proportion in the placebo group.  But with only five questions on my quizzes, I omitted that question this time.


3. What kinds of values are in a 95% confidence interval for the difference in population proportions (nicotine minus placebo) who would successfully abstain from smoking for a year?  [Options: A. Both negative and positive values; B. Only negative values; C. Only positive values]

Students could answer this question be calculating the confidence interval, either by hand or with software.  But they could also answer this based on the z-statistic.  A z-statistic of 3.63 is quite far out in the tail of a standard normal distribution, so the p-value is quite close to zero, so the null hypothesis should be resoundingly rejected at the 0.05 significance level.  That means that a 95% confidence interval for the difference in population proportions will not include the value zero.  Because I stipulated the order of subtraction to be (nicotine minus placebo), and because the nicotine group had a larger proportion of successes than the placebo group, this 95% confidence interval will include only positive values.

I was tempted to ask about a confidence interval for the relative risk, to see whether students would realize that such a confidence interval would include only values larger than one.  In hindsight I wish that I had turned question #3 into a two-part matching-style question, with one part asking about a confidence interval for the difference in population proportions and the other asking about a confidence interval for the ratio of population proportions.


4. What conclusion would you draw from this study?  [Options: A. Not enough evidence to conclude that the nicotine lozenge is more effective than the placebo lozenge; B. Very strong evidence that the nicotine lozenge is equally effective as the placebo lozenge; C. Very strong evidence that the nicotine lozenge is more effective than the placebo lozenge]

This is the bottom-line question about this study, so I did not consider leaving this out.  The correct answer is C. I have been including a version of option B on many questions, to see whether students might be tempted into (inappropriately) accepting the null hypothesis for cases with a larger p-value.


Finally, we arrive at the question that I’ve been intending to highlight all along:

5. If we consider the smokers in this study to be representative of the population of all smokers who want to quit, what would you conclude about the percentage of smokers who would successfully abstain from smoking for one year if they were to take nicotine lozenges?  [Options: A. Substantially greater than 50%, B. Substantially less than 50%, C. Very close to 50%]

It’s very tempting to conclude that more than half of all smokers would successfully quit with the nicotine lozenge.  After all, the previous questions have led to concluding that the nicotine lozenge is substantially better than the placebo.  The z-statistic is very large, so the p-value is very small, so the data provide very strong evidence that the nicotine lozenge is better than the placebo. 

It’s also reasonable to conclude that the nicotine lozenge is a good bit better than the placebo.  Our point estimate from question #1 is that smokers are 1.86 times more likely to quit if they use the nicotine lozenge rather than the placebo.  A 95% confidence interval for the population relative risk extends from 1.32 to 2.62, well above the value 1.

Several students asked me for help with this question.  My advice to them was:

  1. Don’t calculate anything.
  2. Re-read the sentence about how effective the nicotine lozenge was for the smokers in this study. 
  3. Re-read the question and options. 
  4. Think

I wanted students to realize that only 17.9% of the smokers who received the nicotine lozenge succeeded in quitting for a year.  That’s a lot less than 50%.  A 95% confidence interval runs from 14.4% to 21.4%, which is still a lot less than 50%.  So, the take-away message is: Being a lot better than a placebo does not necessarily mean being good in an absolute sense.  This study indicates that smokers who want to quit have a better chance of success with a nicotine lozenge than with a placebo.  But those smokers should not be fooled into thinking that nicotine lozenges move the odds in favor of their successfully quitting. Unfortunately, even with nicotine lozenges, the odds appear to be substantially against successful quitting.

When I used this context on longer, non-auto-graded assignments in the past, I asked students to determine the confidence interval for the population proportion who would successfully quit if given a nicotine lozenge.  I think it’s worthwhile to ask students to conduct one-sample inference as part an assignment that’s primarily about two-sample inference.  But I like question #5 on this quiz version better, because this just asks students to think without calculating anything.  I also like that this quiz version presents the sample percentages of success rather than a table of counts at the outset, so students do not even need to calculate the relevant percentage (17.9%) for themselves.


There you have it: a five-question, auto-graded quiz that I hope leads students to think and not just apply a procedure mindlessly. 

Speaking of thinking, may I be so bold as to ask what you think of this quiz?  Is it better than asking nothing at all?  I hope you do not hesitate to answer yes.  Ah, but of course you can truthfully say that this quiz is better than nothing without necessarily thinking that the quiz is good.  I’ll be disappointed if you think this, but I’ll also appreciate that you not only grasped my point but also seized an immediate opportunity to apply it.

P.S. The article about the nicotine lozenge study can be found here.

#82 Power, part 3

This post continues and completes my discussion of introducing students to the concept of power.  Let me remind you of the scenario that I presented in the first post of this series (here):

Suppose that Tamika is a basketball player whose probability of successfully making a free throw has been 0.6.  During one off-season, she works hard to improve her probability of success.  Of course, her coach wants to see evidence of her improvement, so he asks her to shoot some free throws.  If Tamika really has improved, how likely is she to convince the coach that she has improved?

The first post in this series described using an applet (here) to conduct simulation analyses to lead students to the concepts of rejection region and power, and then to consider factors that affect power.  In this post I will make three points about teaching these concepts in courses for mathematically inclined students, such as those majoring in statistics or mathematics or engineering or economics.


1. Start with simulation analyses, even for mathematically inclined students.

I suspect that some statistics teachers regard simulation as a valuable tool for students who are uncomfortable with math but not necessarily with mathematically inclined students.  I agree I think simulation can be very enlightening and powerful tools even with students who enjoy and excel at mathematical aspects of statistics.  I recommend introducing students to the concept of power through simulation analyses, regardless of how well prepared or comfortable the students are with mathematics.

You could ask students to write their own code to conduct these simulations, but I typically stick with the applet because it’s so convenient and produces nice visuals such as:


2. Proceed to ask mathematically inclined students to perform probability calculations to confirm what the simulations reveal.

Tamika’s example provides a great opportunity for students to practice working with the binomial distribution:

  • a) Let the random variable X represent the number of shots that Tamika would successfully make out of 25 shots, assuming that she has not improved.  What probability distribution would X have?
  • b) Determine the smallest value of k for which Pr(X k) ≤ 0.05.
  • c) Does this agree with your finding from the simulation analysis?
  • d) Explain what this number has to do with Tamika’s efforts to convince the coach that she has improved.

The random variable X would have a binomial distribution with n = 25 and p = 0.6.  To answer part (b), students could work with a cumulative distribution function by realizing that Pr(X ≥ k) = 1 – Pr(X ≤ k – 1) in this case.  Then they can use software or a graphing calculator to determine that the smallest value of k that satisfies this criterion is k = 20, for which Pr(X ≥ 20) ≈ 0.0294.  This means that Tamika must successfully make 20 or more of the 25 shots to convince her coach that she has improved, when the coach gives her 25 shots and using 0.05 as his standard to be convinced.

Instead of using the cumulative distribution function, students could use the inverse cumulative distribution function built into many software programs.  For example, this command in R is:

Some students get tripped up by the need for the first input to be 0.95 rather than 0.05.  Students also need to be careful to realize that the output value of 19 = k – 1, so the value of k = 20.  As some students struggle with this, I remind them of two things: First, they should return to their simulation results to make sure that their binomial calculations agree.  Second, when they’re not sure whether 19 or 20 is the answer they’re looking for, they can check that by calculating Pr(X ≥ 19) and Pr(X ≥ 20) to see which one meets the criterion.

  • e) Let the random variable Y represent the number of shots that Tamika would successfully make if she has improved her success probability to 0.7.  What probability distribution would Y have?
  • f) Determine Pr(Y k) for the value of k that you determined in part (b).
  • g) Does this agree with your finding from the simulation analysis?
  • h) Explain what this number has to do with Tamika’s efforts to convince the coach that she has improved.

The random variable Y would have a binomial distribution with n = 25 and p = 0.7.  Once they realize this, students can use software to calculate Pr(Y ≥ 20) ≈ 0.1935.  For example, the R command to calculate this is:

This probability is very close to the approximation from the simulation analysis.  Tamika has slightly less than a 20% chance of convincing her coach that she has improved, if she is given a sample of 25 shots, the coach uses a significance level of 0.05, and her improved probability of success is 0.7 rather than 0.6.

Students can then use software to produce exact power calculations, using the binomial probability distribution, for different values of the sample size, significance level, and improved success probability.

A drawback of using software such as R or Excel to calculate these probabilities is that they do not automatically provide visual representation of the probability distribution.  The applet that I used for the simulation analyses does have an option to calculate and display exact binomial probabilities:


3. Ask mathematically inclined students to write code to produce graphs of power as a function of sample size, or significance level, and of alternative value for the parameter.

Recall that the pair of R commands for calculating the rejection region and power for Tamika’s first situation is:

Then I like to ask mathematically included students: Re-write this power calculation to use just one line of code.  For students who need a hint: Where did the 19 value in the second line come from?  This leads to:

A follow-up is: How can you alter this command to calculate power as a function of sample size, for values from n = 1 through n = 300?  The key is to replace the value 25 with a vector (call it n) containing integer values from 1 through 300.  The resulting graph (with α = 0.05 and palt = 0.7) is:

Does this graph behave as you expected?  Mostly yes, but there’s an oddity.  This graph shows that power generally increases as sample size increases, as we expected.  But I say generally because there are lots of short-run exceptions, because of the discrete-ness of the binomial distribution.  The pattern is more noticeable if we restrict our attention to sample sizes values from n = 1 through 100:

Students can then proceed to produce and describe graphs of power as a function of significance level and of the improved probability value (for n = 25 in both graphs, palt = 0.7 on the left, and α= 0.05 on the right), as shown here:

Do these graphs behave as you expected?  Power increases as the significance level increases, as expected, but this is a step function due to the discreteness.  Power does increase as the improved probability value increases, as expected.


The concept of power is a challenging one for many students to grasp.  I recommend starting with a simple scenario involving a single proportion, such as Tamika trying to convince her coach of her improvement as a free throw shooter.  I think simulation analyses and visualizations can help students to understand the key ideas*.  With mathematically inclined students, I suggest following up the simulations with probability calculations and simple coding as described in this post. My hope is that these activities deepen their understanding of power and also their facility with probability distributions.

* As long as the simulation analyses are accompanied by asking good questions!

#81 Power, part 2

In last week’s post (here), I presented an extended series of questions that used a simulation analysis to introduce students to the concept of statistical power.  The simulation analysis used an applet (here) to produce graphs like this:

In the context of last week’s example, this pair of graphs shows that there’s about a 63% that Tamika will perform well enough to convince her coach that she has improved, based on a sample size of 100 shots, a significance level of 0.05, and an improvement in her probability of success from 0.6 to 0.7.

Needless to say, I believe that the questions I presented can be helpful for developing students’ understanding of the concept of power.  I hasten* to add that this activity is far from fool-proof.  In this post, I will discuss a few common difficulties and misunderstandings.

* Well, it took me a week, so perhaps I have not really hastened to add this, but I like the sound of the word.


A big part of my point in last week’s post was that we can help students to focus on the concept of power by postponing the use of statistical terminology.  I waited until we had completed the simulation activity before defining the terms power and also type I error and type II error.  Of course, this required me to move beyond simply talking about whether Tamika had improved or not, and whether her coach was convinced or not.  At this point I mentioned terms such as null hypothesis and alternative hypothesis, rejecting the null hypothesis and failing to reject the null hypothesis.  Then I asked my students to state the null and alternative hypotheses that Tamika’s coach was testing, both in words and in terms of a parameter. 

Most students seemed to realize quickly that the null hypothesis was that Tamika had not improved, and the alternative was that she had improved.  But they struggled with expressing these hypotheses in terms of a parameter.  To point them in the right direction, I asked whether the parameter is a proportion or a mean, but this did not seem to help.  I took a conceptual step back and asked whether the variable is categorical or numerical.  This time several students answered quickly but incorrectly in the zoom chat that the variable was numerical. 

This is a very understandable mistake, because graphs such as the ones above display the distribution of a numerical variable.  But I pointed out that the variable for Tamika is whether or not she successfully makes a shot, which is categorical.  The parameter is therefore the long-run proportion of shots that she would make, which my students know to represent with the symbol π.  The hypotheses are therefore H0: π = 0.6 (no improvement) versus Ha: π > 0.6 (improvement).

This difficulty reveals a common problem when using simulation to introduce students to concepts of statistical inference.  To understand what the simulation analysis and resulting graphs reveal, it’s crucial to realize that such graphs are not displaying the results not of a single sample, which is what we would observe in practice.  Rather, the graphs are showing results for a large number of made-up samples, under certain assumptions, in order to investigate how the procedure would perform in the long run.  This is a big conceptual leap.  I strongly recommend using physical devices such as coins and cards for students’ first encounters with simulation (see posts #12 and #27, here and here), in order to help them with recognizing this step and taking it gradually.  When you rely on technology to conduct simulations later, students must follow this step in their minds to make sense of the results.


As I presented the activity for my students via zoom, I also encouraged them to use the applet to carry out simulation analyses themselves.  I should not have been surprised by the most common question I received from my students, but I was surprised at the time.  Several students expressed concern about getting slightly different values than I did.  For example, they might have gotten 0.6271 or 0.6343 rather than the 0.6317 that I obtained in the graphs above.  I responded that this was a good question but nothing to worry about.  Those differences, I said, were due to the random nature of simulation and therefore to be expected.  I added that using a large number of repetitions for the simulation analysis, such as 10,000, should ensure that we all obtain approximately the same value.

Some students followed up by asking how such responses will be graded on assignments and exams.  I had been thinking that some students resist a simulation-based approach because they are uncomfortable with approximate answers rather than a single correct answer.  But this question made me realize that some students may be skeptical of simulation analyses not for intellectual or psychological reasons but rather out of concern about their grades.

I tried to assure my students that with simulation analyses, reasonable values in the right ballpark would earn full credit, for both open-ended and auto-graded responses.  I should have also thought to respond that many questions will instead ask about the simulation process and the interpretation of results.


My pledge that students would receive full credit for reasonable approximations was called into question less than half an hour after class ended.  Here are the questions that I asked in the (auto-graded) follow-up quiz:

Suppose that I have regularly played Solitaire on my computer with a 20% chance of winning any one game.  But I have been trying hard lately to improve my probability of winning, and now I will play a series of (independent) games to gather data for testing whether I have truly improved.

1. What is the alternative hypothesis to be tested?  [Options: That I have improved; That I have not improved; That I have doubled my probability of winning a game]

2. Suppose that I have not improved, but the data provide enough evidence to conclude that I have improved.  What type of error would this represent?  [Options: Type I error; Type II error; Type III error; Standard error]

Now suppose that I really have improved, and my success probability is now 25% rather than 20%.  Also suppose that I plan to play 40 independent games and that my test will use a significance level of 0.05.  Use the Power Simulation applet to conduct a simulation analysis of this situation.

3. What is the rejection region of the test?  [Options: Winning 13 or more times in the 40 games; Winning 20 or more times in the 40 games; Winning 8 or more times in the 40 games; Winning 10 or more times in the 40 games]

4. Which of the following comes closest to the probability that these 40 games will provide convincing evidence of my improvement?  [Options 0.18; 0.25; 0.40; 0.75; 0.99]

5. Continue to assume that my success probability is now 25% rather than 20% and that the test uses a significance level of 0.05.  About how many games would I have to play in order to have a 50% chance that the games will provide convincing evidence of my improvement?  Enter your answer as an integer.  (Hint: Use the applet, and feel free to use trial-and-error.) 

For questions #3 and #4, I spaced the options far enough apart to leave no doubt about the correct answers, as long as the student conducted the simulation correctly and used a reasonably large number of repetitions.  Question #5 is the most interesting and problematic one.  Asking students to determine a sample size that would achieve a particular value of power went a bit beyond what we had done in the class activity.  Students were supposed to realize that increasing sample size generally increases power, and I gave them the hint to feel free to use trial-and-error.  I thought I had allowed for a reasonably large interval of answers to receive full (auto-graded) credit, but a student came to my virtual office hours to ask why her answer had not received credit.  She showed me that her sample size did indeed produce a power value close to 0.5, so I expanded the interval of values to receive full credit*. I also let her know that I greatly appreciate students who begin assignments early and draw concerns to my attention quickly.

* The discrete-ness of the binomial distribution is more of an issue here than variability of simulation results.  I will discuss this further in part 3 of this series, but for now I’ll show a graph of power (calculated from the binomial distribution) as a function of sample size for the values that I decided to accept as reasonable.  This graph shows that power does generally increase with sample size, but the discrete-ness here makes the function more interesting and non-monotonic:


I believe that the simulation activity that I presented last week is effective for introducing students to the concept of power.  But I also acknowledge that this is a challenging topic, so in this post I have tried to point out some difficulties that students encounter.

#80 Power, part 1

I wish I had a better title for this post.  This one-word title reminds me of my two-part post about confounding variables, which I simply titled Confounding (see posts #43 and #44, here and here).  I tried to be clever with that title by arguing that the concept of confounding is one of the most confounding topics that students encounter in introductory statistics.  I suppose I could argue that the concept of power is one of the most powerful topics that students encounter, but my point is really that power is another topic that students find to be especially confounding.  I will abandon my search for cleverness and stick with this boring (but not misleading!) title.

I think we can help students to understand the concept of power by eliminating unnecessary terminology and calculations for our first pass at the topic.  We don’t need to mention null and alternative hypotheses, or rejection regions, or Type I and Type II errors, or p-values, or binomial or normal distributions, or expected value or standard deviation or z-score.  Don’t get me wrong: We’ll use most of those ideas, but we don’t need to let the terminology get in the way.

Instead we can present students with a scenario and an overarching question that you and I recognize as a question of power.  Then we can lead students to answer that big question by asking a series of smaller questions.  Questions that I pose to students appear in italics below.


Here’s the scenario that I use with my students: Suppose that Tamika is a basketball player whose probability of successfully making a free throw has been 0.6.  During one off-season, she works hard to improve her probability of success.  Of course, her coach wants to see evidence of her improvement, so he asks her to shoot some free throws. 

Here’s the overarching question: If Tamika really has improved, how likely is she to convince the coach that she has improved?  The other big question is: What factors affect how likely she is to convince the coach that she has improved?

I try not to over-do sports examples with my students, but I think the context here is very helpful and easy to follow, even for students who are not sports fans.

You won’t be surprised to see that we’ll use simulation as our tool to address these questions.


Let’s say that the coach gives Tamika 25 shots with which to demonstrate her improvement.

a) Suppose that she successfully makes 23 of the 25 shots.  Would you be reasonably convinced that she has improved?  Why or why not?

b) What if she makes 16 of the 25 shots – would you be reasonably convinced that she has improved?  Why or why not?

Most students realize that 60% of 25 is 15*, so both 16 and 23 are more successes that we would expect (for the long-run average) if she had not improved.  Their intuition suggests that 23 successes would provide very strong evidence of improvement, because it seems unlikely that a 60% shooter would achieve that many successes.  On the other hand, 16 successes does not provide strong evidence of improvement, because it seems that a 60% shooter could easily get a bit lucky and obtain 16 successes.

* You’re welcome to call this the expected value if you’d like.

c) What does your intuition suggest about how many shots Tamika would have to make successfully in order to be convincing? 

When I asked my students to type their answer to this question into the zoom chat during class a few days ago, nearly every student typed 20.  I said that this seemed reasonable and that we would proceed to use simulation to investigate this question a bit more carefully.  We used an applet (here) to conduct the simulation analysis.  The applet inputs required are:

d) Which input values can you specify already?

The hypothesized probability of success is 0.6, and the sample size is 25.  Later we’ll assume that Tamika has improved to have a 70% chance of success, so we’ll enter 0.7 for the alternative probability of success.  I like to start with simulating just one sample at a time, so we’ll enter 1 for number of samples at first; later we’ll enter a large number such as 10,000 for the number of samples.

e) Click on “draw samples” five times, using 1 for the number of samples each time.  Did each of the simulated samples produce the same number of successful shots?

Part e) would be easy to skip, but I think it’s important.  This question forces students to acknowledge randomness, or sampling variability.  I don’t think any students struggle to answer this correctly, but I think it’s worth drawing their attention to this point.

f) Now enter 9995 for the number of samples, and click on “draw samples” to produce a total of 10,000 simulated sample results.  Describe the resulting distribution for the number of successes.  Comment on shape, center, and variability.

Here are some typical results:

My students are quick to say that the shape of this distribution is symmetric, unimodal, normal-ish.  The center is near 15, which is what we expected because 60% of 25 is 15.  There’s a good bit of variability here: The simulated results show that Tamika sometimes made as few as 7 or 8 shots out of 25, and she also made as many as 23 or 24 shots out of 25.

g) Has this simulation analysis assumed that Tamika has improved, or that Tamika has not improved?

This is also a key question that is easy for students to miss: This simulation analysis has assumed that Tamika has not improved*.  We use the distribution of the number of successes, assuming that she has not improved, to decide how many successes she needs to provide convincing evidence of improvement.  I try to reinforce this point with the next question:

* You’re welcome to call this the null hypothesis.

h) Based on these simulation results, do you feel justified in your earlier answers about whether 23 successes, or 16 successes, would provide convincing evidence of improvement?  Explain.

Students who thought that 23 successes in 25 attempts provides very strong evidence of improvement should feel justified, because this simulation reveals that such an extreme result would happen only about 4 times in 10,000* (see graph on the left).  Similarly, students were correct to believe that 16 successes does not provide much evidence of improvement, because it’s not at all unlikely (better than a 40% chance*) for a 60% shooter to do that well (or better) by random chance (see graph on the right).

* You’re welcome to refer to these percentages as approximate p-values.  See post #12 (here) for an introduction to simulation-based inference.


Now we come to one of the harder questions:

i) Suppose that the coach decides on the following criterion for his decision rule: He’ll decide that Tamika has improved if her number of successes is large enough that such an extreme result would happen less than 5% of the time with a 60% shooter.  According to this rule, how many shots does Tamika need to make successfully to convince her coach?

I encourage students to answer this at first with trial-and-error.  Enter 17, and then 18, and so on into the “rejection region” box until you find the smallest number for which less than 5% of the simulated samples produce such a large number (or more) of successes.  The answer turns out to be that Tamika needs to make 20 or more of the 25 shots* to be convincing, as shown here:

* You’re welcome to call this the rejection region of the test, especially as the applet uses that term.

I was quick to point out to my students how good their intuition was.  As I mentioned earlier, nearly all of my students who responded in the zoom chat predicted that Tamika would need to make 20 shots to be convincing.

Now, finally, we address the big picture question:

j) Make a guess for how likely Tamika is to make 20 or more shots successfully out of 25 attempts, if she has improved to a 0.7 probability of successfully making a single shot.

I don’t really care how well students guess here.  My point is to remind them of the big question, the reason we’re going through all of this.  Next we use the applet to conduct another simulation to answer this question:

k) Check the “show alternative” box in the applet, which displays the distribution of number of successes, assuming that Tamika has improved to a 0.7 probability of success.  Do you see much overlap in the two distributions?  Is this good news or bad news for Tamika?  Explain.

There is considerable overlap in the two distributions, as shown here:

l) According to the applet’s simulation results, in what percentage of the 10,000 simulated samples does Tamika, with a 70% chance of making a single shot, do well enough to convince the coach of her improvement by successfully making 20 or more shots?  Would you say that Tamika has a good chance of demonstrating her improvement in this case?

Unfortunately for Tamika, she does not have a good chance of demonstrating her improvement.  In my simulation result shown here, she only does so about 19.5% of the time:

Here’s where we introduce the term of the day: We have approximated the power of this test.  Power in this case represents the probability that Tamika convinces her coach that she has improved, when she truly has improved.


Now we’ll begin to consider factors that affect power, first by asking:

m) What would you encourage Tamika to request, in order to have a better chance of convincing the coach that she has improved?

Several of my students responded very quickly in the zoom chat to say: more shots*.

* You’re welcome to call this a larger sample size.

n) Now suppose that the coach offers 100 shots for Tamika to show her improvement.  Re-run the simulation analysis.  Is there more, less, or the same amount of overlap in the two distributions?  Is this good news or bad news for Tamika?  Explain.

The simulation results reveal that the larger sample size leads to much less overlap between these two distributions:

This is very good news for Tamika, because this shows that it’s easier to distinguish a 70% shooter from a 60% shooter when she takes 100 shots than with only 25 shots.

o) How many shots must she now make successfully in order to convince the coach?  How does this compare to the percentage of 25 shots that she needs to make in order to be convincing?

The following graph shows that making 69 or more shots is sufficient to convince the coach that she has improved from a 60% shooter:

Recall that with 25 shots, Tamika had to make 20 of them to be convincing, so the percentage that she needs to make has decreased from 80% to 69% with the increase in sample size.

p) What is the (approximate) probability that Tamika will be able to convince the coach of her improvement, based on a sample of 100 shots?  How has this changed from the earlier case in which she could only take 25 shots?

This output shows that she has about a 63% chance of convincing the coach now:

This probability is more than three times larger than the previous case with only 25 shots.

q) What else could Tamika ask the coach to change about his decision process, in order to have a better chance to convince him of her improvement?

This one is much harder for students to suggest than sample size, but someone eventually proposes to change the 5% cut-off value, the significance level.  Making that larger would mean that the coach is requiring less strong evidence to be convincing, so that will increase Tamika’s chances of convincing the coach.

r) Change the coach’s significance level from 0.05 to 0.10.  How does this change the number of shots that Tamika must make to convince the coach of her improvement?  How does this change the probability that she convinces the coach of her improvement?

As shown in the following output, Tamika now only needs to make 67 shots, rather than 69, in order to convince the coach.  The probability that she achieves this as a 70% shooter is approximately 0.777, which is considerably larger than the previous probability of approximately 0.632.

s) Identify one more factor that affects how likely Tamika is to convince the coach that she has improved.

I sometimes give a hint by suggesting that students think about the applet’s input values. Then someone will suggest that Tamika could try to improve more.

t) Now suppose that Tamika improves so much that she has a 0.8 probability of successfully making a single shot.  How does this change the number of shots that Tamika must make to convince the coach of her improvement?  How does this change the probability that she convinces the coach of her improvement?

I tell students that they do not need to use the applet to answer the first of these questions.  This change does not affect how many shots she must make to convince the coach.  That value depends only on her previous probability of success, not her new and improved probability of success.  But her new success probability will produce even greater separation between the two distributions and will increase her probability of convincing the coach.  The following output reveals that the new probability is approximately 0.999:


This activity can introduce students to the concept of power without burdening them with too much terminology or too many calculations.  I grant that it’s very convenient to use terms such as significance level and rejection region and power, but I prefer to introduce those after students have first explored the basic ideas.

In the second post in this series, I will discuss some common questions from students, describe some assessment questions that I used for this topic, including some that I now regret, and present extensions of this activity for introducing the concept of power to more mathematically inclined students.

#79 No blog post today

I’m sorry to report that I do not have a blog post for today.

Hold on – doesn’t that sentence itself, especially when combined with this one, constitute a blog post?!  After all, you can see for yourself that I am posting this on my blog

But while this may constitute a logical puzzle, or conundrum, or paradox, it’s not really related to teaching introductory statistics, and it does not ask good questions, so I don’t know that you should consider this a legitimate blog post.  I am only sending this out because some of you have been very nice to say that you look forward to hearing from me on Monday mornings, so I thought I should announce and explain that I have no blog post this week.

The explanation is simply that I’ve fallen behind, and today (Mon Jan 4) is the first day of classes for the Winter quarter at Cal Poly, and I’ve needed to spend my time on preparing for class rather than writing a blog post.

I will be teaching the second course in our two-course sequence for Business majors.  The first course included basic ideas of data collection and analysis, a good bit of probability, and statistical inference for one variable.  After a brief review, this second course entails inference for comparing groups, chi-square tests, analysis of variance, simple linear regression, and multiple regression.  For good measure, the course concludes with brief introductions to time series and quality control.

This will be my second experience teaching an entirely online course.  I’m feeling less nervous than I was at the start of the Fall quarter, but I also feel that this time I should set my sights higher than simply surviving the experience.

I have mentioned before that I give lots of quizzes to my students.  I have already posted three quizzes before my class even begins.  One is about a welcome video that I prepared, another about a video that introduces myself to students (which I wrote about in post #63, titled My first video, here), and a third about the course syllabus.  I tell students that these quizzes essentially offer free points, because students are welcome to re-watch the videos and re-read the syllabus after they have seen the quiz questions. 

Just for fun, here are some of the questions on these quizzes*:

  1. Which of the following is NOT a reasonable description of statistics?  [Options: Confusing people with nonsense; gaining insights from data; Making decisions under uncertainty]
  2. How many quizzes will we have?  [Options: A few; Lots; None]
  3. How should you interact with your classmates and instructor?  [Options: With arrogance and condescension; With courtesy and respect; With rudeness and sarcasm]
  4. Are you responsible for practicing academic honesty?  [Options: No; Only on odd-numbered dates; Only on Sundays; Yes]
  5. What is my teaching philosophy?  [Options: Ask good questions; Insist on perfection; Learn by viewing; Rely on luck]

* I usually report the answers to my quiz questions, but this time I think I’ll trust you to figure them out for yourself.

I apologize in advance in case I need to return to “no blog post today” a few times in the coming weeks, but I will definitely return.  Some topics that I have in mind include introducing students to the concept of power and questioning whether two-sided tests are always preferable to one-sided ones.  I will also write about one of my favorite topics to teach, one that does not often come up in statistics courses but is especially relevant when teaching Business students: adjusting for inflation.

Best wishes for asking good questions in 2021!

P.S. Hey, wait a minute – did I just write a blog post today after all? [Options: Absolutely; Not even close; Well, kind of]

#78 Two fun (and brief) items

Thanks for reading this, my final blog post for the infamous year 2020.  In contrast to this seemingly unending year*, I will keep this post very brief.  I will conclude this decidedly not-fun year by presenting two fun items that I recently encountered.

* Even though today is December 28th, it feels more like March 303rd.  (I can’t take credit for this joke, but I regret that I cannot remember where I first saw a version of it.)


The first fun item is a quote from American educator Alice Wellington Rollins.  Even though I just learned of this quote within the past two weeks, it’s actually 122 years old, having appeared in the Journal of Education in 1898 (volume 47, issue 22, page 339, available here).  Stacey Hancock brought this to my attention, as she cites this quote in an article about teaching statistics that she has written for the March 2021 issue of the Notices of the American Mathematical Society.  I think this quote offers a valuable perspective on my “ask good questions” refrain:

The test of a good teacher is not how many questions he can ask his pupils that they will answer readily, but how many questions he inspires them to ask him which he finds it hard to answer.

Alice Wellington Rollins, Journal of Education, 1898

The second fun item is a very recent addition to the brilliant* collection of xkcd comics. 

* I like to think that I do not use the adjective brilliant casually.  If you have not seen these comics, consider taking a look.  Some particularly clever ones that address statistical ideas include: Convincing (here), Correlation (here), and Significant (here).

When I look back on this horrible but memorable year, I hope to think of this image and advice from a recent xkcd comic (available here):


Many thanks and best wishes to all who have read this blog in 2019 and 2020.  I hope that you have found something that helps you to ask good questions of your students.  My aspiration remains to write essays about teaching introductory statistics that are practical, thought-provoking, and fun*.

* And, perhaps just this once, brief.

#77 Discussing data ethics

This guest post has been contributed by Soma Roy.  You can contact her at soroy@calpoly.edu.

Soma Roy is a colleague of mine in the Statistics Department at Cal Poly – San Luis Obispo. Soma is an excellent teacher and has been so recognized with Cal Poly’s Distinguished Teaching Award.  She also served as editor of the Journal of Statistics EducationI recently learned about some of Soma’s ideas for generating student discussions in online statistics courses, and I am delighted that she agreed to write this guest blog post about one such idea, which introduced students to data ethics.


The GAISE (Guidelines for Assessment and Instruction in Statistics Education) College Report (available here) recommends the use of real data with a context and purpose in statistics classes*. One of the ways I achieve this throughout the course, regardless of what statistics topic we are studying at the time, is by always using data (either in raw or summarized form) from research studies published in peer-reviewed journals.

* Just because the recommendation comes in the college report doesn’t mean that the advice couldn’t apply to K-12 classes.

For example, a study I use to motivate the comparison of means between two groups was conducted by Gendreau et al. and published in the Journal of Abnormal Psychology in 1972 (here). In this study, 20 inmates at a Canadian prison were randomly assigned either to be in solitary confinement or to remain non-confined (that is, have contact with others around them) for seven days. Researchers measured each inmate’s EEG alpha frequency on several days* in order to investigate the effect that sensory deprivation can have on one’s EEG alpha frequency**.

* The article provides data for the 20 inmates at three different time periods, but my students only analyze the data from the final (seventh) day of the experiment.

** Alpha waves are brain waves, the predominance of which is believed to indicate that the individual is in a relaxed but aware state. High frequency of alpha waves is considered to be better than low frequency of alpha waves (Wikipedia).

Without fail, one of the first things that students do when they read about this study is ask: How could they just put someone in solitary confinement? That becomes a jumping off point for our discussion on data ethics. This discussion covers the ethics of study design, data collection, data analyses, and publication of findings.


When the COVID-19 pandemic turned my in-person class into an online class, I decided to turn our brief, in-class discussion into an asynchronous, week-long discussion in our learning management system, Canvas. Borrowing from Allan’s style, the questions that I posted appear in italics, below, accompanied by short blurbs on what I was hoping to address with each of the questions, as well as some student responses and comments.

You have read about an experiment conducted on inmates of a Canadian prison, where 20 inmates were randomly split into two groups. One group of 10 inmates was placed in solitary confinement, and the other group was allowed to remain non-confined. 

Are you as struck as I was the first time I read about this experiment, by how unethical and cruel this experiment was, in that people were randomly assigned to be placed in solitary confinement!? 

Unfortunately, there have many, many experiments in the past that violated human rights. That realization has brought about the requirement for all research projects involving human subjects to be reviewed before any data can be collected. 

This discussion is about the ethics to be considered when one decides to carry out a study with human subjects (specifically an experiment that involves manipulating treatment conditions), collect data, or analyze data and publish results from any study. The first few questions below focus on historical studies, while the next few questions in this discussion look into what the process is to propose and carry out human subjects studies, and also what are ethical practices when it comes to data analysis and publication of study results. 

I hope that, going forward, this discussion helps you think critically about any studies that you may be involved in as a researcher, and keep in mind that (to borrow from the great American poet Maya Angelou) when we “know better, (we should) do better.” 

For this discussion, you need to make two (2) posts:

Part 1: First, you will post a response to one of the questions (1) – (10) below. Be sure to copy and paste the question that you are responding to. 

1. Google “Tuskegee Syphilis Study” – describe the study (year(s), methods, participants, objective, etc.). Why is it considered unethical? Cite your source(s). (e.g., Wikipedia link)

2. Google “US apologizes to Guatemalans, 1940s” – describe the study or studies conducted in the 1940s (year(s), methods, participants, objective, etc.). Why are the studies considered unethical? Cite your source(s). (e.g., Wikipedia link)

3. Google “Human Radiation Experiments in the US, 1940s” – describe the study or studies conducted in the 1940s and even later (year(s), methods, participants, objective, etc.). Why are the studies considered unethical? Cite your source(s). (e.g., Wikipedia link) 

4. Google “Project Bluebird, Project Artichoke” – describe the study or studies (year(s), methods, participants, objective, etc.). Why are the studies considered unethical? Cite your source(s). (e.g., Wikipedia link) 

5. Google “The Monster Study” – describe the study (year(s), methods, participants, objective, etc.). Why is the study considered unethical? Cite your source(s). (e.g., Wikipedia link) 

6. Google “Brown eyes, Blue eyes experiment, Jane Elliot” – describe the study (year(s), methods, participants, objective, etc.). What was the objective of the study? Why do some people consider the study to be unethical? Cite your source(s). (e.g., Wikipedia link) 

This first part of my discussion assignment requires students to read up about a particular historical study, identify some of the key elements such as what was the objective of the study, on whom was the study conducted, when it was conducted, how it was conducted, and why the study is considered unethical. Students are required to cite their sources.

All six of these studies have a plethora of information available from multiple reliable sources on the internet. My hope is that as students read about these studies, they will recognize the shortcomings in the study design – where the researchers went wrong in how they treated their subjects or how they recruited their subjects, or just who their subjects were. I also hope that students will recognize the need for an institutional review board (IRB), the need for informed consent, and the need to protect vulnerable populations.

The Tuskegee study, understandably the most infamous of the lot, draws the most outrage from students. Students find the experiment “crazy and insane,” “a great example of raging biases and racism,” and “lacking in decency.” Students are appalled that little to no information was shared with the participants, that a study that was supposed to last only 6 months lasted 40 years, and that even after penicillin was established to be a standard treatment for syphilis, it wasn’t administered to the participants. Students are saddened by the fact that the researchers abused the knowledge that the participants were impoverished by offering incentives such as free meals and free treatment for other ailments in return for their participation in the study.

Students have similar reactions to the other studies as well. Some of their common responses include:

  • Subjects in any study should be told whether any negative outcomes were to be expected.
  • Participation should be voluntary; leaving the study should be easy and come at no cost to the participant.
  • Children should not be experimented on, at least not without permission from a parent or guardian who can make decisions in the child’s best interest.
  • People who are vulnerable, such as children, prisoners, pregnant women, and people from racial and ethnic minorities, should be protected, and not taken advantage of.

The “Brown eyes, blue eyes” experiment draws some interesting responses*. Some of my students write that while the experiment was well meaning, and was trying to teach students about discrimination on the basis of color, conducting an experiment on impressionable children, especially without the consent of their parents, was unethical. 

* For anyone unfamiliar with this experiment: On the day after the assassination of Dr. Martin Luther King, Jr., teacher Jane Elliot repeatedly told students in her all-white third-grade class that brown-eyed people were better than blue-eyed people.  On the next day, she switched to saying that blue-eyed people were better than brown-eyed people. She observed her students’ behaviors toward each other on both days.


Through their answers to the questions above, sometimes directly and sometimes indirectly, students arrive at recognizing the need for an institutional review board, the need for informed consent, and the need to protect vulnerable populations. This leads to the next set of questions in my discussion assignment:    

7. When you conduct research on human subjects, your research protocol needs to be reviewed by an institutional review board, and you need to obtain informed consent from your subjects. Explain what the bold terms mean, when did these procedures start getting enforced in the U.S., and why you need the review or informed consent. Cite your source(s). (e.g., Wikipedia link) 

8. When you conduct research on human subjects, certain sections of the population are referred to as “vulnerable populations” or “protected groups.”  What are these groups, and why do they need to be protected? Give one or two historical examples that were unethically performed on vulnerable populations. Cite your sources (e.g. link from National Institutes of Health) 

For the question about the IRB and informed consent, students are required to describe the terms, why they are needed, and report what year these procedures were put in place in the U.S. Again they are required to provide references. Students discover that concerns about many of the studies referred to in (1) – (6), specifically the Tuskegee Syphilis study and the human radiation experiments, led to the creation of IRBs.

In the wrap-up of this discussion, we revisit the study about the Canadian prisoners, in which some inmates were assigned to solitary confinement to study the effect of sensory deprivation on brain function. The research article mentions that the subjects volunteered to participate, and were told that there were no incentives (e.g. monetary or parole recommendation), that their status in prison would remain unchanged, except for a note in their file mentioning their cooperation. Students discuss whether this is enough of a protection, or enough of an informed consent.


The next two questions touch upon what happens to data after they have been collected. Should the person analyzing the data get to pick and choose which data to include in the analysis, based on what creates a more sensational story? Should studies be published only if they show statistically significant findings? Who stands to lose from violations of the ethics of data analysis? Who stands to lose from publication bias*?

* For class examples, I intentionally use studies that showed statistically significant results as well as studies that didn’t. I also have a separate week-long discussion topic in which students read article abstracts from various peer-reviewed journals, where they see both statistically significant and not significant study results; that discussion touches on one more aspect of data ethics – who funded the study, and why that is important to disclose and to know?

9. What is publication bias? When does it arise? Who stands to benefit from it? More importantly, who stands to lose from it? Give an example of any study or studies where publication bias was present. Cite your source(s). (e.g., Wikipedia link)

10. What is data manipulation (including “selective reporting” and “data fabrication”)? How is it done? Who stands to benefit from it? More importantly, who stands to lose from it? Give an example of any study or studies where the researchers were accused of wrongful data manipulation. Cite your source(s). (e.g., Wikipedia link)


To earn full credit for the discussion assignment, students must also reply to another student’s post.  This is just my way of encouraging them to read and reflect on what other students posted. Students can only reply after they have first submitted their own initial post:

Part 2: Second, respond/reply to a post by another student – adding more detail/insight to their post. (Note: You will need to first post an answer to part 1 before you can see anybody else’s posts.)


I grade these student discussions very generously. Students almost always get full credit as long as they follow the instructions and make reasonable posts, cite their sources, and don’t just copy-and-paste a Wikipedia article.

On my end-of-quarter optional survey about the class this term, students noted this ethics discussion as the discussion they liked the most. Some students said that this discussion topic was the topic from the course that made the biggest impression on them – describing it as “thought-provoking,” “interesting,” and “eye opening.”

In the past I have used this discussion assignment only in introductory classes. But now that I have the online discussion set up in Canvas, I will also use it in my upper-level courses on design of experiments.

Even though I have used these questions as a discussion topic, I can also see using them as a homework assignment, mini-project, or student presentation. For now, I will stick with the online discussion format because my students said they liked reading what other students wrote. While the pandemic keeps us in remote online classrooms, this format provides one more way for students to connect with their peers, as well as learn about some ethical issues associated with collecting and analyzing data.

This guest post has been contributed by Soma Roy.  You can contact her at soroy@calpoly.edu.

#76 Strolling into serendipity

This post is going to meander.  I’ll get to the point right away, but then I’m going to take a long detour before I return to the point.

The point of this post is to let you know about the 2021 U.S. Conference on Teaching Statistics (USCOTS), encourage you to attend and participate in this conference, and urge you to help with spreading the word.  The conference theme is Expanding Opportunities.  It will be held virtually on June 28 – July 1, with pre-conference workshops beginning on June 24.  The conference sessions will be thought-provoking, directly relevant to teaching statistics, and fun!  See the conference website here for more information.

Now I’m going to indulge in a stroll down memory lane before I return to the point.  If you’re in a hurry or don’t feel like accompanying me on this journey, I understand completely and encourage to skip ahead past the next several sections.  You can search for “And then 2020 happened” to find the spot where I conclude my reminiscences and return to discussing the 2021 USCOTS.


I like conferences.  Even though I’m an introvert who feels much more comfortable in a small town than in a big city, I have greatly enjoyed and learned a lot from attending conferences across the country and around the world.  The best part has been meeting, learning from, and befriending people with similar professional goals and interests.

My first conference was the Joint Mathematics Meetings (JMM) held in San Francisco in 1991.  I had never been to San Francisco, and I had only been to California when I was nine years old.  I was in my second year of teaching at Dickinson College in Pennsylvania.  I roomed with my good friend from graduate school Tom Short, who was on the academic job market.  We walked around the city, taking in the sights and remarking that San Francisco is an even hillier city to walk than Pittsburgh, where we had attended Carnegie Mellon University together.  A conference highlight for me was attending a presentation by Tom Moore, whom I had never met.  Tom had written an article with Rosemary Roberts, titled “Statistics at Liberal Arts Colleges” (here), which had inspired me as I finished graduate school and before I started teaching at Dickinson.  I also gave a presentation at the conference, titled “Using HyperCard to teach statistics.”  I remember being extremely nervous before my presentation.  As I refresh my memory by checking the conference program here, I am surprised at not remembering that my presentation was apparently given at 7:05 on a Saturday morning!*

Another memorable conference from early in my career was the ASA’s 1992 Winter Conference, held in Louisville, Kentucky.  I was amazed and delighted to find an entire conference devoted to the theme of Teaching Statistics.  By this time Tom Short was teaching at Villanova University, so he and I drove to Louisville together.  I gave my first conference talk about an early version of Workshop Statistics.  Two presentations had a huge impact on my teaching and stand out in my mind to this day.  Bob Wardrop described his highly innovative introductory course that reimagined the sequencing of topics by using simulation-based inference to present topics of statistical inference from the beginning of the course.   Joan Garfield gave the plenary address, invited and introduced by David Moore, on educational research findings about how students learn statistics.  Joan later wrote an article based on this presentation titled “How Students Learn Statistics” (available here), the general principles of which hold up very well more than 25 years later.

Returning to San Francisco for the Joint Statistical Meetings (JSM) in 1993, I met and chatted with Jeff Witmer, convener of the “isolated statisticians” group and editor of Stats magazine, to which I had recently submitted an article.  I also interacted with Robin Lock for the first time at that conference; he and I have presented in the same sessions of conferences, sometimes with a joint presentation, many times over the years.  The 1993 JSM was also the occasion in which I met a graduate student from Cornell University who was studying both statistics and education, and who had a perfect name for a statistics teacher*.

* Of course, I had no clue at the time that Beth Chance and I would write articles and textbooks together, give conference presentations and conduct workshops together, coordinate the grading of AP Statistics exams, become colleagues in the same department, and eat ice cream together more times than I could count.

In 1994 I traveled outside of North America for the first time, to attend the International Conference on Teaching Statistics (ICOTS) in Marrakech.  Despite tremendously troublesome travel travails*, I greatly enjoyed the exotic locale and the eye-opening experience of meeting and hearing from statistics teachers and education researchers from around the world.  I gave another presentation about Workshop Statistics.  Some specific memories include George Cobb’s talk about workshops for mathematicians who teach statistics and Dick Scheaffer’s presentation about Activity-Based Statistics.

* Try saying (or typing) that ten times fast.

Oh dear, I really could keep writing a full paragraph (or more) about every conference that I’ve attended over the past thirty years.  But I need to remember that I’m writing a blog post, not a memoir.  I hope I’ve made my point that I benefitted greatly from attending and presenting at conferences as I embarked on my career as a teacher of statistics.  Especially for a small-town introvert, these conferences greatly expanded my horizons.  I’m incredibly fortunate and grateful that some of the people I met at these conferences, whose work I admired and had a big impact on me, went on to become lifelong friends and valued collaborators.

I hasten to add that I have continued to enjoy and benefit from conferences throughout my career.  Since 1995, the only JSM that I have missed was in 2016 due to illness.  It took me a few months to recover from my surgery that year, and I considered myself fully recovered when I was able to attend the AMATYC conference in Denver in November of 2016.  I remember feeling very happy to be well enough to walk around a conference hotel and be able to participate in a conference again.  I also recall feeling somewhat silly to consider conference attendance as an important marker of my recovery.


As I continue this stroll down memory lane, I now turn toward USCOTS.  I have attended all eight USCOTS conferences*, which have been held in odd-numbered years since 2005, and I have come to regard USCOTS as my favorite conference. 

* I realize that the word “conference” here is redundant with the C in USCOTS, but I fear that “USCOTSes” looks and sounds ridiculous.

The organizers of the first USCOTS, Dennis Pearl and Deb Rumsey and Jack Miller, did a terrific job of establishing a very welcoming and supportive environment.  Conference sessions were designed to engage participants, and the conference provided provide many opportunities for interaction among attendees, outside of sessions as well as during them.

The inaugural USCOTS in 2005 was the most influential conference my career.  The lineup of plenary speakers was star-studded: Dick Scheaffer and Ann Watkins, Roxy Peck, Cliff Konold, Robin Lock and Roger Woodard, and George Cobb (see the program here).  Roxy’s talk was memorable not only for its enticing title (How did teaching introductory statistics get to be so complicated?) but also for the insights about teaching statistics that Roxy garnered from a famous video of a selective attention test (here).  George’s banquet presentation at this conference, which also featured a provocative title* (Introductory statistics: A saber tooth curriculum?), has achieved legendary status for inspiring a generation of statistics teachers to pursue simulation-based inference**. 

* Of course, I admire that both of these titles ask good questions.

** See here for a journal article that George wrote, based on this presentation, in which he subtly revised to title to ask: A Ptolemaic curriculum?

The next three USCOTS were also very engaging and informative.  I will mention just one highlight from each:

  • In 2007 Dick De Veaux gave a terrific banquet presentation, titled “Math is music; statistics is literature,” that was almost the equal of George’s for its cleverness and thought-provoking-ness. 
  • Chris Wild inspired us in 2009, and provided a glimpse of even more impressive things to come, with his demonstration of dynamic software that introduces young students to statistics, and excited them about the topic, through data visualization. 
  • Rob Gould challenged us in 2011 to think about how best to prepare students to be “citizen statisticians,” arguing that they come to our classes having already experienced immersive experiences with data.

My point here is that USCOTS was designed from the outset as a very engaging and interactive conference, ideal for statistics teachers looking to meet like-minded peers and exchange ideas for improving their teaching.


Following the 2011 USCOTS, I was quite surprised and honored when Deb and Dennis asked me to take on the role of USCOTS program chair.  I have now served in this capacity for four conferences, from 2013 – 2019.  I have tried to maintain the distinctive features that make USCOTS so valuable and worthwhile.  My primary addition to the program has been a series of five-minute talks that comprise opening and closing sessions.  I have been thrilled that so many top-notch statistics educators have accepted my invitations to give these presentations.

If you’ve never given a five-minute presentation, let me assure you that it can be very challenging and nerve-wracking.  Condensing all that you want to say into five minutes forces you to focus on a single message and also to organize your thoughts to communicate that message in the brief time allotted.  

For my first year as program chair in 2013, I went so far as to insist on the “Ignite” format that requires each presenter use 20 slides that automatically advance every 15 seconds.  I have loosened this restriction in subsequent years.  The opening five-minute talks have launched the conferences with energy and fun.  They have generating thought-provoking discussions among attendees.  The closing talks have recapped the conference experience and inspired participants to depart with enthusiasm for implementing some of what they’ve learned with their own students*. 

* You can find slides and recordings for these five-minute talks, along with other conference presentations and materials, by going here, clicking on “years” on the right side, going to the year of interest, then clicking on “program,” and finally clicking on the session link within the program page.  As you peruse the lists of presenters for an opening or closing session, you may notice that I like to arrange the order of presentation alphabetically by first name.

My point in this section is that since I have been entrusted with the keys to the USCOTS program, I have tried to maintain USCOTS as welcoming, engaging, and valuable conference.  Serving as program chair for the past four incarnations of USCOTS has provided me with considerable helpings of both professional pride and enjoyment.


After the 2019 USCOTS, I decided to pass the program chair baton to someone of the next generation who would infuse the conference with new ideas and vitality.

I asked Kelly McConville to take on this role.  Even though Kelly is early in her career as a statistics professor*, she already has considerable experience as a successful program chair.  She has served as program chair for ASA’s Statistics and Data Science Education section at JSM, for the Electronic Undergraduate Statistics Research Conference, and for the Symposium on Data Science and Statistics (see here).  Kelly has attended several USCOTS conferences and gave one of the five-minute talks at the closing session for USCOTS in 2017.

* Congratulations are in order, because Kelly was informed just last week that she has earned tenure in her faculty position at Reed College.

Kelly replied by asking if I would consider co-chairing USCOTS with her in 2021, and I happily agreed.


And then 2020 happened*.

* There’s obviously no need for me to describe how horrible 2020 has been in myriad ways.  But I can’t resist noting that a vaccine has been developed, tested, and approved in less than one year.  This is an incredible achievement, one in which the field of statistics has played an important role. The vaccine is being administered for the first time in the U.S. (outside of trials) on the day that this post appears.

The pandemic required Dennis (who continues to serve as director of CAUSE, the organization that puts on USCOTS) and Kelly and me to decide whether to plan for an in-person, virtual, or hybrid USCOTS.  Spurred on by Camille Fairbourne, Michigan State University had agreed to host USCOTS in late June of 2021.  In August of 2020, we asked statistics teachers to answer survey questions about planning for USCOTS.  Among 372 responses, 50.3% recommended a virtual conference and only 11.8% recommended in-person, with the remaining 37.9% preferring a hybrid.  Mindful of drastic cuts to many schools’ budgets as well as continuing uncertainty about public health, we made the difficult decision to forego an in-person conference and hold USCOTS virtually.

We quickly selected a conference theme: Expanding Opportunities.  Aspects of this timely theme that conference sessions will explore include:

  • How can we increase participation and achievement in the study of statistics by students from under-represented groups?
    • What classroom practices can help with this goal?
    • How can curriculum design increase such participation and achievement?
    • What role can extra-curricular programs play?
    • How can remote learning and new technologies help?
    • How can we collaborate more effectively with colleagues and students in other disciplines to achieve this goal?
  • How can we support and encourage students and colleagues who are beginning, or contemplating, careers in statistics education?
  • Can the emerging discipline of data science help to democratize opportunities for students from under-represented groups?
  • What does educational research reveal about the effectiveness of efforts to expand opportunities?

The conference will feature thought-provoking plenary sessions, interactive breakout sessions, informative posters-and-beyond sessions, and opening and closing sessions with inspiring and lively five-minute presentations. Other highlights include birds-of-a-feather discussions, a speed mentoring session, an awards ceremony*, extensive pre-conference workshops, and sponsor technology demonstrations.

* The USCOTS Lifetime Achievement Award has been renamed the George Cobb Lifetime Achievement Award in Statistics Education, in honor of George, the first recipient of the USCOTS Award, who passed away on May 6, 2020.

One of the plenary sessions will be a panel discussion about fostering diversity in our discipline.  Kelly and I plan to ask the panelists questions such as:

  • What are some barriers to pursuing study of statistics, and succeeding in study of statistics, for students from under-represented groups?
  • What are some strategies for eliminating barriers and expanding opportunities for students from under-represented groups in the following areas?
    • Recruitment
    • Curriculum
    • Individual courses
    • Program/department culture
    • Other?
  • How (if at all) does the emerging discipline of data science offer potential solutions for expanding opportunities and fostering diversity?
  • What are some strategies for encouraging and supporting people from diverse backgrounds to pursue and succeed in careers as statistics teachers and statistics education researchers?

We are determined to reproduce the welcoming, engaging, interactive, and fun aspects of USCOTS as much as possible in a virtual setting.  We also hope that the virtual format will encourage participation from statistics teachers who might not have invested as much time as it takes to travel to an in-person conference.


One of my favorite words is serendipity.  I like the definition from Google’s dictionary almost as much as the word itself: the occurrence or development of events by chance in a happy or beneficial way.  The benefits that I gained from attending conferences early in my career resulted from chance encounters more than from planned meetings.  Serendipity is one of the best aspects of any conference*. 

* Heck, serendipity is one of the best things in life.  Sadly, serendipity has also been one of the biggest casualties of the pandemic.

By definition, serendipity is impossible to plan in advance.  Serendipity is especially challenging to arrange with a virtual conference that people can attend without leaving their homes.  But we’re going to do everything we can to infuse the 2021 USCOTS with opportunities for serendipity, and we welcome suggestions about how to create such opportunities.  I hope that all USCOTS participants in 2021 make new acquaintances and renew friendships with colleagues who are united by a common desire to teach statistics effectively to the next generation of citizens and scholars.

How can you help?  First, mark the dates June 28 – July 1, 2021 on your calendar and plan to attend USCOTS.  Second, consider submitting a proposal to conduct a workshop, lead a breakout session, present a virtual poster, or facilitate a birds-of-a-feather discussion.  Third, please let others know about USCOTS and encourage them to participate.  Spreading the word broadly can expand opportunities to participate in USCOTS, where we can share ideas about expanding opportunities for others to engage in our profession. 

Once again, more information is available at the conference website here.

#75 More final exam questions

I gave my first asynchronous online final exam this past week.  I find writing online exams to be much more time-consuming and stressful than writing good, old-fashioned in-person exams*.  I’ve identified five aspects of writing online exams that take considerable time and effort:

  1. Writing with good multiple-choice questions and answer options;
  2. Creating multiple versions of most questions in an effort to reduce cheating;
  3. Thinking of questions where googling does not provide much of an advantage;
  4. Entering all of the questions into the format required by the learning management system;
  5. Double- and triple- and quadruple-checking everything**

* I’m finding it hard to remember the days of photocopying exams and handing them to students on paper.

** I became obsessed with this last one, because typos and other errors are so much more problematic now than they used to be.  I may not remember photocopying, but I fondly recall the good old days when a student would point out a mistake and I simply had to say: Excuse me, class, please look on the board to see a correction for part c) of question #3.  I really stressed and lost sleep about this.  And somehow I still managed to mess up!  I’m embarrassed to report that despite my efforts, students found an error on both the Wednesday and Friday versions of my final exams.  I was especially grateful to the student who started the exam at 7am on Wednesday and let me know about the error as soon as she finished, so I was able to make the correction before most students began the exam.

Now I’m in the throes of grading.  You may know that when it comes to grading, I enjoy procrastination*.  But the timeline is tight because grades are due on Tuesday.  Without further preamble, I will now discuss some of the multiple-choice questions that I asked my students on this exam.  I will provide answers at the end.

* See post #66, First step of grading exams, here.


1. Suppose that you want to investigate whether Cal Poly students tend to watch more movies than Cal Poly faculty.  Would you collect data to investigate this question using random sampling, random assignment, or both? [Options: A) Random sampling only; B) Random assignment only; C) Both random sampling and random assignment]

I like this question because I try to emphasize the distinction between random sampling and random assignment.  This is a meant to be an easy question.  Students should realize that it’s not reasonable to randomly assign people to the roles of faculty or student.

2. Suppose that the nine current members of the U.S. Supreme Court are still the same nine members of the Supreme Court two years from now. Indicate how the following values will change from now until then (two years from now). a) Mean of ages; b) Standard deviation of ages; c) Median of ages; d) Inter-quartile range of ages [Options: A) Increase; B) Decrease; C) Remain the same]

This is also intended as an easy question.  The mean and median will increase by two years.  But as measures of variability, the standard deviation and inter-quartile range will not change when everyone becomes two years older.

3. a) Which would be larger – the mean weight of 10 randomly selected people, or the mean weight of 1000 randomly selected cats (ordinary domestic housecats)?  b) Which would be larger – the standard deviation of the weights of 1000 randomly selected people, or the standard deviation of the weights of 10 randomly selected cats (ordinary domestic housecats)? [Options: A) Cats; B) People]

I have written about this question before*.  Part (b) is very challenging for students.  Unfortunately, many students come to believe that a larger sample size produces a smaller standard deviation, without realizing that this result applies to the variability of a sample statistic, such as a sample mean, not to variability in the original measurements, such as weights of people and cats.

* See post #16, Questions about cats, here.

4. Suppose that a fair coin is flipped 10 times.  Which is more likely – that the flips result in 5 heads and 5 tails, or that the flips result in 6 of one outcome and 4 of the other? [Options: A) 5 of each; B) 6-4 split; C) These are equally likely.]

Students could answer this by calculating the relevant binomial probabilities.  But they might also realize the key point that a 6-4 split can happen in two different ways.  Even though a particular 6-4 split is less likely than a 5-5 result, a 6-4 split in either direction is more likely than a 5-5 result.  These probabilities turn out to be 0.246 for obtaining 5 heads and 5 tails, 0.410 for achieving a 6-4 split.

5. Suppose that Chiara has a 10% chance of making an error when she conducts a test. If she conducts 10 independent tests, which of the following is closest to the probability that she makes at least one error? [Options: A) 0.10; B) 0.25; C) 0.50; D) 0.65; E) 0.99]

I intend for students to perform the calculation: Pr(at least one error) = 1 – Pr(no errors) = 1 – (0.9)10 ≈ 0.651.  I chose options far enough apart that some students might use their intuition to determine the correct answer, if they realize that making at least one error would be more likely than not without being extremely likely.


6. The United States has about 330 million residents.  Suppose that you want to estimate the proportion of Americans who wore socks yesterday to within a margin-of-error of 3.5 percentage points with 95% confidence.  Which of the following is closest to the number of people that you would need to randomly sample? [Options: A) 30; B) 1000; C) 30,000; D) 1,000,000]

I also discussed this question, which I ask on every final exam, in post #21 here.  Influenced by the 330 million number, many students mistakenly believe that a sample size of 1 million, or at least 30 thousand, people is required.

7. Suppose that Carlos, Dwayne, and Elsa select separate and independent random samples of 50 Cal Poly students each.  They ask each student in the sample how much sleep they got last night, in minutes.  Then they calculate the average amount of sleep for the students in their sample.  How likely is it that Carlos, Dwayne, and Elsa obtain the same value for their sample average? [Options: A) This is very likely. B) There’s about a 50% chance of this. C) There’s a 1 in 3 chance of this. D) This is very unlikely.]

This question addresses the concept of sampling variability, which is even more fundamental than that of sampling distribution.  This is meant to be an easy question that students can answer based on their intuition or by remembering what we discovered when simulating the drawing of random samples with an applet such as this one (here) that randomly samples words from the Gettysburg Address.

8. Suppose that Yasmin and Jade want to select a random sample of San Luis Obispo county residents and ask each person whether or not they spent Thanksgiving in their own home.  Suppose also that Yasmin wants to estimate the population proportion to within ± 0.04 with 95% confidence, and Jade wants to estimate the population proportion to within ± 0.02 with 95% confidence.  Who would need to use a larger sample size?  (You need not calculate any sample sizes to answer this question.)  [Options: A) Jade; B) Yasmin; C) They would both need the same sample size.]

Here is another question for which students could spend a good bit of time performing calculations, but they’re better served by thinking this through.  They need only realize that obtaining a smaller margin-of-error requires a larger sample size.

9. Suppose that you conduct a hypothesis test about a population mean and calculate the t-test statistic to equal 0.68.  Which of the following is the best interpretation of this value?  [Options: A) If the null hypothesis were true, the probability would be 0.68 of obtaining a sample mean as far as observed from the hypothesized value of the population mean. B) The probability is 0.68 that the null hypothesis is true. C) The sample mean is 0.68 standard errors greater than the hypothesized value of the population mean. D) The sample mean is equal to 0.68 times the standard error.]

Students’ ability to interpret the value of a test statistic is worth assessing.  You no doubt realize that I purposefully chose a value less than 1 for the t-test statistic here, partly to see whether students might confuse the interpretation of a test statistic and a p-value.

10. Suppose that you take a random sample of 100 books from a large library.  For each of the following questions, indicate the appropriate inference procedure. a) How old, on average, is a book from this library? b) Are 75% of books in this library less than 20 years old? c) What percentage of books in this library contain fewer than 300 pages? d) How many pages, on average, are contained in a book from this library? e) What percentage of books in this library have been borrowed at least once in the past 10 years? [Options: A) z-interval for proportion; B) z-test for proportion; C) t-interval for mean; D) t-test for mean]

This series of questions is very similar to the questions that I discussed in last week’s post (A sneaky quiz, here), so my students should have expected questions of this type.  I think these questions are a bit harder than the ones I presented in class and on that quiz, though.  Parts (b) and (c) involve a categorical variable, but students might be tempted to think of a numerical variable because the context also refers to a book’s age and number of pages.


I’m selfishly glad that the time I invested into writing multiple-choice questions for my final exam has now served double-duty by providing me with the basis for this blog post.  But I really do need to get back to grading the open-ended questions …

P.S. The correct answers are: 1. A; 2. A, C, A, C; 3. B, B; 4. B; 5. D; 6. B; 7. D; 8. A; 9. C; 10. C, B, A, C, A.

#74 A sneaky quiz

Last summer I participated as a student in an online course for the first time.  The topic was how to teach an online course.  The course was delivered asynchronously, but it was not self-paced because there were regular due dates on assignments.  Somewhat to my embarrassment, I found that I was highly motivated by the assignments and those due dates.

As I have been teaching my own students online this term, I decided to give even more quizzes than usual to motivate my students to keep up.  For each topic that we have studied, I have given a handout quiz and an application quiz.  The handout quizzes have often asked the same questions that we answered as we worked through the handout, while the application quizzes have asked students to apply what they learned to a new study or situation.  As long as a student attended one of my live zoom sessions or watched the videos that I prepared, and paid a modest amount of attention, they should have done very well on the handout quizzes.  I even allowed two attempts on these handout quizzes, recording the average score.


My final class meeting of the term occurred on Monday of Thanksgiving week.  I told my students in advance that we would not study new material on that day.  Instead I provided them with practice questions about identifying which inference procedure to apply for a particular question.  As this is the first course in a two-course sequence, and we spent about half of the term studying probability, we have only studied inference for a single mean or a single proportion.  Here’s how I summarized things at the start of the handout for this class:

  • Statistical inference draws a conclusion (i.e., infers something) about a population parameter based on a sample statistic. 
    • A confidence interval estimates the value of a parameter with a range of values.
      • A population proportion π can be estimated with a z-interval.
      • A population mean μ can be estimated with a t-interval.
    • A hypothesis test assesses the plausibility of a particular claim about the parameter.
      • A claim about a population proportion π can be tested with a z-test.
      • A claim about a population mean μ can be tested with a t-test.

The instructions that I provided for the task were: For each of the following research questions, identify which of these four inference procedures would be appropriate.  Furthermore, if the research question calls for a hypothesis test, state the appropriate null and alternative hypotheses.  If the research question calls for a confidence interval, clearly identify the parameter to be estimated.

The ten questions that we analyzed were:

  • a) How many hours does a full-time Cal Poly student spend studying, per week, on average?
  • b) Does a full-time Cal Poly student spend an average of more than 25 hours per week studying?
  • c) Does the percentage of full-time Cal Poly students who were born in California differ from 80%?
  • d) What proportion of full-time Cal Poly students were born in California?
  • e) What proportion of people with a driver’s license in California have indicated a willingness to be an organ donor?
  • f) Have less than two-thirds of all people with a driver’s license in California indicated a willingness to be an organ donor?
  • g) What is the price of an average transaction at the Subway on campus?
  • h) What proportion of transactions at the Subway on campus include a soft drink?
  • i) Do most transactions at the Subway on campus include a soft drink?
  • j) Do weekday employees at a company take sick days disproportionately often on Mondays and Fridays?

We worked through the first four of these together.  I advised students to start by identifying the observational units, variable, and type of variable for each question.  I emphasized that deciding whether the parameter is a mean or a proportion boils down to determining whether the variable is numerical or categorical.  I also admitted that the question itself often contains a key (giveaway) word, such as average in parts (a) and (b), percentage in (c), and proportion in (d).

Next I asked students to discuss parts (e)-(j) together in zoom breakout rooms of 4-5 students per group.  Then we came back together to discuss these.  I pointed out that questions (f) and (i) do not use a giveaway word, so they require more careful thought.  Students need to realize that the variable in (f) is whether or not the person has indicated a willingness to be an organ donor, which is categorical, so the parameter is the proportion of all people with a California driver’s license who have indicated such a willingness.  The word most carries a lot of weight in (i), revealing that the alternative hypothesis is that the proportion of all Subway transactions that include a soft drink is greater than one-half.

Question (j) is a favorite of mine. Its impetus is an old* Dilbert cartoon, available here.  The joke is that the pointy-haired boss expresses outrage upon learning that two-fifths of all sick days at his company are taken on Mondays and Fridays.  The observational units are sick days, and the variable is whether or not the sick day was taken on Monday or Friday.  The null hypothesis asserts that two-fifths of all sick days are taken on Monday or Friday, which is what would be expected if sick days were not being mis-used to produce long weekends.  The alternative hypothesis is that more than two-fifths of all sick days are taken on Monday or Friday.

* I just realized that very few, if any, of my students were alive when this particular cartoon appeared in 1996.  Hmm, I wonder if my university’s special incentive to take early retirement is still available.


The title of this post promised something sneaky.  You might be thinking that unless sneaky has been redefined as boring, what you’ve read so far does not even come close.  Please keep reading …

I have mentioned before that my course this term is asynchronous, even though I strongly encourage students to attend my optional live zoom sessions on MWF mornings.  Because of the asynchronous listing, I feel obligated to make videos to accompany the handouts for the students who cannot, or choose not to, attend the live sessions.  These videos show me working through the examples in the handout.  I always begin by saying something like: I strongly encourage you to pause the video, answer the handout questions on your own first, and then resume the video to watch my discussion of the questions.  The videos usually show me writing answers directly in the handout file, especially when performing calculations. 

But this time I purposefully did not write on the handout for the video recording.  Instead I only talked about the ten questions (a) – (j).  For the students who ignored my advice to answer the questions for themselves before watching the video, I wanted them at a minimum to take their own notes based on what I was saying.  I hope that active listening and writing might have activated their learning to some extent.

That’s a bit of sneakiness on my part, but that does not constitute the sneaky quiz mentioned in the title of this post.


Most of my handout quiz questions throughout this term have repeated questions that were asked directly in the handout.  But this time students could not answer questions on the handout quiz merely by copying answers from their notes.  Here are the quiz questions:

  1. For how many of the ten questions in this handout are Cal Poly students the observational units?
  2. How many of the ten questions in this handout involve a categorical variable?
  3. How many of the ten questions in this handout involve inference for a population mean?
  4. How many of the ten questions in this handout ask for a confidence interval?
  5. How many of the ten questions in this handout ask for a hypothesis test with a two-sided alternative hypothesis?

I hope that this sneaky approach of mine forced students to review their notes and also reinforced some ideas about how to decide on an inference method.  I hope that these quiz questions reminded students, perhaps sub-consciously, to think about the observational units (question #1), the type of variable (#2), the parameter (#3), whether the question calls for a confidence interval or hypothesis test (#4), and whether an alternative hypothesis is one- or two-sided (#5).

My writing teachers from college might be disappointed that my previous two sentences both began with “I hope …”  Nevertheless, I return to that construction once more for my conclusion: I hope you agree that sneakiness is forgivable, perhaps even desirable, as a pedagogical strategy when the intent is to prompt student to think without their realizing it.