# Archive for

## #43 Confounding, part 1

The topic of confounding is high on the list of most confounding topics in introductory statistics.  Dictionary.com provides these definitions of confound (here):

1. to perplex or amaze, especially by a sudden disturbance or surprise; bewilder; confuse: The complicated directions confounded him.
2. to throw into confusion or disorder: The revolution confounded the people.
3. to throw into increased confusion or disorder
4. to treat or regard erroneously as identical; mix or associate by mistake: Truth confounded with error.
5. to mingle so that the elements cannot be distinguished or separated
6. to damn (used in mild imprecations): Confound it!

Definition #5 comes closest to how we use the term in statistics.  Unfortunately, definitions #1, #2, and #3 describe what the topic does to many students, some of whom respond in a manner that illustrates definition #6.

In this post I will present two activities that introduce students to this important but difficult concept, along with some follow-up questions for assessing their understanding.  One example will involve two categorical variables, and the other will feature two numerical variables.  As always, questions that I pose to students appear in italics.

I have used a variation of the following example, which I updated for this post, for many years.  I hold off on defining the term confounding until students have anticipated the idea for themselves.  Even students who do not care about sports and know nothing about basketball can follow along.

1. During the 2018-19 National Basketball Association season, the Sacramento Kings won 13 home games and lost 16 when they had a sell-out crowd, compared to 11 home wins and 1 loss when they had a smaller crowd.

a) Identify the observational units, explanatory variable, and response variable in this study.  Also classify each variable as categorical or numerical.

As I argued in post #11 (Repeat after me, here), I think these questions are important to ask at the start of nearly every activity, to orient students to the context and the type of analysis required.  The observational units are games, more specifically home games of the Sacramento Kings in the 2018-19 season.  The explanatory variable is crowd size, and the response variable is game outcome.  As presented here, both variables are categorical (and binary).  Crowd size could be studied as a numerical variable, but that information is presented here as whether or not the crowd was a sell-out or smaller.

b) Organize the data into a table of counts, with the explanatory variable groups in columns.

First we set up the table as follows:

Then I suggest to students that we work with each number as we encounter it in the sentence above, so I first ask where the number 2018 should go in the table.  This usually produces more groans than laughs, and then we proceed to fill in the table as follows:

Some optional questions for sports fans: Does the number 41 make sense in this context?  Basketball fans nod their heads, knowing that an NBA team plays an 82-game season, with half of the games played at home.  Did the Kings win more than half of their home games?  Yes, they won 24 of 41 home games, which is 58.5%.  Does this mean that the Kings were an above-average team in that season?  No.  In fact, after including data from their games away from home, they won only 39 of 82 games (47.6%) overall.

c) Calculate the proportion of wins for each crowd size group.  Do these proportions suggest an association (relationship) between the explanatory and response variables?  Explain.

The Kings won 11/12 (.917, or 91.7%) of games with a smaller crowd.  They won 13/29 (.448, or 44.8%) of games with a sell-out crowd.  This seems like a substantial difference (almost 48 percentage points), which suggests that there is an association between crowd size and game outcome.  The Kings had a much higher winning percentage with a smaller crowd than with a sell-out crowd.

d) Produce a well-labeled segmented bar graph to display these proportions.

Here’s a graph generated by Excel:

e) Is it reasonable to conclude that a sell-out crowd caused the team to play worse?  If not, provide an alternative explanation that plausibly explains the observed association.

This is the key question of the entire activity.  I always find that some students have been anticipating this question and are eager to respond: Of course not!  These students explain that the Kings are more likely to have a sell-out crowd when they’re playing against a good team with superstar players, such as the Golden State Warriors with Steph Curry.  I often have to prod students to supply the rest of the explanation: What else is true about the good teams that they play against?  The Kings are naturally less likely to win against such strong teams.

At this point I introduce the term confounding variable as one whose potential effects on a response variable cannot be distinguished from those of the explanatory variable.  I also point out that a confounding variable must be related to both the explanatory and response variable.  Finally, I emphasize that because of the potential for confounding variables, one cannot legitimately draw cause-and-effect conclusions from observational studies.

f) Identify a confounding variable in this study, and explain how this confounding variable is related to both the explanatory and response variable.

This is very similar to question (e), now asking students to express their explanation with this new terminology.  Some students who provide the alternative explanation well nevertheless struggle to specify a confounding variable clearly.  A good description of the proposed confounding variable is: strength of opponent.  It seems reasonable to think that a stronger opponent is more likely to generate a sell-out crowd, and a stronger opponent also makes the game less likely to result in a win for the home team.

I usually stop this in-class activity there, but you could ask students to dig deeper in a homework assignment or quiz.  For example, we can look at more data to explore whether our conjectures about strength of opponent hold true.

It seems reasonable to use the opposing team’s percentage of games won in that season as a measure of its strength.  Let’s continue to work with categorical variables by classifying teams with a winning percentage of 40% and below as weak, between 40% and 60% as moderate, 60% and above as strong.  This leads to the following tables of counts:

Do these data support the two conjectures about how strength of opponent relates to crowd size and to game outcome?  Support your answer with appropriate calculations and graphs.

The first conjecture was that stronger opponents are more likely to generate a sell-out crowd.  This is supported by the data, as we see that 100% (10/10) of strong opponents produced a sell-out crowd, compared to 61.9% (13/21) of moderate opponents and 60% (6/10) of weak opponents.  These percentages are shown in this segmented bar graph:

The second conjecture was that stronger opponents are less likely to produce a win by the home team.  This is clearly supported by the data.  The home team won 100% (10/10) of games against weak opponents, which falls to 57.1% (12/21) of games against moderate teams, and only 20% (2/10) of games against strong teams.  These percentages are shown in this segmented bar graph:

Here’s a quiz question based on a different candidate for a confounding variable. It also seems reasonable to think that games played on weekends (let’s include Fridays with Saturdays, and Sundays) are more likely to attract a sell-out crowd that games played on weekdays.  What else would have to be true about the weekend/weekday breakdown in order for that to be a confounding variable for the observed association between crowd size and game outcome?  What remains is for students to mention a connection with the response variable: Weekend games would need to be less likely to produce a win for the home team, as compared to weekday games.

Again we can look at the data on this question.  Consider the following tables of counts:

Do the data support the argument for the weekday vs. weekend variable as a confounding variable?  Cite relevant calculations to support your response.  Only half of the argument is supported by the data.  Weekend games were slightly more likely to produce a sell-out crowd than a weekday game (13/17 ≈ 0.765 vs. 16/24 ≈ 0.667).  But weekend games were not less likely to produce a home team win than weekday games (11/17 ≈ 0.647 vs. 13/24 ≈ 0.542).  Therefore, the day of week variable does not provide an alternative explanation for why sell-out crowds are less likely to see a win by the home team than a smaller crowd.

Students could explore much more with these data*.  For example, they could analyze opponent’s strength as a numerical variable rather than collapsing it into three categories as I did above.

* I provide a link to the datafile at the end of this post.

The second example is based on an activity that I have used for more than 25 years.  My first contribution to the Journal of Statistics Education, from 1994 (here), presented an example for distinguishing association from causation based on the relationship between a country’s life expectancy and its number of people per television.  In updating the example for this post, I chose a different variable and used data as of 2017 and 2018 from the Word Bank (here and here)*.

* Again, a link to the datafile appears at the end of this post.

2. The following table lists the life expectancy (in years) and the number of automatic teller machines (ATMs per 100,000 adults) in 24 countries around the world:

a) Identify the observational units and variables.  What type of variable are these?  Which is explanatory and which is response?

Yes, I start with these fundamental questions yet again.  The observational units are countries, the explanatory variable is number of ATMs per 100,000 adults, and the response is life expectancy.  Both variables are numerical.

b) Which of the countries listed has the fewest ATMs per 100,000 adults?  Which has the most?

This question is unnecessary, I suppose, but I think it helps students to engage with the data and context.  Haiti has the fewest ATMs: about 2 per 100,000 adults.  The United States has the most: about 174 ATMs per 100,000 adults.

c) Produce a scatterplot of the data, with the response variable on the vertical axis.

Here’s the scatterplot:

d) Does the scatterplot indicate an association between life expectancy and number of ATMs?  Describe its direction, strength, and form.

Yes, the scatterplot reveals a positive association between a country’s life expectancy and its number of ATMs per 100,000 adults.  This association is moderately strong but not linear.  The form follows a curved pattern.

e) Do you believe that installing more ATM machines in countries such as Haiti, Bangladesh, Algeria, and Kenya would cause their inhabitants to live longer?  If not, provide a more plausible, alternative (to cause-and-effect) explanation for the observed association.

This is the key question in the activity, just as with the question in the previous activity about whether sell-out crowds cause the home team to play worse.  Students realize that the answer here is a resounding no.  It’s ridiculous to think that installing more ATMs would cause Haitians to live longer.  Students can tell you the principle that association is not causation.

Students can also suggest a more plausible explanation for the observed association.  They talk about how life expectancy and number of ATMs are both related to the overall wealth, or technological sophistication, of a country.

f) Identify a (potential) confounding variable, and explain how it might relate to the explanatory and response variables.

This is very similar to the previous question.  Here I want students to use the term confounding variable and to express their suggestion as a variable.  Reasonable answers include measures of a country’s wealth or technological sophistication.

This completes the main goal for this activity.  At the risk of detracting from this goal, I often ask an additional question:

g) Would knowing a country’s number of ATMs per 100,000 adults be helpful information for predicting the life expectancy of the country?  Explain.

The point of this question is much harder for students to grasp than with the preceding questions.  I often follow up with this hint: Would you make different life expectancy predictions depending on whether a country has 10 vs. 100 ATMs per 100,000 adults?  Students confidently answer yes to this one, so they gradually come to realize that they should also answer yes to the larger question: Knowing a country’s number of ATMs per 100,000 adults is helpful for predicting life expectancy.  I try to convince them that the association is real despite the lack of a cause-and-effect connection.  Therefore, predictions can be enhanced from additional data even without a causal* relationship.

* I greatly regret that the word causal looks so much like the word casual.  To avoid this potential confusion, I say cause-and-effect much more than causal.  But I had just used cause-and-effect in the previous sentence, so that caused me to switch to causal in the last sentence of the paragraph.

This example also leads to extensions that work well on assignments.  For example, I ask students to:

• take a log transformation of the number of ATMs per 100,000 adults,
• describe the resulting scatterplot of life expectancy vs. this transformed variable,
• fit a least squares line to the (transformed) data,
• interpret the value of r^2,
• interpret the slope coefficient, and
• use the line to predict the life expectancy of a country that was not included in the original list.

Here is a scatterplot of life expectancy vs. log (base 10) of number of ATMs per 100,000 adults, with the least squares line:

The relationship between life expectancy and this transformed variable is positive, moderately strong, and fairly linear.  With this log transformation, knowing a country’s number of ATMs per 100,000 adults explains 46.7% of the variability in countries’ life expectancy values.  The slope coefficient of 9.356 means that the model predicts an increase of 9.356 years in life expectancy for a tenfold increase in number of ATMs per 100,000 adults.  Using this line to predict the life expectancy of Costa Rica, which has 74.41 ATMs per 100,000 adults produces: predicted life expectancy = 60.51 + 9.356×log(74.41) ≈ 60.51 + 9.356×1.87 ≈ 78.02 years.  The actual life expectancy reported for Costa Rica in 2018 is 80.10, so the prediction underestimated by only 2.08 years.

Two earlier posts that focused on multivariable thinking also concerned confounding variables.  In post #3 (here), the graduate program was a confounding variable between an applicant’s gender and the admission decision.  Similarly, in post #35 (here), age was a confounding variable between a person’s smoking status and their lung capacity.

In next week’s second part of this two-part series, I will address more fully the issue of drawing causal conclusions.  Along the way I will present two more examples that involve confounding variables, with connections to data exploration and statistical inference.  I hope these questions can lead students to be less confounded by this occasionally vexing* and perplexing topic.

* I doubt that the term vexing variable will catch on, but it does have a nice ring to it!

## #42 Hardest topic, part 2

In last week’s post (here), I suggested that sampling distributions constitute the hardest topic to teach in introductory statistics.  I provided five recommendations for teaching this challenging topic, including an exhortation to hold off on using the term sampling distribution until students understand the basic idea.  I also gave many examples of questions that can help students to develop their understanding of this concept.

In this post I present five more suggestions for teaching the topic of sampling distributions, along with many more examples of questions for posing to students.  As always, such questions appear in italics.  Let’s continue the list …

6. Pay attention to the center of a sampling distribution as well as its shape and variability.

We teachers understandably devote a lot of attention to the shape and variability of a sampling distribution*.  I think we may neglect to emphasize center as much we should.  With a sample proportion or a sample mean, the mean of its sampling distribution is the population proportion or population mean.  Maybe we do not make a big deal of this result because it comes as no surprise.  But this is the very definition of unbiasedness, which is worth our drawing students’ attention to.

* I’ll say more about these aspects in upcoming suggestions.

We can express the unbiasedness of a sample mean mathematically as:

As I have argued before (in post #19, Lincoln and Mandela, part 1, here), this seemingly simple equation is much more challenging to understand than it appears.  The three symbols in this equation all stand for a different mean.  Ask students: Express what this equation says in a sentence.  This is not easy, so I lead my students thorough this one symbol at a time: The mean of the sample means is the population mean.  A fuller explanation requires some more words: If we repeatedly take random samples from the population, then the mean of the sample means equals the population mean.  This is what it means* to say that the sample mean is an unbiased estimator of the population mean.

* Oops, sorry for throwing another mean at you!

I emphasize to students that this result is true regardless of the population distribution and also for any sample size.  The result is straight-forward to derive from properties of expected values.  I show students this derivation in courses for mathematically inclined students but not in a typical Stat 101 course, where I rely on simulations to convince students that the result is believable.

I suspect that we take unbiasedness of a sample proportion and sample mean for granted, but you don’t have to study obscure statistics in order to discover one that is not unbiased.  For example, the sample standard deviation is not unbiased when sampling from a normal distribution*.

* The sample variance is unbiased in this case, but the unbiasedness does not survive taking the square root.

The following graph of sample standard deviations came from simulating 1,000,000 random samples of size 10 from a normal distribution with mean 100 and standard deviation 25:

What aspect of this distribution reveals that the sample standard deviation is not an unbiased estimator of the population standard deviation?  Many students are tempted to point out the slight skew to the right in this distribution.  That’s worth noting, but shape is not relevant to bias.  We need to notice that the mean of these sample standard deviations (≈ 24.32) is not equal to the value that we used for the population standard deviation (σ = 25). Granted, this is not a large amount of bias, but this difference (24.32 vs. 25) is much more than you would expect from simulation variability with one million repetitions*.

* Here’s an extra credit question for students: Use the simulation results to determine a 95% confidence interval for the expected value of the sample standard deviation, E(S).  This confidence interval turns out to be approximately (24.31 → 24.33), an extremely narrow interval thanks to the very large number of repetitions.

7. Emphasize the impact of sample size on sampling variability.

Under suggestion #1 in the previous post (here), I emphasized the key idea that averages vary less than individual values.  The corollary to this is that averages based on larger samples vary less than averages based on smaller samples.  You don’t need to tell students this; you can lead them to tell you by asking them to … (wait for it) … simulate!  Returning to the context of sampling Reese’s Pieces candies, consider these two graphs from simulation analyses (using the applet here), based on a sample size of 25 candies on the left, 100 candies on the right:

What’s the most striking difference between these two distributions?  Some students comment that the distribution on the right is more “filled in” that the one of the left.  I respond that this is a good observation, but I think there’s a more important difference.  Then I encourage students to focus on the different axis scales between the graphs.  Most students recognize that the graph on the right has much less variability in sample proportions than the one on the right.  How do the standard deviations (of the sample proportions) compare between the two graphs?  Students respond that the standard deviation is smaller on the right.  How many times larger is the standard deviation on the left than the one on the right?  Students reply that the standard deviation is about twice as big on the left as the right.  By how many times must the sample size increase in order to cut the standard deviation of the sample proportion in half?  Recalling that the sample sizes were 25 and 100, students realize that they need to quadruple the sample size in order to cut this standard deviation in half.

I lead students through a similar set of questions based on simulating the sampling distribution of a sample mean.  Students again come to realize that the standard deviation of a sample mean decreases as the sample size increases, and also that a four-fold increase in sample size cuts this standard deviation in half.  This leads us to the result:

I follow up by asking: Explain the difference between SD(X-bar) and σ.  Even students who somewhat understand the idea can have difficulty with expressing this well.  The key is that σ represents the standard deviation of the individual values in the population (penny ages, or word lengths, or weights, or whatever), but SD(X-bar) is the standard deviation of the sample means (averages) that would result from repeatedly taking random samples from the population.

Here’s an assessment question* about the impact of sample size on a sampling distribution: Suppose that a region has two hospitals.  Hospital A has about 10 births per day, and hospital B has about 50 births per day.  About 50% of all babies are boys, but the percentage who are boys varies at each hospital from day to day.  Over the course of a year, which hospital will have more days on which 60% or more of the births are boys – A, B, or negligible difference between A and B?

* This is a variation of a classic question posed by psychologists Kahneman and Tversky, described here.

Selecting the correct answer requires thinking about sampling variability.  The smaller hospital will have more variability in the percentage of boys born on a day, so Hospital A will have more days on which 60% or more of the births are boys.  Many students struggle with this question, not recognizing the important role of sample size on sampling variability.

This principle that the variability of a sample statistic decreases as sample size increases applies to many other statistics, as well.  For example, I ask students to think about the sampling distribution of the inter-quartile range (IQR), comparing sample sizes of 10 and 40, under random sampling from a normally distributed population.  How could you investigate this sampling distribution?  Duh, with simulation!  Describe how you would conduct this simulation.  Generate a random sample of 10 values from a normal distribution.  Calculate the IQR of the 10 sample values.  Repeat this for a large number of repetitions.  Produce a graph and summary statistics of the simulated sample IQR values.  Then repeat all these steps with a sample size of 40 instead of 10.

I used R to conduct such a simulation analysis with 1,000,000 repetitions. Using a normally distributed population with mean 100 and standard deviation 25, I obtained the following graphs (sample size of 10 on the left, 40 on the right):

Compare the variability of the sample IQR with these two sample sizes.  Just as with a sample mean, the variability of the sample IQR is smaller with the larger sample size.  Does the sampling variability of the sample IQR decrease as much by quadrupling the sample size as with the sample mean?  No.  We know that the SD of the sample mean is cut in half by quadrupling the sample size.  But the SD of the sample IQR decreases from about 10.57 to 5.96, which is a decrease of 43.6%, a bit less than 50%.

8. Note that population size does not matter (much).

As long as the population size is considerably larger than the sample size, the population size has a negligible impact on the sampling distribution.  This revelation runs counter to most students’ intuition, so I think it fails to sink in for many students.  This minimal role of population size also stands in stark contrast to the important role of sample size described under the previous suggestion.

How can we help students to appreciate this point?  Simulation, of course.  In post #19 (Lincoln and Mandela, part 1, here), I described a sampling activity using the 268 words in the Gettysburg Address as the population.  The graph on the left below displays the distribution of word lengths (number of letters) in this population (obtained from the applet here).  For the graph on the right, the population has been expanded to include 40 copies of the Gettysburg Address, producing a population size of 268×40 = 10,720 words.

How do these two population distributions compare?  These distributions are identical, except for the population sizes.  The proportions of words at each length value are the same, so the population means and standard deviations are also the same.  The counts on the vertical axis are the only difference in the two graphs.

Now let’s use the applet to select 10,000 samples, with a sample size of 10 words per sample, from each of these two populations.   The graphs below display the resulting distributions of sample means, on the left from the original population and the right from the 40-times-larger-population:

How do these two distributions of sample means compare?  These two sampling distributions are essentially the same.  They both have a very slight skew to the right.  Both means are very close to the population mean of 4.295 letters per word.  The standard deviations of the sample means are very similar in the two sampling distributions, with a slightly smaller standard deviation from the smaller population.  Here’s the bottom-line question: Did the very different population sizes have much impact on the distribution of the sample means?   No, not much impact at all.

Would the variability in a sample mean or a sample proportion differ considerably, depending on whether you were selecting a random sample of 1000 people in California (about 40 million residents) or Montana (about 1 million residents)?  Once again, the population size barely matters, so the (probably surprising) answer is no.

Speaking of large populations, you might also let students know that sampling from a probability distribution is equivalent to sampling from an infinite population.  This is a subtle point, tricky for many students to follow.  You could introduce this idea of sampling from an infinite process with the Reese’s Pieces applet (here).

Depending on your student audience, you could use this as an opening to discuss the finite population correction factor, given by the following expression, where n represents sample size and N population size:

This is the factor by which the standard deviation of the sampling distribution should be adjusted when sampling from a finite population, rather than from an infinite process represented by a probability distribution.  When the population size N is considerably larger than the sample size n, this factor is very close to 1, so the adjustment is typically ignored.  A common guideline is that the population size should be at least 20 (some say 10) times larger than the sample size in order to ignore this adjustment.

9. Celebrate the wonder!

Sampling variability means that the value of a sample statistic varies from sample to sample.  But a sampling distribution reveals a very predictable pattern to that variation.  We should not be shy about conveying to students how remarkable this is!

Consider three populations represented by the following probability distributions:

Are these three probability distributions similar?  Certainly not.  On the left is a normal distribution, in the middle a shifted exponential distribution, and on the right a discrete distribution with five equally spaced values.  These distributions are not similar in the least, except that I selected these populations to have two characteristics in common: They all have mean 100 and standard deviation 20.

Now let’s use software (R, in this case) to select 100,000 random samples of n = 40 from each population, calculating the sample mean for each sample.  Here are the resulting distributions of 100,000 sample means:

That example is very abstract, though, so many students do not share my enthusiasm for how remarkable that result is.  Here’s a more specific example: In post #36 (Nearly normal, here), I mentioned that birthweights of babies in the U.S. can be modelled by a normal distribution with mean 3300 grams and standard deviation 500 grams.  Consider selecting a random sample of 400 newborns from this population.  Which is larger: the probability that a single randomly selected newborn weighs between 3200 and 3400 grams, or the probability that the sample mean birthweight in the random sample of 400 newborns is between 3200 and 3400 grams?  Explain your answer.

The second probability is much larger than the first.  The distribution of sample means is much less variable than the distribution of individual birthweights.  Therefore, a sample mean birthweight is much more likely to be within ±100 grams of the mean than an individual birthweight.  These probabilities turn out to be about 0.1585 (based on z-scores of ±0.2) for an individual baby, compared to 0.9999 (based on z-scores of ±4.0) for the sample mean birthweight.

I think this is remarkable too: Even when we cannot predict an individual value well at all, we can nevertheless predict a sample average very accurately.

Now let’s work with with a categorical variable.  Here is the distribution of sample proportions that results from simulating 1,000,000 samples of sample size 1000 per sample, assuming that the population proportion with the characteristic is 0.4 (using Minitab software this time):

What’s remarkable here?  Well, for one thing, this does look amazingly like a bell-shaped curve.  More importantly, let me ask: About what percentage of the sample proportions are within ±0.03 of the assumed population proportion?  The answer is very close to 95%.  So what, why is this remarkable?  Well, let’s make the context the proportion of eligible voters in the United States who prefer a particular candidate in an election.  There’s about a 95% chance that the sample proportion preferring that candidate would be within ±0.03 of the population proportion with that preference.  Even though there are more 250 million eligible voters in the U.S., we can estimate the proportion who prefer a particular candidate very accurately (to within ±0.03 with 95% confidence) based on a random* sample of only 1000 people!  Isn’t this remarkable?!

* I hasten to add that random is a very important word in this statement. Selecting a random sample of people is much harder to achieve than many people believe.

10. Don’t overdo it.

I stated at the outset of this two-part series that sampling distributions comprise the hardest topic to teach in introductory statistics.  But I’m not saying that this is the most important topic to teach.  I think many teachers succumb to the temptation to spend more time on this topic than is necessary*.

* No doubt I have over-done it myself in this long, two-part series.

Sampling distributions lie at the heart of fundamental concepts of statistical inference, namely p-values and confidence intervals.  But we can lead students to explore and understand these concepts* without teaching sampling distributions for their own sake, and without dwelling on mathematical aspects of sampling distributions.

* Please see previous posts for ideas and examples. Posts #12, #13, and #27 (here, here, and here) use simulation-based inference to introduce p-values. Posts #14 and #15 (here and here) discuss properties of confidence intervals.

This lengthy pair of posts began when I answered a student’s question about the hardest topic to teach in introductory statistics by saying: how the value of a sample statistic varies from sample to sample, if we were to repeatedly take random samples from a population. I conclude by restating my ten suggestions for teaching this challenging topic:

2. Hold off on using the term sampling distribution, and then always add of what.
3. Simulate!
4. Start with the sampling distribution of a sample proportion, then a sample mean.
5. Emphasize the distinctions among three different distributions: population distribution, sample distribution, sampling distribution.
6. Pay attention to the center of a sampling distribution as well as its shape and variability.
7. Emphasize the impact of sample size on sampling variability.
8. Note that population size does not matter (much).
9. Celebrate the wonder!
10. Don’t over-do it.

## #41 Hardest topic, part 1

As I recounted in post #38 (here), a student recently asked what I think is the hardest topic to teach in an introductory statistics course.  My response was: how the value of a sample statistic varies from sample to sample, if we were to repeatedly take random samples from a population.  As you no doubt realize, I could have answered much more succinctly: sampling distributions.

Now I will offer suggestions for helping students to learn about this most challenging topic.  Along the way, in keeping with the name and spirit of this blog, I will sprinkle in many questions for posing to students, as always in italics.

Just as you can’t run before you can walk, you also can’t understand the long-run pattern of variation in a statistic until you first realize that the value of a statistic varies from sample to sample.  I think many teachers consider sampling variability to be so obvious that it does not warrant mentioning.  But have you heard the expression, widely but mistakenly attributed to Einstein*, that “the definition of insanity is doing the same thing over and over and expecting different results”?  Well, if you take a random sample of 10 Reese’s Pieces candies from a large bag, and then do that over and over again, is it crazy to expect to obtain different values for the sample proportions of candies that are orange?  Of course not!  In fact, you would be quite mistaken to expect to see the same result every time.

I think this is a key idea worth emphasizing.  One way to do that is to give students samples of Reese’s Pieces candies*, ask them to calculate the proportion that are orange in their sample, and produce a dotplot on the board to display the variability in these sample proportions.

* Just for fun, I often ask my students: In what famous movie from the 1980s did Reese’s Pieces play a role in the plot?  Apparently the Mars company that makes M&Ms passed on this opportunity, and Hershey Foods jumped at the chance to showcase its lesser-known Reese’s Pieces**.  The answer is E.T. the Extra-Terrestrial.

** See here for a discussion of this famous product-placement story.

As we study sampling variability, I also ask students: Which do you suspect varies less: averages or individual values?  This question is vague and abstract, so I proceed to make it more concrete: Suppose that every class on campus calculates the average height of students in the class.  Which would vary less: the heights of individual students on campus, or the average heights in these classes?  Explain your answer.

I encourage students to discuss this in groups, and they usually arrive at the correct answer: Averages vary less than individual values.  I want students to understand this fundamental property of sampling variability before we embark on the study of sampling distributions.

2. Hold off on using the term sampling distribution, and then always add of what.

The term sampling distribution is handy shorthand for people who already understand the idea*.  But I fear that using this term when students first begin to study the concept is unhelpful, quite possibly harmful to their learning.

* For this reason, I will not hesitate to use the term throughout this post.

I suggest that we keep students’ attention on the big idea: how the value of a sample statistic would vary from sample to sample, if random samples were randomly selected over and over from a population.  That’s quite a mouthful, consisting of 25 words with a total of 118 letters.  It’s a lot easier to say sampling distribution, with only 2 words and 20 letters.  But the two-word phrase does not convey meaning unless you already understand, whereas the 25-word description reveals what we’re studying.  I’ll also point out that the 25 words are mostly short, with an average length of only 4.72 letters per word, compared to an average length of 10.0 letters per word in the two-word phrase*.

* I’m going to resist the urge to determine the number of Scrabble points in these words.  See post #37 (What’s in a name, here) if that appeals to you.

I don’t recommend withholding the term sampling distribution from students forever.  But for additional clarity, I do suggest that we always add of what.  For example, we should say sampling distribution of the sample mean, or of the sample proportion, or of the chi-square test statistic, rather than expecting students to figure out what we intend from the context.

3. Simulate!

Sampling distributions address a hypothetical question: what would happen if …  This hypothetical-ness is what makes the topic so challenging to understand.  I realize, of course, that the mathematics of random variables provides one approach to studying sampling distributions, but I think the core idea of what would happen if … comes alive for students with simulation.  We can simulate taking thousands of samples from a population to see what the resulting distribution of the sample statistic looks like.

What do I recommend next, after you and your students have performed such a simulation?  That’s easy: Simulate again.  What next?  Simulate again, this time perhaps by changing a parameter value, asking students to predict what will change, and then running the simulation to see what does change in the distribution of the sample statistics.  Then what?  Simulate some more!  Now change the sample size, ask students to predict what will change in the sampling distribution, and then examine the results.

I hope that students eventually see so many common features in simulation results that they start to wonder if there’s a way to predict the distribution of a sample statistic in advance, without needing to run the simulation.  At this point, we teachers can play the hero’s role by presenting the mathematical results about approximate normality.  This is also a good time, after students have explored lots of simulation analyses of how a sample statistic varies from sample to sample, to introduce the term sampling distribution.

I think simulation is our best vehicle for helping students to visualize the very challenging concept of what would happen if …  But I hasten to add that simulation is not a panacea.  Even extensive use of simulation does not alter my belief that sampling distributions are the hardest topic in Stat 101.

How can we maximize the effectiveness of simulation for student learning of this topic?  One answer is to make the simulation as visual as possible.  For example, my colleague Beth Chance designed an applet (here) that simulates random selection of Reese’s Pieces by showing candies emerging from a machine:

Students see the candies coming out of the machine and the resulting value of the sample proportion that are orange.  Then they see the graph of sample proportions on the right being generated sample-by-sample as the candy machine dispenses more and more samples.

Another way to make sure that simulation is effective for student learning is to ask (good) questions that help students to understand what’s going on with the simulation.  For example, about the Reese’s Pieces applet: What are the observational units in a single sample?  What is the variable, and what kind of variable is it?  What are the observational units in the graph on the right?  What is the variable, and what kind of variable is it?  In a single sample, the observational units are the individual pieces of candy, and the variable is color, which is categorical.  About the graph on the right, I used only 100 samples in the simulation above so we can see individual dots.  For a student who has trouble identifying the observational units, I give a hint by asking: What does each of the 100 dots represent?  The observational units are the samples of 25 candies, and the variable is the sample proportion that are orange, which is numerical.  These questions can help students to focus on this important distinction between a single sample and a sampling distribution of a statistic.

What do you expect to change in the graph when we change the population proportion (probability) from 0.4 to 0.7?  Most students correctly predict that the entire distribution of sample proportions will shift to the right, centering around 0.7.  Then changing the input value and clicking on “Draw Samples” confirms this prediction.  What do you expect to change in the graph when we change the sample size from 25 to 100?  This is a harder question, but many students have the correct intuition that this change reduces the variability in the distribution of sample proportions.

Here’s another question that tries to draw students’ attention to how simulation works: Which of the inputs has changed between the graph on the left and the graph on the right below – probability, sample size, or number of samples?  What is the impact of that change?

A hint for students who do not spot the correct answer immediately: Do these distributions differ much in their centers or their variability?  The answer here is no, based on both the graph and the means and standard deviations.  (Some students need to be convinced that the difference between the standard deviations here – 0.100 vs. 0.098 – is negligible and unimportant.)  This suggests that the population proportion (probability) and sample size did not change.  The only input value that remains is the correct answer: number of samples.  The scale on the vertical axis makes clear that the graph on the right was based on a larger number of samples than the graph on the left.  This is a subtle issue, the point being that the number of samples, or repetitions, in a simulation analysis is not very important.  It simply needs to be a large number in order to display the long-run pattern as clearly as possible.  The graph on the right is based on 10,000 samples, compared to 1000 samples for the graph on the left.

4. Start with the sampling distribution of a sample proportion, then a sample mean.

Simulating a sampling distribution requires specifying the population from which the random samples are to be selected.  This need to specify the population is a very difficult idea for students to understand.  In practice, we do not know the population.  In fact, the reason for taking a sample is to learn about the population.  But we need to specify a population to sample from in order to examine the crucial question of what would happen if … When studying a yes/no variable and therefore a sample proportion, you only need to specify one number in order to describe the entire population: the population proportion.  Specifying the population is more complicated when studying a sample mean of a numerical variable, because you need to think about the shape and variability of the distribution for that population.  This relative simplicity is why I prefer to study the sampling distribution of a sample proportion before moving to the sampling distribution of a sample mean.

5. Emphasize the distinctions among three different distributions: population distribution, sample distribution, sampling distribution*.

* It’s very unfortunate that those last two sound so similar, but that’s one of the reasons for suggestion #2, that we avoid using the term sampling distribution until students understand the basic idea.

The best way to emphasize these distinctions is to display graphs of these three distributions side-by-side-by-side.  For example, the following graphs, generated from the applet here, show three distributions:

• ages (in years) in a population of 1000 pennies
• ages in a random sample of 25 pennies
• sample mean ages for 10,000 random samples of 25 pennies each

Which of these graphs has different observational units and variables from the other two graphs?  The graph on the right is the odd one out.  The observational units on the right are not pennies but samples of 25 pennies.  The variable on the right is sample mean age, not individual age.  Identify the number of observational units in each of these graphs.  I admit that this is not a particularly important question, but I want students to notice that the population (on the left) consists of 1000 pennies, the sample (in the middle) has 25 pennies, and the distribution of sample means (on the right) is based on 10,000 samples of 25 pennies each.

Which of the following aspects of a distribution do the three graphs have in common – shape, center, or variability?  The similar mean values indicate that the three graphs have center in common.  Describe how the graphs differ on the other two aspects.  The distribution of sample means on the right has much less variability than the distributions of penny ages on the left and in the middle, again illustrating the principle that averages vary less than individual values.  The distribution of sample means on the right is also quite symmetric and bell-shaped, as compared to the skewed-right distributions of penny ages in the other two graphs.

This issue reminds me of an assessment question that I discussed in post #16 (Questions about cats, here): Which is larger – the standard deviation of the weights of 1000 randomly selected people, or the standard deviation of the weights of 10 randomly selected cats?  This question is not asking about the mean weight of a sample.  It’s simply asking about the standard deviation of individual weights, so the sample size is not relevant.  Nevertheless, many students mistakenly respond that cats’ weights have a larger standard deviation than people’s weights.

Here’s a two-part assessment question that address this issue: Suppose that body lengths of domestic housecats (not including the tail) have mean 18 inches and standard deviation 3 inches.  a) Which would be larger – the probability that the length of a randomly selected cat is longer than 20 inches, or the probability that the average length in a random sample of 50 cats is longer than 20 inches, or are these probabilities the same?  b) Which would be larger – the probability that the length of a randomly selected cat is between 17 and 19 inches, or the probability that the average length in a random sample of 50 cats is between 17 and 19 inches, or are these probabilities the same?  To answer these questions correctly, students need to remember that averages vary less than individual values.  So, because a length of 20 inches is greater than the mean, the probability of exceeding 20 inches is greater for an individual cat than for a sample average.  Similarly, the probability of being between 17 and 19 inches is greater for a sample average than for an individual cat, because this interval is centered on the population mean.

I find that I have more to say about teaching what I consider to be the hardest topic in an introductory statistics course, but this post is already on the long side.  I will provide five more suggestions and several more examples about teaching sampling distributions next week.

## #40 Back to normal

I presented some questions for helping students to understand concepts related to normal distributions in post #36 (here).  I return to normal distributions* in this post by presenting an extended activity (or assignment) that introduces the topic of classification and the concept of trade-offs in error probabilities.  This activity also gives students additional practice with calculating probabilities and percentiles from normal distributions.  As always, questions that I pose to students appear in italics.

* I came up with the “back to normal” title of this post many weeks ago, before so much of daily life was turned upside down by the coronavirus pandemic.  I realize that everyday life will not return to normal soon, but I decided to continue with the title and topic for this post.

Suppose that a bank uses an applicant’s score on some criteria to decide whether or not to approve a loan for the applicant.  Suppose for now that these scores follow normal distributions, both for people who would repay to the loan and for those who would not.  Those who would repay the loan have a mean of 70 and standard deviation of 8; those who not repay the loan have a mean of 30 and standard deviation of 8.

• a) Draw sketches of these two normal curves on the same axis.
• b) Write a sentence or two comparing and contrasting these distributions.
• c) Suggest a decision rule, based on an applicant’s score, for deciding whether or not to give a loan to the applicant.
• d) Describe the two kinds of classification errors that could be made in this situation.
• e) Determine the probabilities of the two kinds of error with this rule.

a) Below is a graph, generated with R, of these two normal distributions.  The red curve on the left pertains to people who would not repay the loan; the green curve on the right is for those who would repay the loan:

b) The two distributions have the same shape and variability.  But their centers differ considerably, with a much larger center for those who would repay the loan.  The scores show very little overlap between the two groups.

c) Most students have the reasonable thought to use the midpoint of the two means (namely, 50) as the cutoff value for a decision rule.  Some students need some help to understand how to express the decision rule: Approve the loan for those with a score of 50 or higher, and deny the loan to those with a score below 50.

d) This is the key question that sets up the entire activity.  Students need to recognize and remember that there are two distinct issues (variables) here: 1) whether or not the applicant would in fact repay the loan, and 2) whether the loan application is approved or denied.  Keeping these straight in one’s mind is crucial to understanding and completing this activity.  I find myself reminding students of this distinction often.

With these two variables in mind, the two kinds of errors are:

• Denying the loan to an applicant who would repay
• Approving the loan for an applicant who would not repay

e) The z-scores are (50 – 70) / 8 = -2.50 for one kind of error and (50 – 30) / 8 = 2.50 for the other.  Both probabilities are approximately 0.006.  At this point I prefer that students use software* for these calculations, so they can focus on the concepts of classification and error probability trade-offs.  These probabilities are shown (but hard to see, because they are so small) in the shaded areas of the following graph, with cyan for the first kind of error and pink for the other:

* Software options include applets (such as here), R, Minitab, Excel, …

More interesting questions arise when the two score distributions are not separated so clearly.

Now suppose that the credit scores are normally distributed with mean 60 and standard deviation 8 among those who would repay the loan, as compared to mean 40 and standard deviation 12 among those who would not repay the loan.

• f) Draw sketches of these two normal curves on the same axis.
• g) Describe how this scenario differs from the previous one.
• h) Determine the probabilities of the two kinds of error (using the decision rule based on a cut-off value of 50).
• i) Write a sentence or two to interpret the two error probabilities in context.

f) Here is the new graph:

g) The primary change is that the centers of these score distributions are much closer than before, which means that the distributions have much more overlap than before.  This will make it harder to distinguish people who would repay their loan and those who would not.  A smaller difference is that the variability now differs in the two scores distributions, with slightly less variability in the scores of those who would repay the loan.

h) These error probabilities turn out to be approximately 0.106 for the probability that an applicant who would repay the loan is denied (shown in cyan in the graph below), 0.202 for the probability that an applicant who would not repay is approved (shown in pink):

i) I think this question is important for assessing whether students truly understand, and can successfully communicate, what they have calculated.  There’s a 10.6% chance that an applicant who would repay the loan is denied the loan.  There’s a 20.2% chance that an applicant who would not repay the loan is approved.

Now let’s change the cutoff value in order to decrease one of the error probabilities to a more acceptable level.

• j) In which direction – smaller or larger – would you need to change the decision rule’s cutoff value in order to decrease the probability that an applicant who would repay the loan is denied?
• k) How would the probability of the other kind of error – approving a loan for an applicant who would not repay it – change with this new cutoff value?
• l) Determine the cutoff value needed to decrease the error probability in (j) to .05.  Does this confirm your answer to (j)?
• m) Determine the other error probability with this new cut-off rule.  Does this confirm your answer to (k)?
• n) Write a sentence or two to interpret the two error probabilities in context.

j) This question prompts students to think about the goal before doing the calculation.  This kind of error occurs when the score is less than the cutoff value, and we need the error probability to decrease from 0.106 to 0.050.  Therefore, we need a smaller cutoff value, less than the previous cutoff of 50.  Here is a graph of the situation, with the cyan-colored area reduced to 0.05:

k) Using a smaller cutoff value will produce a larger area above that value under the curve for people who would not repay the loan, as shown in pink in the graph above.  Therefore, the second error probability will increase as the first one decreases.

l) Students need to calculate a percentile here.  Specifically, they need to determine the 5th percentile of a normal distribution with mean 60 and standard deviation 8.  They could use software to determine this, or they could realize that the z-score for the 5th percentile is -1.645.  The new cutoff value needs to be 1.645 standard deviations below the mean: 60 – 1.645×8 = 46.84.  This is indeed smaller than the previous cutoff value of 50.  When students mistakenly add 1.645 standard deviations to the mean, I hope that they realize their error by recalling their correct intuition that the cutoff value should be smaller than before.

m) This probability turns out to be approximately 0.284, which is indeed larger than with the previous cutoff (0.202).

n) Now there’s a 5% chance that an applicant who would repay the loan is denied, because that’s how we determined the cutoff value for the decision rule.  This rule produces a 28.4% chance that an applicant who would not repay the loan is approved.

Now let’s reduce the probability of the other kind of error.

• o) Repeat parts (j) – (n) with the goal of decreasing the probability that an applicant who would not repay the loan is approved to 0.05.

o) For this goal, the cutoff value needs to become larger than 50, which increases the probability that an applicant who would repay the loan is denied.  The cut-off value is now 1.645 standard deviations above the mean: 40 + 1.645×12 = 59.74.  This increases the other error probability to approximately 0.487.  This means that 48.7% of those who would repay the loan are denied, and 5% of those who would not repay are approved, as depicted in the following graph:

Now that we have come up with three different decision rules, I ask students to think about how we might compare them.

• p) If you consider the two kinds of errors to be equally serious, how might you decide which of the three decision rules considered thus far is the best?

This open-ended question is a tough one for students.  I give them a hint to think about the “equally serious” suggestion, and some suggest looking at the average (or sum) of the two error probabilities.

• q) Calculate the average of the two error probabilities for the three cutoff values that we have considered.
• r) Which cutoff value is the best, according to this criterion, among these three options?

We can organize our previous calculations in a table:

According to this criterion, the best cutoff value among these three options is 50, because that produces the smallest average error probability.  But of course, these three values are not the only possible choices for the cutoff criterion.  I suggest to students that we could write some code to calculate the two error probabilities, and their average, for a large number of possible cutoff values.  In some courses, I ask them to write this code for themselves; in other courses I provide them with the following R code:

• s) Explain what each line of code does.
• t) Run the code and describe the resulting graph.
• u) Report the optimal cutoff value and its error probabilities.
• v) Write a sentence describing the optimal decision rule.

Asking students to explain what code does is no substitute for asking them to write their own code, but it can assess some of their understanding:

• The first line creates a vector of cutoff values from 30 to 70.
• The second line calculates the probability that an applicant who would repay the loan has a score below the cutoff value and so would mistakenly be denied.
• The third line calculates the probability that an applicant who would not repay the loan has a score above the cutoff value and so would mistakenly be approved.
• The fourth line calculates the average of these two error probabilities.
• The fifth line produces a graph of average error probability as a function of cutoff value.
• The sixth line determines the optimal cutoff value by identifying which minimizes the average error probability.

Here is the resulting graph:

This graph shows that cutoff values in the neighborhood of 50 are much better (in terms of minimizing average error probability) than cutoff values less than 40 or greater than 60.  The minimum value of average error probability appears to be close to 0.15, achieved at a cutoff value slightly above 50.

The R output reveals that the optimal cutoff value is 50.14, very close to the first cutoff value that we analyzed.  With this cutoff value, the probability of denying an applicant who would repay the load is 0.109, and the probability of approving an applicant who would not repay is 0.199.  The average error probability with this cutoff value is 0.154.

The optimal decision rule, for minimizing the average of the two error probabilities, is to approve a loan for those with a score of 50.14 or greater, and deny a loan to those with a score of less than 50.14.

• w) Now suppose that you consider denying an applicant who would repay the loan to be three time worse than approving an applicant who would not repay the loan.  What criterion might you minimize in this case?
• x) With this new criterion, would you expect the optimal cutoff value to be larger or smaller than before?  Explain.
• y) Describe how you would modify the code to minimize the appropriate weighted average of the error probabilities.
• z) Run the modified code.  Report the optimal cutoff value and its error probabilities.  Also write a sentence describing the optimal decision rule.

We can take the relative importance of the two kinds of errors into account by choosing the cut-off value that minimizes a weighted average of the two error probabilities.  Because we consider the probability of denying an applicant who would repay to be the more serious error, we need to reduce that probability, which means using a smaller cutoff value.

We do not need to change the first three lines of code.  The key change comes in the fourth line, where we must calculate a weighted average instead of an ordinary average.  Then we need to remember to use the weighted average vector in the fifth and sixth lines.  Here is the modified R code:

The graph produced by this code follows:

We see from the graph that the weighted average of error probabilities is minimized with a cutoff value near 45.  The R output reveals the optimal cutoff value to be 45.62.  The associated error probabilities are 0.036 for denying an applicant who would repay, 0.320 for approving an applicant who would not repay, and 0.107 for the weighted average.  The optimal decision rule for this situation is to approve applicants with a score of 45.62 or higher, deny applicants with a score of less than 45.62.

Whew, I have reached the end of the alphabet*, so I’d better stop there!

* You may have noticed that I had to squeeze a few questions into part (z) to keep from running out of letters.

Most teachers like to give their students an opportunity for lots of practice with normal distribution calculations.  With this activity, I have tried to show that you can provide such practice opportunities while also introducing students to ideas such as classification and error probability trade-offs.

P.S. I have used a version of this activity for many years, but I modified the context for this blog post after watching a session at the RStudio conference held in San Francisco at the end of January.  Martin Wattenberg and Fernanda Viegas gave a very compelling presentation (a recording of which is available here) in which they described an interactive visualization tool (available here) that allows students to explore how different cutoff values affect error probabilities.  Their tool addresses issues of algorithmic fairness vs. bias by examining the impact of different criteria on two populations – labeled as blue and orange people.

P.P.S. I was also motivated to develop this activity into a blog post by a presentation that I saw from Chris Franklin in Atlanta in early February.  Chris presented some activities described in the revised GAISE report for PreK-12 (the updated 2020 version will appear here later this year), including one that introduces the topic of classification.