Skip to content

Archive for

#19 Lincoln and Mandela, part 1

Two great leaders will be featured in this post and the next: Abraham Lincoln and Nelson Mandela.  Well, to be honest, featured is too strong, but these men provide the background for in-class activities that help students to understand two very important concepts in statistics: random sampling and random assignment.

When I first mention these two terms in class, I suspect that many students only hear random and don’t pay much attention to sampling versus assignment.  I admit that I did not make a big deal of this distinction myself when I started teaching.  But now I try to emphasize that random sampling and random assignment are very different ideas with very different goals.  In a nutshell:

  • Random sampling concerns how to select observational units for a sample.  Random sampling allows for generalizing the results of a sample to the larger population.
  • Random assignment pertains to how observational units come to be in groups to be compared.  Random assignment allows for the possibility of drawing a cause-and-effect conclusion.

This post will discuss random sampling with reference to Lincoln, and the next will concern random assignment while mentioning Mandela.  Along the way we’ll sneak in a touch of history and also some psychology.  As always, questions for students appear in italics.

I begin this activity by asking students to consider the 268 words in this speech as the population of interest:

The natural first question is: What speech is this, and who wrote it?  I’m glad that most students recognize this as Lincoln’s Gettysburg Address.  Then I give these instructions:

  • Circle ten words as a representative sample from this population.
  • For each word in your sample, record how many letters are in the word.
  • Calculate the average (mean) number of letters per word in your sample.
  • Plot your sample average on a dotplot on the board, along with the sample averages of your classmates.

Those who remember post #11 (here) will not be surprised that I next ask students: Identify the observational units and variable, first in your sample and then for the graph on the board.  For the students’ samples of ten words, the observational units are words, and the variable is the length of the word, as measured by number of letters.  But for the dotplot that students produce on the board, the observational unit are samples of 10 words, and the variable is the average length of a word.

All of this is prelude to the important question: How can we use the dotplot on the board to tell whether this sampling method (my telling students to circle ten words) is any good?  Before a student will respond, I often have to add: What additional information would you like to know to help you decide whether this sampling method was good?  At this point a student usually responds that they would like to know the average word length in the entire population of 268 words.  I reply: Great idea, and before class I calculated this population average to be 4.295 letters per word.  Then I draw a vertical line through the dotplot at this value.  Here are results from a recent class:

At this point I define sampling bias as a systematic tendency for a sampling method to over-represent some observational units and under-represent others.  Then I ask: Would you say that this sampling method (my asking students to circle ten words) is biased?  If so, in which direction?  How can you tell from the dotplot?

Students recognize that a large majority of the sample averages are greater than the population average.  This means that there’s a systematic tendency for this sampling method to over-represent large words and under-represent small words.  In other words, this sampling method is biased toward over-estimating the average length of a word in the Gettysburg Address.

I emphasize to students that sampling bias is a property of the sampling method, not of any one sample generated by the method.  One illustration of this is to ask: Whose idea was it to select a sample by circling ten words based solely on human judgment?  Students reply, somewhat sheepishly, that it was my idea.  I respond that this is absolutely right: The sampling bias here is my fault, not theirs, because the sampling method was my idea.

Then I ask: Suggest some reasons for why this sampling method turned out to be biased in this way.  Students are quick to suggest good explanations for this sampling bias.  They mention that longer words (such as government, battlefield, and consecrate) convey the meaning of the speech better than smaller words (such as a, by, and for).  Students also suggest that longer words are more likely to be selected because they are just more interesting than smaller words.

Next I ask whether sample size is the problem: Would asking people to circle twenty words (rather than ten) eliminate, or at least reduce, the sampling bias?  Most students realize that taking a larger sample of words would not help with this problem, because people would still be prone to select larger words rather than smaller ones.

Before we conclude this discussion of biased sampling, I ask students to give me a chance to redeem myself by proposing a new sampling method: Suppose that I ask you to close your eyes and point at the page ten times in order to select words for your sample.  Would this sampling method be unbiased?  (After all, doesn’t closing your eyes guarantee a lack of bias?)  Explain.  Most students correctly realize that this sampling method is still biased toward longer words.  You would be more likely to select longer words than shorter ones, because longer words take up more space on the page.

Finally, I ask: Suggest a different sampling method that would be unbiased.  Some students immediately respond with a magic word: random!  So I follow up with: What does it mean to select a random sample of words in this situation?  This question is harder, but eventually a student says that random sampling gives every word, whether it is an interesting word such as dedicate or a boring word like of, the same chance of being selected.

We then proceed to examine properties of random sampling.  Sometimes I ask students to generate their own random samples of words from this population.  One option for doing this is to give them a numbered list of the 268 words and then use a random number generator (such as the one at to select their sample.  They can then calculate their sample mean word length and put a dot on a new dotplot on the board, using the same scale as the original dotplot.

Another option is to move directly to using an applet (available here) to select random samples of words.  This applet starts by showing the distribution of word lengths in the population, which is skewed to the right:

You can select random samples by first clicking on Show Sampling Options.  I ask students to start by selecting one random sample of 5 words, which produces a result such as:

The applet calculates the sample mean word length for this sample and plots that on a graph.  Then asking the applet to select 999 more samples results in a graph of sample means that looks like:

Now we’re ready for the key questions: Does this distribution of sample means indicate sampling bias or unbiasedness of this random sampling method?  What aspect of the distribution leads you to this conclusion?  The shape and variability in this distribution are completely irrelevant to the issue of sampling bias.  To address this issue, we focus on the center of the distribution.  We see that the center of the distribution of sample means is very close to the population mean.  We can quantify this by noting that the mean of the 1000 sample means is 4.336 letters/word, which is quite close to the population mean of 4.295 letters/word. Therefore, this random sampling method appears to be unbiased.

Before moving on, I want to point out how challenging the following statement can be for students:

The mean of the sample means is the population mean.

This sentence contains only ten words, but three of them are the word mean(s)!  We can rewrite this statement mathematically, using common notation, as:

Notice that this equation contains only three symbols (in addition to the equals sign), but all three of them describe a mean!  It takes considerable time and careful thought for students to recognize and understand what these three means are and how they relate to each other:

  • The population mean.  For the population of 268 words in the Gettysburg Address, the value of the population mean is 4.295 letters/word.
  • The sample mean, which varies from sample to sample.  Each student calculated his/her own sample mean and represented it with a dot on the board. The first random sample generated by the applet above had a sample mean of 3.6 letters/word.  The applet then generated 999 more random samples and calculated the sample mean number of letters/word for each one.
  • The mean of the sample means.  We could have calculated this for the students’ sample means in class; we did not bother, but we know from the graph that the mean of the sample means would have been much greater than 4.295.  The applet did calculate the mean of the 1000 sample means that it generated; the mean of these sample means turned out to be 4.336 letters/word.  If we went on to generate all possible random samples, in the long run the mean of the sample means would be 4.295, the same value as the population mean.

My next question for students: Consider taking random samples of size 20 words per sample, rather than 5 words per sample.  How (if at all) would you expect the distribution of sample means to change, in terms of center, variability, and shape?  After students think about this, discuss it among themselves, and record their predictions, we use the applet to make this change, which produces a result such as:

We see that the center of this distribution is still close to the population mean of 4.295 letters/word.  Most students expect this, because this simply shows that random sampling is still unbiased with a larger sample size.  The key finding is that the variability of sample means is smaller with a larger sample size.  How can we tell?  One way is that the sample means now range from about 3 to 6 letters/word, whereas before (with a smaller sample size of 5) they ranged from about 2 to 8 letters/word.  Even better, we can note that the standard deviation of the sample means is now about 0.463, which is much less than its value of 0.945 with the smaller sample size.  The shape of the distribution of sample means is a bit more symmetric and normal-looking with the larger sample size than with the smaller sample size, much less skewed than the distribution of the population.

This last point foreshadows the concept of a sampling distribution of a sample mean and the Central Limit Theorem.  I think this context and applet provide a great opportunity to study those ideas*, but at this point I prefer to keep the focus on the topics of sampling bias and random sampling.

* One feature that I particularly like about this applet is that it displays three distributions at once, which are crucial (and challenging) for students to keep in mind when studying sampling distributions:

  • Population distribution (of word lengths)
  • Sample distribution (of word lengths)
  • Sampling** distribution (of average word lengths in a sample)

** It’s very unfortunate that the modifier words sample and sampling are so similar, yet the distributions they describe are precisely a key distinction to understand.  Perhaps we should avoid using the term sampling distribution and instead say distribution of sample averages.  It’s nice to be able to use shorthand when speaking with colleagues who understand the ideas, but in this case the extra words provide clarity for students who are just beginning to consider the ideas.

Before leaving the topic of sampling bias and random sampling, I ask a few more questions of my students, all in the context of selecting a sample of students at our university to complete a survey:

  • Would it be easy or hard to select a random sample of 50 Cal Poly students?

It takes a while for some students to realize that selecting such a random sample would be very hard to achieve.  It’s unlikely that university administrators would provide a list of all students at the university.  Having access to such a list would enable us to select a random sample of students’ names, but we would still face the challenges of contacting them successfully and then, even more problematic, convincing them to respond to our survey.

  • Suppose that you select a sample of Cal Poly students by standing in front of the library or recreation center and approaching 50 students who pass by.  Would this constitute a random sample of Cal Poly students?  What if you stand in front of the recreation center and approach 50 students who pass by?

Most students realize that this sampling method (standing in one location and recruiting passersby) does not constitute random sampling.  Some students would be more likely to be selected than others, in part because they are out-and-about on campus more often.  It’s also likely that you would be more likely to approach students who appear to be …, well, …, approachable, as opposed to students who look more intimidating or less friendly.  Even though the word random is used in an everyday sense to mean anything that is unplanned or unstructured, random sampling has a technical meaning.

  • Even though the convenience sampling described above is not random, could it nevertheless result in a sample that is representative of the population of Cal Poly students?  Identify a variable for which you would not be willing to consider such a convenience sample (as described above) to be representative of the population of Cal Poly students.  Also identify a variable for which you would be willing to consider such a sample (as described above) to be representative of the population of Cal Poly students.

We should certainly not consider a convenience sample, selected from students who pass by the library or recreation center, to be representative of the population for most variables, such as how often a student uses the recreation center per week, and whether or not a student knows where the library is on campus.  We should also be wary for variables about the student’s major, or how many hours they study per week, or how much sleep they get per night.  But there’s probably no reason to doubt that such a sample is representative of the population for a variable such as blood type.

I have used far more than 268 words to write this post.  Clearly I am much less economical with words than Abraham Lincoln in his Gettysburg Address.  I look forward to name-dropping Nelson Mandela into the next post, which will feature random assignment and discuss how that is quite different from random sampling.

P.S. Beth Chance and I developed the Gettysburg Address activity based the famous “random rectangles” activity developed by Dick Scheaffer and others.  As I told Dick when I interviewed him for the Journal of Statistics Education (here), I suspect that random rectangles is the most widely used activity for teaching statistics of all time, at least among activities that do not involve M&M candies.  You can read more about the genesis of the random rectangles activity in this JSE article (here).

P.P.S. This website (here) provides six different versions of the Gettysburg Address, with minor variations (and slightly different numbers of words) among them.  The one used above is the Hay copy.

#18 What do you expect?

I argued in post #6 (here) that the most dreaded two-word term in statistics is standard deviation.  In this post I discuss the most misleading two-word term in statistics.  There’s no doubt in my mind about which term holds this distinction.  What do you expect me to say?

If you expect me to say expected value, then your expectation is correct.

Below are four examples for helping students to understand the concept of expected value and avoid being misled by its regrettable name.  You’ll notice that I do not even use that misleading name until the end of the second example.  As always, questions that I pose to students appear in italics.

1. Let’s return to the random babies activity from post #17 (here).  I used the applet (here) to generate one million repetitions of distributing four babies to their mothers at random, with the following results:

I ask students: Calculate the average number of matches per repetition.  I usually get some blank stares, so I ask: Remind me how to calculate an average.  A student says to add up the values and then divide by the number of values.  I respond: Yes, that’s all there is to it, so please do that with these one million values.  At this point the blank stares resume, along with mutterings that they can’t possibly be expected* to add a million values on their own.

* There’s that word again.

But of course adding these one million values is not so hard at all: Adding the 375,124 zeroes takes no time, and then adding the 332,938 ones takes barely a moment.  Then you can make use of a wonderful process known as multiplication to calculate the entire sum: 0×(375,124) + 1×(332,938) + 2×(250,014) + 4×(41,924) = 1,000,662.  Dividing by 1,000,000 just involves moving the decimal point six places to the left.  This gives 1.000662 as the average number of matches in the one million simulated repetitions of this random process of distributing four babies to their mothers at random.

Then I ask: What do you think the long-run average (number of matches per repetition) will be if we continue to repeat this random process forever and ever?   Most students predict that the long-run average will be 1.0, and I tell them that this is exactly right.  I also show the applet’s graph of the average number of matches as a function of number of repetitions (for the first 1000 repetitions), which shows considerable variation at first but then gradual convergence toward a long-run value:

At this point we discuss how to calculate the theoretical long-run average based on exact probabilities rather than simulation results.  To derive the formula, let’s rewrite the calculation of the average number of matches in one million repetitions from above:

Notice that this calculation is a weighted average, where each possible value (0, 1, 2, 4) is weighted by the proportion of repetitions that produced the value.  Now recall the exact probabilities that we calculated in post #17 (here) for this random process:

and then replace the proportions in the weighted average calculation with the exact, theoretical probabilities:

This expression works out to be 24/24, which is better known as the value 1.0.  This is the theoretical long-run average number of matches that would result from repeating this random process forever and ever.  In general, a theoretical long-run average is the weighted average of the possible values of the random process, using probabilities as weights.  We can express this in a formula as follows, where LRA represents long-run average, x represents the possible values, and p(x) represents their probabilities:

Back to the random babies context, next I ask:

  • Is this long-run average the most likely value to occur?  Students recognize that the answer is no, because we are slightly more likely to obtain 0 matches than 1 match (because probability 9/24 is greater than 8/24).
  • How likely is the long-run average value to occur?  We would obtain exactly 1 match one-third (about 33.33%) of the time, if we were to repeat the random process over and over.
  • Do you expect the long-run average value to occur if you conduct the random babies process once?  Not really, because it’s twice as likely that we will not obtain 1 match than it is that we will obtain 1 match.

2. Now a very generic example: Consider rolling a fair, ordinary, six-sided die (or number cube), and then observing the number of dots on the side that lands up.  Calculate and interpret the long-run average value from this random process.

Saying that the die is fair means that the six possible outcomes should be equally likely, so the possible values and their probabilities are:

We can calculate the long-run average to be: LRA = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6) = 21/6 = 3.5.  This means that if we were to roll the die for a very large number of rolls, the average number of dots appearing on the side that lands up would be very close to 3.5.

Now I ask the same three questions from the end of the previous example:

  • Is this long-run average the most likely value to occur in the die-rolling process?  Of course not, because it’s downright impossible to obtain 3.5 dots when rolling a die. 
  • How likely is the long-run average value to occur?  Duh, like I just said, it’s impossible!  The probability is zero.
  • Do you expect the long-run average value to occur if you roll a die once?  Once more, with feeling: Of course not!

Students naturally wonder why I asked these seemingly pointless questions for the die-rolling example.  Here’s where things get a bit dicey (pun intended).  I sheepishly reveal to students that the common term for this quantity that we have been calculating and interpreting is expected value, abbreviated as EV or E(X).

Let’s ask those questions again about the die-rolling process, but now using standard terminology:

  • Is the expected value the most likely value to occur in the die-rolling process? 
  • How likely is the expected value to occur? 
  • Do you expect the expected value to occur if you conduct the die rolling process once? 

The answers to these questions are the same as before: No, of course not, the expected value (3.5 dots) is certainly not expected, because it’s impossible!

Isn’t this ridiculous?  Can we blame students for getting confused between the expected value and what we expect to happen?  As long as we’re stuck with this horribly misleading term, it’s incumbent on us to help students understand that the expected value of a random process does not in any way, shape, or form mean the value that we expect to occur when we conduct the random process.  How can we do this?  You already know my answer: Ask good questions!

3. Now let’s consider the gambling game of roulette.  When an American roulette wheel (as shown below) is spun, a ball eventually comes to rest in one of its 38 numbered slots.  The slots have colors: 18 red, 18 black, and 2 green.

The simplest version of the game is that you can bet on either a number or a color:

  • If you bet $1 on a color (red or black) and the ball lands in a slot of that color, then you get $2 back for a net profit of $1.  Otherwise, your net profit is -$1.
  • If you bet $1 on a number and the ball lands in that number’s slot, then you get $36 back for a net profit of $35.  Otherwise, your net profit is -$1.

I ask students to work through the following questions in groups, and then we discuss the answers:

  • a) List the possible values of your net profit from a $1 bet on a color, and also report their associated probabilities.  The possible values for net profit are +1 (if the ball lands on your color) and -1 (if it lands on a different color).  The wheel contains 18 slots of your color, so the probability that your net profit is +1 is 18/38, which is about 0.474.  The probability that your net profit is -1 is therefore 20/38, which is about 0.526.  Not surprisingly, it’s a little more likely that you’ll lose than win.
  • b) Determine the expected value of the net profit from betting $1 on a color.  The expected value is $1×(18/38) + (-$1)×(20/38) = -$2/38, which is about -$0.053.
  • c) Interpret what this expected value means.  If you were to bet $1 on a color for a large number of spins of the wheel, then your average net profit would be very close to a loss of $0.053 (about a nickel) per spin.
  • d) Repeat (a)-(c) for betting $1 on a number.  The possible values of net profit are now +35 (if the balls lands on your number) and -1 (otherwise).  The respective probabilities are 1/38 (about 0.026) and 37/38 (about 0.974).  The expected value of net profit is $35×(1/38) + (-$1)×(37/38) = -$2/38, which is about -$0.053.  If you were to bet $1 on a number for a large number of spins of the wheel, then your average net profit would be very close to a loss of $0.053 (about a nickel) per spin.
  • e) How do the expected values of the two types of bets compare?  Explain what this means.  The two expected values are identical.  This means that if you bet for a large number of spins, your average net profit will be to lose about a nickel per spin, regardless of whether you bet on a color or number.
  • f) Are the two types of bets identical?  (Would you get the same experience by betting on a color all evening vs. betting on a number all evening?)  If not, explain their primary difference.  No, the bets are certainly not identical, even though their expected values are the same.  If you bet on a number, you will win much less often than if you bet on a color, but your winning amount will be much larger when you do win.
  • g) The expected value from a $1 bet might seem too small to form the basis for the huge gambling industry.  Explain how casinos can make substantial profits based on this expected value.  Remember that the expected value is the average net profit per dollar bet per spin.  Casinos rely on attracting many customers and keeping them gambling for a large number of spins.  For example, if 1000 gamblers make $1 bets on 1000 spins each, then the expected value* of the casino’s income would 1000×1000×($2/38) ≈ $52,638.58.

* I have resisted the temptation to use a shorthand term such as expected income or expected profit throughout this example.  I believe that saying expected value every time might help students to avoid thinking of “expected” in the everyday sense of the word when we intend its technical meaning.

4. I like to use this question on exams to assess students’ understanding of expected value: At her birthday party, Sofia swings at a piñata repeatedly until she breaks it.  Her mother tells Sofia that she has determined the probabilities associated with the possible number of swings that could be needed for Sofia to break the piñata, and she has calculated the expected value to be 2.4.  Interpret what this expected value means.

A good answer is: If Sofia were to repeat this random process (of swinging until she breaks a piñata) for a very large number of piñatas, then the long-run average number of swings that she would need will be very close to 2.4 swings per piñata.

I look for three components when grading students’ interpretations: 1) long-run, 2) average, and 3) context.  Let’s consider each of these:

  1. The phrase long-run does not need to appear, but the idea of repeating the random process over and over for a large number of repetitions is essential.  I strongly prefer that the interpretation describe what “long run” means by indicating what would be repeated over and over (in this case, the process of swinging at a piñata until it breaks).  
  2. The idea of “average” is absolutely crucial to interpreting expected value, but it’s not uncommon for students to omit this word from their interpretations.   The interpretation makes no sense if it says that Sofia will take 2.4 swings in the long run.
  3. As is so often the case in statistics, context is key.  If a student interprets the expected value as “long-run average” with no other words provided, then the student has not demonstrated an ability to apply the concept to this situation.  In fact, a student could respond “long-run average” without bothering to read a single word about the context.

I also think it’s helpful to ask students, especially those who are studying to become teachers themselves, to critique hypothetical responses to interpreting the expected value, such as:

  • A. The long-run average is 2.4 swings.
  • B. The average number of swings that Sofia needs to break the piñata is 2.4 swings.
  • C. If Sofia were to repeat this random process (of swinging until she breaks a piñata) for a very large number of piñatas, then she would need very close to 2.4 swings in the long run.

I would assign partial credit to all three of these responses. Response A is certainly succinct, and it includes the all-important long-run average.  But the only mention of context in response A is the word “swings,” which I do not consider sufficient for describing the process of Sofia swinging at a piñata until it breaks.  Response B sounds pretty good, as it mentions average and describes the context well, but it is missing the idea of long-run.  Adding “if she were to repeat this process with a large number of piñatas” to response B would make it worthy of full credit.  Response C is so long and generally on-point that it might be hard to see what’s missing.  But response C makes no mention of the word or idea of average.  All that’s needed for response C to deserve full credit is to add “on average” at the end or insert “an average of” before “2.4 swings.”

Can we expect students to understand what expected value means?  Sure, but the unfortunate name makes this more of a challenge than it should be, as it practically begs students to confuse expected value with the value that we expect to occur.  As much as I would like to replace this nettlesome term with long-run average and its abbreviation LRA, I don’t expect* this alternative to catch on in the short term.  But I do hope that this change catches on before the long run arrives.

* Sorry, I can’t stop using this word!

P.S. I borrowed the scenario of Sofia swinging at a piñata from my colleague John Walker, who proposed this context in an exam question with more involved probability calculations.