Two great leaders will be featured in this post and the next: Abraham Lincoln and Nelson Mandela. Well, to be honest, featured is too strong, but these men provide the background for in-class activities that help students to understand two very important concepts in statistics: random sampling and random assignment.
When I first mention these two terms in class, I suspect that many students only hear random and don’t pay much attention to sampling versus assignment. I admit that I did not make a big deal of this distinction myself when I started teaching. But now I try to emphasize that random sampling and random assignment are very different ideas with very different goals. In a nutshell:
- Random sampling concerns how to select observational units for a sample. Random sampling allows for generalizing the results of a sample to the larger population.
- Random assignment pertains to how observational units come to be in groups to be compared. Random assignment allows for the possibility of drawing a cause-and-effect conclusion.
This post will discuss random sampling with reference to Lincoln, and the next will concern random assignment while mentioning Mandela. Along the way we’ll sneak in a touch of history and also some psychology. As always, questions for students appear in italics.
I begin this activity by asking students to consider the 268 words in this speech as the population of interest:
The natural first question is: What speech is this, and who wrote it? I’m glad that most students recognize this as Lincoln’s Gettysburg Address. Then I give these instructions:
- Circle ten words as a representative sample from this population.
- For each word in your sample, record how many letters are in the word.
- Calculate the average (mean) number of letters per word in your sample.
- Plot your sample average on a dotplot on the board, along with the sample averages of your classmates.
Those who remember post #11 (here) will not be surprised that I next ask students: Identify the observational units and variable, first in your sample and then for the graph on the board. For the students’ samples of ten words, the observational units are words, and the variable is the length of the word, as measured by number of letters. But for the dotplot that students produce on the board, the observational unit are samples of 10 words, and the variable is the average length of a word.
All of this is prelude to the important question: How can we use the dotplot on the board to tell whether this sampling method (my telling students to circle ten words) is any good? Before a student will respond, I often have to add: What additional information would you like to know to help you decide whether this sampling method was good? At this point a student usually responds that they would like to know the average word length in the entire population of 268 words. I reply: Great idea, and before class I calculated this population average to be 4.295 letters per word. Then I draw a vertical line through the dotplot at this value. Here are results from a recent class:
At this point I define sampling bias as a systematic tendency for a sampling method to over-represent some observational units and under-represent others. Then I ask: Would you say that this sampling method (my asking students to circle ten words) is biased? If so, in which direction? How can you tell from the dotplot?
Students recognize that a large majority of the sample averages are greater than the population average. This means that there’s a systematic tendency for this sampling method to over-represent large words and under-represent small words. In other words, this sampling method is biased toward over-estimating the average length of a word in the Gettysburg Address.
I emphasize to students that sampling bias is a property of the sampling method, not of any one sample generated by the method. One illustration of this is to ask: Whose idea was it to select a sample by circling ten words based solely on human judgment? Students reply, somewhat sheepishly, that it was my idea. I respond that this is absolutely right: The sampling bias here is my fault, not theirs, because the sampling method was my idea.
Then I ask: Suggest some reasons for why this sampling method turned out to be biased in this way. Students are quick to suggest good explanations for this sampling bias. They mention that longer words (such as government, battlefield, and consecrate) convey the meaning of the speech better than smaller words (such as a, by, and for). Students also suggest that longer words are more likely to be selected because they are just more interesting than smaller words.
Next I ask whether sample size is the problem: Would asking people to circle twenty words (rather than ten) eliminate, or at least reduce, the sampling bias? Most students realize that taking a larger sample of words would not help with this problem, because people would still be prone to select larger words rather than smaller ones.
Before we conclude this discussion of biased sampling, I ask students to give me a chance to redeem myself by proposing a new sampling method: Suppose that I ask you to close your eyes and point at the page ten times in order to select words for your sample. Would this sampling method be unbiased? (After all, doesn’t closing your eyes guarantee a lack of bias?) Explain. Most students correctly realize that this sampling method is still biased toward longer words. You would be more likely to select longer words than shorter ones, because longer words take up more space on the page.
Finally, I ask: Suggest a different sampling method that would be unbiased. Some students immediately respond with a magic word: random! So I follow up with: What does it mean to select a random sample of words in this situation? This question is harder, but eventually a student says that random sampling gives every word, whether it is an interesting word such as dedicate or a boring word like of, the same chance of being selected.
We then proceed to examine properties of random sampling. Sometimes I ask students to generate their own random samples of words from this population. One option for doing this is to give them a numbered list of the 268 words and then use a random number generator (such as the one at random.org) to select their sample. They can then calculate their sample mean word length and put a dot on a new dotplot on the board, using the same scale as the original dotplot.
Another option is to move directly to using an applet (available here) to select random samples of words. This applet starts by showing the distribution of word lengths in the population, which is skewed to the right:
You can select random samples by first clicking on Show Sampling Options. I ask students to start by selecting one random sample of 5 words, which produces a result such as:
The applet calculates the sample mean word length for this sample and plots that on a graph. Then asking the applet to select 999 more samples results in a graph of sample means that looks like:
Now we’re ready for the key questions: Does this distribution of sample means indicate sampling bias or unbiasedness of this random sampling method? What aspect of the distribution leads you to this conclusion? The shape and variability in this distribution are completely irrelevant to the issue of sampling bias. To address this issue, we focus on the center of the distribution. We see that the center of the distribution of sample means is very close to the population mean. We can quantify this by noting that the mean of the 1000 sample means is 4.336 letters/word, which is quite close to the population mean of 4.295 letters/word. Therefore, this random sampling method appears to be unbiased.
Before moving on, I want to point out how challenging the following statement can be for students:
The mean of the sample means is the population mean.
This sentence contains only ten words, but three of them are the word mean(s)! We can rewrite this statement mathematically, using common notation, as:
Notice that this equation contains only three symbols (in addition to the equals sign), but all three of them describe a mean! It takes considerable time and careful thought for students to recognize and understand what these three means are and how they relate to each other:
- The population mean. For the population of 268 words in the Gettysburg Address, the value of the population mean is 4.295 letters/word.
- The sample mean, which varies from sample to sample. Each student calculated his/her own sample mean and represented it with a dot on the board. The first random sample generated by the applet above had a sample mean of 3.6 letters/word. The applet then generated 999 more random samples and calculated the sample mean number of letters/word for each one.
- The mean of the sample means. We could have calculated this for the students’ sample means in class; we did not bother, but we know from the graph that the mean of the sample means would have been much greater than 4.295. The applet did calculate the mean of the 1000 sample means that it generated; the mean of these sample means turned out to be 4.336 letters/word. If we went on to generate all possible random samples, in the long run the mean of the sample means would be 4.295, the same value as the population mean.
My next question for students: Consider taking random samples of size 20 words per sample, rather than 5 words per sample. How (if at all) would you expect the distribution of sample means to change, in terms of center, variability, and shape? After students think about this, discuss it among themselves, and record their predictions, we use the applet to make this change, which produces a result such as:
We see that the center of this distribution is still close to the population mean of 4.295 letters/word. Most students expect this, because this simply shows that random sampling is still unbiased with a larger sample size. The key finding is that the variability of sample means is smaller with a larger sample size. How can we tell? One way is that the sample means now range from about 3 to 6 letters/word, whereas before (with a smaller sample size of 5) they ranged from about 2 to 8 letters/word. Even better, we can note that the standard deviation of the sample means is now about 0.463, which is much less than its value of 0.945 with the smaller sample size. The shape of the distribution of sample means is a bit more symmetric and normal-looking with the larger sample size than with the smaller sample size, much less skewed than the distribution of the population.
This last point foreshadows the concept of a sampling distribution of a sample mean and the Central Limit Theorem. I think this context and applet provide a great opportunity to study those ideas*, but at this point I prefer to keep the focus on the topics of sampling bias and random sampling.
* One feature that I particularly like about this applet is that it displays three distributions at once, which are crucial (and challenging) for students to keep in mind when studying sampling distributions:
- Population distribution (of word lengths)
- Sample distribution (of word lengths)
- Sampling** distribution (of average word lengths in a sample)
** It’s very unfortunate that the modifier words sample and sampling are so similar, yet the distributions they describe are precisely a key distinction to understand. Perhaps we should avoid using the term sampling distribution and instead say distribution of sample averages. It’s nice to be able to use shorthand when speaking with colleagues who understand the ideas, but in this case the extra words provide clarity for students who are just beginning to consider the ideas.
Before leaving the topic of sampling bias and random sampling, I ask a few more questions of my students, all in the context of selecting a sample of students at our university to complete a survey:
- Would it be easy or hard to select a random sample of 50 Cal Poly students?
It takes a while for some students to realize that selecting such a random sample would be very hard to achieve. It’s unlikely that university administrators would provide a list of all students at the university. Having access to such a list would enable us to select a random sample of students’ names, but we would still face the challenges of contacting them successfully and then, even more problematic, convincing them to respond to our survey.
- Suppose that you select a sample of Cal Poly students by standing in front of the library or recreation center and approaching 50 students who pass by. Would this constitute a random sample of Cal Poly students? What if you stand in front of the recreation center and approach 50 students who pass by?
Most students realize that this sampling method (standing in one location and recruiting passersby) does not constitute random sampling. Some students would be more likely to be selected than others, in part because they are out-and-about on campus more often. It’s also likely that you would be more likely to approach students who appear to be …, well, …, approachable, as opposed to students who look more intimidating or less friendly. Even though the word random is used in an everyday sense to mean anything that is unplanned or unstructured, random sampling has a technical meaning.
- Even though the convenience sampling described above is not random, could it nevertheless result in a sample that is representative of the population of Cal Poly students? Identify a variable for which you would not be willing to consider such a convenience sample (as described above) to be representative of the population of Cal Poly students. Also identify a variable for which you would be willing to consider such a sample (as described above) to be representative of the population of Cal Poly students.
We should certainly not consider a convenience sample, selected from students who pass by the library or recreation center, to be representative of the population for most variables, such as how often a student uses the recreation center per week, and whether or not a student knows where the library is on campus. We should also be wary for variables about the student’s major, or how many hours they study per week, or how much sleep they get per night. But there’s probably no reason to doubt that such a sample is representative of the population for a variable such as blood type.
I have used far more than 268 words to write this post. Clearly I am much less economical with words than Abraham Lincoln in his Gettysburg Address. I look forward to name-dropping Nelson Mandela into the next post, which will feature random assignment and discuss how that is quite different from random sampling.
P.S. Beth Chance and I developed the Gettysburg Address activity based the famous “random rectangles” activity developed by Dick Scheaffer and others. As I told Dick when I interviewed him for the Journal of Statistics Education (here), I suspect that random rectangles is the most widely used activity for teaching statistics of all time, at least among activities that do not involve M&M candies. You can read more about the genesis of the random rectangles activity in this JSE article (here).
P.P.S. This website (here) provides six different versions of the Gettysburg Address, with minor variations (and slightly different numbers of words) among them. The one used above is the Hay copy.