Skip to content

#9 Statistics of illumination, part 3

I started a series of posts a few weeks ago (here and here) with examples to demonstrate that statistics can shed light on important questions without requiring sophisticated mathematics.  I use these examples on the first day of class in a statistical literacy course and also in presentations for high school students.  A third example that I use for this purpose is the well-known 1970 draft lottery.

Almost none of my students were alive when the draft lottery was conducted on December 1, 1969.  I tell them that I was alive but not old enough to remember the event, which was televised live.  The purpose was to determine which young men would be drafted to serve in the U.S. armed forces, perhaps to end up in combat in Vietnam.  The draft lottery was based on birthdays, so as not to give any advantage or disadvantage to certain groups of people.  Three hundred and thirty-six capsules were put into a bin, with each capsule containing one of the 366 dates of the year. The capsules were drawn one-at-a-time, with draft number 1 being assigned to the birthday drawn first (which turned out to be September 14), meaning that young men born on that date were the first to be drafted.

Let’s look at the results:

Students are naturally tempted to find the draft number assigned to their own birthday, and I encourage them to do this first.  Then we see who has the smallest draft number in the class.  I always look up the draft number for today’s date before class begins, and then in class I ask if anyone has that draft number.  Students always look perplexed about why that draft number is noteworthy, until I wish a happy birthday to anyone with that draft number*.

* If you are reading this blog entry on the day that it is first posted, and your draft number is 161: Happy birthday!

Then I show students the following scatterplot, which has sequential date on the horizontal axis (e.g., January 1 has date #1, February 1 has date #32, and so on through December 31 with date #366) and draft number on the vertical axis.  I ask students: What would you expect this graph to look like with a truly fair, random lottery process?  They quickly respond that the graph should display nothing but random scatter.  Then I ask: Does this graph appear to display random scatter, as you would expect from a fair, random lottery?  Students almost always respond in the affirmative.

I suggest to students that we dig a little deeper, just to be thorough because the stakes in this lottery were so high.  I propose that we proceed month-by-month, calculating the median draft number for each month.  Students agree that this sounds reasonable, and then I ask: What do we first need to do with the table of draft numbers in order to calculate medians?  Many will respond immediately that we need to put the draft numbers in order for each month.  Then I offer a silly follow-up question: Would the process of doing that by hand be quick and easy, or time-consuming and tedious?  After they answer that, I provide them with the following table, where the draft numbers have been sorted from smallest to largest within each month:

Just to get warmed up, we calculate January’s median draft number together as a class.  Of course, this requires finding the (31+1)/2 = 16th value in order, which is 211.  Then I ask each student to determine the median draft number for their own birth month.  I point out that those born in a 30-day month have more work to do, because they must calculate the average of the 15th and 16th ordered values.  I write the medians on the board as students call them out.  Here they are:

Now I ask: Do you see any pattern in these medians, or do they look like random scatter?  Students are quick to respond that, to their surprise, they do see a pattern!  There’s a tendency for larger medians in earlier months, smaller medians in later months.  In fact, every median in the first six months is larger than every median in the second six months.  Then I present the same scatterplot as before, but with the medians superimposed:

Now that we have the medians to help guide us, students are quick to see an abundance of dots in the top left and bottom right (high draft numbers early in the year, low draft numbers late in the year) of the graph.  They also point out a shortage of dots in the bottom left and top right.  At this point I recommend showing students portions of this video of how the lottery was conducted: link.  You might then explain that the problem was inadequate mixing of the capsules.  For example, the January and February capsules were added to the bin first and so settled near the bottom and tended to be drawn later.  The November and December capsules were added to the bin last and so remained near the top and tended to be drawn earlier.

On the first day of class I end this example there, but you could ask more questions.  For example:  We now think we see a pattern in the scatterplot, but how can we investigate how unlikely such a pattern would be with a truly fair, random lottery?  The approach to answering this is quite straightforward, at least in principle: Use software to conduct a large number of random lotteries and see how often we get a result as extreme as the actual 1970 draft lottery.  But this leads to another question: How can we measure this extremeness, how different the actual lottery results are from what would be expected with a fair, random lottery?  One answer: Use the correlation coefficient between sequential date and draft number.  What would this correlation value be for a truly fair, random lottery?  Zero.  With the actual 1970 draft lottery results, this correlation equaled -0.226.  How often would a random lottery produce a correlation coefficient of with an absolute value of 0.226 or higher?  To answer this I simulated 10,000 random lotteries, calculated the correlation coefficient for each one, and produced the following graph of the 10,000 correlation values:

What does this graph reveal about our question of the fairness of the 1970 draft lottery?  First notice what is not relevant: the approximate normality of the sampling distribution of the correlation coefficient.  That this graph is centered at 0 is also not relevant, although that does indicate that the simulation was performed correctly.  What matters is that none of the 10,000 simulated random lotteries produces a correlation coefficient of 0.226 or higher in absolute value.  This indicates that the 1970 draft lottery result would be extremely unlikely to happen from a truly fair, random lottery.  Therefore, we have extremely strong evidence that the process underlying the 1970 results was not a fair, random lottery.

Fortunately, many improvements were made in the process for the following year’s lottery.  The capsules were mixed much more thoroughly, and the process included random selection of draft numbers as well as random drawing of birthdates.  In other words, a birthdate pulled out of one bin was matched up with a draft number drawn from another bin.  The correlation coefficient for that lottery’s results turned out to be 0.014.  Looking at the simulation results, we see that such a correlation value is not at all surprising from a fair, random lottery.

Another extension of this example is to classify the birthdates and draft numbers into three categories and then summarize the 1970 draft lottery results in a 3×3 table of counts as follows:

You could then ask students to produce and describe a segmented bar graph of these results.  You could also ask them to conduct a chi-square test and summarize their conclusion.  The graph below gives another view of the association between birthdate and draft number.  The chi-square test results in a test statistic of 25.18 and a p-value of 0.00005.

I think this draft lottery example fits nicely with the “statistics of illumination” theme.  The context here is extremely important, and the straightforward calculation of medians sheds considerable light on a problem that could easily have gone unnoticed.  I recommend discussing this example in conjunction with the earlier one about readability of cancer pamphlets (link).  With the cancer pamphlets, calculating medians was an unhelpful distraction that diverted attention from the more pressing issue of comparing distributions.  But with the draft lottery, it’s very hard to see much in the scatterplot until you calculate medians, which are quite helpful for discerning a pattern amidst the noise. I also emphasize to students that achieving true randomness can be much more difficult than you might expect.

P.S. The simulation analysis above was performed with the Corr/Regression applet available at:  Even though my name appears first in the name of this applet collection, Beth Chance deserves the vast majority* of the credit for imagining and designing and programming these applets.  I’ll have much more to say about simulation-based inference in future posts.

* Whatever percentage of the credit you may think “vast majority” means here, your thought is almost surely an underestimate.

P.P.S. You can read more about the 1970 draft lottery in many places, including here.

#8 End of the alphabet

As you might imagine, considering the first letter of my first name, I am rather partial to the first letter of the alphabet.  Students also seem to be quite fond of this letter, perhaps because it represents the grade that they are working toward.  Nevertheless, despite the attractions of the letter A, I often draw my students’ attention to the very end of the alphabet, because I think z-scores represent an important and underappreciated concept in introductory statistics.

Some believe that the sole purpose of a z-score is to provide an intermediate step in a normal probability calculation.  Moreover, this step has been rendered obsolete by technology.  But the idea of measuring distance in terms of number of standard deviations is very useful and relevant in many situations.  This is what z-scores do, and this enables us to compare proverbial apples and oranges.  Four examples follow. As always, my questions to students appear in italics.

1. I introduce students to this concept in a context that they are quite familiar with: standardized exams such as the SAT and ACT.  Suppose that Bob achieves a score of 1250 on the SAT, and his sister Kathy scores 29 on the ACT.  Who did better, relative to their peers?  What more information do you need?

Students realize that it’s meaningless to compare scores of 1250 and 29, because the two exams are scored on completely different scales.  I provide some more information:

  • SAT scores have a mound-shaped distribution with a mean of about 1050 and a standard deviation (SD) of about 200.
  • ACT scores have a mound-shaped distribution with a mean of about 21 and an SD of about 5.5.

Now what can you say about who did better relative to their peers – Bob or Kathy?

At this point some students come up with the key insight: compare the two siblings in terms of how many standard deviations above the mean their test scores are.  It’s fairly easy to see that Bob’s score is exactly 1 SD above the mean on the SAT.  We can also see that Kathy’s score is more than 1 SD above the mean on the ACT, because 21 + 5.5 = 26.5 is less than Kathy’s score of 29.  With a little more thought and effort, we can calculate that Kathy’s score is (29 – 21) / 5.5 ≈ 1.45 SDs above the mean.  Therefore, it’s reasonable to conclude that Kathy did better than Bob relative to their peers.

Next I introduce the term z-score (also known as a standard score or standardized score) for what we have calculated here: the number of standard deviations above or below the mean a value is.  I’m tempted not to give a formula for calculating a z-score, but then I succumb to orthodoxy and present: z = (x – mean) / SD.

Now let’s consider two more siblings, Peter and Kellen.  Peter scores 650 on the SAT, and Kellen scores 13 on the ACT.  Who did better, relative to their peers?  Explain.

Having figured out a reasonable approach with Bob and Kathy, students are on much firmer ground now.  Peter’s score is exactly 2 SDs below the mean on the SAT, and Kellen’s score is between 1 and 2 SDs below the mean on the ACT.  In fact, Kellen’s z-score can be calculated to be (13 – 21) / 5.5 ≈ -1.45, so his ACT score is 1.45 SDs below average.  Because Kellen’s score is closer to average than Peter’s, and because both scored below average, Kellen did somewhat better relative to his peers than Peter.

We could produce graphs to compare these distributions and siblings:

The graphs help to make clear that Kathy’s score is farther out than Bob’s in the right tail of their distributions and that Peter’s score is farther out in the left tail than Kellen’s.  You could take the natural next step here and calculate percentiles from normal distributions for each sibling, but I usually stop short of that step to keep the focus on z-scores.

Next I’d like to show you one of my favorite* questions from an AP Statistics exam.  This question, taken from the 2011 exam, is about evaluating players based on speed and strength.  Even though the question mentions no particular sport or position, I’ll always think of this as the “linebacker” question.

* I discussed my all-time favorite question in post #2 (link).

Here’s the first part of the question:

This is a very challenging question to start the exam.  Rather than ask for a routine calculation or interpretation, this question requires students to think things through.  Students need to realize that the minimum time is too close to the mean for the normal distribution to provide a reasonable model.  What does “too close” mean here?  That’s the whole point: Let’s measure that closeness in number of standard deviations.  In other words, let’s look to the end of the alphabet and calculate a z-score for the minimum time: z = (4.40 – 4.60) / 0.15 ≈ -1.33.  What does this z-score tell us?  The minimum time is only 1.33 standard deviations below the mean.  Why does this indicate that a normal distribution would be unreasonable?  As shown in the graph below, with a normal distribution, about 9% of the data would fall more than 1.33 SDs below the mean, so we would expect the minimum time to be farther below the mean than that.

On to the second part of the question:

This part is very straight-forward, allowing students to do something routine that they might have practiced repeatedly, and there’s certainly nothing wrong with that.  This z-score can be calculated to be: z = (370 – 310) / 25 = 2.40.   Notice that the interpretation is as important as the calculation: This z-score tells us that a player who can lift 370 pounds is lifting 2.4 SDs more than average.  Saying that this weight is 2.4 SDs away from the average would leave out important information about direction; students who gave this response did not receive full credit.

Here’s the final part of the question:

Most students saw that Player A was faster but less strong than Player B.  Students then needed to realize that z-scores would be an effective way to compare the players on the two criteria.  Some students had the intuition that a 5-pound difference in weightlifting amount (B’s advantage over A) is less impressive than a 0.15-second difference in running time (A’s advantage over B), but they needed to justify this conclusion by looking at SDs.  A savvy student might have recognized that part (b) pointed them in a helpful direction by asking explicitly for a z-score calculation and interpretation.

The z-scores for speed turn out to be -1.2 for Player A, -0.2 for Player B.  (Smaller values for time are better, indicating faster speed.)  The z-scores for strength turn out to be 2.4 for Player A, 2.6 for Player B.  Comparing these allows us to say that Player B is only slightly stronger than Player A, but Player A is considerably faster than Player B.  Because the question advised us to consider both criteria as equally valuable, Player A is the better choice.

3. I also want students to have a sense for what constitutes a large z-score.  For example, z-scores larger than 3 in absolute value do not come along very often.  This is especially relevant when conducting significance tests for population proportions.  It’s easy for students (and instructors) to get so caught up in blindly following the steps of a significance test that they lose sight of interpreting and drawing a conclusion from a z-score.  A favorite example of mine concerns Hans Rosling, who dedicated his life to increasing public awareness of global health issues and achieved some internet fame for his entertaining and informative TED talks (link).  Rosling and his colleagues liked to ask groups of people: Has the percentage of the world’s population who live in extreme poverty doubled, halved, or remained about the same over the past twenty years?  The correct answer is that this percentage has halved, but only 5% of a sample of 1005 U.S. adults in 2017 got this right.  Rosling liked to say that chimpanzees would do better than people: With only three options, we would expect 33.33% of chimpanzees to answer correctly.

I ask students: How far apart are these proportions: .05 for a sample of U.S. adults versus .3333 for blind guessing? What conclusion about Rosling’s hypothesis can you draw?  Explain how your conclusion follows from that calculation.

This is exactly what a z-score can tell us.  First we need to know the standard deviation of the sample proportion, assuming that people are guessing among the three options.  We could use a simulation analysis to estimate this standard deviation, or we could use the familiar formula that results in:

At this point many students would not pause for a moment before proceeding to use software or a graphing calculator or a normal probability table to determine the p-value, but I strongly encourage pausing to think about that enormous z-score!   The observed value of the sample proportion (5% who answered correctly) is 19.1 standard deviations below the value one-third that would be expected from random guessers such as chimpanzees!!*  We don’t need statistical software or an applet or a normal probability table to tell us that this is a HUGE discrepancy.  This means that there’s (essentially) no way in the world that as few as 5% of a random sample would have answered correctly in a population where everyone blindly guesses.  We have overwhelming evidence in support of Rosling’s claim that humans (at least U.S. adults) do worse than guessing (like chimpanzees would) on this question.

* With a z-score of -19.1, I joke with students that writing a correct interpretation with no exclamation points is only worth half-credit.

Some normal curve graphs might help to put this in perspective.  The graph on the left below shows the distribution of sample proportions with a sample size of 1005, assuming that the population proportion equals one-third.  We can see that a sample proportion of .05 is extremely far out in the tail.  Equivalently, the graph on the right shows a z-score of -19.1 with a standard normal distribution:

4. Suppose that Arturo and Bella take an exam for which the mean score is 70 and standard deviation of scores is 8.  Arturo’s score on the exam is 75, and Bella’s score is 1.5 standard deviations above Arturo’s score.  What is Bella’s score on the exam?  Show your work.

Notice that this question is not asking for a z-score calculation.  I have recently started to ask this question on exams, because I began to worry that students were simply memorizing the mechanics of calculating a z-score and interpreting the result by rote.  I figured that they might be able to do that without really understanding the concept of “number of standard deviations” away.  By asking for a value that is 1.5 standard deviations away from a value that is not the mean, I think this question assesses student understanding.  I’m happy to say that most of my students were able to answer this question correctly: Bella’s score is 75 + 1.5×8 = 75 + 12 = 87.

Where does this leave us?  Whether your first name is Abel or Allison, Zachary or Zoya, or (most likely) something in between, I hope we can agree that when it comes to teaching introductory statistics, the last letter of the alphabet is not least important.

#7 Two dreaded words, part 2

Last week (here) I suggested that the term standard deviation inspires the most dread in introductory statistics students.  I argued that this is very unfortunate because standard deviation (SD) is simply a measure of variability, the most fundamental concept in statistics.  I proposed some questions that might help students to develop a better understanding of what SD measures.  But none of my questions made use of real data, so l will fix that deficiency here in part two.

But first I want to mention a textbook exercise with real data that I encountered near the beginning of my teaching career in the early 1990s.  The question gave seat belt usage percentages for each of the fifty states.  The data were real, and the context was interesting.  The textbook question asked: Calculate the standard deviation of these 50 seat belt usage percentages.  That’s it.  Students were not asked to interpret the value of the SD.  They were not asked to graph the data and describe the distribution.  They were just asked to calculate the value of an SD for some unspecified reason.  I joked to myself that a student would deserve full credit for answering this textbook exercise not with the requested calculation but with a one-word response: Why?

This reminds me that I used to collect data from my students by asking how many keys they were carrying.  My students would then analyze the data and describe the distribution.  A friend eventually worked up the nerve to ask me that devastating one-word question: Why?  I had to admit that I had no motivating question in mind behind collecting these data, so I realized that it was a pointless exercise and abandoned it.

What’s my point behind these digressions?  First, as I tried to demonstrate in the previous post, I think you can ask good questions without real data.  Second, my point now is that you can ask bad questions with real data. In mathematicians’ words, I believe that using real data is neither necessary nor sufficient for asking good* questions.

* I have not forgotten that I promised in my first post (link) to consider the issue of what makes a question good.  I still promise to come back to that eventually.

Don’t get me wrong: I’m a big proponent of using real data.  Let’s get to some good questions using real data that address the dreaded topic of standard deviation (SD).  As always, questions that I pose to students appear in italics.

People love to talk about the weather as one of the most variable, changeable aspects of the world around us.  In many locations around the world one can hear local residents advise: “If you don’t like the weather right now, just wait an hour.”  But how variable is the weather in different locations?  I went to to look up the high temperatures for the 28 days of February 2019 for four locations: San Luis Obispo, California; Juneau, Alaska; Princeville, Hawaii; and Sedona, Arizona.  Before we look at the data, make a prediction for the ordering of these four locations in terms of the SD of their February high temperatures.

I don’t really care whether students know much about the weather in these places, although I encourage them to discuss this with each other.  Then I show them this graph of the data:

Based on what this graph reveals, put these locations in order from smallest to largest SD of their daily February temperatures.

Now I do care that students can read the graphs properly and understand the relative amounts of variability in these distributions.  San Luis Obispo (SLO) and Princeville clearly have less variability in temperatures than Juneau and Sedona, despite having higher temperatures on average.  Princeville has slightly less variability than SLO.  If you do not like high temperatures between 70 and 80 degrees, then do not spend February in Princeville!  On the other hand, if you want to experience a variety of temperatures, consider spending February in Sedona, where high temperatures ranged from 35 to 66 degrees, giving Sedona the most variability among these locations.  Here are the SDs, in order from smallest to largest:

What if we looked at daily high temperatures in July instead of February?  Would we see a very different story or a similar one?  More specifically: Which of these four locations would you predict to have a larger SD in July than in February?  Which would you predict to have a smaller SD?  What is your predicted ordering of the four locations by SD of daily July temperatures?

Let’s look at a graph of the data:

With the benefit of seeing this graph, re-answer the previous questions.

Now you’d better avoid Princeville if you cannot abide high temperatures between 82 and 85 degrees!  In fact, the variability in Princeville’s high temperatures has gone from small in February to miniscule in July, with an SD of less than 1 degree.  Sedona’s variability has also decreased considerably, even as its temperatures have soared into the 90s and low 100s.  On the flip side, temperatures in San Luis Obispo and Juneau are more variable (substantially for SLO, slightly for Juneau) in July than they were in February.

Here are the new SDs, which reveal that Sedona’s temperature variability has dropped from the highest to second lowest:

One final observation and question for students before we leave these data behind: Sedona’s smallest high temperature in July 2019 was 82 degrees, which occurred on the very last day of the month.  What would happen to Sedona’s SD of July temperatures – decrease or increase or remain pretty much the same – if we were to remove the 82 value?

Removing that outlier would decrease the variability considerably and so would decrease the SD.  It turns out that the decrease in SD would be from 4.69 degrees to 3.73 degrees, less than half of the SD value from February in Sedona.

Students can be encouraged to extend this analysis outside of class.  They could use to collect their own data to analyze, perhaps from other locations across the country or other parts of the world.  They could also analyze temperatures for different months, or for the same month across different years.  They could also analyze weather data other than daily high temperatures.

Now let’s move to sports, where a popular word is parity.  This means that teams in a league are roughly evenly matched, which allows fans of many teams to entertain somewhat realistic hopes of winning a championship.  Do some sports and some leagues/conferences have more parity than others?

I investigated this question by recording the win proportion* for each team in the 2018 Major League Baseball season.  Then I grouped the teams according to whether they are in the National League (NL) or American League (AL).  The purpose is to see whether one league or the other has substantially more parity among its teams’ win proportions.

* For example, the Boston Red Sox achieved the best win-loss record of 108-54, for a win proportion of 108/162 ≈ 0.667.  On the other extreme, the Baltimore Orioles had the worst win-loss record of 47-115, for a win proportion of 47/162 ≈ 0.290.

The first question for students is: Would a smaller or larger SD of win proportions indicate greater parity in a sports league?  Most students realize that greater parity means less variability, which requires a smaller SD.

The next question is for baseball fans only: Make an educated guess for which league – NL or AL – had a smaller SD of its teams’ win proportions in the 2018 season.

Let’s look at a graph of the distributions of win proportions for the two leagues:

Now we arrive at the key question for all students, not just baseball fans: Based on the graph, which league had a smaller SD of win proportions?

Most students correctly realize that the National League (NL) has a considerably smaller SD of win proportions, and therefore more parity, than the American League (AL).  Teams were generally much closer in their win proportions, and therefore more middle-of-the-pack in quality, in the NL.  In fact, 14 of 15 teams in the NL won between 40% and 60% of their games.  (Only the Miami Marlins missed out, and just barely, with their 0.391 win proportion.)  In contrast, only 8 of 15 AL teams won between 40% and 60% of their games.  The AL was more of a feast-or-famine league, as it was home to the four teams with the best win proportions (Boston Red Sox, Houston Astros, New York Yankees, Oakland Athletics) and the three teams with the worst win proportions (Baltimore Orioles, Kansas City Royals, Chicago White Sox).

The SDs of the win proportions turn out to be .065 in the NL, .113 in the AL.  Then I ask students: Calculate the percentage difference in these SD values.  This is not trivial for most introductory students, and I plan to devote a future post to this under-appreciated topic.  The calculation of this percentage difference is: (.113 – .065) / .065 ≈ .738 × 100% = 73.8%.  Interpret this value in this context.  The SD of win proportions in the AL was 73.8% larger than the SD of win proportions in the NL.  This seems like a very large discrepancy in parity between the two leagues.

Let’s look at two other sports: football and basketball.  I gathered the same kind of data on win proportions for the 2018 National Football League (NFL) season and the 2018-19 National Basketball Association (NBA) season.  Let’s start with some questions for sports fans: How do you expect the SDs of win proportions to compare across the three sports: baseball, football, basketball?  Within football and basketball, do you expect to see much of a difference in the SDs of win proportions between the two conferences?  In which of the three sports do you expect to see the largest discrepancy in SDs of win proportions between the two leagues/conferences?

Then I ask all students to answer these questions based on the following graph of the data (where the top two dotplots are for the two baseball leagues, the next two are for the two football conferences, and the bottom two are for the two basketball conferences):

What do we see here?  First, baseball has the least variability in win proportions, and football has the most variability. This makes sense in terms of the number of games in each sport’s season: Baseball has the longest season (162 games), followed by basketball (82 games) and finally football (16 games).  Second, football and basketball do not have the large discrepancy in SDs between their conferences that baseball has between its leagues.  The AFC appears to have a slightly smaller SD than the NFC in football, and the Western conference appears to have a slightly smaller SD than the Eastern conference in basketball.  The following SD values bear this out:

As with the weather data, students could be encouraged to extend this analysis to other sports data that they could collect.  They could analyze other sports, perhaps at college or high school levels, or other years, or use a different variable than win proportions.

Almost all of these questions, for both weather and sports, have boiled down to asking: Look at the graph of the distributions, and arrange the groups in order according to magnitude of their SDs.  So, if that’s all the questions asked, why is the length of this post surpassing 2000 words?  I like to think that this illustrates one of the great things about using real data: The context can pull you in, inspiring you* to explore more aspects and analyze additional data in the hope of gaining more insights.

* And your students!

P.S. The datasets that I analyzed in this post can be downloaded from the links below.

#6 Two dreaded words, part 1

Which two-word term produces more anxiety and negative feelings among introductory statistics students than any other?

I don’t think correlation coefficient produces the most negative feelings, or confidence interval, or even hypothesis test.  I don’t think random sample achieves maximum anxiety, or observational study, or expected value, or even confounding variable.  No, my bet is that standard deviation makes students shiver with fear and cringe with distaste more than any other two-word term, perhaps even long after they have completed their statistics course*.

Why is this so unfortunate?  Because variability is the single most fundamental concept in statistics, and the most common measure of variability is … (brace yourself) … standard deviation.

* If you would vote for sampling distribution, I see your point.  But I don’t think sampling distribution comes up outside of a statistics classroom nearly as much as standard deviation.  Trust me: I’ll have lots to say about teaching sampling distributions in later posts.

The formula for standard deviation (allow me to abbreviate this as SD for the rest of this post) can certainly look intimidating:

Expressing this as an algorithm does not make it any more palatable:

  1. Subtract the mean from each data value.
  2. Square each difference.
  3. Add them up.
  4. Divide by one less than the sample size.
  5. Take the square root.

What to do?  I recommend asking questions that help students to understand what SD is all about, rather than wasting their time with calculating SD by hand.  Here are ten questions that try to do this:

1. Suppose that Samira records the ages of customers at the Snack Bar on a college campus from 12-2pm tomorrow, while Mary records the ages of customers at the McDonald’s near the highway at the same time.  Who will have the larger SD of their ages – Samira or Mary?  Explain why.

Mary is likely to encounter people of all ages at McDonald’s – from toddlers to senior citizens and every age in between.  Samira might run into some toddlers and senior citizens at the on-campus snack bar, but she’ll mostly find a lot of people in the 18-25-year-old age group.  Because the ages of McDonald’s customers will vary more than ages of Snack Bar customers, Mary will have a larger SD of ages than Samira will.

2. Suppose that Carlos and Hector visit their local humane society animal shelter.  Carlos records the weights of the 25 cats that they find there, and Hector records the weights of the 25 human beings that they encounter.  Who will have the larger SD of their weights Carlos or Hector?

This question is getting at the same understanding as the previous one*. Most students are quick to realize that the weights of human beings vary much more than the weights of ordinary domestic cats, so Hector will have a larger SD than Carlos.

* But this question involves cats, and I like cats!  I plan to devote a future post to nothing but questions that involve cats in one way or another.

3. Draw four rectangles so that the SD of their widths is greater than the SD of their heights.  This question was sent to me by Camille Fairbourn and John Keane in their proposal to conduct a breakout session at the 2019 U.S. Conference on Teaching Statistics* (link).  They later told me that the original source for the question is the Illustrative Mathematics project (link). I especially like this question because if you understand the concept of SD, you can answer this question correctly with a moment’s thought and less than a minute of time to draw the rectangles.  But if you do not understand the concept, you’re not going to succeed by (accidentally) drawing the rectangles correctly by random chance.

* If you want to impress me with a proposal for a session in a conference that I am chairing: Ask good questions!

4. Consider the following dotplots of customer ratings (on a scale of 1 – 5) of four restaurants (A – D).  Arrange the four restaurants in order from smallest SD to largest SD, without performing any calculations.

First notice that all four restaurants have an average (mean) rating of 3, right in the middle of the scale.  I hope that this helps students to focus on variability as the key idea here.

Let’s start with ratings of restaurant B, which display no variability whatsoever, because all 10 customers gave a mediocre rating of 3.  On the other extreme, customers disagree very strongly about restaurant C, with half giving a stellar rating of 5 and the other half giving a lousy rating of 1.  These extreme cases reveal that the SD is smallest for B and largest for C.

What about restaurants A and D?  Remember that the average (mean) rating is 3 for both, and notice that A has more ratings near the middle while D has more ratings on the ends.  In fact, you could produce the distribution for A by starting with D and moving one of the 1s and one of the 5s to 3.  Therefore, the SD is smaller for A than for D.   The correct ordering, from smallest SD to largest SD, is: B – A – D – C.

Many students struggle with this question, even when I encourage them to discuss it in groups.  I think one of their primary difficulties is appreciating that I am asking about the variability in the ratings.  Some students seem to think that the question is about variability in the frequencies (counts), so they think D shows no variability because the frequency is the same for each rating score (2 customers gave each possible rating score from 1 – 5).  Other students seem to think that variability refers to the number of different rating scores used, so they think that A has more variability than C because A’s customers used all five ratings scores whereas C’s customers only used the values 1, 3, and 5.

If you’re really intent on having students calculate an SD or two by hand, you might do that for the ratings of restaurants B and C here.  The calculation is very easy for B, because every customer gave a rating of 3, so the mean is 3, so all of the deviations from the mean are 0, so the SD = 0.  For restaurant C, the five ratings of 5 all have a squared deviation of 22 = 4, and the five ratings of 1 all have a squared deviation of (-2)2 = 4.  The sum of squared deviations is therefore 10×4 = 40.  Dividing by one less than the sample size gives 40/9 ≈ 4.444.  Taking the square root gives SD ≈ 2.108.  We often say the SD “sorta-kinda” represents a typical deviation from the mean, so close to 2 seems about right for the SD of ratings in restaurant C.

The numerical values of these SDs turn out to be 0.000 for B, 1.115 for A, 1.491 for D, and 2.108 for C.

5. Now let’s throw a fifth restaurant into the mix.What about the SD of ratings for restaurant E below – where does that fall in the ordering among restaurants A-D?

Some students are fooled by the “bumpiness” of the distribution of ratings for restaurant E, because the frequencies/counts bounce up from a rating of 1 to a rating of 2, and then down to a rating of 3, and then back up to 4 and back down to 5.  But as we noted above, we need to focus on the variability of the ratings, not the variability of the frequencies.  Restaurant E’s ratings have more variability than B’s and less than C’s, but how do they compare to A and D?  Notice that you could create E’s distribution from D’s by moving a rating of 1 to a rating of 2 and a rating of 5 to a rating of 4.  So, E has less variability than D.  But E has more variability than A, because you could also create E’s distribution from A’s by moving one rating of 3 to 2 and another rating of 3 to 4.  The SD of the ratings for restaurant E turns out to be 1.247.

6. Can SD ever equal zero?  Under what circumstances?

Sure.  All that’s needed for an SD to equal zero is for the data to display no variability whatsoever.  In other words, SD = 0 when all of the data values equal the same value, as we saw with ratings of restaurant B above.

7. Can SD ever be negative?  Under what circumstances?

No, an SD value can never be negative.  Data cannot have less than no variability, so 0 is the smallest possible value for an SD.  Mathematically, the formula for SD involves squaring deviations from the mean; those squared values can never be negative.

8. If I were to add 5 points to the exam score of every student in my class, would the SD of the exam scores increase, decrease, or remain the same?  Explain why.

Adding 5 points to every exam score would shift the distribution of scores to the right by 5 points, and it would increase the average (mean) score by 5 points.  But the amount of variability in the exam scores would not change, so the SD would not change.

9. If I were to double the exam score of every student in my class, would the SD of the exam scores increase, decrease, or remain the same?  Explain why.

Doubling the exam scores increase their variability, so the SD would increase*.  To be more precise, the SD would double.  If you’re teaching a course for mathematically inclined students, you could ask them to derive this result from the formula, but I don’t recommend that for students in a typical “Stat 101” course.

* Some of you may be thinking that if every student earned identical exam scores in the first place, then doubling the scores would not increase the SD, because the SD would still equal zero.

10. If I were to add 500 points to the exam score for one lucky student in my class, would the SD of the exam scores change very much?  Explain your answer.

Yes, such an incredibly extreme outlier would have a massive impact on the SD.  How can you tell?  Because the mean would be greatly affected by the enormous outlier, and so deviations from the mean would also be affected, and so squared deviations would be all the more affected.  In other words, SD is not at all resistant to outliers.

There you have it – ten questions to help students make sense of standard deviation.  But wait a minute – there’s no real data in any of these examples!  That’s a fair criticism, but I think these questions can nevertheless be effective for developing conceptual understanding (recommendation #2 in the GAISE report, link).  Of course, we can ask good questions that develop conceptual understanding and use real data (GAISE recommendation #3).  But this post has already gotten pretty long.  Please stay tuned for next week’s installment, which will feature questions with real data that seek to develop students’ understanding of the dreaded standard deviation.

#5 A below-average joke

I sometimes joke that when I moved from Pennsylvania to California, the average IQ dropped in both states*.  Then I follow up in class with my students (after the uproarious laughter subsides) by asking:

Is it actually possible, in principle anyway, for the average IQ to decrease in both states when a person moves from Pennsylvania to California?  If so, what would have to be true for this to happen?

I encourage students to think about this for a minute and then discuss it with someone nearby.  I want students to realize that this phenomenon could in fact happen, provided that my IQ was greater than the average IQ in Pennsylvania and less than the average IQ in California.

* For my friends and blog followers in Pennsylvania, please forgive me!  You are quite welcome to tell your students that the average IQ increased in both states when I moved.  You might begin the story by saying “a statistics-teacher-who-seems-to-think-he’s-funny-but-actually-isn’t sometimes jokes that when he moved …”.

Here’s a follow-up question:

Let’s say that my average class size in Pennsylvania was 24 students per class, and my average class size in California has been 36 students per class.  Does it therefore follow that the average class size across my teaching career has been 30 students, the average of 24 and 36?  What additional information (if any) is needed to determine the overall average?

This question is more challenging than the previous one.  Again I urge students to discuss this with each other, and most come to recognize that calculating an overall average by taking the average of two group averages will only work if the two groups have the same sample size.

Then I tell my students that I have taught for more years in California (18) than in Pennsylvania (12), and I also teach more classes per year in California than I did in Pennsylvania.  This leads to my asking:

Is the average class size for my career is closer to 24 or 36 students per class?

Because I have taught more classes in California than I did in Pennsylvania, the average class size for my career will be closer to the California average of 36 students per class than the Pennsylvania average of 24.

I realize that these questions might not work as well for you, lacking the autobiographical element.  My point is that even though students begin to learn about averages in early grades, we can ask good questions that lead students to understand more subtle properties of how averages work.  Here are some more questions along these lines:

Suppose that the average (mean) house price in a community is $425,000.  Does it follow that half of the houses cost more than $425,000?

No, this does not follow.  Half of the houses would cost more than the median, but the mean could be quite different from the median.

Do you suspect that more or less than 50% of the houses cost more than the average (mean), and why?  (Hint: Think about the likely shape for the distribution of house prices.)

This question requires a lot more thought than the previous one. The distribution of house prices is likely to be skewed to the right, because some houses cost millions of dollars whereas there’s a lower bound on how little a house can cost (zero, for an extreme lower bound).  With a skewed right distribution, the mean is typically greater than the median, so less than 50% of the houses would cost more than the mean.

Is it possible for 90% of a company’s employees to earn less than the average (mean) salary in the company?

Yes.  A few outliers on the high end, or a strong enough right skew of the salary distribution, could pull up the mean so much that a large majority of employees earn less than the average salary.

Is this possibility only a mathematical curiosity, or is it actually quite plausible?

This is quite plausible, because the salaries of a few highly paid executives could make this happen.

Construct a hypothetical example for a company with ten employees to illustrate this property.

Some students find this to be challenging, and I again encourage them to discuss it in groups.  We can construct a fairly simple example by supposing that the company president earns a salary of $500K (K = thousand) per year and her nine employees each earn a salary of $100K per year.  Then the sum of these salaries (in thousands of dollars) is 500K + 9×100K = 1400K, which produces an average of 1400K/10 = 140K dollars.  All nine of the employees earning $100K earn less than the average, so 90% of the employees earn less than the average.

Now let’s consider a more challenging set of questions in a context with which students have considerable experience: class sizes.  Suppose that a total of 150 students at a college are taking a statistics course.  The college offers five sections of the course, each taught by a different instructor.  The class sizes are shown in the following table and graph:

Number of students1515151590

Think about the average class size (number of students in a class) at the college.

What is the average class size, per instructor, across the five instructors?

This one is easy: The total number of students is 90 + 15 + 15 + 15 + 15 = 150, so the average class size per instructor (or per section) is 150/50 = 30 students.  The college can truthfully advertise 30 as its average class size.

Now look at class size from the students’ perspective.  Across these 150 students, what is the average size, per student, of their statistics class?

This is much harder. We now need to consider 150 data values, with each student reporting how many students are in their statistics class.  The graph of this distribution is very different from the previous one:

Whereas the previous graph (with sections/instructors as the observational units) was heavily weighted toward the low end of 15 students in a class, this new graph (with individual students as the observational units) is weighted more toward the high end with 90 students in a class.   We can anticipate that the average class size will now be much larger than it was before.  You might want to remind students at this point that the mean/average can be thought of as the balance point of the distribution.

Calculating the average requires adding up the 150 values, which sounds quite daunting, but let’s think it through:

  • The 90 students in Class A all have 90 students in their class, so all 90 of these students will report the value 90.  Adding up these 90 values of 90 gives 90×90  = 8100.
  • The other 60 students (in Classes B-E) all have 15 students in their class, so all 60 of these students will report the value 15.  Adding up these 60 values of 15 gives 60×15 = 900.
  • The sum of these class size values for the entire group of 150 students is therefore 8100 + 900 = 9000.
  • Dividing by 150 to calculate the average gives 9000/150 = 60 students in an average class, from the students’ perspective.

Comment on how these two average class sizes compare, and explain why they differ as they do.  Which average would you say is more relevant?

From the students’ perspective, the average class size is not even close to 30!  In fact, it’s twice that size.  Some students (60 of them, which is 40% of the total of 150 students) have the benefit of a small class size of 15.  But most students (90 of them, which is 60% of the students) are stuck in a large class of 90 students.  In other words, most students would be pretty seriously misled if they chose this college based on the advertised average class size of 30 students per class.  From the students’ perspective, it seems that 60 is the more relevant average to report.

How many of the 150 students are in a class with the average class size, for either version of the average?

Zero.  None of the 150 students are in a 30-student class or a 60-student class.  They are all in either a 90-student class or a 15-student class.  This again illustrates, as we emphasized in the previous post (here), that averages are of limited value because they say nothing about variability and nothing about the distribution.

Calculate the median class size from each perspective.  Comment on how the medians compare to each other and to the averages (means).

From the perspective of the five instructors, the median class size is 15.  From the perspective of the 150 students, the median class size is 90.  The median is considerably less than the mean from the instructors’ perspective, and the median is much greater than the mean from the students’ perspective.

What’s the point of my asking these questions about class size?

  1. We can ask questions about averages that get beyond basic calculations, leading students to explore and develop an understanding of the concept of averages.  Focusing on conceptual understanding is the second GAISE recommendation (here), after the recommendation to teach statistical thinking that was emphasized in posts #3 and #4.
  2. We always need to be careful with our language, especially to make clear what the observational units in a statistical study are.  In this case, we need to be clear about whether the observational units are the five instructors/sections or the 150 students.  The analysis of average class size differs considerably depending on which perspective we take We can ask questions about averages that get beyond basic calculations and lead students to explore and understand more complicated aspects of averages.
  3. As I emphasized in the previous post about cancer pamphlets (here), measures of center don’t tell the whole story.  Variability is crucial to consider, and distribution also matters a lot. We can ask questions about averages that get beyond basic calculations and lead students to explore and understand more complicated aspects of averages.

P.S. I first read about calculating average class sizes from both students’ and instructors’ perspectives in a review of introductory textbooks by George Cobb that appeared in JASA in 1987.  George praised a textbook by Wonnacott and Wonnacott for including questions about this.  George’s review made a big impact on my teaching career, particularly his mantra, repeated throughout the review: Judge a book by its exercises, and you cannot go far wrong.  This sage advice strongly influenced the development of my teaching philosophy*.

* In case you missed it: Ask good questions.

#4 Statistics of illumination, part 2

This is the second post in my “Statistics of illumination” series, in which I present examples to illustrate that statistics can shed light on important questions.  I use these examples on the first day of a statistical literacy course and also in presentations to high school students.  The methods used are quite simple, but the ideas involved are fairly sophisticated.  Click here for the first post in this series.  Questions that I pose to students appear in italics below.

This example is based on a consulting project undertaken by my friend Tom Short, which he and his collaborators described in a JSE article (here).  The research question is whether cancer pamphlets are written at the appropriate level to be understood by cancer patients.  The data collection involved two aspects.  A sample of cancer patients were given a reading test, and a sample of cancer pamphlets were analyzed for their readability.  The data for both patients and pamphlets are on the same grade level scale; for example, 8 means that a patient could read at the 8th grade level, or that a pamphlet’s readability was at the 8th grade level.  The data are summarized in the following two tables:

How might you go about measuring these characteristics: reading level of patients and readability of pamphlets?

Your students have probably taken lots of reading tests, but the idea of measuring readability of text might be new to them.  Measuring readability can involve aspects such as the average length of a sentence, average length of a word, and rate of polysyllabic words.*

* I often joke that my favorite polysyllabic word is “polysyllabic.”  That one word has more syllables than my entire teaching philosophy: Ask good questions.

Explain why it’s not possible to calculate the mean reading level of a patient, based on how the data are presented here.

Most students correctly note that this is not possible because we need numerical values to calculate a mean, so the “below 3” and “above 12” categories prevent that.

Is it possible to calculate the median reading level of a patient?

Many students correctly respond that this is indeed possible, because we simply need ordered data to determine the middle value.  The non-numeric categories are on the two extremes and not in the middle, so they do not impede calculating the median.

Calculate the median reading level of a patient and the median readability level of a pamphlet.

For the patients, students need to determine the (63+1)/2 = 32nd value in order.  This gives some students pause, until they realize that they need to add up the counts for each grade level until they reach the 32nd value.  I point out that they can count from the top or bottom; that’s the whole point that the median is in the middle!  Because there are an even number of pamphlets (30), students need to calculate the average of the 15th and 16th values in order.  The median values turn out to be grade level 9 for both the patients’ reading levels and the pamphlets’ readability levels.

Now we arrive at the key question:

Does the two medians being the same indicate that the patients’ reading levels and the pamphlets’ readability levels are well matched up with each other?

Many students have an initial reaction of yes.  I suppose I’m guilty of leading them to think this by asking them to calculate the medians before getting to this question.  For these students, I offer a substantial hint to point in a helpful direction:

What proportion of patients have a reading level below that of the simplest pamphlet?

The answer is 17/63 ≈ .270.  This means that 27% of the patients have a reading level below that of the simplest pamphlet.  This is clearly a problem, indicating that the patients’ reading levels and pamphlets’ readability levels are not well aligned.  Granted, it might not be possible to present cancer pamphlet information below sixth grade readability, but something has to be done to help the 27% of patients who cannot read at the sixth grade level or higher.

As I argued with the first example in this series, I think this small example illustrates statistical thinking.  More specifically, I emphasize four take-home messages for students from this example:

1. Start with a graph.  In this activity I purposefully led students into a bad habit: calculating summary statistics (medians, in this case) without first making a graph to look at the data.  The graph below clearly shows that the patients and pamphlets are not well aligned, because there’s a lot of red (representing patients) and no green (representing pamphlets) at the bottom of the scale.  In other words, the left side of the graph makes clear that many patients do not have a pamphlet that they can read.

2. Think about variability.  The discipline of statistics is all about variability, which is an obvious but easy to overlook fact of life.  Not everyone is the same: different people read at different levels.  Not all pamphlets are the same: some are easier to read than others.  Whereas measures of center, such as mean and median, are often very helpful, they tell us nothing about variability.

3. Think in terms of distribution.   It’s easy for us instructors to under-estimate how hard this is for students.  It’s much easier for students to see 63 reading levels, but understanding the distribution of those 63 values as a single entity, rather than as 63 separate things, is a very challenging cognitive step for many students.  Why is this so important here?  The question of whether pamphlets are well-aligned to patients is about how well the two distributions (of patients’ reading levels, of pamphlets’ readability levels) match up with each other across all reading levels.

4. Address the research question in context.  I sometimes use this example as a very open-ended question on a final exam, if I have not presented it earlier in the course.  Some students try to apply a two-sample t-test to these data.  I suspect this is because they think it’s somehow better to use more sophisticated methods rather than simpler ones.  But even if a two-sample t-test could be performed here (it can’t because the sample mean reading level of the patients cannot be calculated), such a test would not address the motivating question, just like calculating medians above did nothing to address the question.  Always make sure that your analysis of data addresses the reason that the data were collected.

P.S. I mentioned at the outset that Tom Short introduced me to these data, early in his career when he was teaching at Villanova University.  Tom and I became friends in graduate school.  From there we embarked on careers focused on teaching statistics.  He and I frequently attended conferences and workshops together, and we collaborated on several education projects.  I learned a lot about both statistics and teaching from Tom, often through conversations that we had while walking in places as varied as the streets of San Francisco and the beaches of Daytona.  We also enjoyed talking about sports and competing in a fantasy baseball league, in which Tom managed the wonderfully-named Markov Fielders.  Tom died unexpectedly in November of 2018, a great loss to the many friends, colleagues, and students who were enriched by knowing him.  On Wednesday of this week I will be chairing a memorial session for Tom at the Joint Statistical Meetings.  I dedicate this blog post to him.

#3 Statistics of illumination, part 1

Perhaps you’ve heard this expression:

Most people use statistics as a drunk uses a lamppost – more for support than for illumination.

I do not doubt that there is some truth in this clever saying, but I try to convince my students that statistics can shed light on important questions.  I have given talks to high school students with the title “Statistics of Illumination,” in which I present several examples to make this point.  In this post I will present one of the examples, and I will present other examples in later posts.  Questions that I pose to students will appear in italics.

Consider the following table of counts, based on data from the University of California at Berkeley’s graduate admissions process in 1973:


Why is it not reasonable to simply consider the counts 533 and 113 in order to compare admissions decisions of men and women?

This question leads students to consider the importance of proportional reasoning.  Because many more men than women applied to these programs, we need to calculate proportions (or percentages, or rates).

Calculate the proportion of male applicants who were accepted.  Also calculate the proportion of female applicants who were accepted.

These proportions can be calculated as:

  • Men: 533/1198 ≈ .445 were accepted
  • Women: 113/449 ≈ .252 were accepted

Comment on how these proportions compare. Does this difference appear to be substantial?

The acceptance rate* for men is almost 20 percentage points higher than the acceptance rate for women (44.5% vs. 25.2%).  This certainly seems like a big enough difference to indicate that something worth investigating further is going on here.

* Saying “acceptance rate” is much simpler language than saying “proportion who were accepted”  or even “proportion of acceptance,” but I suggest taking the time to explain to students that the term “acceptance rate” refers to a proportion here.

Let’s proceed to dig a little deeper.  The counts in the table above came from combining data from two programs that we’ll call A and F.  The following tables show the counts for these two programs separately:

Men acceptedMen deniedWomen acceptedWomen denied
Prog A5113148919
Prog F2235124317

Before analyzing these data, first convince yourself that there’s no cheating here: The bottom row reveals that counts for programs A and F really do add up to the counts given earlier

Within each program, calculate the proportion of male applicants who were accepted and the proportion of female applicants who were accepted.  Comment on how the proportions compare within each program.

This requires students to think a bit harder than the earlier calculation of proportions did, because they need to calculate for themselves the total number of applicants for each (program, sex) pair.  These acceptance proportions can be calculated as:

  • Program A, men: 511/(511+314) = 511/825 ≈ .619
  • Program A, women: 89/(89+19) = 89/108 ≈ .824
  • Program F, men: 22/(22+351) = 22/373 ≈ .059
  • Program F, women: 24/(24+317) = 24/341 ≈ .070

Now when we compare the acceptance rates between men and women, we see a very different picture than before: Women have a higher acceptance rate than men in both programs!  The difference is slight in program F (7.0% vs. 5.9%) and considerable in program A (82.4% vs. 61.9%).

Based on this more in-depth analysis, is there evidence of discrimination against women in the graduate admissions process?

No.  At the program level, where admissions decisions are made, there’s no evidence to suggest that men have a higher acceptance rate than women. If anything, the acceptance rates appear to favor women in both programs.  Perhaps program A should have done more to recruit women applicants (only 108 of their 933 applicants were women), but they accepted a substantially higher proportion of women applicants than men.

Some students will comment on the weird thing that has happened here.  If not, you can ask them what’s odd about their calculations.  If that prompt does not work, go ahead and point out the oddity: Women have a higher acceptance rate than men in both programs, but men have a higher acceptance rate than women when you combine the two programs together.

Explain, based on the data provided, how this oddity occurs.

This is the hard part.  This question requires students to think through what’s happening here.  Typically, the first response I hear is: More men than women applied.  To which I respond: Yes, but that’s why we calculated proportions in the first place.  I hasten to add: That’s not completely off-track, but there’s more to it.  I often need to give a hint, so I ask students:

Think about two ways in which programs A and F differ from each other, with regard to applicants’ sex and acceptance rates.

Many students still struggle to discern what’s going on at this point.  But I resist telling them the explanation, because I think their struggle is worthwhile.  I also encourage them to work with nearby students to figure this out together.  Eventually students come to realize that :

  • Most men applied to program A, and most women applied to program F.
  • Program A had much higher acceptance rates than program F.

These two points, taken together, explain the oddity.  We can summarize this explanation more succinctly in one sentence: Men applied mostly to the program that’s easy to get into, whereas women applied mostly to the program that’s very hard to get into. This explains how it happens that women have a higher acceptance rate than men in both programs but a lower acceptance rate than men when the programs are combined.

I believe that this example/activity illustrates statistical thinking, which is the first recommendation in the GAISE (link) report.  The math/arithmetic involved here is quite straightforward, but the thinking required to explain this phenomenon is fairly sophisticated.  Moreover, this example/activity illustrates the new emphasis in the 2016 GAISE report about giving students experience with multivariable thinking.  The observational units here are the applicants, and the three variables are sex, admission decision, and program.  All three of these variables are related to each other, and understanding the oddity* requires understanding those relationships. You might refer to sex as the explanatory variable, admission decision as the response variable, and program as the confounding variable.

* This oddity is often known as Simpson’s paradox. When I started teaching 30 years ago, I joked with my students that it’s unclear whether Simpson’s paradox was named for Lisa or Bart.  I would not have guessed that Lisa and Bart would still be appearing in new episodes 30 years later!

I have used this example on the first day of class in a statistical literacy course.  In that setting I do not bother to introduce any terminology but instead focus on the statistical thinking involved.  You could also use it when discussing the analysis of two-way tables, or really at any point in a course.  The key idea to emphasize is that a relationship between two variables might be explained by considering how both could be related to a third variable.  And that statistics can be illuminating!

P.S. For a fun but difficult follow-up challenge, ask students to create their own made-up example for which Simpson’s paradox occurs.  For example, ask them to create an example with two softball players (call them Amy and Barb), where Amy has a higher proportion of successes (hits) than Barb in June and also in July, but Barb has a higher proportion of hits than Amy when June and July are combined.  This sounds weird, perhaps impossible, but it could happen. To succeed in creating such an example, you need to think through the two conditions needed to make this paradox happen.  Encourage students to think of how this example could be analogous to the admissions example, because if they just start making up some numbers and hope that the paradox will occur, they will be at it for a very long time!

The key here is that Amy has to get most of her attempts when it’s easy to get a hit, and Barb must get most of her attempts when it’s hard to get a hit.  Here’s one way to make this happen:

  • June: Amy gets 9 hits in 10 attempts (90%), Barb gets 80 hits in 100 attempts (80%)
  • July: Amy gets 20 hits in 100 attempts (20%), Barb gets 1 hit in 10 attempts (10%)

Sure enough, Amy does better than Barb in both months.  But when we combine the two months:

  • June and July combined: Amy gets 29 hits in 110 attempts (26.4%), Barb gets 81 hits in 110 attempts (73.6%)

Why does Barb do (much) better than Amy overall despite doing worse in each month?  Because she was lucky enough to get most of her attempts when it was easy to get a hit.  (The pitching was really lousy in June!!)

If you’d rather avoid the sports context, you could say that Amy and Barb are college roommates, with the oddity that Amy has a higher proportion of A grades than Barb in humanities courses and also in science courses, but Barb has a higher proportion of A grades than Barb when these two kinds of courses are combined.

Further reading: The Berkeley graduate admissions data are from a well-known example that has been presented in many textbooks.  I’ve used only programs A and F in order to keep things simpler.  The original article is here. An interesting follow-up is here. The Berkeley data, more examples, and more information about Simpson’s paradox are also presented in a Wikipedia entry here.

#2 My favorite question

This blog is about asking good questions to teach introductory statistics, so let me tell you about my all-time favorite question.  I want to emphasize from the outset that I had nothing to do with writing it. I’m just a big fan.

I am referring to question #6, called an investigative task, on the 2009 AP Statistics exam.  I’ll show you the question piece-by-piece, snipped from the College Board website. You can find this question and many other released AP Statistics exams here.

Here’s how the question begins:

Oh dear, I have to admit that this is an inauspicious start.  Frankly, I think this a boring, generic context for a statistics question.  Even worse, there’s no mention of real data. What’s so great about this? Nothing at all, but please read on …

I think this is a fine question, but I admit that it’s a fairly routine one. Describing the parameter in a study is an important step, and I suspect that students find this much more challenging than many instructors realize.  I would call this an adequate question, perhaps a good question, certainly not a great question. So, I don’t blame you if you’re wondering why this is my all-time favorite question.  Please read on …

Now we’re getting somewhere.  I think this is pretty clever: presenting students with a statistic that they have almost certainly never encountered before, and asking them to figure out something about the unknown statistic based on what they know.  The question is not particularly hard, but it does ask students to apply something they know to a new situation. Students should realize that right-skewed distributions tend to have a larger mean than median, so the ratio mean/median should be greater than 1 with these data.

Part (b) also helps students to prepare for what comes next …

Now we’re talking!  I think part (c) makes this a great question.  To answer this part well, students have to understand the reasoning process of statistical significance, and they have to apply that reasoning process in a situation that they have almost surely never encountered or even thought about: making an inference about the symmetry or skewness of a population distribution.  This is extremely challenging, but I think this assesses something very important: whether students can apply what they have learned to a novel situation that goes a bit beyond what they studied.

Notice that this question does not use words such as hypothesis or test or reject or strength of evidence or p-value.  The key word in the question is plausible.  Students have to realize that the simulation analysis presented allows them to assess the plausibility of the assumption underlying the simulation: that the population follows a normal distribution.  Then they need to recognize that they can assess plausibility by seeing whether the observed value of the sample statistic is unusual in the simulated (null) distribution of that statistic. It turns out that the observed value of the mean/median ratio (1.03) is not very unusual in the simulated (null) distribution, because 14/100 of the simulated samples produced a statistic more extreme than the observed sample value.  Therefore, students should conclude that the simulation analysis reveals that a normally distributed population could plausibly have produced the observed sample.

A common student error is not recognizing the crucial role that the observed value (1.03) of the statistic plays.  More specifically, two common student errors are:

  • Commenting that the simulated distribution is roughly symmetric, and concluding that it’s plausible that the population distribution is normal.  Students who make this error are failing to notice the distinction between the simulated distribution of sample statistics and the population distribution of mpg values.
  • Commenting that the simulated distribution of sample statistics is centered around the value 1, which is the expected value of the statistic from a normal population, and concluding that it’s plausible that the population distribution is normal.  Students who make this error are failing to realize that the simulation assumed a normal population in the first place, which is why the distribution of simulated sample statistics is centered around the value 1.

If this question ended here, it would be one of my all-time favorites.  But it doesn’t end here. There’s a fourth part, which catapults this question into the exalted status of my all-time favorite.  Once again (and for the last time!), please read on…

Wow, look at what’s happening here!  Students are being told that they don’t have to restrict their attention to common statistics that they have been taught.  Rather, this question asks students to exercise their intellectual power to create their own statistic! Moreover, they should know enough to predict how their statistic will behave in a certain situation (namely, a right-skewed distribution).  This part of the question not only asks students to synthesize and apply what they have learned, but it also invites students to exercise an intellectual capability that they probably did not even realized they possess. Some common (good) answers from students include the following statistics, both of which should take a value greater than 1 with a right-skewed distribution:

  • (maximum – median) / (median – minimum)
  • (upper quartile – median) / (median – lower quartile)

There you have it: my all-time favorite question from an introductory statistics exam.  I encourage you to ask this question, or some variation of it*, of your students. I suggest asking this in a low-stakes setting and then discussing it with students afterward.  Encourage them to realize that the reasoning processes they learn in class can be applied to new situations that they have not explicitly studied, and also help them to recognize that they are developing the intellectual power to create new analyses of their own.

* Even though this is my all-time favorite question, I suggest three revisions related to part (c).  First, I would provide students with sample values of the mean and median and ask them to calculate the value of the ratio for themselves.  I think this small extra step might help some students to realize the importance of seeing where the observed value of the statistic falls in the simulated distribution.  Second, I recommend altering the sample data a bit to make the observed value of the sample statistic fall quite far out in the tail of the simulated (null) distribution of the statistic.  This would lead to rejecting the plausibility of a normally distributed population in favor of concluding that the population distribution was right-skewed.  I think this conclusion might be a bit easier for students to recognize, while still assessing whether students understand how to draw an appropriate conclusion from the simulation analysis.  Third, I would prefer to use 1000 or 10,000 repetitions for the simulation, which would require using a histogram rather than a dotplot for the display.

P.S. I mentioned at the top that I had nothing to do with writing this question.  Three people who played a large role in writing it and developing a rubric for grading it were Bob Taylor, Chris Franklin, and Josh Tabor.  They all served on the Test Development Committee for AP Statistics at the time. Bob chaired the committee, Chris served as Chief Reader, and Josh was the Question Leader for the grading of this question. Josh also wrote a JSE article (here) that analyzed various choices for the skewness statistic in part (d).

#1 Ten questions about this blog

Legend has it that when asked for advice about becoming a successful singer, Frank Sinatra replied: Sing good songs.  My advice for becoming an effective teacher of statistics is similarly succinct: Ask good questions.

In this blog I will present ideas for teaching statistics, based on this three-word teaching philosophy: ask good questions.  Let me pose and answer ten “w” questions about this blog:

  1. Who is the audience?  I am writing this for teachers of statistics, particularly introductory statistics, at all levels.  I hope to present ideas that will be relevant and helpful to high school teachers and two- and four-year college teachers of statistics.  I especially hope that relatively new teachers, and those whose primary training was not in statistics, will find this blog to be beneficial.
  2. What is the content?  This blog will provide ideas for teaching statistics, especially introductory statistics.  Most of the posts will describe activities that you can use in class, or questions that you can use on assessments, to help students learn statistics.  Other posts will feature discussions of what and how to teach in statistics courses.
  3. Who am I?  (Why, thanks for asking!)  I have been teaching statistics, primarily at the introductory level to undergraduate students, for thirty years.  I have taught at both Dickinson College, a liberal arts college in Pennsylvania, and in the Statistics Department at Cal Poly in San Luis Obispo.  I have given more than 100 conference presentations and conducted more than 100 faculty development workshops about teaching statistics.  Click on “about me” if you’d like to read more.
  4. What is the style?  I am aiming for a very conversational style in this blog.  This will not be academic writing by any stretch.  I hope to convey in writing a style that I try to adopt in workshops – speaking directly to other teachers and proposing concrete suggestions for teaching statistics.  I may even try for humor on occasion.
  5. Why should you read this?  Because you’re interested in picking up ideas and activities and questions that you can use for teaching statistics.  I’d like to think that some of my suggestions are worthwhile, and I’m counting on you to identify which ones they are.
  6. Was that one of my attempts at humor?  Well, I suppose so, but please stick around – I can do better.
  7. Where and when will posts appear?  For the next year I plan to post one blog entry per week, most likely on Monday mornings, right here at  You can use the button on the right to sign up as a follower and receive email notifications about new posts.
  8. Will this blog gradually build suspense, like a great novel or movie or television series, until a climactic moment at which the secret to successful teaching of statistics will be revealed?  No.  The secret has already been revealed.  In case you missed it, please see the name of this blog and re-read the first and second paragraphs.  In fact, why not, I’ll repeat it again: Ask good questions.  That’s the take-home message.  If I have any wisdom to impart, that’s it.  The rest is just details.  But I’ll do my best to make the details relevant, thought-provoking, and fun.
  9. What makes a question good?  Now that’s an excellent question!  But I am going to respectfully decline to answer at first.  Instead I will provide lots of examples of what I think are good questions for teaching/learning statistics in coming posts.  Then I will try to address this by discerning some general characteristics of good questions from those examples.
  10. Whoa, wait a minute!  Why should you waste your time with such obvious advice as “ask good questions”?  I admit that you’ve probably never been advised to ask bad questions, or not to ask questions at all.  Nevertheless, I don’t think my advice is unhelpful.  I suspect that we teachers spend too much of our most precious commodity – time – on creating presentations for students to hear and writing exposition for them to read.  I think we serve our students’ learning much better by investing our time into crafting good questions that lead students to develop and deepen their understanding of what we want them to learn.  Please join me as I try to persuade you of this in coming posts.