Skip to content

Posts tagged ‘descriptive statistics’

#5 A below-average joke

I sometimes joke that when I moved from Pennsylvania to California, the average IQ dropped in both states*.  Then I follow up in class with my students (after the uproarious laughter subsides) by asking:

Is it actually possible, in principle anyway, for the average IQ to decrease in both states when a person moves from Pennsylvania to California?  If so, what would have to be true for this to happen?

I encourage students to think about this for a minute and then discuss it with someone nearby.  I want students to realize that this phenomenon could in fact happen, provided that my IQ was greater than the average IQ in Pennsylvania and less than the average IQ in California.

* For my friends and blog followers in Pennsylvania, please forgive me!  You are quite welcome to tell your students that the average IQ increased in both states when I moved.  You might begin the story by saying “a statistics-teacher-who-seems-to-think-he’s-funny-but-actually-isn’t sometimes jokes that when he moved …”.

Here’s a follow-up question:

Let’s say that my average class size in Pennsylvania was 24 students per class, and my average class size in California has been 36 students per class.  Does it therefore follow that the average class size across my teaching career has been 30 students, the average of 24 and 36?  What additional information (if any) is needed to determine the overall average?

This question is more challenging than the previous one.  Again I urge students to discuss this with each other, and most come to recognize that calculating an overall average by taking the average of two group averages will only work if the two groups have the same sample size.

Then I tell my students that I have taught for more years in California (18) than in Pennsylvania (12), and I also teach more classes per year in California than I did in Pennsylvania.  This leads to my asking:

Is the average class size for my career is closer to 24 or 36 students per class?

Because I have taught more classes in California than I did in Pennsylvania, the average class size for my career will be closer to the California average of 36 students per class than the Pennsylvania average of 24.

I realize that these questions might not work as well for you, lacking the autobiographical element.  My point is that even though students begin to learn about averages in early grades, we can ask good questions that lead students to understand more subtle properties of how averages work.  Here are some more questions along these lines:

Suppose that the average (mean) house price in a community is $425,000.  Does it follow that half of the houses cost more than $425,000?

No, this does not follow.  Half of the houses would cost more than the median, but the mean could be quite different from the median.

Do you suspect that more or less than 50% of the houses cost more than the average (mean), and why?  (Hint: Think about the likely shape for the distribution of house prices.)

This question requires a lot more thought than the previous one. The distribution of house prices is likely to be skewed to the right, because some houses cost millions of dollars whereas there’s a lower bound on how little a house can cost (zero, for an extreme lower bound).  With a skewed right distribution, the mean is typically greater than the median, so less than 50% of the houses would cost more than the mean.

Is it possible for 90% of a company’s employees to earn less than the average (mean) salary in the company?

Yes.  A few outliers on the high end, or a strong enough right skew of the salary distribution, could pull up the mean so much that a large majority of employees earn less than the average salary.

Is this possibility only a mathematical curiosity, or is it actually quite plausible?

This is quite plausible, because the salaries of a few highly paid executives could make this happen.

Construct a hypothetical example for a company with ten employees to illustrate this property.

Some students find this to be challenging, and I again encourage them to discuss it in groups.  We can construct a fairly simple example by supposing that the company president earns a salary of $500K (K = thousand) per year and her nine employees each earn a salary of $100K per year.  Then the sum of these salaries (in thousands of dollars) is 500K + 9×100K = 1400K, which produces an average of 1400K/10 = 140K dollars.  All nine of the employees earning $100K earn less than the average, so 90% of the employees earn less than the average.

Now let’s consider a more challenging set of questions in a context with which students have considerable experience: class sizes.  Suppose that a total of 150 students at a college are taking a statistics course.  The college offers five sections of the course, each taught by a different instructor.  The class sizes are shown in the following table and graph:

Number of students1515151590

Think about the average class size (number of students in a class) at the college.

What is the average class size, per instructor, across the five instructors?

This one is easy: The total number of students is 90 + 15 + 15 + 15 + 15 = 150, so the average class size per instructor (or per section) is 150/50 = 30 students.  The college can truthfully advertise 30 as its average class size.

Now look at class size from the students’ perspective.  Across these 150 students, what is the average size, per student, of their statistics class?

This is much harder. We now need to consider 150 data values, with each student reporting how many students are in their statistics class.  The graph of this distribution is very different from the previous one:

Whereas the previous graph (with sections/instructors as the observational units) was heavily weighted toward the low end of 15 students in a class, this new graph (with individual students as the observational units) is weighted more toward the high end with 90 students in a class.   We can anticipate that the average class size will now be much larger than it was before.  You might want to remind students at this point that the mean/average can be thought of as the balance point of the distribution.

Calculating the average requires adding up the 150 values, which sounds quite daunting, but let’s think it through:

  • The 90 students in Class A all have 90 students in their class, so all 90 of these students will report the value 90.  Adding up these 90 values of 90 gives 90×90  = 8100.
  • The other 60 students (in Classes B-E) all have 15 students in their class, so all 60 of these students will report the value 15.  Adding up these 60 values of 15 gives 60×15 = 900.
  • The sum of these class size values for the entire group of 150 students is therefore 8100 + 900 = 9000.
  • Dividing by 150 to calculate the average gives 9000/150 = 60 students in an average class, from the students’ perspective.

Comment on how these two average class sizes compare, and explain why they differ as they do.  Which average would you say is more relevant?

From the students’ perspective, the average class size is not even close to 30!  In fact, it’s twice that size.  Some students (60 of them, which is 40% of the total of 150 students) have the benefit of a small class size of 15.  But most students (90 of them, which is 60% of the students) are stuck in a large class of 90 students.  In other words, most students would be pretty seriously misled if they chose this college based on the advertised average class size of 30 students per class.  From the students’ perspective, it seems that 60 is the more relevant average to report.

How many of the 150 students are in a class with the average class size, for either version of the average?

Zero.  None of the 150 students are in a 30-student class or a 60-student class.  They are all in either a 90-student class or a 15-student class.  This again illustrates, as we emphasized in the previous post (here), that averages are of limited value because they say nothing about variability and nothing about the distribution.

Calculate the median class size from each perspective.  Comment on how the medians compare to each other and to the averages (means).

From the perspective of the five instructors, the median class size is 15.  From the perspective of the 150 students, the median class size is 90.  The median is considerably less than the mean from the instructors’ perspective, and the median is much greater than the mean from the students’ perspective.

What’s the point of my asking these questions about class size?

  1. We can ask questions about averages that get beyond basic calculations, leading students to explore and develop an understanding of the concept of averages.  Focusing on conceptual understanding is the second GAISE recommendation (here), after the recommendation to teach statistical thinking that was emphasized in posts #3 and #4.
  2. We always need to be careful with our language, especially to make clear what the observational units in a statistical study are.  In this case, we need to be clear about whether the observational units are the five instructors/sections or the 150 students.  The analysis of average class size differs considerably depending on which perspective we take We can ask questions about averages that get beyond basic calculations and lead students to explore and understand more complicated aspects of averages.
  3. As I emphasized in the previous post about cancer pamphlets (here), measures of center don’t tell the whole story.  Variability is crucial to consider, and distribution also matters a lot. We can ask questions about averages that get beyond basic calculations and lead students to explore and understand more complicated aspects of averages.

P.S. I first read about calculating average class sizes from both students’ and instructors’ perspectives in a review of introductory textbooks by George Cobb that appeared in JASA in 1987.  George praised a textbook by Wonnacott and Wonnacott for including questions about this.  George’s review made a big impact on my teaching career, particularly his mantra, repeated throughout the review: Judge a book by its exercises, and you cannot go far wrong.  This sage advice strongly influenced the development of my teaching philosophy*.

* In case you missed it: Ask good questions.

#4 Statistics of illumination, part 2

This is the second post in my “Statistics of illumination” series, in which I present examples to illustrate that statistics can shed light on important questions.  I use these examples on the first day of a statistical literacy course and also in presentations to high school students.  The methods used are quite simple, but the ideas involved are fairly sophisticated.  Click here for the first post in this series.  Questions that I pose to students appear in italics below.

This example is based on a consulting project undertaken by my friend Tom Short, which he and his collaborators described in a JSE article (here).  The research question is whether cancer pamphlets are written at the appropriate level to be understood by cancer patients.  The data collection involved two aspects.  A sample of cancer patients were given a reading test, and a sample of cancer pamphlets were analyzed for their readability.  The data for both patients and pamphlets are on the same grade level scale; for example, 8 means that a patient could read at the 8th grade level, or that a pamphlet’s readability was at the 8th grade level.  The data are summarized in the following two tables:

How might you go about measuring these characteristics: reading level of patients and readability of pamphlets?

Your students have probably taken lots of reading tests, but the idea of measuring readability of text might be new to them.  Measuring readability can involve aspects such as the average length of a sentence, average length of a word, and rate of polysyllabic words.*

* I often joke that my favorite polysyllabic word is “polysyllabic.”  That one word has more syllables than my entire teaching philosophy: Ask good questions.

Explain why it’s not possible to calculate the mean reading level of a patient, based on how the data are presented here.

Most students correctly note that this is not possible because we need numerical values to calculate a mean, so the “below 3” and “above 12” categories prevent that.

Is it possible to calculate the median reading level of a patient?

Many students correctly respond that this is indeed possible, because we simply need ordered data to determine the middle value.  The non-numeric categories are on the two extremes and not in the middle, so they do not impede calculating the median.

Calculate the median reading level of a patient and the median readability level of a pamphlet.

For the patients, students need to determine the (63+1)/2 = 32nd value in order.  This gives some students pause, until they realize that they need to add up the counts for each grade level until they reach the 32nd value.  I point out that they can count from the top or bottom; that’s the whole point that the median is in the middle!  Because there are an even number of pamphlets (30), students need to calculate the average of the 15th and 16th values in order.  The median values turn out to be grade level 9 for both the patients’ reading levels and the pamphlets’ readability levels.

Now we arrive at the key question:

Does the two medians being the same indicate that the patients’ reading levels and the pamphlets’ readability levels are well matched up with each other?

Many students have an initial reaction of yes.  I suppose I’m guilty of leading them to think this by asking them to calculate the medians before getting to this question.  For these students, I offer a substantial hint to point in a helpful direction:

What proportion of patients have a reading level below that of the simplest pamphlet?

The answer is 17/63 ≈ .270.  This means that 27% of the patients have a reading level below that of the simplest pamphlet.  This is clearly a problem, indicating that the patients’ reading levels and pamphlets’ readability levels are not well aligned.  Granted, it might not be possible to present cancer pamphlet information below sixth grade readability, but something has to be done to help the 27% of patients who cannot read at the sixth grade level or higher.

As I argued with the first example in this series, I think this small example illustrates statistical thinking.  More specifically, I emphasize four take-home messages for students from this example:

1. Start with a graph.  In this activity I purposefully led students into a bad habit: calculating summary statistics (medians, in this case) without first making a graph to look at the data.  The graph below clearly shows that the patients and pamphlets are not well aligned, because there’s a lot of red (representing patients) and no green (representing pamphlets) at the bottom of the scale.  In other words, the left side of the graph makes clear that many patients do not have a pamphlet that they can read.

2. Think about variability.  The discipline of statistics is all about variability, which is an obvious but easy to overlook fact of life.  Not everyone is the same: different people read at different levels.  Not all pamphlets are the same: some are easier to read than others.  Whereas measures of center, such as mean and median, are often very helpful, they tell us nothing about variability.

3. Think in terms of distribution.   It’s easy for us instructors to under-estimate how hard this is for students.  It’s much easier for students to see 63 reading levels, but understanding the distribution of those 63 values as a single entity, rather than as 63 separate things, is a very challenging cognitive step for many students.  Why is this so important here?  The question of whether pamphlets are well-aligned to patients is about how well the two distributions (of patients’ reading levels, of pamphlets’ readability levels) match up with each other across all reading levels.

4. Address the research question in context.  I sometimes use this example as a very open-ended question on a final exam, if I have not presented it earlier in the course.  Some students try to apply a two-sample t-test to these data.  I suspect this is because they think it’s somehow better to use more sophisticated methods rather than simpler ones.  But even if a two-sample t-test could be performed here (it can’t because the sample mean reading level of the patients cannot be calculated), such a test would not address the motivating question, just like calculating medians above did nothing to address the question.  Always make sure that your analysis of data addresses the reason that the data were collected.

P.S. I mentioned at the outset that Tom Short introduced me to these data, early in his career when he was teaching at Villanova University.  Tom and I became friends in graduate school.  From there we embarked on careers focused on teaching statistics.  He and I frequently attended conferences and workshops together, and we collaborated on several education projects.  I learned a lot about both statistics and teaching from Tom, often through conversations that we had while walking in places as varied as the streets of San Francisco and the beaches of Daytona.  We also enjoyed talking about sports and competing in a fantasy baseball league, in which Tom managed the wonderfully-named Markov Fielders.  Tom died unexpectedly in November of 2018, a great loss to the many friends, colleagues, and students who were enriched by knowing him.  On Wednesday of this week I will be chairing a memorial session for Tom at the Joint Statistical Meetings.  I dedicate this blog post to him.

#2 My favorite question

This blog is about asking good questions to teach introductory statistics, so let me tell you about my all-time favorite question.  I want to emphasize from the outset that I had nothing to do with writing it. I’m just a big fan.

I am referring to question #6, called an investigative task, on the 2009 AP Statistics exam.  I’ll show you the question piece-by-piece, snipped from the College Board website. You can find this question and many other released AP Statistics exams here.

Here’s how the question begins:

Oh dear, I have to admit that this is an inauspicious start.  Frankly, I think this a boring, generic context for a statistics question.  Even worse, there’s no mention of real data. What’s so great about this? Nothing at all, but please read on …

I think this is a fine question, but I admit that it’s a fairly routine one. Describing the parameter in a study is an important step, and I suspect that students find this much more challenging than many instructors realize.  I would call this an adequate question, perhaps a good question, certainly not a great question. So, I don’t blame you if you’re wondering why this is my all-time favorite question.  Please read on …

Now we’re getting somewhere.  I think this is pretty clever: presenting students with a statistic that they have almost certainly never encountered before, and asking them to figure out something about the unknown statistic based on what they know.  The question is not particularly hard, but it does ask students to apply something they know to a new situation. Students should realize that right-skewed distributions tend to have a larger mean than median, so the ratio mean/median should be greater than 1 with these data.

Part (b) also helps students to prepare for what comes next …

Now we’re talking!  I think part (c) makes this a great question.  To answer this part well, students have to understand the reasoning process of statistical significance, and they have to apply that reasoning process in a situation that they have almost surely never encountered or even thought about: making an inference about the symmetry or skewness of a population distribution.  This is extremely challenging, but I think this assesses something very important: whether students can apply what they have learned to a novel situation that goes a bit beyond what they studied.

Notice that this question does not use words such as hypothesis or test or reject or strength of evidence or p-value.  The key word in the question is plausible.  Students have to realize that the simulation analysis presented allows them to assess the plausibility of the assumption underlying the simulation: that the population follows a normal distribution.  Then they need to recognize that they can assess plausibility by seeing whether the observed value of the sample statistic is unusual in the simulated (null) distribution of that statistic. It turns out that the observed value of the mean/median ratio (1.03) is not very unusual in the simulated (null) distribution, because 14/100 of the simulated samples produced a statistic more extreme than the observed sample value.  Therefore, students should conclude that the simulation analysis reveals that a normally distributed population could plausibly have produced the observed sample.

A common student error is not recognizing the crucial role that the observed value (1.03) of the statistic plays.  More specifically, two common student errors are:

  • Commenting that the simulated distribution is roughly symmetric, and concluding that it’s plausible that the population distribution is normal.  Students who make this error are failing to notice the distinction between the simulated distribution of sample statistics and the population distribution of mpg values.
  • Commenting that the simulated distribution of sample statistics is centered around the value 1, which is the expected value of the statistic from a normal population, and concluding that it’s plausible that the population distribution is normal.  Students who make this error are failing to realize that the simulation assumed a normal population in the first place, which is why the distribution of simulated sample statistics is centered around the value 1.

If this question ended here, it would be one of my all-time favorites.  But it doesn’t end here. There’s a fourth part, which catapults this question into the exalted status of my all-time favorite.  Once again (and for the last time!), please read on…

Wow, look at what’s happening here!  Students are being told that they don’t have to restrict their attention to common statistics that they have been taught.  Rather, this question asks students to exercise their intellectual power to create their own statistic! Moreover, they should know enough to predict how their statistic will behave in a certain situation (namely, a right-skewed distribution).  This part of the question not only asks students to synthesize and apply what they have learned, but it also invites students to exercise an intellectual capability that they probably did not even realized they possess. Some common (good) answers from students include the following statistics, both of which should take a value greater than 1 with a right-skewed distribution:

  • (maximum – median) / (median – minimum)
  • (upper quartile – median) / (median – lower quartile)

There you have it: my all-time favorite question from an introductory statistics exam.  I encourage you to ask this question, or some variation of it*, of your students. I suggest asking this in a low-stakes setting and then discussing it with students afterward.  Encourage them to realize that the reasoning processes they learn in class can be applied to new situations that they have not explicitly studied, and also help them to recognize that they are developing the intellectual power to create new analyses of their own.

* Even though this is my all-time favorite question, I suggest three revisions related to part (c).  First, I would provide students with sample values of the mean and median and ask them to calculate the value of the ratio for themselves.  I think this small extra step might help some students to realize the importance of seeing where the observed value of the statistic falls in the simulated distribution.  Second, I recommend altering the sample data a bit to make the observed value of the sample statistic fall quite far out in the tail of the simulated (null) distribution of the statistic.  This would lead to rejecting the plausibility of a normally distributed population in favor of concluding that the population distribution was right-skewed.  I think this conclusion might be a bit easier for students to recognize, while still assessing whether students understand how to draw an appropriate conclusion from the simulation analysis.  Third, I would prefer to use 1000 or 10,000 repetitions for the simulation, which would require using a histogram rather than a dotplot for the display.

P.S. I mentioned at the top that I had nothing to do with writing this question.  Three people who played a large role in writing it and developing a rubric for grading it were Bob Taylor, Chris Franklin, and Josh Tabor.  They all served on the Test Development Committee for AP Statistics at the time. Bob chaired the committee, Chris served as Chief Reader, and Josh was the Question Leader for the grading of this question. Josh also wrote a JSE article (here) that analyzed various choices for the skewness statistic in part (d).