Skip to content

Posts tagged ‘observational studies’

#44 Confounding, part 2

Many introductory statistics students find the topic of confounding to be one of the most confounding topics in the course.  In the previous post (here), I presented two extended examples that introduce students to this concept and the related principle that association does not imply causation.  Here I will present two more examples that highlight confounding and scope of conclusions.  As always, this post presents many questions for posing to students, which appear in italics.


3. A psychology professor at a liberal arts college recruited undergraduate students to participate in a study (here).  Students indicated whether they had engaged in a single night of total sleep deprivation (i.e., “pulling an all-nighter”) during the term.  The professor then compared the grade point averages (GPAs) of students who had and who had not pulled an all-nighter.  She calculated the following statistics and determined that the difference in the group means is statistically significant (p-value < 0.025):

a) Identify the observational units and variables.  What kinds of variables are these?  Which is explanatory, and which is response?

My students know to expect these questions at the outset of every example, to the point that they sometimes groan.  The observational units are the 120 students.  The explanatory variable is whether or not the student pulled at least one all-nighter in the term, which is categorical.  The response variable is the student’s grade point average (GPA), which is numerical.

b) Is this a randomized experiment or an observational study?  Explain how you can tell.

My students realize that this is an observational study, because the students decided for themselves whether to pull an all-nighter.  They were not assigned, randomly or otherwise, to pull an all-nighter or not.

c) Is it appropriate to draw a cause-and-effect conclusion between pulling an all-nighter and having a lower GPA?  Explain why or why not.

Most students give a two-letter answer followed by a two-word explanation here.  The correct answer is no.  Their follow-up explanation can be observational study or confounding variables.  I respond that this explanation is a good start but would be much stronger if it went on to describe a potential confounding variable, ideally with a description of how the confounding variable provides an alternative explanation for the observed association.  The following question asks for this specifically.

d) Identify a (potential) confounding variable in this study.  Describe how it could provide an alternative explanation for why students who pulled an all-nighter have a smaller mean GPA than students who have not.

Students know this context very well, so they are quick to propose many good explanations.  The most common suggestion is that the student’s study skills constitute a confounding variable.  Perhaps students with poor study skills resort to all-nighters, and their low grades are a consequence of their poor study skills rather than the all-nighters.  Another common response is coursework difficulty, the argument being that more difficult coursework forces students to pull all-nighters and also leads to lower grades.  Despite having many good ideas here, some students struggle to express the confounding variable as a variable.  Another common error is to describe the link between their proposed confounding variable and the explanatory variable, neglecting to describe a link with the response.

e) Is it appropriate to rule out a cause-and-effect relationship between pulling an all-nighter and having a lower GPA?  Explain why or why not.

This may seem like a silly question, but I think it’s worth asking.  Some students go too far and think that not drawing a cause-and-effect conclusion is equivalent to drawing a no-cause-and-effect conclusion.  The answer to this question is: Of course not!  It’s quite possible that pulling an all-nighter is harmful to a student’s academic performance, even though we cannot conclude that from this study.

f) Describe how (in principle) you could design a new study to examine whether pulling an all-nighter has a negative impact on academic performance (as measured by grades).

Many students give the answer I’m looking for: Conduct a randomized experiment.  Then I press for more details: What would a randomized experiment involve?  The students in the study would need to be randomly assigned to pull an all-nighter or not. 

g) How would your proposed study control for potential confounding variables? 

I often need to expand on this question to prompt students to respond: How would a randomized experiment account for the fact that some students have better study skills than others, or are more organized than others, or have more time for studying than others?  Some students realize that this is what random assignment achieves.  The purpose of random assignment is to balance out potential confounding variables between the groups.  In principle, students with very good study skills should be balanced out between the all-nighter and no-all-nighter groups, just as students with poor study skills should be similarly balanced out.  The explanatory variable imposed by the researcher should then constitute the only difference between the groups.  Therefore, if the experiment ends up with a significant difference in mean GPAs between the groups, we can attribute that difference to the explanatory variable: whether or not the student pulled an all-nighter.

I end this example there, but you could return to this study later in the course.  You could ask students to conduct a significance test to compare the two groups and calculate a confidence interval for the difference in population means.  At that point, I strongly recommend asking about causation once again.  Some students seem to think that inference procedures overcome concerns from earlier in the course about confounding variables.  I think we do our students a valuable service by reminding them* about issues such as confounding even after they have moved on to study statistical inference. .

* Even better than reminding them is asking questions that prompt students to remind you about these issues.


4. Researchers interviewed parents of 479 children who were seen at a university pediatric ophthalmology clinic.  They asked parents whether the child slept primarily in room light, darkness, or with a night light before age 2.  They also asked about the child’s eyesight diagnosis (near-sighted, far-sighted, or normal vision) from their most recent examination. 

a) What are the observational units and variables in this study?  Which is explanatory, and which is response?  What kind of variables are they?

You knew this question was coming first, right?  The observational units are the 479 children.  The explanatory variable is the amount of lighting in the child’s room before age 2.  The response variable is the child’s eyesight diagnosis.  Both variables are categorical, but neither is binary.

b) Is this an observational study or a randomized experiment?  Explain how you can tell.

Students also know to expect this question at this point.  This is an observational study.  Researchers did not assign the children to the amount of light in their rooms.  They merely recorded this information.

The article describing this study (here) included a graph similar to this:

c) Does the graph reveal an association between amount of lighting and eyesight diagnosis?  If so, describe the association.

Yes, the percentage of children who are near-sighted increases as the amount of lighting increases.  Among children who slept in darkness, about 10% were near-sighted, compared to about 34% among those who slept with a night light and about 55% among those who slept with room light.  On the other hand, the percentage with normal vision decreases as the amount of light increases, from approximately 65% to 50% to 30%.

Here is the two-way table of counts:

d) Were most children who slept in room light near-sighted?  Did most near-sighted children sleep in room light?  For each of these questions, provide a calculation to support your answer. 

Some students struggle to recognize how these questions differ.  The answer is yes to the first question, because 41/75 ≈ 0.547 of those who slept in room light were near-sighted.  For the second question, the answer is no, because only 41/137 ≈ 0.299 of those who were near-sighted slept in room light.

e) Is it appropriate to conclude that light in a child’s room causes near-sightedness?  Explain your answer. 

No.  Some students reflexively say observational study for their explanation.  Others simply say confounding variables.  These responses are fine, as far as they go, but the next question prompts students to think harder and explain more fully.

f) Some have proposed that parents’ eyesight might be a confounding variable in this study.  How would that explain the observed association between the bedroom lighting condition and the child’s eyesight? 

Asking about this specific confounding variable frees students to concentrate on how to explain the confounding.  Most students point out that eyesight is hereditary, so near-sighted parents tend to have near-sighted children.  Unfortunately, many students stop there.  But this falls short of explaining the observed association, because it says nothing about the lighting in the child’s room.  Completing the explanation requires adding that near-sighted parents may tend to use more light in the child’s room than other parents, perhaps so they can more easily check on the child during the night.


The next set of questions continues this example by asking about how one could (potentially) draw a cause-and-effect conclusion on this topic.

g) What would conducting a randomized experiment to study this issue entail?

Children would need to be randomly assigned to have a certain amount of light (none, night light, or full room light) in their bedroom before the age of 2.

h) How would a randomized experiment control for parents’ eyesight? 

This question tries to help students focus on the goal of random assignment: to balance out all other characteristics of the children among the three groups.  For example, children with near-sighted parents should be (approximately) distributed equally among the three groups, as should children of far-sighted parents and children of parents with normal vision.  Even better, we also expect random assignment to balance out factors that we might not think of in advance, or might not be able to observe or measure, that might be related to the child’s eyesight.

i) What would be the advantage of conducting a randomized experiment to study this issue?

If data from a randomized experiment show strong evidence of an association between a child’s bedroom light and near-sightedness, then we can legitimately conclude that the light causes an increased likelihood of near-sightedness.  This cause-and-effect conclusion would be warranted because random assignment would (in principle) account for other potential explanations.

j) Would conducting such a randomized experiment be feasible in this situation?  Would it be ethical?

To make this feasible, parents would need to be recruited who would agree to allow random assignment to determine how much light (if any) to use in their child’s bedroom.  It might be hard to recruit parents who would give up this control over their child’s environment.  This experiment would be ethical as long as parents were fully informed and consented to this agreement.


You can return to this example, and the observational data from above, later in the course to give students practice with conducting a chi-square test.  This provides another opportunity to ask them about the scope of conclusions they can draw.

l) Conduct a chi-square test.  Report the test statistic and p-value.  Summarize your conclusion.  The test statistic turns out to be approximately 56.5.  With 4 degrees of freedom, the p-value is extremely close to zero, about 7.6×10^(-12).  The data provide overwhelming evidence against the null hypothesis of no association, in favor of the alternative that there is an association between amount of light in the child’s room before age 2 and eyesight diagnosis later in childhood.

m) In light of the very large test statistic and extremely small p-value, is it reasonable to conclude that light in a child’s room causes an increased chance of the child becoming near-sighted?  I think it’s very important to ask this again after conducting the hypothesis test.  Some students mistakenly think that hypothesis tests are so advanced that they can override what they learned earlier in the course.  The extremely small p-value in no way compensates for the observational nature of these data and the possibility of confounding variables.  A cause-and-effect conclusion between bedroom light and near-sightedness still cannot be drawn.

n) Why do you think the researchers bothered to collect and analyze these data, considering that no causal conclusion can be drawn?

Some students believe that a cause-and-effect conclusion is the only kind worth drawing. I ask this question to help them realize that establishing evidence of association can be a worthy goal in its own right, apart from the question of causation.

o) Is it reasonable to generalize this study’s finding about an association between room lighting and near-sightedness to the population of all children in the United States?  Explain.

Most students realize that the correct answer is no, but many mistakenly attribute this to the observational nature of the data.  With regard to generalizability, the key point is that the children in this study were not randomly selected from any population.  They were all patients at a university pediatric ophthalmology clinic, so they are not likely to be representative of all U.S. children with regard to issues involving eyesight.  The finding of an association between increased bedroom light and near-sightedness may or may not hold in the larger population of U.S. children in general.

Asking this question can help students who confuse bias and confounding, or who believe that bias and confounding are the same idea.  This can also remind students of the important distinction between random sampling and random assignment, which I discussed in posts #19 and #20 (Lincoln and Mandela, here and here).


Observational studies abound in many fields.  They often produce intriguing results that are discussed in news media.  Accordingly, it’s important for students to understand the topic of confounding and especially how confounding affects the scope of conclusions that can be drawn from observational studies.  The four examples in this two-part series introduce students to these ideas.  They also provide an opportunity to make connections among different parts of the course, spanning topics of data exploration and statistical inference as well as design of studies and scope of conclusions.

P.S. The topic of drawing cause-and-effect conclusions legitimately from observational studies has become widely studied.  I confess that I do not address this topic in my introductory statistics courses, but some argue strongly that I am doing my students a disservice in this regard.  After all, the most important causal conclusion of the twentieth century may have been that smoking causes cancer, which was not determined by randomly assigning humans to smoke or not.

One of the most prominent advocates for causal inference is Judea Pearl, who has co-authored a general-audience book titled The Book of Why: The New Science of Cause and Effect (information and excerpts can be found here).  Statistics educators who argue for including this topic prominently include Milo Schield (here), Danny Kaplan (here), and Jeff Witmer (here).  A recent article in the Journal of Statistics Education by Cummiskey et al (here) also makes this case.

P.P.S. for teachers of AP Statistics: I’ll be conducting some one-hour sessions via zoom in which I lead students through the first five questions on the 2011 exam, discussing what graders looked for and highlighting common student errors.  I hope this provides some helpful practice and preparation for the upcoming 2020 AP Statistics exam.  Please contact me (allanjrossman@gmail.com) if you would like to invite your students to attend one of these sessions.

#43 Confounding, part 1

The topic of confounding is high on the list of most confounding topics in introductory statistics.  Dictionary.com provides these definitions of confound (here):

  1. to perplex or amaze, especially by a sudden disturbance or surprise; bewilder; confuse: The complicated directions confounded him.
  2. to throw into confusion or disorder: The revolution confounded the people.
  3. to throw into increased confusion or disorder
  4. to treat or regard erroneously as identical; mix or associate by mistake: Truth confounded with error.
  5. to mingle so that the elements cannot be distinguished or separated
  6. to damn (used in mild imprecations): Confound it!

Definition #5 comes closest to how we use the term in statistics.  Unfortunately, definitions #1, #2, and #3 describe what the topic does to many students, some of whom respond in a manner that illustrates definition #6.

In this post I will present two activities that introduce students to this important but difficult concept, along with some follow-up questions for assessing their understanding.  One example will involve two categorical variables, and the other will feature two numerical variables.  As always, questions that I pose to students appear in italics.


I have used a variation of the following example, which I updated for this post, for many years.  I hold off on defining the term confounding until students have anticipated the idea for themselves.  Even students who do not care about sports and know nothing about basketball can follow along.

1. During the 2018-19 National Basketball Association season, the Sacramento Kings won 13 home games and lost 16 when they had a sell-out crowd, compared to 11 home wins and 1 loss when they had a smaller crowd.

a) Identify the observational units, explanatory variable, and response variable in this study.  Also classify each variable as categorical or numerical.

As I argued in post #11 (Repeat after me, here), I think these questions are important to ask at the start of nearly every activity, to orient students to the context and the type of analysis required.  The observational units are games, more specifically home games of the Sacramento Kings in the 2018-19 season.  The explanatory variable is crowd size, and the response variable is game outcome.  As presented here, both variables are categorical (and binary).  Crowd size could be studied as a numerical variable, but that information is presented here as whether or not the crowd was a sell-out or smaller.

b) Organize the data into a table of counts, with the explanatory variable groups in columns.

First we set up the table as follows:

Then I suggest to students that we work with each number as we encounter it in the sentence above, so I first ask where the number 2018 should go in the table.  This usually produces more groans than laughs, and then we proceed to fill in the table as follows:

Some optional questions for sports fans: Does the number 41 make sense in this context?  Basketball fans nod their heads, knowing that an NBA team plays an 82-game season, with half of the games played at home.  Did the Kings win more than half of their home games?  Yes, they won 24 of 41 home games, which is 58.5%.  Does this mean that the Kings were an above-average team in that season?  No.  In fact, after including data from their games away from home, they won only 39 of 82 games (47.6%) overall.

c) Calculate the proportion of wins for each crowd size group.  Do these proportions suggest an association (relationship) between the explanatory and response variables?  Explain.

The Kings won 11/12 (.917, or 91.7%) of games with a smaller crowd.  They won 13/29 (.448, or 44.8%) of games with a sell-out crowd.  This seems like a substantial difference (almost 48 percentage points), which suggests that there is an association between crowd size and game outcome.  The Kings had a much higher winning percentage with a smaller crowd than with a sell-out crowd.

d) Produce a well-labeled segmented bar graph to display these proportions.

Here’s a graph generated by Excel:

e) Is it reasonable to conclude that a sell-out crowd caused the team to play worse?  If not, provide an alternative explanation that plausibly explains the observed association.

This is the key question of the entire activity.  I always find that some students have been anticipating this question and are eager to respond: Of course not!  These students explain that the Kings are more likely to have a sell-out crowd when they’re playing against a good team with superstar players, such as the Golden State Warriors with Steph Curry.  I often have to prod students to supply the rest of the explanation: What else is true about the good teams that they play against?  The Kings are naturally less likely to win against such strong teams.

At this point I introduce the term confounding variable as one whose potential effects on a response variable cannot be distinguished from those of the explanatory variable.  I also point out that a confounding variable must be related to both the explanatory and response variable.  Finally, I emphasize that because of the potential for confounding variables, one cannot legitimately draw cause-and-effect conclusions from observational studies.

f) Identify a confounding variable in this study, and explain how this confounding variable is related to both the explanatory and response variable.

This is very similar to question (e), now asking students to express their explanation with this new terminology.  Some students who provide the alternative explanation well nevertheless struggle to specify a confounding variable clearly.  A good description of the proposed confounding variable is: strength of opponent.  It seems reasonable to think that a stronger opponent is more likely to generate a sell-out crowd, and a stronger opponent also makes the game less likely to result in a win for the home team.


I usually stop this in-class activity there, but you could ask students to dig deeper in a homework assignment or quiz.  For example, we can look at more data to explore whether our conjectures about strength of opponent hold true.

It seems reasonable to use the opposing team’s percentage of games won in that season as a measure of its strength.  Let’s continue to work with categorical variables by classifying teams with a winning percentage of 40% and below as weak, between 40% and 60% as moderate, 60% and above as strong.  This leads to the following tables of counts:

Do these data support the two conjectures about how strength of opponent relates to crowd size and to game outcome?  Support your answer with appropriate calculations and graphs.

The first conjecture was that stronger opponents are more likely to generate a sell-out crowd.  This is supported by the data, as we see that 100% (10/10) of strong opponents produced a sell-out crowd, compared to 61.9% (13/21) of moderate opponents and 60% (6/10) of weak opponents.  These percentages are shown in this segmented bar graph:

The second conjecture was that stronger opponents are less likely to produce a win by the home team.  This is clearly supported by the data.  The home team won 100% (10/10) of games against weak opponents, which falls to 57.1% (12/21) of games against moderate teams, and only 20% (2/10) of games against strong teams.  These percentages are shown in this segmented bar graph:

Here’s a quiz question based on a different candidate for a confounding variable. It also seems reasonable to think that games played on weekends (let’s include Fridays with Saturdays, and Sundays) are more likely to attract a sell-out crowd that games played on weekdays.  What else would have to be true about the weekend/weekday breakdown in order for that to be a confounding variable for the observed association between crowd size and game outcome?  What remains is for students to mention a connection with the response variable: Weekend games would need to be less likely to produce a win for the home team, as compared to weekday games.

Again we can look at the data on this question.  Consider the following tables of counts:

Do the data support the argument for the weekday vs. weekend variable as a confounding variable?  Cite relevant calculations to support your response.  Only half of the argument is supported by the data.  Weekend games were slightly more likely to produce a sell-out crowd than a weekday game (13/17 ≈ 0.765 vs. 16/24 ≈ 0.667).  But weekend games were not less likely to produce a home team win than weekday games (11/17 ≈ 0.647 vs. 13/24 ≈ 0.542).  Therefore, the day of week variable does not provide an alternative explanation for why sell-out crowds are less likely to see a win by the home team than a smaller crowd.

Students could explore much more with these data*.  For example, they could analyze opponent’s strength as a numerical variable rather than collapsing it into three categories as I did above. 

* I provide a link to the datafile at the end of this post.


The second example is based on an activity that I have used for more than 25 years.  My first contribution to the Journal of Statistics Education, from 1994 (here), presented an example for distinguishing association from causation based on the relationship between a country’s life expectancy and its number of people per television.  In updating the example for this post, I chose a different variable and used data as of 2017 and 2018 from the Word Bank (here and here)*.

* Again, a link to the datafile appears at the end of this post.

2. The following table lists the life expectancy (in years) and the number of automatic teller machines (ATMs per 100,000 adults) in 24 countries around the world:

a) Identify the observational units and variables.  What type of variable are these?  Which is explanatory and which is response? 

Yes, I start with these fundamental questions yet again.  The observational units are countries, the explanatory variable is number of ATMs per 100,000 adults, and the response is life expectancy.  Both variables are numerical.

b) Which of the countries listed has the fewest ATMs per 100,000 adults?  Which has the most?

This question is unnecessary, I suppose, but I think it helps students to engage with the data and context.  Haiti has the fewest ATMs: about 2 per 100,000 adults.  The United States has the most: about 174 ATMs per 100,000 adults.

c) Produce a scatterplot of the data, with the response variable on the vertical axis.

Here’s the scatterplot:

d) Does the scatterplot indicate an association between life expectancy and number of ATMs?  Describe its direction, strength, and form.

Yes, the scatterplot reveals a positive association between a country’s life expectancy and its number of ATMs per 100,000 adults.  This association is moderately strong but not linear.  The form follows a curved pattern.

e) Do you believe that installing more ATM machines in countries such as Haiti, Bangladesh, Algeria, and Kenya would cause their inhabitants to live longer?  If not, provide a more plausible, alternative (to cause-and-effect) explanation for the observed association.

This is the key question in the activity, just as with the question in the previous activity about whether sell-out crowds cause the home team to play worse.  Students realize that the answer here is a resounding no.  It’s ridiculous to think that installing more ATMs would cause Haitians to live longer.  Students can tell you the principle that association is not causation.

Students can also suggest a more plausible explanation for the observed association.  They talk about how life expectancy and number of ATMs are both related to the overall wealth, or technological sophistication, of a country.

f) Identify a (potential) confounding variable, and explain how it might relate to the explanatory and response variables.

This is very similar to the previous question.  Here I want students to use the term confounding variable and to express their suggestion as a variable.  Reasonable answers include measures of a country’s wealth or technological sophistication.

This completes the main goal for this activity.  At the risk of detracting from this goal, I often ask an additional question:

g) Would knowing a country’s number of ATMs per 100,000 adults be helpful information for predicting the life expectancy of the country?  Explain.

The point of this question is much harder for students to grasp than with the preceding questions.  I often follow up with this hint: Would you make different life expectancy predictions depending on whether a country has 10 vs. 100 ATMs per 100,000 adults?  Students confidently answer yes to this one, so they gradually come to realize that they should also answer yes to the larger question: Knowing a country’s number of ATMs per 100,000 adults is helpful for predicting life expectancy.  I try to convince them that the association is real despite the lack of a cause-and-effect connection.  Therefore, predictions can be enhanced from additional data even without a causal* relationship.

* I greatly regret that the word causal looks so much like the word casual.  To avoid this potential confusion, I say cause-and-effect much more than causal.  But I had just used cause-and-effect in the previous sentence, so that caused me to switch to causal in the last sentence of the paragraph.


This example also leads to extensions that work well on assignments.  For example, I ask students to:

  • take a log transformation of the number of ATMs per 100,000 adults,
  • describe the resulting scatterplot of life expectancy vs. this transformed variable,
  • fit a least squares line to the (transformed) data,
  • interpret the value of r^2,
  • interpret the slope coefficient, and
  • use the line to predict the life expectancy of a country that was not included in the original list.

Here is a scatterplot of life expectancy vs. log (base 10) of number of ATMs per 100,000 adults, with the least squares line:

The relationship between life expectancy and this transformed variable is positive, moderately strong, and fairly linear.  With this log transformation, knowing a country’s number of ATMs per 100,000 adults explains 46.7% of the variability in countries’ life expectancy values.  The slope coefficient of 9.356 means that the model predicts an increase of 9.356 years in life expectancy for a tenfold increase in number of ATMs per 100,000 adults.  Using this line to predict the life expectancy of Costa Rica, which has 74.41 ATMs per 100,000 adults produces: predicted life expectancy = 60.51 + 9.356×log(74.41) ≈ 60.51 + 9.356×1.87 ≈ 78.02 years.  The actual life expectancy reported for Costa Rica in 2018 is 80.10, so the prediction underestimated by only 2.08 years.


Two earlier posts that focused on multivariable thinking also concerned confounding variables.  In post #3 (here), the graduate program was a confounding variable between an applicant’s gender and the admission decision.  Similarly, in post #35 (here), age was a confounding variable between a person’s smoking status and their lung capacity.

In next week’s second part of this two-part series, I will address more fully the issue of drawing causal conclusions.  Along the way I will present two more examples that involve confounding variables, with connections to data exploration and statistical inference.  I hope these questions can lead students to be less confounded by this occasionally vexing* and perplexing topic.

* I doubt that the term vexing variable will catch on, but it does have a nice ring to it!


P.S. The two datafiles used in this post can be downloaded from the links below: