# Posts from the ‘Uncategorized’ Category

## #87 It’s about time, part 1

Today’s post is about a fun topic that I teach only occasionally.  I’ll be introducing my students to this topic a few minutes after this post appears on Monday morning.

Let me dispense with any more of a preamble and jump right in with the first of 3.5 examples.  As always, questions that I pose to students appear in italics.

This post will feature many graphs that I consider to be interesting and informative, but this is not one of them:

This histogram displays the distribution of number of vehicles (in thousands) crossing the Peace Bridge, a scenic bridge connecting the United States and Canada near Niagara Falls, for each month from January 2003 through December 2019.

Describe what this histogram reveals about this distribution. This distribution is skewed to the right, with a center near about 500,000 vehicles per month.  Some months had as few as 300,000 vehicles making the crossing; on the other extreme one month had about 850,000 vehicles making the crossing.

But none of that is very interesting.  Remember that I said this is monthly data over many years, so it would be much more informative to look for patterns in month-to-month and year-to-year variation of the crossing numbers over time:

Describe and explain the recurring pattern that this graph reveals.  The most obvious feature of this graph is the consistent pattern of increasing and then decreasing numbers of bridge crossings, over and over.  Looking a bit more closely reveals that each of these cycles occurs over a one-year period.  The increase occurs every spring, culminating in a peak in the summer.  The decrease occurs every fall, reaching a nadir in the winter.  Examining the actual data (available at the end of this post) indicates that the maximum occurs in August for most years, the minimum in February.  This pattern makes sense, of course, because people tend to travel more in summer months than winter months.

After taking the recurring pattern into account, has the number of bridge crossings been increasing or decreasing over time? The number of bridge crossings has been decreasing slightly but steadily over these years.  For example, the peak number of crossings exceeded 800,000 in the summer of 2003 but fell short of 600,000 in the summer of 2019.  This is more than a 25% decrease over this 16-year period.  The numbers of crossing seem to have levelled off in the five most recent years.

In which year does the decrease appear to be most pronounced?  Can you offer an explanation based on what was happening in the world then?  The biggest drop occurred between 2008 and 2009, during the global financial crisis that followed the bursting of the U.S. housing bubble.

When introducing a new topic, I typically start a class session with an example like this before I define terms for my students.  At this point I tell them that data such as these are called time series data, which is a fairly self-explanatory term.  The two most important aspects to look for in a graph of time series data are trend and seasonality.  These data on monthly numbers of vehicles crossing the Peace Bridge provide a good example of both features.

I have mentioned before that I am currently teaching the second course in a two-course introductory sequence for first-year business students.  This course includes a brief introduction to time series data.  I enjoy teaching this topic, because it gives rise to interesting examples like this, and I think students see the relevance.

A downside of this topic for me is that it requires more preparation time, partly because I only teach time series every few years and so have to re-learn things each time.  Another reason is that I feel a stronger obligation to find and present current data when teaching this topic.

Speaking of keeping time series examples up-to-date: What do you expect to see when the Peace Bridge crossing data are updated to include the year 2020?  I don’t think any of my students will be surprised to see this:

A slightly harder question is: What do you expect the histogram to look like, when data for the year 2020 are included?  Here is the updated histogram, with a cluster of values on the low end:

A moral here is that even long-established and consistent trends may not continue forever.  Extraordinary events can and do occur.  We and our students have lived through one such event for the past year.

Another of my favorite examples is this graph of a country’s life expectancy for the years 1960 – 2018:

Describe what the graph reveals.  There are three distinct patterns here.  Life expectancy increased steadily, from about 48 to 64 years, between 1960 and 1990.  Then life expectancy decreased dramatically until 2005, falling back to about 53 years.  The years since 2005 have seen another increase, more steep than the gradual increase from the 1960s through 1980s, although the rate of increase has levelled off a bit since 2015.  Life expectancy in 2018 slightly surpassed the previous high from 1990.

Make a guess for which country this is.  It usually takes a few guesses before a student thinks of the African continent, and then a few more guesses until they arrive at the correct country: South Africa.

What might explain the dramatic decrease in life expectancy for this country between 1990-2005?  Why do you think the trend has reversed again since then?  Some students guess that apartheid is the cause, but then someone suggests the more pertinent explanation: Sub-Saharan Africa experienced an enormous and catastrophic outbreak of HIV/AIDS in the 1990s.  Things have improved considerably in large part because of effective and inexpensive treatments.

When I present this example to my students this week, I plan to point out three morals that are quite relevant to our current situation:

1. Trends don’t always continue indefinitely.
2. Bad things happen.  (This includes devastating viruses.)
3. Good things happen.  (Medical innovations can help a lot.)

Especially because I am teaching business students, I like to include some time series examples of stock prices, which are easy to download from many sites including the Yahoo finance site (here).  Let’s make this another guessing game: The daily closing prices of what company’s stock are represented in the following graph?  I’ll give you a hint: I know that all of my students use this company’s product.  I’m also willing to bet that all of your students have heard of this company, even if they have not used its product.

Would you have liked to have owned stock in this company in 2020?  Duh!  By what percentage did the closing price change from the last day of 2019 (closing price: \$68.04) to the last day of 2020 (closing price: \$337.32)?  I really like my students to become comfortable working with percentage changes*.  This example provides another good opportunity.  The percentage increase in this company’s stock price during 2020 works out to be a (337.31 – 68.04) / 68.04 × 100% ≈ 395.75% increase!  Make a guess for what company this is.  I bet you guessed correctly: Zoom**.

* See posts #28 (A pervasive pet peeve, here) and #83 (Better, not necessarily good, here).

** When I published my first blog post on July 8, 2019, I meant to include a postscript advising all of my readers to invest in Zoom.  I just re-read that post (here) and am dismayed to realize that I forgot to include that stock tip.  Oh well, I also forgot to invest in Zoom myself.

Data on consumer prices obtained by the Bureau of Labor Statistics (BLS) can make for interesting time series data and are also easy to download (for example, from here).  Below is a graph from the BLS website, displaying the national average price for a gallon of unleaded gasoline, by month, starting with January of my first year in college* and ending with January of my students’ first year in college:

* Another downside to teaching time series is that it draws attention to how much time has gone by in your life!

The national average price of a gallon of unleaded gasoline increased from \$1.298 in January of 1981 to \$2.326 forty years later.  Calculate the percentage increase*.  This works out to be a (2.326 – 1.298) / 1.298 × 100% ≈ 79.2% increase.  Does gasoline really cost this much more now than in my first year of college?

For now, the answer I’d like for that last question is: Hold on, not so fast.  This leads to another of my favorite topics to teach, but I am going to stop now and pick up here next week in part 2 of this post.

P.S. Data on crossing of the Peace Bridge can be found here.  Data on life expectancy in South Africa was obtained here. Links to all of the datafiles in this post appear below.

P.P.S. Many thanks to Robin Lock for giving me a personalized crash course on the  fundamentals of time series analysis when I first started to teach this topic. Robin also introduced me to the Peace Bridge data, a more thorough analysis of which can be found in the chapter on time series that he wrote for the Stat2 textbook (here).

## #86 Cars, dogs, tweets

Once again I have not found time to write a full essay for this week’s blog post*.  I’m behind on preparing for my classes for Monday, and I have several other items on my “to do” list, and I’d rather not think about my “should have been done by now” list.  I’ll also be giving an exam at the end of this week, and I’ve learned that I need to devote several days to prepare for giving an exam online.

* I almost titled this No blog post today, as I did with post #79 (here).

But I really like to have something to read on Monday mornings for everyone who has been so kind to sign up to have this delivered to your inbox.  So, please allow me to ramble on for a bit* about two datasets that I have gathered in the past couple of weeks, related to topics of correlation, regression, and prediction.  The first one is very straightforward but has some appealing aspects.  The second one might introduce you to a fun website, especially if you’re a dog person.

* Please remember that I do not have the time to strive for a coherent, well-argued essay this week.

In last week’s post (here), I described an assignment that I recently gave to my students, asking them to perform chi-square, ANOVA, and correlation analyses.  My students are currently working on a follow-up assignment in which they apply one-predictor regression analysis.  I grew tired of using the same dataset for years, so I collected some new data for them to analyze.  I went to cars.com and recorded the price, age (in years), and miles driven for a sample of pre-owned Toyota Prius cars*.

* Notice how deftly I avoided using a plural word for Prius in that sentence.  I read that Toyota sponsored a survey for people to vote on the appropriate plural term (here).  Apparently, “Prii” was the most popular of five options presented, but with only 25% of the vote.

Here are graphs and output from JMP for predicting price from miles and from age:

My students needed to produce this output and then answer a series of fairly straightforward questions.  These included identifying the better predictor of price and then:

• identifying and interpreting the value of r2;
• identifying and interpreting the residual standard error;
• conducting and drawing a conclusion from a t-test about the slope coefficient;
• determining and interpreting a confidence interval for the population slope coefficient;
• predicting the price of a pre-owned Prius with 100,000 miles;
• producing and commenting on a confidence interval for the mean price in the population of all pre-owned Prius cars with 100,000 miles, and a prediction interval for the price of an individual pre-owned Prius with 100,000 miles;
• describing how the midpoints and widths of these two intervals compare.

Then I asked whether residual plots reveal any patterns that suggest a non-linear relationship (miles is the predictor on the left, age on the right):

I followed by directing students to apply a log transformation to price and re-conduct their analysis:

Notice that age is more strongly correlated with log(price) than miles, even though miles was more strongly correlated with price than age.  I asked students to predict the price of a pre-owned five-year-old Prius, both with a point estimate and a prediction interval, which required them to back-transform in their final step.

I plan to return to this dataset as we study multiple regression.  It’s not surprising that miles driven is strongly correlated with age, as shown in the graph on the left below.  Considering that, it is a bit surprising that both predictors (age and miles) are useful in a multiple regression model for predicting log(price), even after controlling for the other, as shown in the output on the right:

As I began preparing to write my exam for this coming Friday, I started looking for data and contexts that I have not used before.  I went to the website for one of my favorite twitter accounts, Thoughts of Dog* (here).  I recorded the number of likes and number of retweets for a sample of 20 tweets from this account.

* Even though I am most assuredly a cat person (see a post filled with statistics questions about cats here), I have nothing against dogs and their people.  This particular dog tweeter even cites statistics on occasion.  For example, the dog recently reported (here) that cuddles with their human have increased by 147% during the pandemic.

Here is a scatterplot of the data, with both variables measured in thousands:

I plan to ask my students fairly straightforward questions about these data, using both open-ended and auto-graded formats.  For example, I want to assess whether students can write their own interpretations for quantities such as r2 and a slope coefficient, as well as seeing whether they can pick out correct versus incorrect interpretations from a list of multiple choice options.  I also want to ask a question or two about residuals, which is a fundamental concept that I often neglect to ask about.  I might write multiple versions of questions from this dataset simply by switching which variable to treat as explanatory and which as response.

I have to admit that I probably re-use datasets in my classes more than I should.  Sometimes I feel a bit guilty for using examples that still seem fairly recent to me but are more than half a lifetime ago for most of my students.  The two datasets presented here have the benefit of being from February of 2021.  There’s nothing especially distinctive about them, but I think they can be useful for developing and assessing students’ understanding of correlation, regression, and prediction.  They have also provided me with a (brief) blog post when I thought I might have to do without for this week.

## #85 Three assignments in one

I give a lot of quizzes in my classes.  I have been giving even more than usual this year while teaching remotely, and I’ve revised them to an auto-graded format.  I’ve written about such quizzes in many posts*.

* My most recent post of this type was #83, titled Better, not necessarily good, here.

I also give longer assignments that ask students to analyze data and submit a report that answers a series of questions.  This post discusses my most recent assignment of this type.

At the risk of prompting you to stop reading now, I confess that the questions in this assignment are quite straight-forward.  I think this assignment is worthwhile for my students, because it asks them to use JMP software to analyze raw data for themselves, as compared to quizzes and exams on which I provide students with output and summaries.

Although there’s nothing particularly original or clever about this assignment, I do like two aspects.  One is that it covers several topics with one dataset.  Students apply a chi-square test, analysis of variance, and a test about correlation on different pairs of variables.  They also produce appropriate graphs and summary statistics prior to conducting those tests.  Students need to select the correct procedure to address a particular question, although JMP software provides a big assist with that.

I also like that the results of two of the three tests do not come close to achieving statistical significance.  I sometimes worry that I present too many examples with very small p-values, so this assignment can remind students that not all studies discover significant differences among groups.

As always, questions that I pose to students appear in italics.

Here’s background information about the study that I provided to students:

An article* reported on a study in which 160 volunteers were randomly assigned to one of four popular diet plans: Atkins, Ornish, Weight Watchers, and Zone (40 subjects per diet).  These subjects were recruited through newspaper and television advertisements in the greater Boston area; all were overweight or obese with body mass index values between 27 and 42. Among the variables recorded were:

• which diet the subject was assigned to
• whether or not the subject completed the twelve-month study
• the subject’s initial weight (in kilograms)
• the degree to which the subject adhered to the assigned diet, taken as the average of 12 monthly ratings, each on a 1-10 scale (with 1 indicating complete non-adherence and 10 indicating full adherence)
• the subject’s weight after 12 months (in kilograms)
• the subject’s weight loss after twelve months (in kilograms, with a negative value indicating weight gain)

* You can find the JAMA article about this study here.  A link to the dataset appears at the end of this post.

• a) For each of the six variables (in the bullet points above), indicate whether the variable is categorical (also binary?) or numerical.

My students are used to this question, as are regular readers of this blog*.  I’m trying to set students up in good position to decide which technique to apply for each of the three sets of questions to come.  Frankly, I doubt that many students think that through.  When they ask questions in office hours about how to proceed with a given question, they often seem to be surprised when I point out that earlier questions in an assignment often prepare them to answer later questions.

* See post #11, titled Repeat after me, here, in which I argue for asking these questions at the beginning of (almost) every example in the course.

The first two variables listed here are categorical, with the second one binary and the first one not.  The other four variables are numerical.

I probably should have asked two additional questions, with which my students are also very familiar, at this point: Was this an observational study or an experiment?  Did this study make use of random sampling, random assignment, both, or neither?  I decided not to ask these questions in this assignment only because it grew to be quite long.

First we will investigate whether the sample data provide strong evidence that different diets produce different amounts of weight loss, on average.

• b) Use JMP to produce dotplots and boxplots of the distributions of weight loss for the four diet groups.
• c) Use JMP to calculate means and standard deviations of weight loss for the four diet groups.
• d) Do the technical conditions for the ANOVA F-test appear to be satisfied?  Explain.
• e) Use JMP to produce the ANOVA table.
• f) Report the null hypothesis being tested, using appropriate symbols.  Report the value of the F-test statistic and p-value.  Would you reject the null hypothesis at the α = 0.10 significance level?  Summarize your conclusion from the ANOVA F-test.

This is the first course in which I have used JMP, which I am learning for the first time myself.  I provided my students with a data file and fairly detailed instructions about how to use JMP to generate the requested output.  Here are the graphs and summary statistics:

For the technical conditions of the ANOVA F-test, I want students to check three things: 1) This experiment made use of random assignment. 2) The dotplots and boxplots do not suggest strong skewness or outliers, so assuming that the weight loss amounts follow normal distributions is reasonable. 3) The ratio of the largest group SD to the smallest group SD is 9.29/5.39 ≈ 1.72 is less than 2, so it’s reasonable to assume that the standard deviations of weight loss among the groups are the same.

JMP produces the following ANOVA table:

The null hypothesis to be tested is that all four diets have the same population mean weight loss: μA = μO = μW = μZ.  The value of the test statistic is F = 0.5361, and the p-value is 0.6587.  This p-value is not small in the least, so the sample data are not at all inconsistent with the null hypothesis that the four diets have the same population mean weight loss.  We would not reject the null hypothesis at the α = 0.10 level, or at any other reasonable significance level.  The sample data from this experiment provide no evidence that the four diets differ with regard to population mean weight loss.

Next we will investigate whether subjects were more or less likely to complete the study depending on which diet they had been assigned.

• g) Identify the name of the appropriate test to investigate this question.
• h) Use JMP to produce an appropriate graph and table to investigate this question.
• i) Which diet group(s) had the largest percentage of subjects who completed the study?  What was the value for that percentage?
• j) Report the null hypothesis being tested.  Also report the value of the test statistic and p-value.  Would you reject the null hypothesis at the α = 0.10 significance level?  Summarize your conclusion from this test.

I received more questions in office hours about part (g) than about any other part.  I always responded by asking about the types of variables involved.  When students told me that both variables are categorical, and only one variable is binary, I asked what test is appropriate for such data.  For the students who answered that they did not know, I directed them to the appropriate section of their notes.  The answer I’m looking for is a chi-square test for comparing proportions between multiple groups*.

* I really dislike the phrase “homogeneity of proportions.”  I don’t see the value of asking students to use a six-syllable word that they might not even understand the meaning of.  I like “chi-square test of homogeneity” even less, because that leaves open the question: homogeneity of what?

Here are a graph and table of counts:

Once again I think it’s a bit unfortunate that JMP automatically selects an appropriate graph after the user indicates the two variables of interest.  The answer to (i) is that the Weight Watchers and Zone diets both had the largest completion percentages: 26/40 = 0.65, so 65% of those assigned to one of these diets completed the study.

JMP produces the following output for the chi-square test:

The usual chi-square statistic is the Pearson value 3.158, with a p-value of 0.3678.  Once again the sample data do not provide evidence to conclude that the four diets differ, this time with regard to likelihood of completion.

Finally, we will investigate whether the data reveal a significant positive association between degree of adherence to the diet and weight loss.

• k) Use JMP to produce an appropriate graph to investigate this question.
• l) Use JMP to calculate the value of the correlation coefficient between these variables.
• m) Calculate the value of the appropriate t-test statistic by hand.  Also report the p-value from the JMP output.  Summarize your conclusion.

Here’s a scatterplot of weight loss vs. adherence score, with a correlation coefficient of r = 0.533:

Calculating the value of this t-statistic is the only test statistic calculation that students complete by hand in this assignment.  I could have asked them to produce regression output with this test statistic, but we had not yet studied regression when I gave this assignment.  The calculation is:

This test statistic reveals that the sample correlation coefficient of 0.533 is about six standard errors away from zero, so the p-value is extremely small, very close to zero.  (The very small p-value can also be seen in the output above.)  The sample data provide very strong evidence that weight loss is positively associated with adherence level in the population of all overweight people looking to lose weight with a popular diet.

Several students asked in office hours about the sample size to use in this calculation.  They noted that the overall sample size for the chi-square test was 160, but they realized that using 160 for n in the correlation test statistic calculation did not seem right.  I simply asked how many people have values of weight loss and adherence level that went into calculating the correlation coefficient.  Students quickly realized that this calculation restricts attention to the 93 subjects who completed the study.

In case you might be wondering about how this assignment is graded, I will now show the grading guidelines that I provided to my grader*.  I encouraged my students to work in groups of 2-3 students on this assignment, but many students opted to work alone.  This assignment generated 75 submissions among my 131 students, so there’s a lot of grading to be done.  I tried to make the guidelines clear and specific, but I also tried to avoid making them so detailed that they would take a lot of time to apply.

* My grader Melissa in a third-year business major who is minoring in statistics.  She has been extremely helpful to me, including catching an error in my solution to this assignment as she started her grading.

Here are my grading guidelines, for a total of 20 points:

• a) 1.5 pts.  Take -.5 if 1-2 of the 6 answers are incorrect.  Take -1 if 3-5 of the 6 answers are incorrect.
• b) 1 pt.  Give .5 pt for the dotplots, .5 pt for the boxplots.  It’s ok if the graphs are horizontal rather than vertical.  Take -1 if the graphs are not separated by diet.
• c) 1 pt.  Take -.5 if any values are missing or incorrect.  Do not bother to check their values closely.
• d) 2 pts.  Give .5 pt for overall answer of yes. Give.5 pt for mentioning random assignment (if they say random sampling but not random assignment, take -.5). Give .5 pt for mentioning normality; it’s fine if they say that the data do not look close enough to normal. Give .5 pt for comparing rato of largest/smallest SD to 2; if they just say “SD condition ok” without checking this ratio, take -.5.
• e) 1 pt.
• f) 2.5 pts.  Give .5 pt for null (either symbols or words are ok, don’t need both), .5 pt for F and p, .5 pt for “do not reject null,” 1 pt for conclusion in context
• g) 1 pt.  It’s ok to say just “chi-square test” or “chi-square test of independence” or “chi-square test of equal proportions.”  Take -.5 pt for saying “chi-square test of goodness-of-fit.”  Take -1 for not mentioning “chi-square” at all.
• h) 2 pts.  Give 1 pt for graph, 1 pt for table.  Take -.5 pt if variables are switched in graph.
• i) 1 pt.  Give .5 pt for identifying WW and Zone as the two diets.  Give .5 pt for correct proportion or percentage.
• j) 2.5 pts.  Give .5 pt for null (either symbols or words are ok, don’t need both), .5 pt for X2 and p (it’s ok if these are off a bit due to rounding), .5 pt for “do not reject null,” 1 pt for conclusion in context
• k) 1 pt.
• l) 1 pt.  It’s ok to have some rounding discrepancy.
• m) 2.5 pts.  Give 1 pt for test stat (take -.5 for right idea but a mistake somewhere, such as using the wrong sample size), .5 pt for p-value (ok to say approx. zero), 1 pt for conclusion in context

## #84 Giving oral exams

This guest post has been contributed by Paul Roback and Kelly McConville.  Paul and Kelly both teach statistics at top-notch liberal arts college – St. Olaf College for Paul and Reed College for Kelly.  In fact, Kelly was a student of Paul’s at St. Olaf.  Paul and Kelly are both exceptional teachers who are making substantial contributions to statistics and data science education.  I am very pleased that they agreed to write a guest blog post about their experiences with giving oral exams to their students while teaching online in the fall term.  You can contact them at roback@stolaf.edu and mcconville@reed.edu.

What was your motivation for giving oral exams/quizzes?

Paul: For years I’ve had the conversation with other statistics teachers that “you can often tell within a few minutes of talking with a student how well they understand the material.”  In these conversations, we’ve often fantasized about administering oral exams to get a more accurate read on students in a shorter amount of time.  But when assessment time came, I always retreated to the tried-and-true written exam, usually in-person but sometimes take-home.  This fall, since I was teaching fully online due to the pandemic and things were already pretty different, I decided to take the plunge to oral exams, both to see how effective they could be, and to build in an opportunity for one-on-one connections with my (virtual) students.  Of course, when I say “take the plunge,” you’ll see it’s more like getting wet up to my knees in the shallow end rather than a cannonball off the high dive into the deep end, but it was a start!

Kelly: Teaching online gave me the push I needed to really rethink my forms of assessment, especially my exams.  In the past, I would give in-person exams that were mostly short-answer questions with a strong focus on conceptual understanding and on drawing conclusions from real data*.

* If you are looking for good conceptual questions, they are all over Allan’s blog, such as post #21 (here).  I have borrowed many a question from Allan!

I didn’t feel that these exams would translate well to a take home structure, partly because now students could just read Allan’s blog to find the correct answers!  I also figured an assessment shake-up would help me fix some of the weaknesses of my in-person exams.  For example, I struggled to assess a student’s ability to determine which methods to use. I didn’t give them access to a computer and so I had to do most of the analysis heavy-lifting and then leave them to explain my work and draw sound conclusions.

Another strong motivator was the one-on-one interaction component of the oral exam.  During my in-person class, I make an effort to learn all students’ names by the start of Week 2, and I try to interact with every student regularly.  I struggled to translate these practices to the online environment, so I appreciated that the oral exam allowed the lab instructors and me to check-in and engage with each student.

In which course did you use an oral exam, and at what stage?

Kelly: This fall I was teaching introductory statistics and for the first time ever, I was teaching it online.  Across the two sections, my two lab instructors and I had a total of 74 students. We administered two exams, each of which included two parts: a two-hour, open-book, open-notes, take-home exam followed by a ten-minute oral exam.  During the take-home part, students were presented with a dataset and asked questions that required them to draw insights from the data.  This part required them to complete all their computations in R and document their work using R Markdown. The oral exam built from the data context on the take-home and focused more on their conceptual understanding of relevant statistical ideas.

Paul: I used an oral quiz in our Statistical Modeling course.  This course has an intro stats prerequisite, and it mostly covers multiple linear and logistic regression.  In addition to the usual assessments from weekly homework, exams, and a project, I added two “mini-tests” this semester, each worth 10% of the total grade.  The first allowed me to give extensive feedback to early written interpretations of modeling results; the second was an oral quiz following students’ first assignment (available here) on logistic regression.

Describe your oral quiz/exam in more detail.

Kelly: For the midterm exam, the students analyzed data related to the Flint water crisis (here). The oral exam questions asked about identifying the type of study, interpreting coefficients in a linear model they built for the take-home component, and drawing conclusions from the “How blood lead levels changed in Flint’s children” graph in the FiveThirtyEight article by Anna Maria Barry-Jester (here).

For the final exam, the students explored the police stops data presented in the Significance article “Racial disparities in police stops in US cities” by Roberto Rivera and Janet Rossenbaum.  The original data can be grabbed from the Stanford Open Policing Project (here), and wrangled data can be found in their github repository (here).  My exam focused on traffic stops in June of 2016 in San Francisco. For the take-home component, students explored the relationship between a driver’s race and whether or not they were searched.   Then, the oral component focused on assessing students’ conceptual understanding of key statistical inference ideas.  This included interpreting a p-value in their own words, grappling with types of errors, and explaining how the accuracy and precision of a confidence interval are affected as the sample size or confidence level are increased.

Kelly: Even though I only had ten precious minutes with each student, I used two of those minutes to combat student unease.  At the beginning of the oral exam, I talked through what to expect and reassured students that: a) brief pauses to consider the question were completely allowed, and b) they could think out loud and I would take the answer they ended on, not where they began.  I spent the last minute of the exam (if we still had time) with light-hearted pleasantries.  Throughout the exam, I was very mindful to maintain a cheerful expression and to nod (regardless of the quality of their answer) so that they felt comfortable and like I was “cheering them on.”

How did you assess student responses?

Paul: I created a scoring rubric based on one used by Allison Theobold at Cal Poly:

• 4 = Outstanding ability to articulate and connect logistic regression concepts, with comprehensive and thoughtful understanding of topics.
• 3 = Good ability to articulate and connect logistic regression concepts, with clear understanding of most topics.
• 2 = Limited ability to articulate and connect logistic regression concepts, with an understanding of some big ideas but also some misconceptions.
• 1 = Little to no ability to articulate and connect logistic regression concepts, with a limited understanding of big ideas and many misconceptions.
• 0 = Wait, have we talked about logistic regression??

I assigned scores in half-steps from 4.5 down to 2.0.  Because we were on zoom, I recorded every discussion (with student permission), just in case I needed to go back and review my assigned score.  As it turns out, I didn’t go back and review a single conversation!  I was able to assign a score to each student immediately after our conversation.  I received no complaints from students and did not second-guess myself.

Kelly: The lab instructors and I did all the 10-minute oral exams via Zoom over the course of two days.  I recorded my sessions (with student permission), in case I wanted to review them afterward, though I didn’t end up needing to.  During the oral exam, I typed terse notes.  While likely indecipherable to anyone else, these were enough for me to be able to go back and fill in later.  I didn’t want my notetaking to get in the way of our statistical conversation or to cause additional anxiety for the student.

Between sets of 6-9 oral exams, I gave myself 30-minute breaks to fill in my feedback on Gradescope, assign a score, and take a breather so that I could start the next set with a high level of engagement. (I didn’t want any of the students to realize I felt like Bill Murray’s character did when he experienced Groundhog Day for the 27th time.)

My assessment rubric was pretty simple and reflected the accuracy and completeness of the student’s answer for each question.  As I stated earlier, I gave each student feedback on the components they got wrong, along with encouraging feedback about what they got right.  I definitely didn’t give points for eloquence.  Overall, the oral exam represented about 25% of each student’s exam grade.

What would you do differently in the future, and what aspects would you maintain?

Kelly: In the future, I will consider having a question bank instead of asking each student the same set of questions. I like to think there wasn’t much cheating on the oral exams, but a student definitely could have shared the questions they were asked with a friend who took the exam at a later time.  I will also increase the testing slots to 15 minutes to allow for a bit more in-depth discussion of a concept.

I think I need to develop a clearer idea upfront of how much the instructors should lead students who are missing the mark.  I firmly believe that learning can happen during an exam, and an instructor’s leading questions can help a student who has strayed off the path to get back on and connect ideas.  For consistency, the lab instructors and I did very little leading this first time around.  When a student didn’t have much of an answer to a question, we just moved on to the next question.  I think that led to some missed learning opportunities.

In terms of what I’ll keep, I liked that the exam built off a data context that the students had already explored, so we didn’t have to spend time setting up the problem.  I will also continue asking questions that require explanations, requiring them to verbalize their thought process.

Paul: Although I plan to keep learning from others’ experiences and from researchers who have systematically studied oral exams, aspect that I’d like to keep include:

• Basing the exam on a recently completed assignment.  To me, this provided a good base from which to launch into discussions of concepts and connections.
• Allowing students to choose ahead of time the first question they’ll answer.  More than one student admitted how nervous they were when we were just starting, but they seemed to calm down after successfully going through their prepared response. Several admitted at the end that the oral exam went much faster and was not nearly as scary as they feared.
• Having an adaptive and loose script.  I believe I was able to fairly evaluate students even without a fixed set of questions (and there’s no risk that fixed script can get out), and the conversation felt more genuine, authentic, and personal, adapted to a student’s level of understanding.
• Conducting it over zoom.  Even though this is less personal than meeting in person, it’s great for sharing screens back and forth, for maintaining a tight timeline and extending into evening hours, and for recording the conversation.
• Keeping the length at 15 minutes.  Anything less seems too rushed and not conversational enough, but anything more seems unnecessary for establishing a proper assessment.
• Grading with the 4-point rubric.  I’m convinced that the total time spent developing, administering, and grading the exam was significantly less than with a conventional written test, and the grades were just as reflective of students’ learning.

Aspects that I’d likely change up include:

• I would not include the “non-stats” ice-breaker question.  I think a little friendly chit-chat, followed by an initial question that the student has prepared, suffices to alleviate a lot of oral-exam anxiety.
• I might stretch 44 15-minute exams over three days instead of just two days, but I felt pretty energized throughout, and I preferred to bite the bullet and keep things to a short timeframe.
• Give students a chance to practice talking aloud through their thought processes beforehand, not just for an oral exam in my class, but for future technical interviews.
• Keep thinking about effective questions.  For example, I could give students data with a context and ask them to talk me through an analysis, from EDA to ultimate conclusions.
• I really didn’t provide students with much feedback other than comments during the exam and their final score.  I would love to find a way to provide a little more feedback, but I would not want to sacrifice being fully present during the conversation.

Did the oral exam/quiz meet your aspirations?  Once you return to face-to-face classes, will you continue to give oral exams/quizzes?

Paul: Yes!  This spring my challenge is to adapt this idea to courses in Statistical Theory, where I’ve always wanted to do oral exams, and Intro to Data Science, where I haven’t previously imagined oral exams).

Kelly: I really feel like I was better able to assess a student’s comprehension of statistical concepts with the oral exam than I have been with my in-class exams.  On a paper exam, you often just see the final answer, not the potentially winding road that got the student there and, for incorrect answers, where faulty logic seeped into the student’s thought process.

However, at the same time, I didn’t get to ask nearly as many conceptual questions this way.  I could see using both types of exams when I am back to the in-person classroom, which I am looking forward to!

## #83 Better, not necessarily good

In this post I will describe a five-question, auto-graded quiz that I recently gave my students.  I modified a longer assignment that I have used in the past to create this quiz.  I will describe my thinking behind each of these questions, and I will also mention some questions that I decided not to ask.  My favorite question on this quiz is the last one, because it tries to assess whether students can take a step back and think about the big picture rather than apply a particular procedure.  As always, questions that I pose to students appear in italics.

The quiz begins with this background information:

Helping smokers to quit is a very important and challenging public health goal.  In a study of the effectiveness of a nicotine lozenge, smokers who wanted to quit were randomly assigned to one of two groups: one group received nicotine lozenges and the other group received placebo lozenges.  At the end of the 52-week study, 17.9% of the 459 subjects in the nicotine group had successfully abstained from smoking, compared to 9.6% of the 458 subjects in the placebo group.

I realize that I have not presented a question yet, but I want to interrupt myself already.  I usually present students with a 2×2 table of counts for questions like this about comparing proportions between two groups. But this time I provided the sample sizes and sample proportions (presented as percentages) instead.  I think it’s worthwhile to present information in different ways at different times, in the hope that students learn to recognize the underlying structure without becoming too attached to a particular presentation.

To help with recognizing the underlying structure, I repeatedly advise students to start with identifying the observational units and variables*.  In this case the observational units are smokers who want to quit, the explanatory variable is whether the person receives a nicotine lozenge or placebo lozenge, and the response variable is whether or not the person successfully abstains from smoking for a year.  Both variables are categorical and binary, so that lets students know that they are analyzing a scenario that involves comparing two proportions.

* See post #11, titled Repeat after me, here.

1. How many times more likely was a smoker who received the nicotine lozenge to quit, as compared to a smoker who received a placebo lozenge?  Report your answer with two decimal places of accuracy.

This question asks for the ratio of the success proportions, also known as the relative risk.  The answer is: 0.179 / 0.096 ≈ 1.86.  I specified an error tolerance of 0.02, so students could receive full credit even with some rounding error.

I often ask this question in an alternate form, asking students to fill in the blank in the following sentence: Those who received the nicotine lozenge were ___ % more likely to quit, compared to those who received a placebo lozenge.  The answer is 86%, which can be determined by subtracting 1 from the ratio and then multiplying by 100%.

I realize that working with the difference in success proportions is more conventional than calculating the ratio, and the next two questions will ask about this difference.  But I think working with percentage differences is an important and challenging skill for students, so I ask this type of question often*.

* See post #28, titled A pervasive pet peeve, here.

2. Calculate the value of the z-test statistic for comparing the success proportions.  Report your answer with two decimal places of accuracy.

Students had the option of calculating this by hand or using software, such as the applet (here) that I have used in class.  If they are calculating this by hand, they need to use:

The only component of this calculation that has not been provided is the value of the combined proportion of successes for the two groups.  To determine this, students could first produce the 2×2 table of counts:

When some students asked about how to determine the combined sample proportion of successes, I mentioned that they could first determine the 2×2 table of counts, but I also pointed them to a shortcut by asking: In what situation does the combined proportion equal the average of the two group proportions?  I wanted them to realize that using the average works when the two groups have the same sample size.  In this case the two sample sizes are so similar that the average gives a very close approximation for the combined proportion of successes.

The test statistic turns out to be z = 3.63.  I allowed an error tolerance of 0.05 to account for rounding errors.

I often ask students to interpret the value of a z-test statistic.  Here they could say that the success proportion in the nicotine group is 3.63 standard errors larger than the success proportion in the placebo group.  But with only five questions on my quizzes, I omitted that question this time.

3. What kinds of values are in a 95% confidence interval for the difference in population proportions (nicotine minus placebo) who would successfully abstain from smoking for a year?  [Options: A. Both negative and positive values; B. Only negative values; C. Only positive values]

Students could answer this question be calculating the confidence interval, either by hand or with software.  But they could also answer this based on the z-statistic.  A z-statistic of 3.63 is quite far out in the tail of a standard normal distribution, so the p-value is quite close to zero, so the null hypothesis should be resoundingly rejected at the 0.05 significance level.  That means that a 95% confidence interval for the difference in population proportions will not include the value zero.  Because I stipulated the order of subtraction to be (nicotine minus placebo), and because the nicotine group had a larger proportion of successes than the placebo group, this 95% confidence interval will include only positive values.

I was tempted to ask about a confidence interval for the relative risk, to see whether students would realize that such a confidence interval would include only values larger than one.  In hindsight I wish that I had turned question #3 into a two-part matching-style question, with one part asking about a confidence interval for the difference in population proportions and the other asking about a confidence interval for the ratio of population proportions.

4. What conclusion would you draw from this study?  [Options: A. Not enough evidence to conclude that the nicotine lozenge is more effective than the placebo lozenge; B. Very strong evidence that the nicotine lozenge is equally effective as the placebo lozenge; C. Very strong evidence that the nicotine lozenge is more effective than the placebo lozenge]

This is the bottom-line question about this study, so I did not consider leaving this out.  The correct answer is C. I have been including a version of option B on many questions, to see whether students might be tempted into (inappropriately) accepting the null hypothesis for cases with a larger p-value.

Finally, we arrive at the question that I’ve been intending to highlight all along:

5. If we consider the smokers in this study to be representative of the population of all smokers who want to quit, what would you conclude about the percentage of smokers who would successfully abstain from smoking for one year if they were to take nicotine lozenges?  [Options: A. Substantially greater than 50%, B. Substantially less than 50%, C. Very close to 50%]

It’s very tempting to conclude that more than half of all smokers would successfully quit with the nicotine lozenge.  After all, the previous questions have led to concluding that the nicotine lozenge is substantially better than the placebo.  The z-statistic is very large, so the p-value is very small, so the data provide very strong evidence that the nicotine lozenge is better than the placebo.

It’s also reasonable to conclude that the nicotine lozenge is a good bit better than the placebo.  Our point estimate from question #1 is that smokers are 1.86 times more likely to quit if they use the nicotine lozenge rather than the placebo.  A 95% confidence interval for the population relative risk extends from 1.32 to 2.62, well above the value 1.

Several students asked me for help with this question.  My advice to them was:

1. Don’t calculate anything.
2. Re-read the sentence about how effective the nicotine lozenge was for the smokers in this study.
3. Re-read the question and options.
4. Think

I wanted students to realize that only 17.9% of the smokers who received the nicotine lozenge succeeded in quitting for a year.  That’s a lot less than 50%.  A 95% confidence interval runs from 14.4% to 21.4%, which is still a lot less than 50%.  So, the take-away message is: Being a lot better than a placebo does not necessarily mean being good in an absolute sense.  This study indicates that smokers who want to quit have a better chance of success with a nicotine lozenge than with a placebo.  But those smokers should not be fooled into thinking that nicotine lozenges move the odds in favor of their successfully quitting. Unfortunately, even with nicotine lozenges, the odds appear to be substantially against successful quitting.

When I used this context on longer, non-auto-graded assignments in the past, I asked students to determine the confidence interval for the population proportion who would successfully quit if given a nicotine lozenge.  I think it’s worthwhile to ask students to conduct one-sample inference as part an assignment that’s primarily about two-sample inference.  But I like question #5 on this quiz version better, because this just asks students to think without calculating anything.  I also like that this quiz version presents the sample percentages of success rather than a table of counts at the outset, so students do not even need to calculate the relevant percentage (17.9%) for themselves.

There you have it: a five-question, auto-graded quiz that I hope leads students to think and not just apply a procedure mindlessly.

Speaking of thinking, may I be so bold as to ask what you think of this quiz?  Is it better than asking nothing at all?  I hope you do not hesitate to answer yes.  Ah, but of course you can truthfully say that this quiz is better than nothing without necessarily thinking that the quiz is good.  I’ll be disappointed if you think this, but I’ll also appreciate that you not only grasped my point but also seized an immediate opportunity to apply it.

P.S. The article about the nicotine lozenge study can be found here.

## #82 Power, part 3

This post continues and completes my discussion of introducing students to the concept of power.  Let me remind you of the scenario that I presented in the first post of this series (here):

Suppose that Tamika is a basketball player whose probability of successfully making a free throw has been 0.6.  During one off-season, she works hard to improve her probability of success.  Of course, her coach wants to see evidence of her improvement, so he asks her to shoot some free throws.  If Tamika really has improved, how likely is she to convince the coach that she has improved?

The first post in this series described using an applet (here) to conduct simulation analyses to lead students to the concepts of rejection region and power, and then to consider factors that affect power.  In this post I will make three points about teaching these concepts in courses for mathematically inclined students, such as those majoring in statistics or mathematics or engineering or economics.

I suspect that some statistics teachers regard simulation as a valuable tool for students who are uncomfortable with math but not necessarily with mathematically inclined students.  I agree I think simulation can be very enlightening and powerful tools even with students who enjoy and excel at mathematical aspects of statistics.  I recommend introducing students to the concept of power through simulation analyses, regardless of how well prepared or comfortable the students are with mathematics.

You could ask students to write their own code to conduct these simulations, but I typically stick with the applet because it’s so convenient and produces nice visuals such as:

2. Proceed to ask mathematically inclined students to perform probability calculations to confirm what the simulations reveal.

Tamika’s example provides a great opportunity for students to practice working with the binomial distribution:

• a) Let the random variable X represent the number of shots that Tamika would successfully make out of 25 shots, assuming that she has not improved.  What probability distribution would X have?
• b) Determine the smallest value of k for which Pr(X k) ≤ 0.05.
• c) Does this agree with your finding from the simulation analysis?
• d) Explain what this number has to do with Tamika’s efforts to convince the coach that she has improved.

The random variable X would have a binomial distribution with n = 25 and p = 0.6.  To answer part (b), students could work with a cumulative distribution function by realizing that Pr(X ≥ k) = 1 – Pr(X ≤ k – 1) in this case.  Then they can use software or a graphing calculator to determine that the smallest value of k that satisfies this criterion is k = 20, for which Pr(X ≥ 20) ≈ 0.0294.  This means that Tamika must successfully make 20 or more of the 25 shots to convince her coach that she has improved, when the coach gives her 25 shots and using 0.05 as his standard to be convinced.

Instead of using the cumulative distribution function, students could use the inverse cumulative distribution function built into many software programs.  For example, this command in R is:

Some students get tripped up by the need for the first input to be 0.95 rather than 0.05.  Students also need to be careful to realize that the output value of 19 = k – 1, so the value of k = 20.  As some students struggle with this, I remind them of two things: First, they should return to their simulation results to make sure that their binomial calculations agree.  Second, when they’re not sure whether 19 or 20 is the answer they’re looking for, they can check that by calculating Pr(X ≥ 19) and Pr(X ≥ 20) to see which one meets the criterion.

• e) Let the random variable Y represent the number of shots that Tamika would successfully make if she has improved her success probability to 0.7.  What probability distribution would Y have?
• f) Determine Pr(Y k) for the value of k that you determined in part (b).
• g) Does this agree with your finding from the simulation analysis?
• h) Explain what this number has to do with Tamika’s efforts to convince the coach that she has improved.

The random variable Y would have a binomial distribution with n = 25 and p = 0.7.  Once they realize this, students can use software to calculate Pr(Y ≥ 20) ≈ 0.1935.  For example, the R command to calculate this is:

This probability is very close to the approximation from the simulation analysis.  Tamika has slightly less than a 20% chance of convincing her coach that she has improved, if she is given a sample of 25 shots, the coach uses a significance level of 0.05, and her improved probability of success is 0.7 rather than 0.6.

Students can then use software to produce exact power calculations, using the binomial probability distribution, for different values of the sample size, significance level, and improved success probability.

A drawback of using software such as R or Excel to calculate these probabilities is that they do not automatically provide visual representation of the probability distribution.  The applet that I used for the simulation analyses does have an option to calculate and display exact binomial probabilities:

3. Ask mathematically inclined students to write code to produce graphs of power as a function of sample size, or significance level, and of alternative value for the parameter.

Recall that the pair of R commands for calculating the rejection region and power for Tamika’s first situation is:

Then I like to ask mathematically included students: Re-write this power calculation to use just one line of code.  For students who need a hint: Where did the 19 value in the second line come from?  This leads to:

A follow-up is: How can you alter this command to calculate power as a function of sample size, for values from n = 1 through n = 300?  The key is to replace the value 25 with a vector (call it n) containing integer values from 1 through 300.  The resulting graph (with α = 0.05 and palt = 0.7) is:

Does this graph behave as you expected?  Mostly yes, but there’s an oddity.  This graph shows that power generally increases as sample size increases, as we expected.  But I say generally because there are lots of short-run exceptions, because of the discrete-ness of the binomial distribution.  The pattern is more noticeable if we restrict our attention to sample sizes values from n = 1 through 100:

Students can then proceed to produce and describe graphs of power as a function of significance level and of the improved probability value (for n = 25 in both graphs, palt = 0.7 on the left, and α= 0.05 on the right), as shown here:

Do these graphs behave as you expected?  Power increases as the significance level increases, as expected, but this is a step function due to the discreteness.  Power does increase as the improved probability value increases, as expected.

The concept of power is a challenging one for many students to grasp.  I recommend starting with a simple scenario involving a single proportion, such as Tamika trying to convince her coach of her improvement as a free throw shooter.  I think simulation analyses and visualizations can help students to understand the key ideas*.  With mathematically inclined students, I suggest following up the simulations with probability calculations and simple coding as described in this post. My hope is that these activities deepen their understanding of power and also their facility with probability distributions.

* As long as the simulation analyses are accompanied by asking good questions!

## #81 Power, part 2

In last week’s post (here), I presented an extended series of questions that used a simulation analysis to introduce students to the concept of statistical power.  The simulation analysis used an applet (here) to produce graphs like this:

In the context of last week’s example, this pair of graphs shows that there’s about a 63% that Tamika will perform well enough to convince her coach that she has improved, based on a sample size of 100 shots, a significance level of 0.05, and an improvement in her probability of success from 0.6 to 0.7.

Needless to say, I believe that the questions I presented can be helpful for developing students’ understanding of the concept of power.  I hasten* to add that this activity is far from fool-proof.  In this post, I will discuss a few common difficulties and misunderstandings.

* Well, it took me a week, so perhaps I have not really hastened to add this, but I like the sound of the word.

A big part of my point in last week’s post was that we can help students to focus on the concept of power by postponing the use of statistical terminology.  I waited until we had completed the simulation activity before defining the terms power and also type I error and type II error.  Of course, this required me to move beyond simply talking about whether Tamika had improved or not, and whether her coach was convinced or not.  At this point I mentioned terms such as null hypothesis and alternative hypothesis, rejecting the null hypothesis and failing to reject the null hypothesis.  Then I asked my students to state the null and alternative hypotheses that Tamika’s coach was testing, both in words and in terms of a parameter.

Most students seemed to realize quickly that the null hypothesis was that Tamika had not improved, and the alternative was that she had improved.  But they struggled with expressing these hypotheses in terms of a parameter.  To point them in the right direction, I asked whether the parameter is a proportion or a mean, but this did not seem to help.  I took a conceptual step back and asked whether the variable is categorical or numerical.  This time several students answered quickly but incorrectly in the zoom chat that the variable was numerical.

This is a very understandable mistake, because graphs such as the ones above display the distribution of a numerical variable.  But I pointed out that the variable for Tamika is whether or not she successfully makes a shot, which is categorical.  The parameter is therefore the long-run proportion of shots that she would make, which my students know to represent with the symbol π.  The hypotheses are therefore H0: π = 0.6 (no improvement) versus Ha: π > 0.6 (improvement).

This difficulty reveals a common problem when using simulation to introduce students to concepts of statistical inference.  To understand what the simulation analysis and resulting graphs reveal, it’s crucial to realize that such graphs are not displaying the results not of a single sample, which is what we would observe in practice.  Rather, the graphs are showing results for a large number of made-up samples, under certain assumptions, in order to investigate how the procedure would perform in the long run.  This is a big conceptual leap.  I strongly recommend using physical devices such as coins and cards for students’ first encounters with simulation (see posts #12 and #27, here and here), in order to help them with recognizing this step and taking it gradually.  When you rely on technology to conduct simulations later, students must follow this step in their minds to make sense of the results.

As I presented the activity for my students via zoom, I also encouraged them to use the applet to carry out simulation analyses themselves.  I should not have been surprised by the most common question I received from my students, but I was surprised at the time.  Several students expressed concern about getting slightly different values than I did.  For example, they might have gotten 0.6271 or 0.6343 rather than the 0.6317 that I obtained in the graphs above.  I responded that this was a good question but nothing to worry about.  Those differences, I said, were due to the random nature of simulation and therefore to be expected.  I added that using a large number of repetitions for the simulation analysis, such as 10,000, should ensure that we all obtain approximately the same value.

Some students followed up by asking how such responses will be graded on assignments and exams.  I had been thinking that some students resist a simulation-based approach because they are uncomfortable with approximate answers rather than a single correct answer.  But this question made me realize that some students may be skeptical of simulation analyses not for intellectual or psychological reasons but rather out of concern about their grades.

I tried to assure my students that with simulation analyses, reasonable values in the right ballpark would earn full credit, for both open-ended and auto-graded responses.  I should have also thought to respond that many questions will instead ask about the simulation process and the interpretation of results.

My pledge that students would receive full credit for reasonable approximations was called into question less than half an hour after class ended.  Here are the questions that I asked in the (auto-graded) follow-up quiz:

Suppose that I have regularly played Solitaire on my computer with a 20% chance of winning any one game.  But I have been trying hard lately to improve my probability of winning, and now I will play a series of (independent) games to gather data for testing whether I have truly improved.

1. What is the alternative hypothesis to be tested?  [Options: That I have improved; That I have not improved; That I have doubled my probability of winning a game]

2. Suppose that I have not improved, but the data provide enough evidence to conclude that I have improved.  What type of error would this represent?  [Options: Type I error; Type II error; Type III error; Standard error]

Now suppose that I really have improved, and my success probability is now 25% rather than 20%.  Also suppose that I plan to play 40 independent games and that my test will use a significance level of 0.05.  Use the Power Simulation applet to conduct a simulation analysis of this situation.

3. What is the rejection region of the test?  [Options: Winning 13 or more times in the 40 games; Winning 20 or more times in the 40 games; Winning 8 or more times in the 40 games; Winning 10 or more times in the 40 games]

4. Which of the following comes closest to the probability that these 40 games will provide convincing evidence of my improvement?  [Options 0.18; 0.25; 0.40; 0.75; 0.99]

5. Continue to assume that my success probability is now 25% rather than 20% and that the test uses a significance level of 0.05.  About how many games would I have to play in order to have a 50% chance that the games will provide convincing evidence of my improvement?  Enter your answer as an integer.  (Hint: Use the applet, and feel free to use trial-and-error.)

* The discrete-ness of the binomial distribution is more of an issue here than variability of simulation results.  I will discuss this further in part 3 of this series, but for now I’ll show a graph of power (calculated from the binomial distribution) as a function of sample size for the values that I decided to accept as reasonable.  This graph shows that power does generally increase with sample size, but the discrete-ness here makes the function more interesting and non-monotonic:

I believe that the simulation activity that I presented last week is effective for introducing students to the concept of power.  But I also acknowledge that this is a challenging topic, so in this post I have tried to point out some difficulties that students encounter.

## #80 Power, part 1

I wish I had a better title for this post.  This one-word title reminds me of my two-part post about confounding variables, which I simply titled Confounding (see posts #43 and #44, here and here).  I tried to be clever with that title by arguing that the concept of confounding is one of the most confounding topics that students encounter in introductory statistics.  I suppose I could argue that the concept of power is one of the most powerful topics that students encounter, but my point is really that power is another topic that students find to be especially confounding.  I will abandon my search for cleverness and stick with this boring (but not misleading!) title.

I think we can help students to understand the concept of power by eliminating unnecessary terminology and calculations for our first pass at the topic.  We don’t need to mention null and alternative hypotheses, or rejection regions, or Type I and Type II errors, or p-values, or binomial or normal distributions, or expected value or standard deviation or z-score.  Don’t get me wrong: We’ll use most of those ideas, but we don’t need to let the terminology get in the way.

Instead we can present students with a scenario and an overarching question that you and I recognize as a question of power.  Then we can lead students to answer that big question by asking a series of smaller questions.  Questions that I pose to students appear in italics below.

Here’s the scenario that I use with my students: Suppose that Tamika is a basketball player whose probability of successfully making a free throw has been 0.6.  During one off-season, she works hard to improve her probability of success.  Of course, her coach wants to see evidence of her improvement, so he asks her to shoot some free throws.

Here’s the overarching question: If Tamika really has improved, how likely is she to convince the coach that she has improved?  The other big question is: What factors affect how likely she is to convince the coach that she has improved?

I try not to over-do sports examples with my students, but I think the context here is very helpful and easy to follow, even for students who are not sports fans.

You won’t be surprised to see that we’ll use simulation as our tool to address these questions.

Let’s say that the coach gives Tamika 25 shots with which to demonstrate her improvement.

a) Suppose that she successfully makes 23 of the 25 shots.  Would you be reasonably convinced that she has improved?  Why or why not?

b) What if she makes 16 of the 25 shots – would you be reasonably convinced that she has improved?  Why or why not?

Most students realize that 60% of 25 is 15*, so both 16 and 23 are more successes that we would expect (for the long-run average) if she had not improved.  Their intuition suggests that 23 successes would provide very strong evidence of improvement, because it seems unlikely that a 60% shooter would achieve that many successes.  On the other hand, 16 successes does not provide strong evidence of improvement, because it seems that a 60% shooter could easily get a bit lucky and obtain 16 successes.

* You’re welcome to call this the expected value if you’d like.

c) What does your intuition suggest about how many shots Tamika would have to make successfully in order to be convincing?

When I asked my students to type their answer to this question into the zoom chat during class a few days ago, nearly every student typed 20.  I said that this seemed reasonable and that we would proceed to use simulation to investigate this question a bit more carefully.  We used an applet (here) to conduct the simulation analysis.  The applet inputs required are:

d) Which input values can you specify already?

The hypothesized probability of success is 0.6, and the sample size is 25.  Later we’ll assume that Tamika has improved to have a 70% chance of success, so we’ll enter 0.7 for the alternative probability of success.  I like to start with simulating just one sample at a time, so we’ll enter 1 for number of samples at first; later we’ll enter a large number such as 10,000 for the number of samples.

e) Click on “draw samples” five times, using 1 for the number of samples each time.  Did each of the simulated samples produce the same number of successful shots?

Part e) would be easy to skip, but I think it’s important.  This question forces students to acknowledge randomness, or sampling variability.  I don’t think any students struggle to answer this correctly, but I think it’s worth drawing their attention to this point.

f) Now enter 9995 for the number of samples, and click on “draw samples” to produce a total of 10,000 simulated sample results.  Describe the resulting distribution for the number of successes.  Comment on shape, center, and variability.

Here are some typical results:

My students are quick to say that the shape of this distribution is symmetric, unimodal, normal-ish.  The center is near 15, which is what we expected because 60% of 25 is 15.  There’s a good bit of variability here: The simulated results show that Tamika sometimes made as few as 7 or 8 shots out of 25, and she also made as many as 23 or 24 shots out of 25.

g) Has this simulation analysis assumed that Tamika has improved, or that Tamika has not improved?

This is also a key question that is easy for students to miss: This simulation analysis has assumed that Tamika has not improved*.  We use the distribution of the number of successes, assuming that she has not improved, to decide how many successes she needs to provide convincing evidence of improvement.  I try to reinforce this point with the next question:

* You’re welcome to call this the null hypothesis.

h) Based on these simulation results, do you feel justified in your earlier answers about whether 23 successes, or 16 successes, would provide convincing evidence of improvement?  Explain.

Students who thought that 23 successes in 25 attempts provides very strong evidence of improvement should feel justified, because this simulation reveals that such an extreme result would happen only about 4 times in 10,000* (see graph on the left).  Similarly, students were correct to believe that 16 successes does not provide much evidence of improvement, because it’s not at all unlikely (better than a 40% chance*) for a 60% shooter to do that well (or better) by random chance (see graph on the right).

* You’re welcome to refer to these percentages as approximate p-values.  See post #12 (here) for an introduction to simulation-based inference.

Now we come to one of the harder questions:

i) Suppose that the coach decides on the following criterion for his decision rule: He’ll decide that Tamika has improved if her number of successes is large enough that such an extreme result would happen less than 5% of the time with a 60% shooter.  According to this rule, how many shots does Tamika need to make successfully to convince her coach?

I encourage students to answer this at first with trial-and-error.  Enter 17, and then 18, and so on into the “rejection region” box until you find the smallest number for which less than 5% of the simulated samples produce such a large number (or more) of successes.  The answer turns out to be that Tamika needs to make 20 or more of the 25 shots* to be convincing, as shown here:

* You’re welcome to call this the rejection region of the test, especially as the applet uses that term.

I was quick to point out to my students how good their intuition was.  As I mentioned earlier, nearly all of my students who responded in the zoom chat predicted that Tamika would need to make 20 shots to be convincing.

Now, finally, we address the big picture question:

j) Make a guess for how likely Tamika is to make 20 or more shots successfully out of 25 attempts, if she has improved to a 0.7 probability of successfully making a single shot.

I don’t really care how well students guess here.  My point is to remind them of the big question, the reason we’re going through all of this.  Next we use the applet to conduct another simulation to answer this question:

k) Check the “show alternative” box in the applet, which displays the distribution of number of successes, assuming that Tamika has improved to a 0.7 probability of success.  Do you see much overlap in the two distributions?  Is this good news or bad news for Tamika?  Explain.

There is considerable overlap in the two distributions, as shown here:

l) According to the applet’s simulation results, in what percentage of the 10,000 simulated samples does Tamika, with a 70% chance of making a single shot, do well enough to convince the coach of her improvement by successfully making 20 or more shots?  Would you say that Tamika has a good chance of demonstrating her improvement in this case?

Unfortunately for Tamika, she does not have a good chance of demonstrating her improvement.  In my simulation result shown here, she only does so about 19.5% of the time:

Here’s where we introduce the term of the day: We have approximated the power of this test.  Power in this case represents the probability that Tamika convinces her coach that she has improved, when she truly has improved.

Now we’ll begin to consider factors that affect power, first by asking:

m) What would you encourage Tamika to request, in order to have a better chance of convincing the coach that she has improved?

Several of my students responded very quickly in the zoom chat to say: more shots*.

* You’re welcome to call this a larger sample size.

n) Now suppose that the coach offers 100 shots for Tamika to show her improvement.  Re-run the simulation analysis.  Is there more, less, or the same amount of overlap in the two distributions?  Is this good news or bad news for Tamika?  Explain.

The simulation results reveal that the larger sample size leads to much less overlap between these two distributions:

This is very good news for Tamika, because this shows that it’s easier to distinguish a 70% shooter from a 60% shooter when she takes 100 shots than with only 25 shots.

o) How many shots must she now make successfully in order to convince the coach?  How does this compare to the percentage of 25 shots that she needs to make in order to be convincing?

The following graph shows that making 69 or more shots is sufficient to convince the coach that she has improved from a 60% shooter:

Recall that with 25 shots, Tamika had to make 20 of them to be convincing, so the percentage that she needs to make has decreased from 80% to 69% with the increase in sample size.

p) What is the (approximate) probability that Tamika will be able to convince the coach of her improvement, based on a sample of 100 shots?  How has this changed from the earlier case in which she could only take 25 shots?

This output shows that she has about a 63% chance of convincing the coach now:

This probability is more than three times larger than the previous case with only 25 shots.

q) What else could Tamika ask the coach to change about his decision process, in order to have a better chance to convince him of her improvement?

This one is much harder for students to suggest than sample size, but someone eventually proposes to change the 5% cut-off value, the significance level.  Making that larger would mean that the coach is requiring less strong evidence to be convincing, so that will increase Tamika’s chances of convincing the coach.

r) Change the coach’s significance level from 0.05 to 0.10.  How does this change the number of shots that Tamika must make to convince the coach of her improvement?  How does this change the probability that she convinces the coach of her improvement?

As shown in the following output, Tamika now only needs to make 67 shots, rather than 69, in order to convince the coach.  The probability that she achieves this as a 70% shooter is approximately 0.777, which is considerably larger than the previous probability of approximately 0.632.

s) Identify one more factor that affects how likely Tamika is to convince the coach that she has improved.

I sometimes give a hint by suggesting that students think about the applet’s input values. Then someone will suggest that Tamika could try to improve more.

t) Now suppose that Tamika improves so much that she has a 0.8 probability of successfully making a single shot.  How does this change the number of shots that Tamika must make to convince the coach of her improvement?  How does this change the probability that she convinces the coach of her improvement?

I tell students that they do not need to use the applet to answer the first of these questions.  This change does not affect how many shots she must make to convince the coach.  That value depends only on her previous probability of success, not her new and improved probability of success.  But her new success probability will produce even greater separation between the two distributions and will increase her probability of convincing the coach.  The following output reveals that the new probability is approximately 0.999:

This activity can introduce students to the concept of power without burdening them with too much terminology or too many calculations.  I grant that it’s very convenient to use terms such as significance level and rejection region and power, but I prefer to introduce those after students have first explored the basic ideas.

In the second post in this series, I will discuss some common questions from students, describe some assessment questions that I used for this topic, including some that I now regret, and present extensions of this activity for introducing the concept of power to more mathematically inclined students.

## #79 No blog post today

I’m sorry to report that I do not have a blog post for today.

Hold on – doesn’t that sentence itself, especially when combined with this one, constitute a blog post?!  After all, you can see for yourself that I am posting this on my blog

But while this may constitute a logical puzzle, or conundrum, or paradox, it’s not really related to teaching introductory statistics, and it does not ask good questions, so I don’t know that you should consider this a legitimate blog post.  I am only sending this out because some of you have been very nice to say that you look forward to hearing from me on Monday mornings, so I thought I should announce and explain that I have no blog post this week.

The explanation is simply that I’ve fallen behind, and today (Mon Jan 4) is the first day of classes for the Winter quarter at Cal Poly, and I’ve needed to spend my time on preparing for class rather than writing a blog post.

I will be teaching the second course in our two-course sequence for Business majors.  The first course included basic ideas of data collection and analysis, a good bit of probability, and statistical inference for one variable.  After a brief review, this second course entails inference for comparing groups, chi-square tests, analysis of variance, simple linear regression, and multiple regression.  For good measure, the course concludes with brief introductions to time series and quality control.

This will be my second experience teaching an entirely online course.  I’m feeling less nervous than I was at the start of the Fall quarter, but I also feel that this time I should set my sights higher than simply surviving the experience.

I have mentioned before that I give lots of quizzes to my students.  I have already posted three quizzes before my class even begins.  One is about a welcome video that I prepared, another about a video that introduces myself to students (which I wrote about in post #63, titled My first video, here), and a third about the course syllabus.  I tell students that these quizzes essentially offer free points, because students are welcome to re-watch the videos and re-read the syllabus after they have seen the quiz questions.

Just for fun, here are some of the questions on these quizzes*:

1. Which of the following is NOT a reasonable description of statistics?  [Options: Confusing people with nonsense; gaining insights from data; Making decisions under uncertainty]
2. How many quizzes will we have?  [Options: A few; Lots; None]
3. How should you interact with your classmates and instructor?  [Options: With arrogance and condescension; With courtesy and respect; With rudeness and sarcasm]
4. Are you responsible for practicing academic honesty?  [Options: No; Only on odd-numbered dates; Only on Sundays; Yes]
5. What is my teaching philosophy?  [Options: Ask good questions; Insist on perfection; Learn by viewing; Rely on luck]

* I usually report the answers to my quiz questions, but this time I think I’ll trust you to figure them out for yourself.

I apologize in advance in case I need to return to “no blog post today” a few times in the coming weeks, but I will definitely return.  Some topics that I have in mind include introducing students to the concept of power and questioning whether two-sided tests are always preferable to one-sided ones.  I will also write about one of my favorite topics to teach, one that does not often come up in statistics courses but is especially relevant when teaching Business students: adjusting for inflation.

Best wishes for asking good questions in 2021!

P.S. Hey, wait a minute – did I just write a blog post today after all? [Options: Absolutely; Not even close; Well, kind of]

## #78 Two fun (and brief) items

Thanks for reading this, my final blog post for the infamous year 2020.  In contrast to this seemingly unending year*, I will keep this post very brief.  I will conclude this decidedly not-fun year by presenting two fun items that I recently encountered.

* Even though today is December 28th, it feels more like March 303rd.  (I can’t take credit for this joke, but I regret that I cannot remember where I first saw a version of it.)

The first fun item is a quote from American educator Alice Wellington Rollins.  Even though I just learned of this quote within the past two weeks, it’s actually 122 years old, having appeared in the Journal of Education in 1898 (volume 47, issue 22, page 339, available here).  Stacey Hancock brought this to my attention, as she cites this quote in an article about teaching statistics that she has written for the March 2021 issue of the Notices of the American Mathematical Society.  I think this quote offers a valuable perspective on my “ask good questions” refrain:

The test of a good teacher is not how many questions he can ask his pupils that they will answer readily, but how many questions he inspires them to ask him which he finds it hard to answer.

Alice Wellington Rollins, Journal of Education, 1898

The second fun item is a very recent addition to the brilliant* collection of xkcd comics.

* I like to think that I do not use the adjective brilliant casually.  If you have not seen these comics, consider taking a look.  Some particularly clever ones that address statistical ideas include: Convincing (here), Correlation (here), and Significant (here).

When I look back on this horrible but memorable year, I hope to think of this image and advice from a recent xkcd comic (available here):

Many thanks and best wishes to all who have read this blog in 2019 and 2020.  I hope that you have found something that helps you to ask good questions of your students.  My aspiration remains to write essays about teaching introductory statistics that are practical, thought-provoking, and fun*.

* And, perhaps just this once, brief.