Skip to content

#68 Knowing or guessing?

I told my students at the beginning of our last class session that I was especially excited about class that day for several reasons:

  1. It was a Friday.
  2. We were about to work through our thirteenth handout of the term, a lucky number.
  3. The date was October 16, the median day for the month of October.
  4. We had reached the end of week 5 of our 10-week quarter, the halfway point.
  5. The topic for the day was my favorite probability rule, in fact my favorite mathematical theorem: Bayes’ rule.

The first two examples that we worked through concerned twitter use and HIV testing, as described in post #10, My favorite theorem, here.

The third and final example of the day presented this scenario: Suppose that Jasmine* has a 70%** chance of knowing (with absolute certainty) the answer to a randomly selected question on a multiple-choice exam.  When she does not know the answer, she guesses randomly among the five options. 

* I had always used the name Brad with this made-up example.  But I realized that I had used an example earlier in the week with the names Allan, Beth, Chuck, Donna, and Ellen, so I thought that I should introduce a bit of diversity into the names of fictional people in my made-up probability examples.  I did a google search for “popular African-American names” and selected Jasmine from the list that appeared.

** When I first rewrote this example with Jasmine in place of Brad, my first thought was to make Jasmine a stronger student than Brad, so I wrote that she has an 80% rather than a 70% chance of knowing the answer for sure.  Later I realized that this change meant that the value 20% was being used for the probability of her guessing and also for the probability of her answering correctly given that she is guessing.  I wanted to avoid this potential confusion, so I changed back to a 70% chance of Jasmine knowing the answer.

a) Before we determine the solution, make a prediction for the probability that Jasmine answers a randomly selected question correctly.  In other words, make a guess for the long-run proportion of questions that she would answer correctly.

I hope students realize that this probability should be a bit larger than 0.7.  I want them to reason that she’s going to answer 70% correctly based on her certain knowledge, and she’s also going to answer some correctly when she’s guessing just from blind luck.  I certainly do not expect students to guess the right answer, but it’s not inconceivable that some could reason that she’ll answer correctly on 20% of the 30% that she guesses on, which is another 6% in addition to the 70% that she knows for sure, so her overall probability of answering correctly is 0.76.

Next I ask students to solve this with a table of counts for a hypothetical population, just as we did for the previous two examples (again see post #10, here).  This time I only provide them with the outline of the table rather than giving row and column labels.  b) Fill in the row and column labels for the table below:

To figure out what labels to put on the rows and columns, I remind students that the observational units here are 100 multiple choice questions, and they need to think about the two variables that we record about each question.  It takes most students a little while to realize that the two variables are: 1) whether Jasmine knows the answer or guesses, and 2) whether Jasmine answers the question correctly or not.  This leads to:

c) Fill in the table of counts for a hypothetical population of 100 questions.  We proceed through the following calculations:

  1. Jasmine will know the answer for 70% of the 100 questions, which is 70.
  2. She will guess at the answer for 100 – 70 = 30 questions.
  3. For the 70 questions where she knows the answer, she will correctly answer all 70, leaving 0 that she will answer incorrectly.
  4. For the 30 questions on which she guesses, we expect her to answer correctly on one-fifth of them, which is 6.
  5. That leaves 30 – 6 = 24 questions for which she will guess and answer incorrectly.
  6. The column totals are therefore 76 correctly answered questions and 24 incorrect.

The completed table is shown here:

d) Use the table to report the probability that Jasmine answers a randomly selected question correctly.  This can read from the table to be: Pr(correct) = 76/100 = 0.76.

e) Show how this unconditional probability (of answering a randomly selected question correctly) can be calculated directly as a weighted average of two conditional probabilities.  This is more challenging for students, but I think the idea of weighted average is an important one.  I want them to realize that the two conditional probabilities are: Pr(correct | know) = 1.0 and Pr(correct | guess) = 0.2.  The weights attached to these are the probabilities of knowing and of guessing in the first place: Pr(know) = 0.7 and Pr(guess) = 0.3.  The unconditional probability of answering correctly can be expressed as the weighted average 0.7×1.0 + 0.3×0.2 = 0.76.

f) Determine the conditional probability, given that Jasmine answers a question correctly, that she actually knows the answer.  Some students think at first that this conditional probability should equal one, but they realize their error when they are asked whether it’s possible to answer correctly even when guessing.  Returning to the table, this conditional probability is calculated to be: 70/76 ≈ 0.921. 

g) Interpret this conditional probability in a sentence.  Jasmine actually knows the answer to about 92.1% of all questions that she answers correctly in the long run.

h) Show how to calculate this conditional probability directly from Bayes’ rule.  The calculation is: Pr(know | correct) = [Pr(correct | know) × Pr(know)] / [Pr(correct | know) × Pr(know) + Pr(correct | guess) × Pr(guess)] = (1×0.7) / (1×0.7 + 0.2×0.3) = 0.70 / 0.76 ≈ 0.921.  I try to impress upon students that even though this calculation looks more daunting with the formula than from filling in the table, the calculations are exactly the same, as seen by our ending up with 0.70/0.76 from the formula and 70/76 from the table.  I also emphasize that I think the table provides an effective and understandable way to organize the calculations.

Here’s a fun extension of this example: Continue to suppose that Jasmine has a 70% chance of knowing (with absolute certainty) the answer to a randomly selected question on a multiple-choice exam.  But now there’s also a 20% chance that she can eliminate three incorrect options, and then she guesses randomly between the remaining two options, one of which is correct.  For the remaining 10% chance, she has no clue and so guesses randomly among all five options.

i) Before conducting the analysis, do you expect the probability that she answers a question correctly to increase, decrease, or remain the same?  Explain.  Then do the same for the conditional probability that she knows the answer given that she answers correctly.

Most students have correct intuition for the first of these questions: If Jasmine can eliminate some incorrect options, then her probability of answering correctly must increase.  The second question is more challenging to think through: Now that she has a better chance of guessing the correct answer, the conditional probability that she knows the answer, given that she answer correctly, decreases.

j) Modify the table of hypothetical counts to determine these two probabilities.  Students must first realize that the table now needs three rows to account for Jasmine’s three levels of knowledge.  The completed table becomes:

The requested probabilities are: Pr(correct) = 82/100 = 0.82 and Pr(know | correct) = 70/82 ≈ 0.854.  Jasmine’s ability to eliminate some incorrect options has increased her probability of answering correctly by six percentage points from 76% to 82%.  But our degree of belief that she genuinely knew the answer, given that she answered correctly, has decreased by a bit more than six percentage points, from 92.1% to 85.4%.

I confess that I did not have time to ask students to work through this extension during Friday’s class.  I may give it as an assignment, or as a practice question for the next exam, or perhaps as a question on the next exam itself.

I have mentioned before that I give lots and lots of quizzes in my courses (see posts #25 and 26, here and here).  This is even more true in my first-ever online course.  I generally assign three handout quizzes and three application quizzes per week.  The handout quiz aims to motivate students to work through the handout, either by attending a live zoom session with me, or on their own, or by watching a video that I prepare for each handout.  The application quiz asks students to apply some of the topics from the handout to a new situation.  I also occasionally assign miscellaneous quizzes.  With regard to Bayes’ rule, I have asked my students to watch a video (here) that presents the idea behind Bayes’ rule in an intuitive and visually appealing way.  I wrote a miscellaneous quiz to motivate students to watch and learn from this video.

The author of this video, Grant Sanderson, argues that the main idea behind Bayes’ rule is that “evidence should not determine beliefs but update them.”  I think the Jasmine example of knowing versus guessing can help students to appreciate this viewpoint.  We start with a prior probability that Jasmine knows the answer to a question, and then we update that belief based on the evidence that she answers a question correctly.  Do I know with absolute certainty that this example helps students to understand Bayes’ rule?  Of course not, but I like the example anyway.  More to the point, the evidence of my students’ reactions and performances on assessments has not persuaded me to update my belief in a pessimistic direction.

#67 Interviews

One of my favorite professional activities has been interviewing statistics teachers and statistics education researchers for the Journal of Statistics Education.  I have conducted 26 such interviews for JSE over the past ten years.  I have been very fortunate to chat with some of the leaders in statistics education from the past few decades, including many people who have inspired me throughout my career.  I encourage you to take a look at the list and follow links (here) to read some of these interviews.

Needless to say, I have endeavored to ask good questions in these interviews.  Asking interview questions is much easier than answering them, so I greatly appreciate the considerable time and thoughtfulness that my interview subjects have invested in these interviews.  I hope that my questions have provided an opportunity to:

1. Illuminate the history of statistics education, both in years recent and back a few decades.  A few examples:

  • Dick Scheaffer describes how the AP Statistics program began. 
  • Mike Shaughnessy talks about how NCTM helped to make statistics more prominent in K-12 education. 
  • Chris Franklin and Joan Garfield discuss how ASA developed its GAISE recommendations for K-12 and introductory college courses. 
  • Jackie Dietz describes the founding of the Journal of Statistics Education
  • Dennis Pearl explains how CAUSE (Consortium for the Advancement of Undergraduate Statistics Education) came to be.
  • George Cobb describes his thought processes behind his highly influential writings about statistics education.
  • Nick Horton shares information about the process through which ASA developed guidelines for undergraduate programs in statistical science.
  • David Moore, Roxy Peck, Jessica Utts, Ann Watkins, and Dick De Veaux talk about how their successful textbooks for introductory statistics came about.

2. Illustrate different pathways into the field of statistics education.  Many of these folks began their careers with statistics and/or teaching in mind, but others started or took a detour into engineering or physics or psychology or economics.  Some even studied fields such as dance and Russian literature.

3. Indicate a variety of ways to contribute to statistics education.  Some interviewees teach in high schools, others in two-year colleges.  Some teach at liberal arts colleges, others in research universities.  Some specialize in teaching, others in educational research.  All have made important contributions to their students and colleagues.

4. Provide advice about teaching statistics and for pursuing careers in statistics education.  My last question of every interview asks specifically for advice toward those just starting out in their careers.  Many of my other questions throughout the interviews solicit suggestions on a wide variety of issues related to teaching statistics well.

5. Reveal fun personal touches.  I have been delighted that my interviewees have shared fun personal tidbits about their lives and careers.  Once again, a few examples:

  • George Cobb describes his experience as the victim of an attempted robbery, which ended with his parting company on good terms with his would-be assailant.
  • David Moore tells of losing an annual bet for 18 consecutive years, which required him to treat his friend to dinner at a restaurant of the friend’s choosing, anywhere in the world.
  • Ron Wasserstein shares that after he and his wife raised their nine children, they adopted two ten-year-old boys from Haiti.
  • Deb Nolan mentions a dramatic career change that resulted from her abandoning plans for a New Year’s Eve celebration.
  • Joan Garfield reveals that she wrote a memoir/cookbook and her life and love of food.
  • Dennis Pearl mentions a challenge that he offers to his students, which once ended with his delivering a lecture while riding a unicycle.
  • Chris Franklin relates that her favorite way to relax is to keep a personal scorebook at a baseball game.
  • Larry Lesser shares an account of his epic contest on a basketball court with Charles Barkley.

My most recent interview (here) is with Prince Afriyie, a recent Cal Poly colleague of mine who now teaches at the University of Virginia.  Prince is near the beginning of his teaching career as a college professor, and his path has been remarkable.  He started in Ghana, where he was inspired to study mathematics by a teacher whom he referred to as Mr. Silence.  While attending college in Ghana, Prince came to the United States on a summer work program; one of his roles was a paintball target at an amusement park in New Jersey.  Serendipity and initiative enabled Prince to stay in the United States to complete his education, with stops in Kentucky, Indiana, and Pennsylvania on his way to earning a doctorate in statistics.  Throughout his education and now into his own career, Prince has taught and inspired students, as he was first inspired by Mr. Silence in his home country.  Prince supplies many fascinating details about his inspiring journey in the interview.  I also asked Prince for his perspective on the two world-changing events of 2020 – the COVID-19 pandemic and the widespread protests for racial justice.

As I mentioned earlier, I conclude every interview with a request for advice aimed at those just beginning their career in statistics education.  Jessica Utts paid me a very nice compliment when she responded that teachers who read these interviews might benefit from asking themselves some of the more general questions that I ask of my interviewees.  Here are some questions that I often ask, which may lead to productive self-reflection:

  • Which came first – your interest in statistics or your interest in education?
  • What were you career aspirations at age 18?
  • What have you not changed about your teaching of statistics over the years?
  • On what do you pride yourself in your teaching?
  • What do you regard as the most challenging topic for students to learn, and how you approach this topic?
  • What is your favorite course to teach, and why?
  • In this time of data science, are you optimistic or pessimistic about the future of statistics?
  • What do you predict as the next big thing in statistics education?
  • What advice do you offer for those just beginning their career in statistics education?

You might also think about how you would answer two fanciful questions that I often ask for fun:

  • If time travel were possible, and you could travel to the past or future without influencing the course of events, what point in time would you choose?  Why?
  • If I offer to treat you and three others to dinner anywhere in the world, with the condition that the dinner conversation would focus on statistics education, whom would you invite, and where would you dine?

P.S. If you have a chance to read some of these interviews, I would appreciate hearing your feedback (here) on questions such as:

  • Who would you like me to interview in the near future?
  • What questions would you like me to ask?
  • Would you prefer shorter interviews?
  • Would you prefer to listen to interviews on a podcast?

P.P.S. For those wondering if I graded my exams last week after finally concluding the all-important procrastination step (see post #66, First step of grading exams, here): Thanks for asking, and I happily report that I did.

#66 First step of grading exams

I gave my first exam of the term, my first online exam ever, this past Friday.  As I sat down to grade my students’ responses for the first time in almost sixteen months, I realized that I had almost forgotten the crucial first step of grading exams: Procrastinate!

I have bemoaned the fact that I have so much less time available to concentrate on this blog now that I have returned to full-time teaching, as compared to last year while I was on leave.  So, what better way to procrastinate from my grading task than by engaging in the much more enjoyable activity of writing a blog post? 

What should I write about?  That’s easy: I will tell you a bit about the exam whose grading I am neglecting at this very moment.

Students took this exam through Canvas, our learning management system*.  This is a first for me, as my students in previous years took exams with paper and pencil.  I included a mix of questions that were auto-graded (multiple choice and numerical answer) and free-response questions that I will grade after I finish the all-important first step of procrastinating.  Roughly two-thirds of the points on the exam were auto-graded.  I wrote several different versions of many questions in an effort to discourage cheating.  Students had 90 minutes to complete the exam, and they were welcome to select any continuous 90-minute period of time between 7am and 7pm.  Students were allowed to use their notes.  Topics tested on the exam including basic ideas of designing studies and descriptive statistics. 

* In post #63 (My first video, here), I referred to Canvas as a course management system.  Since then I realized that I was using an antiquated term, and I have been looking for an opportunity to show that I know the preferred term is now learning management system.

Some of the questions that I asked on this exam appear below (in italics):

1. Suppose that the observational units in a study are the national parks of the United States.  For each of the following, indicate whether it is a categorical variable, a numerical variable, or not a variable.

  • the area (in square miles) of the national park
  • whether or not the national park is in California
  • the number of national parks that are to the east of the Mississippi River
  • whether there are more national parks to the east of the Mississippi River than to the west of the Mississippi River
  • the number of people who visited the national park in September of 2020

I give my students lots of practice with this kind of question (see post #11, Repeat after me, here), but some continue to struggle with this.  Especially challenging is noticing the ones that are not variables for these observational units (parts c and d).  Each student saw one of four variations on this question.  The observational units in the different version were patients who visited the emergency room at the local hospital last week, the commercial flights that left the San Luis Obispo airport last month, and customers at a local In-n-Out fast food restaurant on a particular day.  I posed this as a “matching” question in Canvas, where each of the five parts had the same three options available.

2. Suppose that the ten players on basketball team A have an average height of 76 inches, and the ten players on basketball team B have an average height of 80 inches.  Now suppose that one player leaves team A to join team B, and one player leaves team B to join team A.  How would the average heights of the two teams change?  The options that I presented were: No change, Both averages would increase, Both averages would decrease, The average would increase for A and decrease for B, The average would decrease for A and increase for B, It is impossible to say without more information.

The correct option is the last one: It is impossible to say without more information.  My goal here was for students to understand that players’ heights vary on both teams, so we cannot state any conclusions about how the averages would change without knowing more about the heights of the individual players who changed teams. 

3. San Diego State’s admission rate for Fall 2019 was 34.13%, compared to 28.42% for Cal Poly – SLO’s.  Determine the percentage difference between these admission rates.  In other words, San Diego State’s admission rate was higher than Cal Poly – SLO’s by ___ %.  Enter your answer as a number, with two decimal places of accuracy.  Do not enter the % symbol.

As I mentioned throughout post #28 (A pervasive pet peeve, here), I emphasize how a difference in proportions is not equivalent to a percentage difference.  This question assessed whether students took my emphasis to heart.  Each student answered one of four versions of this question, with different campuses being compared.  I obtained the data on admission rates from the dashboard here.

4. A series of questions referred to the following graph from a recent Gallup study (here):

The most challenging question in this series was a very basic one: How many variables are represented in this graph?  The correct answer is 2, race and preference for how much time police spend in the area.  The other options that I presented were 1, 3, 4, and 12.

5. Another series of questions was based on this study (available here): Researchers surveyed 120 students at Saint Lawrence University, a liberal arts college with about 2500 students in upstate New York.  They asked students whether or not they have ever pulled an all-nighter (stayed up all night studying).  Researchers found that students who claimed to have never pulled an all-nighter had an average GPA (grade point average) of 3.1, compared to 2.9 for students who claimed to have pulled an all-nighter.  Some basic questions included identifying the type of study, explanatory variable, and response variable.  These led to a question about whether a cause-and-effect conclusion can legitimately be drawn from this study, with a follow-up free-response question* asking students to explain why or why not.

* Oh dear, I just reminded myself of the grading that I still need to do.  This procrastination step is fun but not entirely guilt-free.

Some other free-response questions waiting for me to grade asked students to:

6. Create a hypothetical example in which IQR = 0 and the mean is greater than the median.  I think this kind of question works well on an online exam.  Different students should give different responses, so I hope this question encourages independent thinking and discourages cheating.  (See post #31, Create your own example, part 1, here, for many more questions of this type.)

7. Write a paragraph comparing and contrasting the distribution of money winnings in 2019 on three professional golf tours – men’s, women’s, and senior men’s, as displayed in the boxplots:

I am looking for students to compare center, variability, and shape across the three distributions.  They should also comment on outliers and relate their comments to the context.

8. Describe and explain the oddity concerning which hospital performed better, in terms of patients experiencing a complete recovery, for the data shown in the following tables of counts:

I expect this to be one of the more challenging questions on the exam.  Students need to calculate correct proportions, comment on the oddity that Hospital A does worse overall despite doing better for each condition, and explain that Hospital A sees most of the patients in poor condition, who are less likely to experience a full recovery than those in fair condition.

Writing my exam questions in Canvas, and preparing several versions for many questions, took considerably more time than my exam writing in the past.  But of course Canvas has already saved me some time by auto-grading many of the questions.  I should also be pleased that Canvas will also add up students’ scores for me, but I always enjoyed that aspect of grading, largely because it was the last part and provided a sense of completion and accomplishment.

Hmm, I probably should not be imagining an upcoming sense of completion and accomplishment while I am still happily immersed in the procrastination step of the exam-grading process.  I must grudgingly accept that it’s time for me to proceed to step two.  If only I could remember what the second step is …

#65 Matching variables to graphs

On Friday of last week I asked my students to engage with an activity in which I presented them with these seven graphs:

I’m sure you’ve noticed that these graphs include no labels or scales on the axes.  But you can still discern some things about these seven distributions even without that crucial information.  I told my students the seven variables whose distributions are displayed in these graphs:

  • (A) point values of letters in the board game Scrabble
  • (B) prices of properties on the Monopoly game board
  • (C) jersey numbers of Cal Poly football players
  • (D) weights of rowers on the U.S. men’s Olympic team
  • (E) blood pressure measurements for a sample of healthy adults
  • (F) quiz percentages for a class of students (quizzes were quite straight-forward)
  • (G) annual snowfall amounts for a sample of cities taken from around the U.S.

But I did not tell students which variable goes with which graph.  Instead I asked them to work in groups* with these instructions: Make educated guesses for which variable goes with which graph.  Be prepared to explain the reasoning behind your selections.

* This being the year 2020, the students’ groups were breakout rooms in Zoom.

Before I invited the students to join breakout rooms, I emphasized that it’s perfectly fine if they know nothing about Scrabble or Monopoly or rowing or even snowfall*.  For one thing, that’s why they’re working with a group.  Maybe they know about some of these things and a teammate knows about others.  For another thing, I do not expect every group to match all seven pairs perfectly, and this activity is not graded.

* Most of my students are natives of California, and some have never seen snowfall.

I think you can anticipate the next sentence of this blog post: Please take a few minutes to match up the graphs and variables for yourself before you read on*.

* Don’t worry, I do not expect you to get them all right, and remember – this is not for a grade!

Also before I continue, I want to acknowledge that I adapted this activity from Activity-Based Statistics, a wonderful collection based on an NSF-funded project led by Dick Scheaffer in the 1990s.  This variation is also strongly influenced by Beth Chance’s earlier adaptations of this activity, which included generating the graphs from data collected from her students on various variables.

I only gave my students 5-6 minutes to discuss this in their breakout rooms.  When they came back to the main Zoom session, I asked for a volunteer to suggest one graph/variable pair that they were nearly certain about, maybe even enough to wager tuition money.  The response is always the same: Graph #4 displays the distribution of football players’ jersey numbers.  I said this is a great answer, and it’s also the correct answer, but then I asked: What’s your reasoning for that?  One student pointed out that there are no repeated values, which is important because every player has a distinct jersey number.  Another student noted that there are a lot of dots, which is appropriate because college football teams have a lot of players.

Next I asked for another volunteer to indicate a pairing for which they are quite confident, perhaps enough to wager lunch money.  I received two different answers to this.  In one session, a student offered that graph #1 represents the quiz percentages.  What’s your reasoning for that?  The student argued that quizzes were generally straight-forward, so there should be predominatly high scores.  The right side of graph #1 could be quiz percentages in the 80s and 90s, with just a few low values on the left side. 

In the other session, a student suggested that graph #2 goes with point values of letters in Scrabble.  What’s your reasoning for that?  The student noticed that the spacing between dots on the graph is very consistent, so the values could very well be integers.  It also makes sense that the leftmost value on the graph could be 1, because many letters are worth just 1 point in Scrabble.  This scale would mean that the large values on the right side of the graph are 8 (for 2 letters) and 10 (also for 2 letters).  Another student even noted that there are 26 dots in graph #2, which matches up with 26 letters in the alphabet.

When I asked for another volunteer, a student suggested that graph #7 corresponds to Monopoly prices.  What’s your reasoning for that?  The student commented that Monopoly properties often come in pairs, and this graph includes many instances of two dots at the same value.  Also, the distance between the dots is mostly uniform, suggesting a common increment between property prices.  I asked about the largest value on this graph, which is separated a good bit from the others, and a student responded that this dot represents Boardwalk.

After those four variables and graphs were matched up, students got much quieter when I asked for another volunteer.  I wish that I had set up a Zoom poll in advance to ask them to express their guesses for the rest, but I did not think of that before class.  Instead I asked for a description of graph #3.  A student said that there are a lot of identical values on the low end, and then a lot of different values through the high end.  When I asked about which variable that pattern of variation might make sense for, a student suggested snowfall amounts.  What’s your reasoning for that?  The student wisely pointed out that I had said that the cities were taken from around the U.S., so that should include cities such as Los Angeles and Miami that see no snow whatsoever.

Then I noted that the only graphs left were #5 and #6, and the only variables remaining were blood pressure measurements and rower weights.  I asked for a student to describe some differences between these graphs to help us decide which is which.  This is a hard question, so I pointed out that the smallest value in graph #6 is considerably smaller than all of the others, and there’s also a cluster of six dots fairly well separated from the rest in graph #6.  One student correctly guessed that graph #6 displays the distribution of rower weights.  What’s your reasoning for that?  The student knew enough about rowing to say that one member of the team calls out the instructions to help the others row in synch, without actually rowing himself.  Why does the team want that person to be very light?  Because he’s adding weight to the boat but not helping to row!

That leaves graph #5 for the blood pressure measurements.  I suggested that graph #5 is fairly unremarkable and that points are clustered near the center more than on the extremes.

You might be wondering why I avoided using the terms skewness, symmetry, and even outlier in my descriptions above.  That’s because I introduced students to these terms at the conclusion of this activity.  Then I asked students to look back over the graphs and: Identify which distributions are skewed to the left, which are skewed to the right, and which are roughly symmetric.  I gave them just three minutes to do this in the same breakout rooms as before.  Some students understandably confused skewed to the left and skewed to the right at first, but they quickly caught on.  We reached a consensus as follows:

  • Skewed to the left: quiz percentages (sharply skewed), rower weights (#1, #6)
  • Skewed to the right: Scrabble points, snowfall amounts (#2, #3)
  • Symmetric (roughly): jersey numbers, blood pressure measurements, Monopoly prices (#4, #5, #7)

I admitted to my students that while I think this activity is very worthwhile, it’s somewhat contrived in that we don’t actually start a data analysis project by making guesses about what information a graph displays.  In practice we know the context of the data that we are studying, and we produce well-labelled graphs that convey the context to others.  Then we examine the graphs to see what insights they provide about the data in context.

With that in mind, I followed the matching activity with a brief example based on the following graph of predicted high temperatures for cities around California, as I found them in my local newspaper (San Luis Obispo Tribune) on July 8, 2012:

I started with some basic questions about reading a histogram, such as what temperatures are contained in the rightmost bin and how many cities had such temperatures on that date.  Then I posed three questions that get to the heart of what this graph reveals:

  • What is the shape of this distribution?
  • What does this shape reveal about high temperatures in California in July?
  • Suggest an explanation for the shape of this distribution, using what you know about the context.

Students responded that the temperature distribution displays a bimodal shape, with one cluster of cities around 65-80 degrees and another cluster from about 90-100 degrees.  This reveals that California has at least two distinct kinds of locations with regard to high temperatures in July. 

For the explanation of this phenomenon, a student suggested that there’s a split between northern California and southern California.  I replied that this was a good observation, but I questioned how this split would produce the two clusters of temperature values that we see in the graph.  The student quickly followed up with a different explanation that is spot-on: California has many cities near the coast and many that are inland.  How would this explain the bimodality in the graph?  The student elaborated that cities near the coast stay fairly cool even in July, while inland and desert cities are extremely hot.

My students and I then worked through three more examples to complete the one-hour session.  Next I showed them the following boxplots of daily high temperatures in February and July of 2019 for four cities*:

* I discuss these data in more detail in post #7, Two dreaded words, part 2, here.

The students went back to their breakout rooms with their task to: Arrange these four cities from smallest to largest in terms of:

  • center of February temperature distributions;
  • center of July temperature distributions;
  • variability of February temperature distributions; and
  • variability of July temperature distributions

After we discussed their answers and reached a consensus, I then briefly introduced the idea of a log transformation in the context of closing prices of Nasdaq-100 stocks on September 15, 2020:

Finally, we discussed the example of cancer pamphlets’ readability that I described in post #4, Statistics of illumination, part 2, here.

As you can tell, the topic of the class session that I have described here was graphing numerical data.  I think the matching activity set the stage well, providing an opportunity for students to talk with each other about data in a fun way.  I also hope that this activity helped to instill in students a mindset that they should always think about context when examining graphs and analyzing data.

#64 My first week

Many thanks to all who sent encouragement in response to last week’s post (here) about my harrowing experience with creating my first video for my students.  I’m happy to report that my first-ever week of remote teaching went well.  I promise not to turn this blog into a personal diary, but I’d like to share some reflections based on this past week.

I woke up last Monday excited and nervous for the first day of the school year.  That was a good and familiar, even comforting, feeling.  Some unfamiliar feelings followed for the rest of the day.  It was very strange not to leave my house for the first day of school, and it was also weird to realize at the end of the day that I had not changed out of my sweat pants.

I was very glad that many students showed up for my first live zoom session at 8am on Monday.  I also appreciated that many of them turned their cameras on, so I could see their faces on the screen.  A large majority of my students are beginning their first term at Cal Poly, and they seemed eager to get started.  I was excited that these students were beginning the academic coursework of their college experience with me.

One fun thing is that the very first student to join the zoom session turned out to have her birthday on that day.  I know this because we worked through the infamous draft lottery example (see post #9, here), so I asked students to find their own birthday’s draft number, and it turned out that this student’s birthday had draft number 1, which meant that she was born on September 14, last Monday.

I have used three different zoom tools to interact with students:

  1. Breakout rooms provide an opportunity for students to discuss questions with each other.  For example, we used breakout rooms at the beginning of the first session for groups of 5-6 students to introduce themselves to each other.  Then we used the same breakout rooms later for students to discuss possible explanations for the apparent paradox with the famous Berkeley graduate admissions data (see post #3 here).
  2. Polls provide immediate feedback on students’ understanding (see Roxy Peck’s guest post #55 about clicker questions here).  For example, I used polls to ask students to identify variables as categorical or numerical and to indicate whether a number was a parameter or a statistic.
  3. Chat allows students to ask questions of me, and I’ve also asked them to type in responses to some questions in the chat window.  For example, students determined the median draft number for their birth month and typed their finding into the chat.

During Friday’s live zoom session, we studied ideas related to sampling, and we worked through the Gettysburg Address activity (see post #19, Lincoln and Mandela, part 1, here).  I was apprehensive about how this activity would work remotely, but I was pleasantly surprised that it went smoothly.  I prepared a google form in advance and pasted a link in the chat window, through which students entered the average word length in their self-selected sample of ten words from the speech.  This allowed me to see their responses in real time and paste the results into an applet (here), so we could examine a dotplot of the distribution of their sample averages.  Because a large majority of the students’ sample averages exceeded the population average of 4.3 letters per word, the resulting graph illustrated sampling bias:

I also created videos for students who could not attend the optional live sessions.  I’m even getting slightly more comfortable with making videos.  But making corrections to the auto-captioning takes a while, perhaps because the software has trouble translating words from my peculiar voice.  Some unfortunate mis-translations of what I have said include:

  • “grandmother” for “parameter”
  • “in America” for “a numerical variable”
  • “selected a tree” for “selected at random”
  • “once upon a time” for “one sample at a time”
  • “sample beans” for “sample means”

I have already given many quizzes to my students, even after just one week.  I give a quiz based on each handout, just to make sure that they were paying attention as they worked through the examples, either in a live session with me or on their own or by watching a video.  I also assign an application quiz for each handout, in which students apply what they have learned to a new context.  I have also asked students to complete several miscellaneous quizzes, for example by answering questions about a Hans Rosling video on visualizing human progress (here) that I asked them to watch.   I regard these quizzes as low-stakes assessments, and I encourage students to work together on them.

I conclude this brief post by offering five take-aways from my first week of remote teaching.  I realize that none of these is the least bit original, and I suspect that none will provide any insights for those who taught remotely in the spring or started remote teaching in the fall earlier than I did.

  1. Remote teaching can be gratifying.  Rather than thinking about how much I would prefer to be in a classroom with my students and down the hall from my colleagues, I hope to concentrate on my happy discovery that interacting with students virtually can be fun.
  2. Remote teaching can be engaging.  I greatly appreciate my students’ being such good sports about answering my questions and participating in activities.  (See Kelly Spoon’s guest post #60, here, for several ideas about connecting with students online.)
  3. Asking good questions is central to helping students learn*, remotely as well as in-person.
  4. Remote teaching requires considerable preparation**.  For me, some of this preparation has involved planning when to use breakout rooms and polls and chat.  Collecting data from students also requires more preparation than simply asking students to put their result on the board.  Writing quizzes also requires entering the questions into the learning management system after crafting the questions in the first place.
  5. Remote teaching is very tiring.***  I have found the combination of having to prepare so extensively, integrate different technologies at the same time, and stare at a screen for many hours per day to be exhausting!

* You did not see this one coming, did you?

** But on the positive side of the ledger, my commute time has been reduced by nearly 100%.

*** Of course, perhaps age is a confounding variable that explain my fatigue.  Never before have I been as old to start a new school year as I am now.

Here’s one more takeaway, one that I regret: I have much less time and thought to devote to this blog than I had last year.  That’s why this post is so brief and perhaps unhelpful.  As always, thanks for reading and bearing with me.

#63 My first video

I recently endured a harrowing, horrifying, humbling, even humiliating experience.  That’s right: I recorded my first video.

My first-ever online teaching experience begins today, September 14*.  In preparation, I thought I’d record a brief video to introduce myself to my students, hoping to begin the process of establishing a bit of a connection even though I’ll probably never meet these students in person.  I wanted the video to be brief, about five minutes or so.  I’ve never followed a script in class, so I did not write a script for the video, hoping that non-scripted spontaneity would make it more appealing.  But I did prepare some PowerPoint slides, partly to remember what I wanted to say, and also so the slides would occupy most of the screen with my face appearing only in a small corner.  I wanted to use Zoom to make the video, just because I like to keep things simple.  I’ve already used Zoom a bit, and I’ll be using Zoom for live sessions with my students this fall.

* This is the same date that was selected first and received draft number 1 in the infamous 1970 draft lottery.  In post #9 (here), I describe a class activity that illustrates statistical thinking by analyzing those lottery results.

So, I entered the room that now serves as my home office, started my computer, opened Zoom, launched a new meeting, shared my screen, put my PowerPoint file in presentation mode, looked into the camera, pressed the record button, and started talking to myself …

I finished about seven-and-a-half minutes later, only 50% beyond my target time of five minutes*.  I waited for Zoom to produce the recording, and then I eagerly pressed the play button.  This is when the experience turned harrowing.

* Post #28 (here) pertains to my pervasive pet-peeve involving student misunderstandings of percentage differences.

I really don’t like watching myself on a screen, but I understand that many people feel this way about themselves, and Zoom use over the past six months has somewhat inured me to this unpleasant feeling.  That wasn’t the harrowing part.

Those of you who know me, or have heard me give presentations, can probably anticipate that I found the horrifying part to be listening to my voice.  For those of you who have never heard me: I have a very unusual and peculiar* speaking voice.  It doesn’t sound nearly as odd to me in real life as it does on a recording.  After listening to just the first few seconds of the Zoom recording, I was overcome by a desire to apologize to everyone who’s ever had to listen to me – students, colleagues, friends, wife, cats, …  I only hope that this is something that you get used to and barely notice after a while.

* Friends use the word distinctive here to spare my feelings.

To be more specific, my voice tends to rise rather than fall at the end of sentences.  This vocal pattern is sometimes referred to as “upspeak.”  This is apparently a serious topic of research inquiry among linguists, and a Google search will provide a lot of information, references, and advice about upspeak.  My favorite anecdote about this phenomenon is that novelist Richard Russo invested one of his characters with upspeak in his delightful satire of academic life Straight Man.  Russo’s main character, the reluctant chairman of a college’s English department, describes the speaking voice of the department secretary as follows: Most of Rachel’s statements sound like questions.  Her inability to let her voice fall is related to her own terrible insecurity and lack of self-esteem.  To emphasize this aspect of her speaking voice, Russo uses a question mark at the end of Rachel’s sentences throughout the book?*

* Yes, I used that punctuation on purpose to demonstrate Russo’s technique.

In case you’re wondering whether I’m exaggerating about my own upspeaking, I’ll point out that during conference and workshop presentations, I often ask those in attendance to guess where I’m from.  Just asking the question is usually good for a laugh, as people realize that I am acknowledging my unusual vocal inflections, and they’re often curious to know the answer.  Common guesses often include Ireland, Scotland, Scandinavia, Canada, and the upper Midwest.  None of those is correct*.  I believe that my peculiar voice is more of an individual quirk than a regional dialect.

* I will reveal the answer later in this post.

After I overcame my revulsion at hearing my own voice enough to get back to work on my first video, I made and discarded several attempts due to mis-speakings and awkward pauses and the like.  Then as I went through the fifth take, I thought I had a keeper.  I successfully avoided the mis-speaking and pauses.  I was saying what I wanted to say in a reasonable manner.  As I got to the end, I was almost looking forward to playing it back to confirm that this would be the final take, the one to be posted for my students.  It probably would have been, except for one flaw: I realized to my horror that I had been sharing and recording the wrong screen!  I was sharing and recording my laptop screen rather than my monitor screen*, which was the one with the Powerpoint presentation! 

* I’ve actually used just a laptop for the past 20 years until recently.  Seeing that I would need to teach online in the fall, my wife very kindly bought me a new monitor a few months ago.  As this story reveals, I’m still getting used to it.

A few takes later, I again thought I had a keeper, and I was certain that I had shared and recorded the correct screen this time.  I was feeling very proud of myself, downright excited as I got to the last slide, in which I thanked students for taking the time to watch my first video.  But then …  My brain completely froze, and I couldn’t find the button to stop the recording!  I don’t know whether the Zoom control bar was hidden behind the PowerPoint presentation or behind some other application or what, but I flailed about for a full 30 seconds, muttering to myself (and, of course, to the microphone) the whole time.  I know this should be no big deal; it can’t be hard to edit out those last 30 seconds, but I didn’t know how to do that*!

* Now I wish that I had kept all of these outtakes.  But I didn’t realize at the time that there would be so many, or that the experience would make such an impact on me that I would write a full, self-indulgent blog post about it.

I know that none of this was Zoom’s fault, but at this point I decided to learn the basics and record the next few takes with Screencast-o-matic.  These actually went fairly well, and it only took a few more takes to end up with the final version that I posted for my students.  All together, I spent many, many hours making a 7.5-minute video.

Just for fun, let me show you some of the slides from my first video presentation.  I start by telling students where I’m from and pointing out that I slowly ventured a bit farther from home as I went to college and then graduate school and then my first teaching position:

I also wanted to let students know that while I am a very experienced teacher of statistics, I am a complete novice when it comes to teaching online courses:

To reveal a more personal side, I told students about some of my hobbies, along with some photos:

I have mentioned before (see posts #25 and #26 here and here) that I give lots of quizzes to my students.  I plan to do that again with my online course this fall.  In fact, I suspect that very frequent quizzes will be all the more useful in an online setting for helping to keep students on task, indicating what they should be learning, and providing them with feedback on their progress.  I even decided to give them a quiz based on my self-introduction video.  This is an auto-graded, multiple-choice quiz administered in our course management system Canvas.  I expect this quiz to provide students with easy points to earn, because all of the answers appear in the video, and they can re-watch the video after they see the quiz questions.  Here are the questions:

  1. In which state did I live for the first 39 years of my life? [Options: Arizona, California, Hawaii, Mississippi, Pennsylvania]
  2. How many states have I been in? [Options: 1, 13, 47, 50]
  3. What kind of pets have I had? [Options: Birds, Cats, Dogs, Fish, Snakes]
  4. Which of the following is NOT the name of one of my pets? [Options: Cosette, Eponine, Punxsutawney Phil, Puti]
  5. What is the name of my fantasy sports teams? [Options: Cache Cows, Domestic Shorthairs, Markov Fielders, Netminders, Sun Cats]
  6. For how many years have I been at Cal Poly? [Options: 2, 19, 31, 58]
  7. How much experience do I have with online teaching? [Options: None, A little, A lot]
  8. What was my primary project while on leave from Cal Poly for the past academic year? [Options: Playing online games, Proving mathematical theorems, Reading mystery novels, Starting a business, Writing a blog]
  9. What is my teaching philosophy? [Options: Ask good questions, Insist on perfection, Learn by viewing, Rely on luck]
  10. Am I funny? [Option: Well I try to be but I may not succeed often]

So, how did you do?  The correct answers are: Pennsylvania, 47 (all but Arkansas, Mississippi, North Dakota), Cats, Punxsutawney Phil, Domestic Shorthairs, 19, None, Writing a blog, Ask good questions, Well I try to be but I may not succeed often.

P.S. If you would like to watch my first video for yourself, please bear in mind my warning about the peculiarity of my speaking voice.  But if that does not dissuade you, the video can be found here.

#62 Moral of a silly old joke

I have always liked this silly old joke, which I first heard decades ago:

A man takes his dog to see a talent scout, proudly claiming that his dog can talk.  Of course, the talent scout is very skeptical. To convince her, the man asks the dog: What’s on top of a house?  The dog eagerly responds: “Roof, roof!”  The unimpressed talent scout rolls her eyes and tells the man to leave.  The man seizes a second chance and asks the dog: How does sandpaper feel?  The dog gleefully responds: “Rough, rough!”  The scout gets out of her chair and moves to escort the man out of her office.  Begging for one last chance, the man asks the dog: Who was the greatest baseball player of all time?  The dog enthusiastically responds: “Ruth, Ruth!”  The fed-up talent scout removes the man and dog from her office.  Out in the hallway, looking up at the man with a confused and crestfallen expression on his face, the dog says: “DiMaggio?”

Part of why I like this joke is that “DiMaggio?” strikes me as the perfect punch line.  I have seen versions of the joke in which the dog says: “Maybe I should have said DiMaggio?”  I don’t think that’s as funny as the single-word response.  I also don’t think the joke would work nearly as well with Mays* or Aaron or Williams or Trout as the punch line, because those names are so much easier to pronounce than DiMaggio**.

* Joe Posnanski, from whom I have copied this footnoting technique that he calls a Pos-terisk, ranks Willie Mays as the only baseball player better than Babe Ruth (here).

** A name that works nearly as well is Clemente.  Having grown up in western Pennsylvania in the 1960s and 1970s, my favorite baseball player will always be Roberto Clemente.

What in the world does this have to do with teaching statistics, which is the whole point of this blog?!

Please forgive me, as I’m a bit out of practice with writing blog posts*.  Now I will try to connect this silly old joke to the whole point of this blog.

* I again thank the nine guest bloggers who contributed posts during my hiatus in July and August.  If you missed any of these posts, please check them out from the list here

Please consider: What is the moral of this joke?  Let me rephrase that: What did the man do wrong?  Or, to put this in a more positive light: What should the man have done differently?

I’ll give you a hint, as I often do with my own students: The answer that I’m fishing for contains three words.  Want another hint?  Those three words contain a total of 16 letters.  One more hint? The first word has the fewest letters (3), and the last word has the most letters (9).

All right, I’ve dragged this on long enough.  I suspect that you’ve figured out what I think the moral of this silly old joke is.  In order to achieve his (and his dog’s) lifelong dream, all the man needed to do was: Ask good questions.

That’s where the man messed up, right?  His obvious mistake was asking questions for which the answers correspond so well with sounds that an ordinary dog makes.  The man’s incredibly poor choice of questions prevented the dog from demonstrating his remarkable ability.

I repeat: What does this have to do with teaching statistics?!  I suspect that my moral is abundantly clear at this point, but please allow me to summarize:

  • To help our students learn, we need to ask good questions. 
  • To enable our students to demonstrate what they can do, we need to ask good questions. 
  • To empower our students to achieve their potential, we need to ask good questions.

I said in my very first post (see question #8 here) that these three words capture whatever wisdom I may have to offer for teachers of statistics: Ask good questions.  I tried to provide many specific examples over the next 51 posts (here).  That is the whole point of this blog.  I think that’s how we teachers should focus most of our time, effort, and creativity.  Whenever I start to forget this, for example when I momentarily succumb to the temptation to believe that it’s more important to master intricacies of Canvas or Zoom or Powerpoint or Camtasia or Flipgrid or Discord or LockDown Browser or Github or even R, I remember the moral of a silly old joke.

P.S. My professional leave for the 2019-2020 academic year has come to an end, and I am preparing to return to my full-time teaching role*.  I’m hoping to find time to resume writing weekly blog posts, because I greatly enjoy this and hope that these essays have some value.  But I won’t have nearly as much time to devote to blogging for the next nine months, so I’ll need to make the essays shorter or fewer.  Please stick around, and we’ll see how it goes.  For the month of September, I ask for your indulgence as I write some short and unusual blog posts that are less directly applicable to teaching statistics than my typical essays.  As always, thanks very much for reading!

* Our fall classes at Cal Poly will begin on Monday, September 14.  I’ll be teaching online for the first time in my 30+-year career.  Wish me luck!

P.P.S. Thanks to Julie Clark for providing a photo of her dog Tukey. As far as I know, this Tukey cannot talk, but I would not bet against him being able to draw boxplots.

#61 Text as data

This guest post has been contributed by Dennis Sun.  You can contact him at

Dennis Sun is a colleague of mine in the Statistics Department at Cal Poly. He teaches courses in our undergraduate program in data science* as well as statistics. Dennis also works part-time as a data scientist for Google. Dennis is a terrific and creative teacher with many thought-provoking ideas. I am very glad that he agreed to write this guest post about one aspect of his introductory course in data science that distinguishes it from most introductory courses in statistics.

* My other department colleague who has taught for our data science program is Hunter Glanz, who has teamed with Jo Hardin and Nick Horton to write a blog about teaching data science (here).

I teach an “Introduction to Data Science” class at Cal Poly for statistics and computer science majors. Students in my class are typically sophomores who have at least one statistics course and one computer science course under their belt. In other words, my students arrive in my class with some idea of what statistics can do and the programming chops to execute those ideas. However, many of them have never written code to analyze data. My course tries to bring these two strands of their education together.

Of course, many statisticians write code to analyze data. What makes data science different? In my opinion, one of the most important aspects is the variety of data. Most statistics textbooks start by assuming that the data is already in tabular form, where each row is an observation and each column is a variable. However, data in the real world comes in all shapes and sizes. For example, an audio file of someone speaking is data. So is a photograph or the text of a book. These types of data are not in the ready-made tabular form that is often assumed in statistics textbooks. In my experience, there is too much overhead involved to teach students how to work with audio or image data in an introductory course, so most of my non-standard data examples come from the world of textual data.

I like to surprise students with my first example of textual data: Dr. Seuss books. Observations in this “dataset” include:

  1. “I am Sam. I am Sam. Sam I am….”
  2. “One fish, two fish, red fish, blue fish….”
  3. “Every Who down in Whoville liked Christmas a lot….”

and so on. To analyze this data using techniques they learned in statistics class, it first must be converted into tabular form. But how?

One simple approach is a bag of words. In the bag of words representation, each row is a book (or, more generally, a “document”), and each column is a word (or, more generally, a “term”). Each entry in the table is a frequency representing the number of times a term appears in a document. This table, called the “term-frequency matrix,” is illustrated below:

The resulting table is very wide, with many more columns than rows and most entries equal to 0. Can we use this representation of the data to figure out which documents are most similar? This sparks a class discussion about how and why a data scientist would do this.

How might we quantify how similar two documents are? Students usually first propose calculating some variation of Euclidean distance. If xi represents the vector of counts in document i, then the Euclidean distance between two documents i and j is defined as:

This is just the formula for the distance between two points that students learn in their algebra class (and is essentially the Pythagorean theorem), but the formula is intimidating to some students, so I try to explain what is going on using pictures. If we think of xi and xj as vectors, then d(xi, xj) measures the distance between the tips of the arrows.

For example, suppose that the two documents are:

  1. “I am Sam. I am Sam. Sam I am.”
  2. “Why do I like to hop, hop, hop? I do not know. Go ask your Pop.”

and the words of interest are “Sam” and “I.” Then the two vectors are x1 = (3,3) and x2 = (0,2), because the first document contains 3 of each word, and the second includes no “Sam”s and two “I”s.  These two vectors, and the distance between them, are shown here:

At this point, a student will usually observe that the frequencies scale in proportion to the length of the document. For example, the following documents are qualitatively similar:

  1. “I am Sam.”
  2. “I am Sam. I am Sam. Sam I am.”

yet their vectors are not particularly close, since one vector is three times the length of the other:

How could we fix this problem?  There are several ways. Some students propose making the vectors the same length before comparing them, while others suggest measuring the angles between the vectors. What I like about this discussion is that students are essentially invoking ideas from linear algebra without realizing it or using any of the jargon. In fact, many of my students have not taken linear algebra yet at this point in their education. It is helpful for them to see vectors, norms, and dot products in a concrete application, where they arise naturally.

Why would anyone want to know how similar two documents are? Students usually see that such a system could be used to recommend books: “If you liked this, you might also like….”* Students also suggest that it might be used to cluster documents into groups**. However, rarely does anyone suggest the application that I assign as a lab.

* This is called a “recommender system” in commercial applications.

** Indeed, a method of clustering called “hierarchical clustering” is based on distances between observations.

We can use similarity between documents to resolve authorship disputes. The most celebrated example concerns the Federalist Papers, first analyzed by statisticians Frederick Mosteller and David Wallace in the early 1960s (see here). Yes, even though the term “data science” has only become popular in the last 10 years, many of the ideas and methods are not new, dating back over 50 years. However, whereas Mosteller and Wallace did quite a bit of probability modeling, our approach is simpler and more direct.

The Federalist Papers are a collection of 85 essays penned by three Founding Fathers (Alexander Hamilton, John Jay, and James Madison) to drum up support for the new U.S. Constitution.* However, the essays were published under a pseudonym “Publius.” The authors of 70 of the essays have since been conclusively identified, but there are still 15 papers whose authorship is disputed.

* When I first started using this example in my class, few students were familiar with the Federalist Papers. However, the situation has greatly improved with the immense popularity of the musical Hamilton.

I give my students the texts of all 85 Federalist papers (here), along with the authors of the 70 undisputed essays:

Their task is to determine, for each of the 15 disputed essays, the most similar undisputed essays. The known authorships of these essays are then used to “vote” on the authorship of the disputed essay.

After writing some boilerplate code to read in and clean up the texts of the 85 papers, we split each document into a list of words and count up the number of times each word appears in each document. My students would implement this in the programming language Python, which is a general-purpose language that is particularly convenient for text processing, but the task could be carried out in any language, including R.

Rare context-specific words, like “trembling,” are less likely to be a marker of a writer’s style than general words like “which” or “as.” We restrict to the 30 most common words. We also normalize the vectors to be the same length so that distances are invariant to the length of the document. We end up with a table like the following:

Now, let’s look at one of the disputed papers: Federalist Paper #18. We calculate the Euclidean distance between this document and every other document:

Of course, the paper that is most similar to Paper #18 is … itself. But the next few papers should give us some useful information. Let’s grab the authors of these most similar papers:

Although the second closest paper, Paper #19, is also disputed (which is why its author is given as the missing value NaN), the third closest paper was definitively written by Madison. If we look at the 3 closest papers with known authorship, 2 were written by Madison. This suggests attributing Paper #18 to Madison.

What the students just did is machine learning—training a K=3-nearest neighbors classifier on the 70 undisputed essays to predict the authorship Paper #18 — although we do not use any of that terminology. I find that students rarely have trouble understanding conceptually what needs to be done in this concrete problem, even if they struggle to grasp more abstract machine learning ideas such as training and test sets. Thus, I have started using this lab as a teaser for machine learning, which we study later in the course.

Next I ask students: How could you validate whether these predictions are any good? Of course, we have no way of knowing who actually wrote the disputed Federalist Papers, so any validation method has to be based on the 70 papers whose authorship is known.

After a few iterations, students come up with some variant of the following: for each of these 70 papers, we can find the 3 closest papers among the other 69 papers. Then, we can validate the prediction using these 3 closest papers against the known author of the paper, producing a table like the following:

In machine learning, this table is known as a “confusion matrix.” From the confusion matrix, we try to answer questions like:

  1. How accurate is this method overall?
  2. How accurate is this method for predicting documents written by Madison?

Most students assess the method overall by calculating the percentage of correct (or incorrect) predictions, obtaining an accuracy of 67/70 ≈ 96%.

However, I usually get two different answers to the second question:

  • The method predicted 15 documents to be written by Madison, but only 13 were. So the “accuracy for predicting Madison” is 13/15 ≈ 87%.
  • Madison actually wrote 14 of the documents, of which 13 were identified correctly. So the “accuracy for predicting Madison” is 13/14 ≈ 93%.

Which answer is right? Of course, both are perfectly valid answers to the question. These two different interpretations of the question are called “precision” and “recall” in machine learning, and both are important considerations.

One common mistake that students make is that they will include paper i itself as one of the three closest papers to paper i. They realize immediately why this is wrong when this is pointed out. If we think of our validation process as an exam, it is like giving away the answer key on an exam! This provides an opportunity to discuss ideas such as overfitting and cross-validation, again at an intuitive level, without using jargon.*

* The approach of finding the closest papers among the other 69 papers is formally known as “leave-one-out cross validation.”

I have several more labs in my data science class involving textual data. For example, I have students verify Zipf’s Law (learn about this from the video here) for different documents. A student favorite, which I adapted from my colleague Brian Granger (follow him on twitter here) is the “Song Lyrics Generator” lab, where students scrape song lyrics from their favorite artist from the web, train a Markov chain on the lyrics, and use the Markov chain to generate new songs by that artist. One of my students even wrote a Medium post (here) about this lab.

Although I am not an expert in natural language processing, I use textual data often in my data science class, because it is both rich and concrete. It has just enough complexity to stretch students’ imaginations about what data is and can do, but not so much that it is overwhelming to students with limited programming experience. The Federalist Papers lab in particular intersects with many technical aspects of data science, including linear algebra and machine learning, but the concreteness of the task allows us to discuss key ideas (such as vector norms and cross-validation) at an intuitive level, without using jargon. It also touches upon non-technical aspects of data science, including the emphasis on prediction (note the conspicuous absence of probability in this blog post) and the need for computing (the texts are long enough that the term frequencies are not feasible to count by hand). For students who know a bit of programming, this provides them with an end-to-end example of how to use data to solve real problems.

This guest post has been contributed by Dennis Sun.  You can contact him at

#60 Reaching students online

This guest post has been contributed by Kelly Spoon.  You can contact her at

Kelly Spoon teaches statistics in San Diego, at a two-year college (San Diego Mesa College) and for an AP Statistics class (Torah High School of San Diego).  I met Kelly through twitter (@KellyMSpoon), where she shares lots of ideas about teaching statistics and mathematics, and at the AMATYC conference in Milwaukee last fall.  Kelly has since hosted me to give a workshop for her colleagues in San Diego and to conduct a review session for her AP Statistics students via zoom.  Kelly is very passionate about teaching statistics, dedicated to helping all students succeed, and knowledgeable about content and pedagogy.  I am very glad that she agreed to contribute this guest blog post on the very timely topic of teaching statistics (and reaching students) online*.

* Speaking of timely, my first taste of online teaching will begin three weeks from today.

When Allan asked if I would write a guest blog post, I didn’t hesitate to email back with an emphatic yes. Not only because I owe him for presenting to faculty at my college AND doing a review for my AP Statistics students, but because I’m always excited to share my passion for teaching statistics.

Then the actual writing started, and I immediately regretted this decision. There’s just too much to share in such a short space. In the end, I wrote an entirely too long blog post for which Allan suggested some minor edits to fit a theme of fearlessness. I asked myself: What does it mean to teach fearlessly?

To me, the broadest definition is a willingness to thoughtfully try new things – whether tools, policies, assessments, or formats. And at this point, most of us fit that definition by the circumstances of distance learning that have been thrust upon us. Now that I’m a week into a new completely online semester, my previous draft felt like it was missing what most of us want to know right now: How do we teach statistics online?

After having a successful first week of the new fall term that mostly gave me energy rather than leaving me feeling drained (as most of last spring’s emergency remote classes did), I thought I’d share some insights as to how I made that first week work for me. To keep with the theme of this blog, these insights are presented as answers to questions that you might ask yourself as you’re designing your online statistics course. I hope these questions are generic enough to stand the test of time to remain relevant when we’re back in a classroom.

1. Cultivating curiosity

Knowing where you want to end up (your desired outcomes) is crucial when designing a course or individual lesson, but the starting point is sometimes overlooked. As you think about your course, whether you’re meeting in person, on Zoom, or you don’t have scheduled meetings, ask yourself: Does my lesson plan make students want to learn more?

This is where Allan’s blog comes in handy. He has many great examples of good questions that truly spark curiosity, often without requiring a deep understanding of the subject matter to start. However, simply including good questions in a lecture allows students to opt out and wait for the professor or another student to do the thinking for them. Simulation-based inference and the many awesome applets that exist in that same vein are one great way to build curiosity for theory-based inference. Regardless of class modality, one of my favorite tools for sparking curiosity is the activity builder in Desmos.

If you haven’t tried out the Desmos Activity Builder (here), you’re missing out. This one tool can answer questions such as: How do I do activities if I’m teaching online? What if I want to assign activities as homework? What if I don’t want to buy Hershey’s Kisses to make confidence intervals for the proportion that land on their base? The Desmos activity builder allows you to add math, graphs, tables, video, images, text to slides for students to work through. You can have students input math, graphs, tables, text, answer multiple choice, multiple selection, reorder selections, even doodle on a graph or image. That was quite the list. See the image below for a visual of all the things you can add to an activity in Desmos:

On the instructor end, you can see exactly where students are (so it’s great to use if you’re meeting with students at a particular time which we all now know is called synchronous) – I use this to pause the activity and debrief when most students have reached a particular point or nudge those students who seem to be stalled. You can also see student work in real-time and provide them feedback directly in the activity. And many activities have been designed to combine data from across the entire class, allowing you to recreate some favorite in-person activities in an online space.

Here are a few Desmos activities that I’ve created, used, or plan to use to build curiosity:

a) Reading Graphs (here)

This activity was inspired by a workshop on culturally responsive teaching. These graphs and questions appear in my lecture notes before we discuss displays for data. Typically, I have students work in groups of four to answer all of the questions for their graph. Then we do a numbered-head protocol (they number themselves 1-4, and I use a random number generator on the projector to choose a victim to report out) to debrief the activity.  I show them that they already know most everything in that section of the lecture notes, with the added bonus of being able to bring in topical graphs*, including ones on social justice issues. For my asynchronous classes, students go through this activity on their own but can see other student responses once they share. For my synchronous class, I occasionally “pause” the activity to discuss some of the responses to a particular graph.  For instance, the following bar chart of children in poor families leads to so many more questions than answers: What defines a family as poor? Are the observational units the children or the families? Does it matter? What if the parents have different education levels? Where are the other 8%?!

* Please ignore the titanic mosaic plot; I really haven’t found better.

b) Skew the Script – Lesson 1.1 (here)

I just found this activity, despite being a longtime fangirl of @AnkerMath on twitter. Skew the Script (here) has a great curriculum created for AP Statistics with student and instructor resources that strive to make the content relevant. It focuses on using real-world examples and creating equity-driven lessons. This particular exercise has students analyze and describe the problems with a lot of bad graphs. I plan on starting off the 2nd week with this one! I’ll tweet how it goes.

c) Does Beyoncé Write Her Own Songs? (here)

This activity is taken entirely from StatsMedic (here) and adapted for Desmos by Richard Hung (find him on twitter here). StatsMedic is built on a framework of “experience first, formalize later” (EFFL), so their entire curriculum – which they provide for free on their site – is inherently designed to build curiosity. For this particular activity, I’ve edited it a bit to bring in some Allan-inspired questions, like identifying observational units and variables (see post #11, Repeat after me, here). This activity is a variation of Allan’s Gettysburg Address activity (see post #19, Lincoln and Mandela, part 1, here) or the Random Rectangles* activity, and is great for building understanding of sampling bias, random sampling, and sampling distributions.

*I first did the Random Rectangles activity in a workshop conducted by Roxy Peck ; it apparently originated in Activity-Based Statistics by Scheaffer et al.

I believe lectures inherently kill curiosity – even a lecture with questions interspersed for this purpose. Students know that eventually you will tell them the answer, and many will sit and wait until someone else does the work. At least in my flipped classroom, these types of activities incentivize my students to go watch those lectures by making them curious enough to want to know more. As a bonus, I can keep referring back to that tangible activity: Remember when you placed your magnet on the class dotplot in the Random Rectangles activity?

2. Building a collaborative and safe learning environment

So, we can present good questions or well-designed activities to ignite that sense of wonder in our students, but we also need the students to feel connected to each other and to us as educators, especially in an online environment. That brings me to my next question: Am I providing opportunities for students to connect with and learn from one another?

In a traditional classroom, these opportunities may happen organically. Students may chat before class or set up study groups, even if our classes don’t explicitly carve out time for collaboration. In an online class, these moments need to be constructed and provided for students. 

Using Google slides with breakout rooms in Zoom is my go-to for collaboration between students in an online environment. For those of you unfamiliar with Google Slides, they are essentially Google’s version of PowerPoint. The bonus is that you can get a shareable link that allows anyone to edit the slides – even if they don’t have a google account! They just have to click the link, and then they are editing simultaneously. My typical setup is to create a slide for each group within one shared presentation. The slides contain the instructions about what the students should add to the slide to complete the activity. Here are a few of the activities I’ve already used in class:

a) Personality Coordinates

This activity is an ice-breaker – before you roll your eyes, let me finish! – where students put their names on four points and then have to work together to label the X and Y axes. I personally can tolerate this particular ice-breaker because it serves as a needed review of a coordinate plane that I can reference again when we start looking at scatterplots. You can read more about this activity where I originally found out about it on Dan Meyer’s blog (here).

In the image below, you’ll see the circle representing students on slides of the presentation and the highlighted areas are what students are working on. Slides make it easy at a glance to check that students are making progress and let you know which groups you should check in on. There’s even a comment feature so you can provide quick feedback without being too intrusive. If you want to know more about how I ran this activity, check out this twitter thread (here), where I provide the links to the slidedeck and instructions I presented before putting students in breakout rooms.

b) Sampling IRL

This particular activity is a discussion board in my fully online asynchronous class. However, in my synchronous class that meets on Zoom, I saved myself a lot of grading by creating a slide deck in the same vein. On day 1, students worked with a group to fill in a slide with how they would attempt to collect a sample from a given population (students at my college, students at all area community colleges, Starbucks customers, adults in our city).

Based on timing, the second half of this activity happened on the following day, which also allowed me to reformat the slides and add new questions. On Day 2, I moved each breakout room to a new slide and they had to answer two questions:

  1. Could you follow the sampling scheme that the group laid out? If not, what is unclear?
  2. Are there any groups of people who might be left out based on their sampling scheme? Who are they? What type of people from the population will be under/over represented?

In this particular example, I didn’t reinvent anything, I just took an existing prompt and turned it into a collaborative activity by having students answer these questions in groups. And again, the added bonus was that I only needed to grade 8 slides as opposed to 32 discussion posts!

I have loved using this type of activity in my classes. Previously I did a lot of similar activities in face-to-face classes utilizing giant post-its or just the whiteboards around class. I do like that Google slides allows these contributions to be saved to come back to. Here are some things I’ve found that help this run smoothly:

  • Provide roles for the breakout rooms – students don’t have to use them, but it sets expectations. You can see my slide with roles below:
  • Emphasize that someone must share their screen in the breakout rooms. I say this at least three times before opening breakout rooms and then broadcast it to all breakout rooms a few minutes in.
  • Aim for twenty minutes as the sweet spot in terms of length.
  • Monitor progress on the slides, and use the comments to give quick feedback.
  • Join each breakout room to check that all members are contributing.
  • Make your instructions the background image, so students don’t accidentally delete the stuff they need.
  • Know how to access version history, in case a student deletes a slide or encounters an equally devastating problem.
  • If you want to run an activity that requires more than one slide per group, use a slide as a landing page (shared as view only) with the edit links to all the group slides:
  • If you’re using Canvas, you can create a Google Cloud assignment (see a video here) to assign the slides to students who missed class. 

3. Connecting with students

Another key to student success is that students feel a connection to you. That brings us to my third question: How can I ensure that students feel connected to me?

For me, it’s about sharing things I’m interested in. I tried a “liquid syllabus” (see here) this semester rather than my traditional welcome letter, but they both contain the same information that is missing from a traditional syllabus:

  • A section about me and my extracurricular interests – which I try to keep varied so that each student might see some small thing we have in common.
  • My teaching philosophy.
  • What a typical week looks like in our course.

I also respond to each student’s introduction in my asynchronous classes. On our first quiz of the semester, I ask all of my students to ask one question about the course, statistics, or myself and tell me something about themselves. I make sure to respond to each and every one. Yes, my first week of classes is a challenge, but I find that connection pays off later. And it never hurts to interject something you’re passionate about into your lectures and examples – much like Allan, most of my examples are about cats (see blog post #16, Questions about cats, here), and my Canvas pages are adorned with cats too.

4. Creating a safe place for mistakes

If you creep on my welcome site for students, you would see this section: “My course is built on the idea that we aren’t perfect the first time we do something and those mistakes are how we improve and learn. Every assignment (with the exception of exams) can be redone after you receive some guidance from me on how to improve it. There are multiple ways for you to demonstrate your understanding – discussions, projects, exams, creative assignments… If you’ve struggled in a traditional classroom, I hope we’ll find a way to get through this together.” This brings me to my next question: How am I demonstrating to students the value in making mistakes?

I don’t know about you, but I have countless students who are frozen into inaction by their fear of failure. Students that I know understood the material will turn in tests with blank pages. When I ask them about it, they profess that they just weren’t sure they were on the right track. I try to demonstrate how useful mistakes are with my policies (see above), as well as in how I highlight student work and respond to students. I try to bring up “good mistakes” in class or in video debriefs, focusing on the thinking that led the student to that answer and all the understanding that their work shows. I hope that by applauding those efforts and working hard to build those connections with and between students, they will be more willing to share their thinking without fear.*

* This letter from a former student shows that I’m on the right track, but I need to add a question about this to my end-of-semester survey to make sure all students feel this way.

5. Assessing understanding

Online assessments are a tricky beast. It’s nearly impossible to be sure our students are the ones taking our assessments and that they are doing so without some outside aid. I feel like I have to include this section because it’s the most common question I get from faculty – how can I make sure my students aren’t cheating? Short answer, you can’t. So here’s the question to ask yourself: Are exams the best way to assess student knowledge?

Consider projects or other tasks where students can demonstrate that they understand the course content. Projects have the added bonus of letting students see how statistics is actually used to answer questions, relevant to what they are interested in, and connected to the other courses they are taking. I personally do a variation on the ASA Project Competition (here), where students can either submit a written report or record a presentation.

I still have exams, too. I’ve just lessened their weight so that student don’t have any real incentive to cheat. And I have embraced open-ended questions. For years, I avoided these types of questions because they were harder to grade and truly required students to have a better understanding and communication skills than the same question cleverly written as a multiple choice. On my latest exam, here’s one of the options for a particular question pool:

Many colleges were scrambling to provide resources for students with the switch to remote learning. They surveyed students by reaching out via the students’ listed email addresses to see what resources they would need to continue to attend classes in the switch to online. Do you believe this is a good survey technique? Explain why or why not. What are some issues that may arise from this survey option?

Four years of reading the AP Statistics exam has trained me not to fear reading free response questions like the one above. Even three years ago, I’d probably be shaking in my boots at the prospect of grading over a hundred free response questions on a given exam. I cannot emphasize enough how useful participating in the AP reading has been for me as an educator. Empowered by that experience, my “complete” student response to the question has four components:

  1. States that the voluntary response method described is not a good technique.
  2. Notes and provides a reason students may not be included in the survey responses – such as they choose not to take it, don’t check their email, or …
  3. Notes that students without resources are less likely to respond to the survey.
  4. Concludes that the schools will underestimate the resources needed as a result of (3).

Much like an AP scoring rubric, students must get component 1 in order to earn any points for the problem. And for full credit, they must include all four components. If you’re looking for some great questions, beyond those that Allan has provided us here over the past year, previous AP Statistics free response questions are a great place to get inspiration as you write assessments and corresponding rubrics*.

* StatsMedic has very helpfully categorized all of these questions by topic here.

6. The Real Question

All of the questions I’ve asked you to reflect on throughout this post come down to a common theme: Am I reaching ALL of my students?

I’m lucky enough to work at a campus that has provided me with data on my classes’ success rates disaggregated by gender, age, and ethnicity. I know what groups I need to work harder to reach. If possible, get these data from your school. If not, have students self report and then see if you notice any trends throughout the semester/year. If you’re new to the idea of culturally responsive teaching, I strongly recommend Zaretta Hammond’s Culturally Responsive Teaching and the Brain – it’s a great mix of research, practical tips, and reflection.

I hope you found something you can use in your classrooms in this post. Take what works for you, leave what doesn’t. And keep continuously reflecting on your own teaching practices.

Here are Allan’s own words (from post #52, Top thirteen topics, here), because I think they bear repeating: “I know that if I ever feel like I’ve got this teaching thing figured out, it will be time for me to retire, both from teaching and from writing this blog.”

This is my mantra*. Keep reflecting on your choices. Keep trying new things. Keep being fearless. Hopefully along the way, we’ll do better for all of our students.

* Minus the blog part, because I have no idea how he did this for 52 weeks!

This guest post has been contributed by Kelly Spoon.  You can contact her at

#59 Popularity contest

This guest post has been contributed by Anna Fergusson. You can contact her at

Anna Fergusson is a Professional Teaching Fellow in the Department of Statistics at the University of Auckland.  I met Anna at the 2019 Joint Statistical Meetings, where she gave a terrific talk about introducing statistics students to data science, which is the topic of her Ph.D. research.  I admit that part of the appeal of Anna’s presentation was that her activity involved photos of cats.  But more impressive is that Anna described a fascinating activity through which she introduces introductory students to modern computational tools while emphasizing statistical thinking throughout.  I am delighted that Anna agreed to write this guest post about her activity, which also highlights her admirable and effective “sneaky” approach to student learning.  I also encourage you to follow Anna’s blog, with the not-so-subtle title of Teaching Statistics is Awesome and which has become one of my favourites*, here.

* I am using this non-conventional (for Americans) spelling in appreciation for Anna’s becoming my first guest contributor from outside the U.S.

I am thrilled to write this week’s guest post, not just because I get to add another activity to Allan’s examples of “stats with cats” (see post #16 here), but also because I strongly believe in asking good questions to guide students to discover “new-to-them” ideas or methods.

A current focus for my teaching and research is the design of accessible and engaging learning activities that introduce statistics students to new computational ideas or tools.  For these “first exposure” type learning tasks, I use What if..? style questions to encourage curiosity-driven learning. I also use the “changing stuff and seeing what happens” approach for introducing computational concepts, rather than starting the task with formal definitions and examples.

It’s an approach that has been described by both students and teachers as “sneaky,” but I think that it is a pretty good strategy for designing tasks that support the participation of a wide range of students. To pull off this undercover approach, you need a good cover story – something that is engaging, interesting and fun! A really “popular” task I have used to introduce APIs (Application Programming Interfaces) for accessing data involves searching for photos of cats and dogs online. I’ve tried out several versions of this task over the last few years with a range of school-level students and teachers, but this particular version of the task is from the introductory-level university course I’ve designed for students who have not completed Grade 12 mathematics or statistics. The overall question for the exploration is: What is more popular on Pixabay – photos of cats or photos of dogs?

I usually start the activity by asking students: What is your favourite type of animal, cats or dogs? I would like to say that there is a deeper learning point being made here, for example getting students to acknowledge their own personal biases before they attempt to learn from data, but really I ask this question so I can pretend to be offended when more students state that they prefer dogs than cats! And also so I can use this meme:


I then ask students to go to and explore what they can find out about whether photos of cats or dogs are more popular on this website. The only direction I give students is to make sure they have selected “photos” when they search and to point out that the first row of photos are sponsored ones. I encourage students to work in pairs or small groups for this activity.

While finding pretty adorable photos of cats and dogs, students are familiarising themselves with the website and what data might be available for analysis, which will come in handy later in the task. It also helps that popularity metrics such as likes and views are already familiar to students thanks to social media. I generally give students about five minutes to explore and then ask groups to share with the class what they have learned about the popularity of cat and dog photos, including what their “hunch” is about which animal is more popular on Pixabay.

There are a lot of approaches that students can take to explore and compare popularity, and it’s helpful to have some questions up your sleeve to ask each group as they share what they learned. For example, one approach is to determine how many photos are returned when you search for “cat” and compare this to the number of photos that are returned when you search for “dog”. You can ask students who use this approach What happens when you search for “cat” compared to “CAT” compared to “cats”? Students may or may not have noticed that their search terms are being “manipulated” in some way by the website.

Another good question is: Were all the photos returned the kind of “cat” that you expected? This can lead into a discussion about how photos are uploaded and given “tags” by the photographer, and whether the website checks whether the tags are appropriate or correct. Most students discover that if you hover over a photo returned in the search query, you can see some metrics associated with the photo, such as its top three tags and the number of likes, favourites and comments the photo has (see an example below).

To encourage students to think about how the photos are ordered in the search results, I ask students: What photos are being shown to you first when you search for “cat”? Can you spot a pattern to the order of the photos? Initially, students might think that it is just the number of likes (the thumbs-up count) that is determining the order, but if they look across the first 20 or so photos, they should notice that the pattern of decreasing like counts as you move “down the rank” doesn’t always hold.

I also prompt discussion about the nature of the “metrics” by asking: What is another reason why one photo might have more likes than another photo? Clearly, you can’t like a photo if you’ve never viewed it! Additionally, some photos may have been on the website for longer than others and some of these variables require more effort on the part of the “searcher” than others e.g. viewing a photo versus liking a photo.

This phase of the task works well because students are exploring data, generating questions, and integrating statistical and computational thinking, all without any requirements to perform calculations or write precise statistical statements. However, there is only so much you can learn from the website before needing a way to access more of the data faster than viewing each photo individually. Fortunately, Pixabay offers an API service to access photos and data related to the photos (you can find the documentation about the API here).

Don’t know anything about APIs? Don’t worry, neither do my students, and in keeping with my sneaky approach, we’re not going to jump into the API documentation. Instead, I ask students to pay attention to the URL when they search for different photos. I then use a sequence of questions to guide students towards structuring an API request for a particular search:

  • What do you notice changes about the URL each time you try a new search?
  • Can you change the photos searched for and displayed on the page by changing the URL directly?
  • Can you work out how to search for “dog costume” by changing the URL rather than using the search box?

For example, the screenshot below shows that the URL contains fixed information like “photos” and “search” but the last part changes depending on what you search for:

Through this sequence of questions, students start to notice the structure of the URL, and they also learn just a little bit about URL encoding when they try a search based on two words. For example, a search for “cat costume” will result in (1) cute photos of cats, but also (2) a URL where the spaces have been replaced with “%20”:

I then ask students to find a photo of a cat or a dog that they really like and to click on this photo to open its webpage. I then use a sequence of questions to guide students towards structuring an API request for a particular photo:

  • What do you notice about the URL for a specific photo?
  • How is it different from the URL when we were searching for photos?
  • Which part do you think is the ID for the photo?
  • What happens if you delete all the words describing the photo and leave just the ID number, such as:
  • Is there a photo that has an ID based on your birth date?
  • What was the first photo uploaded to the website?
  • How could we randomly select one photo from all the photos on Pixabay?

That last question is a sneaky way to bring in a little bit of discussion about sampling frames, which will be important later in the task if/when we discuss inference.

Once students have played around with changing the URL to change what is displayed on the webpage, I congratulate them on becoming “URL hackers.” Now it’s time to look more closely at what data about the photo is available on its webpage. I typically ask students to write down all the variables they could “measure” about their chosen photo. Depending on time, we can play a quick round of “Variable Boggle,” where each pair of students tries to describe another variable that no other pair has already described before them.

I then tell the students that the Pixabay API is basically a way to grab data about each photo digitally rather than us copying and pasting the data ourselves into a spreadsheet, and that to get data from the API we have to send a request. I then introduce them to an app that I have developed that allows students to: (1) play around with constructing and testing out Pixabay API requests, and (2) obtain samples of photos as datasets.

The app is available here.  Clicking on the top left button that says “API explorer” takes you to the screen shown below:

The API explorer is set up to show a request for an individual photo/image, and students only need to change the number to match the id of the photo they have selected. When they send the request, they will get data back about their photo as JSON (JavaScript Object Notation). As students have already recorded the data about their photo earlier in the task, they don’t seem to be intimidated by this new data structure. I then ask students to compare what we could view about the photo on its webpage with the data we can access about each photo from the API, asking: What is the same? What is missing? What is new?

For example, a comparison of the information available for a photo on the webpage and the JSON returned for an individual photo reveals that only the first three tags about a photo are provided by the API, that the date the photo was created is not provided, and that a new variable called imageSize is provided by the API:

Reminding them of earlier discussion about how long a photo has been online for, I point out that the date the image was uploaded is not directly available from the API (if students have not already identified this is missing when sharing the similarities and differences between data on the webpage and data from the API). I ask them: Is there another variable about the photo that we could use to estimate how long the photo has been online? Do any of these variables appear to contain date information? Once we’ve narrowed it down to two potential candidates – previewURL and userImageURL – I ask students to compare the dates shown in the URL to the date uploaded on the webpage for the photo. This mini-exploration leads to a discussion that we could use the date from the previewURL to estimate the date the photo was uploaded, and that while the dates don’t always match up, the date from previewURL appears to be a reasonable proxy.

One of the limitations of the Pixabay API is that you only get a maximum of 500 results for any request. You do have a choice of ordering the results in terms of popularity or date uploaded, and for my app I have chosen to return the results in terms of popularity (hence the title of the activity!). To help students discover this and also a little more about how JSON is structured, we can use the API explorer to search photos based on a keyword. To connect back to our initial search for “cat” or “dog”, I tell students they can change the API request from “id=” to “q=” to search for photos based on a key word or words. I ask them to use the API explorer to search for photos of cats, and to compare the first three results from their API request (q=cat) to the first three results from searching for “cat” on the Pixabay website (see screenshots below).

Now that we’ve learned a little how we can use the Pixabay API to access data about photos, it’s time to refocus on our overall question: What is more popular on Pixabay – photos of cats or photos of dogs? To do this, we’ll use another feature of the app that allows students to obtain random samples of the most popular photos. I direct students to use the app to take a random sample of 100 cats and 100 dogs from the most popular photos on Pixabay, and the app then displays all the photos in the sample on the left side of the screen:

The interface is designed to allow for a new categorical variable to be created, based on dragging the photos across the page in two groups (see later for examples of explorations of this nature). For this exploration, we don’t need a new categorical variable because we searched for photos of dogs and cats, and the search term used is one of the variables. To use all the photos under “No group” students need to re-label the “No group” heading to something else like “All.” Clicking the “Show data table” button allows students to see the data about each photo as a rectangular data structure (each row is a different photo):

Clicking the “Get links to data” button allows students a quick way to “jump with the data” into an online tool for exploring the data, as well as the option to download the data as a CSV file. I use this task with students after they have already used a tool like iNZight lite (here) to explore data. This means I can just ask my students to use the data to check their hunch about whether photos of cats or dogs are more popular on Pixabay, and give them time to explore their data with their partner/group. Similar to earlier in the task, after about 10 minutes I ask the different pairs/groups of students to share what they have learned. Most groups make plots comparing likes by the search term, as shown here:

Some students create a new variable, for example the number of likes per days online, and compare this for the cat and dog photos in the sample, as below:

Depending on where the class is at in terms of learning about sample-to-population inference, we can talk about more formal approaches for comparing the popularity of cat and dog photos on Pixabay. An important aspect to that discussion is that the population is not all photos on Pixabay, but the most popular photos as determined by Pixabay using some sort of algorithm unknown to us.

The activity ends with asking students to carry out their own exploration to compare the popularity of two types of photos on Pixabay. The huge advantage we have with introducing an API as a source of data to students, and providing an app that allows easy access to that API, is that students get to choose what they want to explore. By using an API connected to a photo-sharing website with search capabilities, students also have a way of building familiarity with the data before accessing the data set. Beyond comparisons of popularity, other interesting investigations involve using what is shown in the photo to create a new categorical variable. For example, I’ve had students explore whether most photos of dogs are outside shots (see earlier discussion and screenshot of creating new categorical variables using the popularity contest app). Other interesting research questions from students have included: Are most of the popular Pixabay photos tagged as “cat,” photos of domestic cats?

Often my students form their ‘hunch” for a research question based on viewing the first 20 or so photos from the website search.  Then they are surprised not to find a similar result when taking a random sample of popular photos. I think there’s something nice in this idea of not jumping to conclusions from searches generated by an algorithm designed to give prominence to some photos over others! My students have also written about how the task helps expand their ideas of where they can get data from and makes them more aware of how much data is being collected from them as they interact with websites.

I commented at the beginning of this post that tasks like these have been described by others as “sneaky.” I’ve also been accused of tricking students into learning because I made the activities so much fun. In fact, my students’ enjoyment continues even when I extend this task to introduce them to using R code to interact with Pixabay photos and the API. I say “even” because so many of my students have pre-determined negative views about learning computer programming, so they really are genuinely surprised to find that the experience of “coding with data” can be fun. Especially if you use a “cover story” of creating memes, using Pixabay photos as a sneaky way to learn about arguments for functions!

When we design activities that introduce students to new computational ideas or tools, it’s only natural to make the “new thing” the star of the show. Although the overall learning goal of this task is to introduce students to some new ideas related to APIs, the immersive experience of searching for photos to find out whether cats are more popular than dogs is the real star of every act of this show. By structuring and asking good questions to drive learning rather than focusing on formal definitions initially, I believe a wide range of students are supported to engage with the many statistical and computational ideas that they discover along the way. What else makes this task successfully sneaky? Cats, of course, lots and lots of photos of cats!

This guest post has been contributed by Anna Fergusson. You can contact her at