Skip to content

#69 More probability questions – correction

I often tell my students that I make mistakes in class on purpose as a teaching strategy, to encourage them to pay close attention, check my work regularly rather than simply copy what I say into their notes, and speak up when they notice something that they question.

This is partially true, but most of the mistakes that I make in class are, of course, genuine ones rather than purposeful.  I admit that I sometimes try to bluff my way through, with tongue firmly planted in cheek, claiming that my mistake had been intentional, an application of that teaching strategy.

Thanks very much to the careful blog reader who spotted a mistake of mine in today’s post.  In a follow-up discussion to the first example, I wrote: If the marginal percentages had been 28% and 43%, then the largest possible value for the intersection percentage would have been 28% + 43% = 71%.  This is not true, because the intersection percentage can never exceed either of the marginal percentages.  With marginal percentages of 28% and 43%, the largest possible value for the intersection percentage would be 28%. 

Perhaps I was thinking of the largest possible percentage for the union of the two events, which would indeed be 28% + 43% = 71%.  Or perhaps I was not thinking much at all when I wrote that sentence.  Or perhaps, just possibly, you might be so kind as to entertain the notion that I made this mistake on purpose, as an example of a teaching strategy, which I am now drawing to your attention?

#69 More probability questions

My students and I have spent the last three weeks studying probability*.  At the end of Friday’s class session, one of the students asked a great question.  Paraphrasing a bit, she asked: We can answer some of these questions by thinking rather than calculating, right?  I was delighted by her question and immediately replied: Yes, absolutely!  I elaborated that some questions call for calculations, so it’s important to know how to use probability rules and tools.  Those questions usually require some thinking as well as calculating.  But other questions ask you to think things through without performing calculations. Let me show you some of the questions that I have asked in this unit on probability**.

* This course is the first of a two-course introductory sequence for business students. 

** Kevin Ross’s guest post (#54, here) provided many examples of probability questions that do not require calculations.

My students saw the following questions on a quiz near the beginning of the probability unit:

1. Suppose that 78% of the students at a particular college have a Facebook account and 43% have a Twitter account.

  • a) Using only this information, what is the largest possible value for the percentage who have both a Facebook account and a Twitter account?  Describe the (unrealistic) situation in which this occurs.
  • b) Using only this information, what is the smallest possible value for the percentage who have both a Facebook account and a Twitter account?  Describe the (unrealistic) situation in which this occurs.

Even though these questions call for a numerical response, and can therefore be auto-graded, they mostly require thinking rather than plugging into a rule.  We had worked through a similar example in class, in which I encouraged students to set up a probability table to think through such questions.  The marginal probabilities given here produce the following table:

For part (a), students need to realize that the percentage of students with both kinds of accounts cannot be larger than the percentage with either account individually.  The largest possible value for that intersection probability is therefore 0.43, so at most 43% of students could have had both kinds of accounts.  If this were not an auto-graded quiz, I would have also asked for a description of this (unrealistic) scenario: that every student with a Twitter account also has a Facebook account.

Part (b) is more challenging.  A reasonable first thought is that the smallest possible probability could be 0.  But then Pr(Facebook or Twitter) would equal 0.78 + 0.43, and 1.21 is certainly not a legitimate probability.  That calculation points to the correct answer: Because Pr(Facebook or Twitter) cannot exceed 1, the smallest possible value for Pr(Facebook or Twitter) is 1.21 – 1 = 0.21.  At least 21% of students must have both kinds of accounts.  This unrealistic scenario requires that every student have a Facebook account or a Twitter account.

Notice that if the two given probabilities had not added up to more than 1, then the smallest possible value for the intersection probability would have been 0%.

The remaining three parts of the quiz provided students with a specific value (36%) for the percentage of students with both a Facebook and Twitter account and then asked for routine calculations:

  • c) What percentage of students have at least one of these accounts?
  • d) What percentage of students have neither of these accounts?
  • e) What percentage of students have one of these accounts but not both?

These percentages turn out to be 85%, 15%, and 49%, respectively.  The easiest way to determine these is to complete the probability table begun above:

The following questions appear on a practice exam that I gave my students to prepare for this coming Friday’s exam:

2. Suppose that a Cal Poly student is selected at random.  Define the events E = {randomly selected student is an Engineering major} and T = {randomly selected student is taking STAT 321 this term}.  For each of the following pairs of probabilities, indicate which probability is larger, or if the two probabilities are the same value.  You might want to consider the following information: A few thousand students at Cal Poly are Engineering majors. A few dozen students are taking STAT 321 this term.  Less than a handful of current STAT 321 students are not Engineering majors.

  • a) Pr(E), Pr(T)
  • b) Pr(E), Pr(E and T)
  • c) Pr(T), Pr(E or T)
  • d) Pr(E), Pr(E | T)
  • e) Pr(T | E), Pr(E | T)

These question requires only thinking, no calculations.  I purposefully avoided giving specific numbers at the end of this question.

Part (a) is an easy one, because there a lot more Engineering majors than there are STAT 321 students.  For part (b), students are to realize that (E and T) is a subset of E, so Pr(E) must be larger than Pr(E and T).  Similarly in part (c), T is a subset of (E or T), so Pr(E or T) must be larger than Pr(T).  For part (d), most STAT 321 students are Engineering majors, so Pr(E | T) is larger than Pr(E).  Finally, relatively few Engineering majors take STAT 321 in any one term, so Pr(E | T) is also larger than Pr(T | E).

My students completed a fairly long assignment that asked them to apply the multiplication rule for independent events to calculate various probabilities that a system functions successfully, depending on whether components are connected in series (which requires all components to function successfully) or in parallel (which requires at least one component to function successfully).  The final two parts of this assignment were:

3. Suppose that three components are connected in a system.  Two of the components form a sub-system that is connected in parallel, which means that at least one of these two components must function successfully in order for the sub-system to function successfully.  This sub-system is connected in series with the third component, which means that both the sub-system and the third component must function successfully in order for the entire system to function successfully.  Suppose that the three components function independently and that the probabilities of functioning successfully for the three components are 0.7, 0.8, and 0.9.  Your goal is to connect the system to maximize the probability that the system functions successfully.

  • i) Which two components would you select to form the sub-system, and which would you select to be connected in series with the sub-system?  Explain your choice.
  • j) Determine the probability that the system functions successfully with your choice.  Justify the steps of your calculation with the appropriate probability rules.

The first of these questions can be answered without performing calculations.  Because the component connected in series must function successfully in order for the system to function successfully, that component should be the most reliable one: the one with probability 0.9 of functioning successfully.  The remaining two components, with success probabilities 0.8 and 0.7, should be connected in parallel.

The calculation for part (j) certainly does require applying probability rules correctly.  The probability that this system functions successfully can be written as*: Pr[(C7 or C8) and C9].  The multiplication rule for independent events allows us to write this as: Pr(C7 or C8) × Pr(C9).  Applying the addition rule on the first term gives: [Pr(C7) + Pr(C8) – Pr(C7 and C8)] × Pr(C9).  Then one more application of the multiplication rule gives: [Pr(C7) + Pr(C8) – Pr(C7) × Pr(C8)] × Pr(C9).  Plugging in the probability values gives: [0.7 + 0.8 – 0.7×0.8] × 0.9, which is 0.846. 

* I’m hoping that my notation here will be clear without my having to define it.  I consider this laxness on my part a perk of blog writing as opposed to more formal writing.

Notice that a student could have avoided thinking through the answer to (i) by proceeding directly to (j) and calculating probabilities for all possible arrangements of the components.  I do not condone that strategy, but I do encourage students to answer probability questions in multiple ways to check their work.  The other two probabilities (for the system functioning successfully) turn out to be 0.776 if the 0.8 probability component is connected in series and 0.686 if the 0.7 probability component is connected in series.

Finally, here’s the in-class example that prompted my student’s question at the top of this blog post:

4. Suppose that Zane has a 20% chance of earning a score of 0 and an 80% chance of earning a score of 5 when he takes a quiz.  Suppose also that Zane must choose between two options for calculating an overall quiz score: Option A is to take one quiz and multiply the score by 10, Option B is to take ten (independent) quizzes and add their scores.

  • a) Which option would you encourage Zane to take?  Explain.
  • b) Which option do you suspect has a larger expected value, or do you suspect that the expected values will be the same?
  • c) Use properties of expected value to determine the expected value of his overall score with each option.  Comment on how they compare.
  • d) Which option do you suspect has a larger standard deviation, or do you suspect that the standard deviations will be the same?
  • e) Use properties of variance to determine the standard deviation of his overall score with each option.  Comment on how they compare.
  • f) If Zane’s goal is to maximize his probability of obtaining an overall score of 50 points, which option should he select?  Explain.
  • g) Calculate the probability, for each option, that Zane scores 50 points.  Comment on how they compare.
  • h) The following graphs display the probability distributions of Zane’s overall quiz score with these two options.  Which graph goes with which option?  Explain.

The key idea here is that multiplying a single quiz score by 10 is a much riskier, all-or-nothing proposition than adding scores for 10 independent quizzes.  A secondary goal is for students to learn how to apply rules of expected values and variances to multiples and sums of random variables.

The expected value of Zane’s score on a single quiz is 4.0, and the standard deviation of his score on a single quiz is 2.0.  The expected value of the random variable (10×Z) is the same as for the random variable (Z1 + Z2 + … + Z10), namely 40.0 quiz points.  This means that neither option is better for Zane in terms of long-run average. 

But this certainly does not mean that the two options yield identical distributions of results.  The variance of (10×Z) is 102 × 4.0 = 400, so the standard deviation is 20.0.  The variance for (Z1 + Z2 + … + Z10) is much smaller: 10 × 4.0 = 40, so the standard deviation is approximately 6.325. 

Zane has an 80% chance of obtaining an overall quiz score of 50 with option A, because he simply needs to score a 5 on one quiz.  With option B, he only achieves a perfect overall score of 50 if he earns a 5 on all 10 quizzes, which has probability (0.8)10 ≈ 0.107.

The graph on the left above shows the probability distribution for option B, and the graph on the right corresponds to option A.  The graphs reveal the key idea that option A is all-or-nothing, while option B provides more consistency in Zane’s overall quiz score.

The great mathematician Laplace reportedly said that “probability theory is nothing but common sense reduced to calculations.”  I wish I had thought quickly enough on my feet to mention this in response to my student’s comment in class on Friday.  I’ll have to settle for hoping that my probability questions lead students to develop a habit of mind to think clearly and carefully about randomness and uncertainty, along with the ability to perform probability calculations.

#68 Knowing or guessing?

I told my students at the beginning of our last class session that I was especially excited about class that day for several reasons:

  1. It was a Friday.
  2. We were about to work through our thirteenth handout of the term, a lucky number.
  3. The date was October 16, the median day for the month of October.
  4. We had reached the end of week 5 of our 10-week quarter, the halfway point.
  5. The topic for the day was my favorite probability rule, in fact my favorite mathematical theorem: Bayes’ rule.

The first two examples that we worked through concerned twitter use and HIV testing, as described in post #10, My favorite theorem, here.

The third and final example of the day presented this scenario: Suppose that Jasmine* has a 70%** chance of knowing (with absolute certainty) the answer to a randomly selected question on a multiple-choice exam.  When she does not know the answer, she guesses randomly among the five options. 

* I had always used the name Brad with this made-up example.  But I realized that I had used an example earlier in the week with the names Allan, Beth, Chuck, Donna, and Ellen, so I thought that I should introduce a bit of diversity into the names of fictional people in my made-up probability examples.  I did a google search for “popular African-American names” and selected Jasmine from the list that appeared.

** When I first rewrote this example with Jasmine in place of Brad, my first thought was to make Jasmine a stronger student than Brad, so I wrote that she has an 80% rather than a 70% chance of knowing the answer for sure.  Later I realized that this change meant that the value 20% was being used for the probability of her guessing and also for the probability of her answering correctly given that she is guessing.  I wanted to avoid this potential confusion, so I changed back to a 70% chance of Jasmine knowing the answer.

a) Before we determine the solution, make a prediction for the probability that Jasmine answers a randomly selected question correctly.  In other words, make a guess for the long-run proportion of questions that she would answer correctly.

I hope students realize that this probability should be a bit larger than 0.7.  I want them to reason that she’s going to answer 70% correctly based on her certain knowledge, and she’s also going to answer some correctly when she’s guessing just from blind luck.  I certainly do not expect students to guess the right answer, but it’s not inconceivable that some could reason that she’ll answer correctly on 20% of the 30% that she guesses on, which is another 6% in addition to the 70% that she knows for sure, so her overall probability of answering correctly is 0.76.

Next I ask students to solve this with a table of counts for a hypothetical population, just as we did for the previous two examples (again see post #10, here).  This time I only provide them with the outline of the table rather than giving row and column labels.  b) Fill in the row and column labels for the table below:

To figure out what labels to put on the rows and columns, I remind students that the observational units here are 100 multiple choice questions, and they need to think about the two variables that we record about each question.  It takes most students a little while to realize that the two variables are: 1) whether Jasmine knows the answer or guesses, and 2) whether Jasmine answers the question correctly or not.  This leads to:

c) Fill in the table of counts for a hypothetical population of 100 questions.  We proceed through the following calculations:

  1. Jasmine will know the answer for 70% of the 100 questions, which is 70.
  2. She will guess at the answer for 100 – 70 = 30 questions.
  3. For the 70 questions where she knows the answer, she will correctly answer all 70, leaving 0 that she will answer incorrectly.
  4. For the 30 questions on which she guesses, we expect her to answer correctly on one-fifth of them, which is 6.
  5. That leaves 30 – 6 = 24 questions for which she will guess and answer incorrectly.
  6. The column totals are therefore 76 correctly answered questions and 24 incorrect.

The completed table is shown here:

d) Use the table to report the probability that Jasmine answers a randomly selected question correctly.  This can read from the table to be: Pr(correct) = 76/100 = 0.76.

e) Show how this unconditional probability (of answering a randomly selected question correctly) can be calculated directly as a weighted average of two conditional probabilities.  This is more challenging for students, but I think the idea of weighted average is an important one.  I want them to realize that the two conditional probabilities are: Pr(correct | know) = 1.0 and Pr(correct | guess) = 0.2.  The weights attached to these are the probabilities of knowing and of guessing in the first place: Pr(know) = 0.7 and Pr(guess) = 0.3.  The unconditional probability of answering correctly can be expressed as the weighted average 0.7×1.0 + 0.3×0.2 = 0.76.

f) Determine the conditional probability, given that Jasmine answers a question correctly, that she actually knows the answer.  Some students think at first that this conditional probability should equal one, but they realize their error when they are asked whether it’s possible to answer correctly even when guessing.  Returning to the table, this conditional probability is calculated to be: 70/76 ≈ 0.921. 

g) Interpret this conditional probability in a sentence.  Jasmine actually knows the answer to about 92.1% of all questions that she answers correctly in the long run.

h) Show how to calculate this conditional probability directly from Bayes’ rule.  The calculation is: Pr(know | correct) = [Pr(correct | know) × Pr(know)] / [Pr(correct | know) × Pr(know) + Pr(correct | guess) × Pr(guess)] = (1×0.7) / (1×0.7 + 0.2×0.3) = 0.70 / 0.76 ≈ 0.921.  I try to impress upon students that even though this calculation looks more daunting with the formula than from filling in the table, the calculations are exactly the same, as seen by our ending up with 0.70/0.76 from the formula and 70/76 from the table.  I also emphasize that I think the table provides an effective and understandable way to organize the calculations.

Here’s a fun extension of this example: Continue to suppose that Jasmine has a 70% chance of knowing (with absolute certainty) the answer to a randomly selected question on a multiple-choice exam.  But now there’s also a 20% chance that she can eliminate three incorrect options, and then she guesses randomly between the remaining two options, one of which is correct.  For the remaining 10% chance, she has no clue and so guesses randomly among all five options.

i) Before conducting the analysis, do you expect the probability that she answers a question correctly to increase, decrease, or remain the same?  Explain.  Then do the same for the conditional probability that she knows the answer given that she answers correctly.

Most students have correct intuition for the first of these questions: If Jasmine can eliminate some incorrect options, then her probability of answering correctly must increase.  The second question is more challenging to think through: Now that she has a better chance of guessing the correct answer, the conditional probability that she knows the answer, given that she answer correctly, decreases.

j) Modify the table of hypothetical counts to determine these two probabilities.  Students must first realize that the table now needs three rows to account for Jasmine’s three levels of knowledge.  The completed table becomes:

The requested probabilities are: Pr(correct) = 82/100 = 0.82 and Pr(know | correct) = 70/82 ≈ 0.854.  Jasmine’s ability to eliminate some incorrect options has increased her probability of answering correctly by six percentage points from 76% to 82%.  But our degree of belief that she genuinely knew the answer, given that she answered correctly, has decreased by a bit more than six percentage points, from 92.1% to 85.4%.

I confess that I did not have time to ask students to work through this extension during Friday’s class.  I may give it as an assignment, or as a practice question for the next exam, or perhaps as a question on the next exam itself.

I have mentioned before that I give lots and lots of quizzes in my courses (see posts #25 and 26, here and here).  This is even more true in my first-ever online course.  I generally assign three handout quizzes and three application quizzes per week.  The handout quiz aims to motivate students to work through the handout, either by attending a live zoom session with me, or on their own, or by watching a video that I prepare for each handout.  The application quiz asks students to apply some of the topics from the handout to a new situation.  I also occasionally assign miscellaneous quizzes.  With regard to Bayes’ rule, I have asked my students to watch a video (here) that presents the idea behind Bayes’ rule in an intuitive and visually appealing way.  I wrote a miscellaneous quiz to motivate students to watch and learn from this video.

The author of this video, Grant Sanderson, argues that the main idea behind Bayes’ rule is that “evidence should not determine beliefs but update them.”  I think the Jasmine example of knowing versus guessing can help students to appreciate this viewpoint.  We start with a prior probability that Jasmine knows the answer to a question, and then we update that belief based on the evidence that she answers a question correctly.  Do I know with absolute certainty that this example helps students to understand Bayes’ rule?  Of course not, but I like the example anyway.  More to the point, the evidence of my students’ reactions and performances on assessments has not persuaded me to update my belief in a pessimistic direction.

#67 Interviews

One of my favorite professional activities has been interviewing statistics teachers and statistics education researchers for the Journal of Statistics Education.  I have conducted 26 such interviews for JSE over the past ten years.  I have been very fortunate to chat with some of the leaders in statistics education from the past few decades, including many people who have inspired me throughout my career.  I encourage you to take a look at the list and follow links (here) to read some of these interviews.

Needless to say, I have endeavored to ask good questions in these interviews.  Asking interview questions is much easier than answering them, so I greatly appreciate the considerable time and thoughtfulness that my interview subjects have invested in these interviews.  I hope that my questions have provided an opportunity to:

1. Illuminate the history of statistics education, both in years recent and back a few decades.  A few examples:

  • Dick Scheaffer describes how the AP Statistics program began. 
  • Mike Shaughnessy talks about how NCTM helped to make statistics more prominent in K-12 education. 
  • Chris Franklin and Joan Garfield discuss how ASA developed its GAISE recommendations for K-12 and introductory college courses. 
  • Jackie Dietz describes the founding of the Journal of Statistics Education
  • Dennis Pearl explains how CAUSE (Consortium for the Advancement of Undergraduate Statistics Education) came to be.
  • George Cobb describes his thought processes behind his highly influential writings about statistics education.
  • Nick Horton shares information about the process through which ASA developed guidelines for undergraduate programs in statistical science.
  • David Moore, Roxy Peck, Jessica Utts, Ann Watkins, and Dick De Veaux talk about how their successful textbooks for introductory statistics came about.

2. Illustrate different pathways into the field of statistics education.  Many of these folks began their careers with statistics and/or teaching in mind, but others started or took a detour into engineering or physics or psychology or economics.  Some even studied fields such as dance and Russian literature.

3. Indicate a variety of ways to contribute to statistics education.  Some interviewees teach in high schools, others in two-year colleges.  Some teach at liberal arts colleges, others in research universities.  Some specialize in teaching, others in educational research.  All have made important contributions to their students and colleagues.

4. Provide advice about teaching statistics and for pursuing careers in statistics education.  My last question of every interview asks specifically for advice toward those just starting out in their careers.  Many of my other questions throughout the interviews solicit suggestions on a wide variety of issues related to teaching statistics well.

5. Reveal fun personal touches.  I have been delighted that my interviewees have shared fun personal tidbits about their lives and careers.  Once again, a few examples:

  • George Cobb describes his experience as the victim of an attempted robbery, which ended with his parting company on good terms with his would-be assailant.
  • David Moore tells of losing an annual bet for 18 consecutive years, which required him to treat his friend to dinner at a restaurant of the friend’s choosing, anywhere in the world.
  • Ron Wasserstein shares that after he and his wife raised their nine children, they adopted two ten-year-old boys from Haiti.
  • Deb Nolan mentions a dramatic career change that resulted from her abandoning plans for a New Year’s Eve celebration.
  • Joan Garfield reveals that she wrote a memoir/cookbook and her life and love of food.
  • Dennis Pearl mentions a challenge that he offers to his students, which once ended with his delivering a lecture while riding a unicycle.
  • Chris Franklin relates that her favorite way to relax is to keep a personal scorebook at a baseball game.
  • Larry Lesser shares an account of his epic contest on a basketball court with Charles Barkley.

My most recent interview (here) is with Prince Afriyie, a recent Cal Poly colleague of mine who now teaches at the University of Virginia.  Prince is near the beginning of his teaching career as a college professor, and his path has been remarkable.  He started in Ghana, where he was inspired to study mathematics by a teacher whom he referred to as Mr. Silence.  While attending college in Ghana, Prince came to the United States on a summer work program; one of his roles was a paintball target at an amusement park in New Jersey.  Serendipity and initiative enabled Prince to stay in the United States to complete his education, with stops in Kentucky, Indiana, and Pennsylvania on his way to earning a doctorate in statistics.  Throughout his education and now into his own career, Prince has taught and inspired students, as he was first inspired by Mr. Silence in his home country.  Prince supplies many fascinating details about his inspiring journey in the interview.  I also asked Prince for his perspective on the two world-changing events of 2020 – the COVID-19 pandemic and the widespread protests for racial justice.

As I mentioned earlier, I conclude every interview with a request for advice aimed at those just beginning their career in statistics education.  Jessica Utts paid me a very nice compliment when she responded that teachers who read these interviews might benefit from asking themselves some of the more general questions that I ask of my interviewees.  Here are some questions that I often ask, which may lead to productive self-reflection:

  • Which came first – your interest in statistics or your interest in education?
  • What were you career aspirations at age 18?
  • What have you not changed about your teaching of statistics over the years?
  • On what do you pride yourself in your teaching?
  • What do you regard as the most challenging topic for students to learn, and how you approach this topic?
  • What is your favorite course to teach, and why?
  • In this time of data science, are you optimistic or pessimistic about the future of statistics?
  • What do you predict as the next big thing in statistics education?
  • What advice do you offer for those just beginning their career in statistics education?

You might also think about how you would answer two fanciful questions that I often ask for fun:

  • If time travel were possible, and you could travel to the past or future without influencing the course of events, what point in time would you choose?  Why?
  • If I offer to treat you and three others to dinner anywhere in the world, with the condition that the dinner conversation would focus on statistics education, whom would you invite, and where would you dine?

P.S. If you have a chance to read some of these interviews, I would appreciate hearing your feedback (here) on questions such as:

  • Who would you like me to interview in the near future?
  • What questions would you like me to ask?
  • Would you prefer shorter interviews?
  • Would you prefer to listen to interviews on a podcast?

P.P.S. For those wondering if I graded my exams last week after finally concluding the all-important procrastination step (see post #66, First step of grading exams, here): Thanks for asking, and I happily report that I did.

#66 First step of grading exams

I gave my first exam of the term, my first online exam ever, this past Friday.  As I sat down to grade my students’ responses for the first time in almost sixteen months, I realized that I had almost forgotten the crucial first step of grading exams: Procrastinate!

I have bemoaned the fact that I have so much less time available to concentrate on this blog now that I have returned to full-time teaching, as compared to last year while I was on leave.  So, what better way to procrastinate from my grading task than by engaging in the much more enjoyable activity of writing a blog post? 

What should I write about?  That’s easy: I will tell you a bit about the exam whose grading I am neglecting at this very moment.

Students took this exam through Canvas, our learning management system*.  This is a first for me, as my students in previous years took exams with paper and pencil.  I included a mix of questions that were auto-graded (multiple choice and numerical answer) and free-response questions that I will grade after I finish the all-important first step of procrastinating.  Roughly two-thirds of the points on the exam were auto-graded.  I wrote several different versions of many questions in an effort to discourage cheating.  Students had 90 minutes to complete the exam, and they were welcome to select any continuous 90-minute period of time between 7am and 7pm.  Students were allowed to use their notes.  Topics tested on the exam including basic ideas of designing studies and descriptive statistics. 

* In post #63 (My first video, here), I referred to Canvas as a course management system.  Since then I realized that I was using an antiquated term, and I have been looking for an opportunity to show that I know the preferred term is now learning management system.

Some of the questions that I asked on this exam appear below (in italics):

1. Suppose that the observational units in a study are the national parks of the United States.  For each of the following, indicate whether it is a categorical variable, a numerical variable, or not a variable.

  • the area (in square miles) of the national park
  • whether or not the national park is in California
  • the number of national parks that are to the east of the Mississippi River
  • whether there are more national parks to the east of the Mississippi River than to the west of the Mississippi River
  • the number of people who visited the national park in September of 2020

I give my students lots of practice with this kind of question (see post #11, Repeat after me, here), but some continue to struggle with this.  Especially challenging is noticing the ones that are not variables for these observational units (parts c and d).  Each student saw one of four variations on this question.  The observational units in the different version were patients who visited the emergency room at the local hospital last week, the commercial flights that left the San Luis Obispo airport last month, and customers at a local In-n-Out fast food restaurant on a particular day.  I posed this as a “matching” question in Canvas, where each of the five parts had the same three options available.

2. Suppose that the ten players on basketball team A have an average height of 76 inches, and the ten players on basketball team B have an average height of 80 inches.  Now suppose that one player leaves team A to join team B, and one player leaves team B to join team A.  How would the average heights of the two teams change?  The options that I presented were: No change, Both averages would increase, Both averages would decrease, The average would increase for A and decrease for B, The average would decrease for A and increase for B, It is impossible to say without more information.

The correct option is the last one: It is impossible to say without more information.  My goal here was for students to understand that players’ heights vary on both teams, so we cannot state any conclusions about how the averages would change without knowing more about the heights of the individual players who changed teams. 

3. San Diego State’s admission rate for Fall 2019 was 34.13%, compared to 28.42% for Cal Poly – SLO’s.  Determine the percentage difference between these admission rates.  In other words, San Diego State’s admission rate was higher than Cal Poly – SLO’s by ___ %.  Enter your answer as a number, with two decimal places of accuracy.  Do not enter the % symbol.

As I mentioned throughout post #28 (A pervasive pet peeve, here), I emphasize how a difference in proportions is not equivalent to a percentage difference.  This question assessed whether students took my emphasis to heart.  Each student answered one of four versions of this question, with different campuses being compared.  I obtained the data on admission rates from the dashboard here.

4. A series of questions referred to the following graph from a recent Gallup study (here):

The most challenging question in this series was a very basic one: How many variables are represented in this graph?  The correct answer is 2, race and preference for how much time police spend in the area.  The other options that I presented were 1, 3, 4, and 12.

5. Another series of questions was based on this study (available here): Researchers surveyed 120 students at Saint Lawrence University, a liberal arts college with about 2500 students in upstate New York.  They asked students whether or not they have ever pulled an all-nighter (stayed up all night studying).  Researchers found that students who claimed to have never pulled an all-nighter had an average GPA (grade point average) of 3.1, compared to 2.9 for students who claimed to have pulled an all-nighter.  Some basic questions included identifying the type of study, explanatory variable, and response variable.  These led to a question about whether a cause-and-effect conclusion can legitimately be drawn from this study, with a follow-up free-response question* asking students to explain why or why not.

* Oh dear, I just reminded myself of the grading that I still need to do.  This procrastination step is fun but not entirely guilt-free.

Some other free-response questions waiting for me to grade asked students to:

6. Create a hypothetical example in which IQR = 0 and the mean is greater than the median.  I think this kind of question works well on an online exam.  Different students should give different responses, so I hope this question encourages independent thinking and discourages cheating.  (See post #31, Create your own example, part 1, here, for many more questions of this type.)

7. Write a paragraph comparing and contrasting the distribution of money winnings in 2019 on three professional golf tours – men’s, women’s, and senior men’s, as displayed in the boxplots:

I am looking for students to compare center, variability, and shape across the three distributions.  They should also comment on outliers and relate their comments to the context.

8. Describe and explain the oddity concerning which hospital performed better, in terms of patients experiencing a complete recovery, for the data shown in the following tables of counts:

I expect this to be one of the more challenging questions on the exam.  Students need to calculate correct proportions, comment on the oddity that Hospital A does worse overall despite doing better for each condition, and explain that Hospital A sees most of the patients in poor condition, who are less likely to experience a full recovery than those in fair condition.

Writing my exam questions in Canvas, and preparing several versions for many questions, took considerably more time than my exam writing in the past.  But of course Canvas has already saved me some time by auto-grading many of the questions.  I should also be pleased that Canvas will also add up students’ scores for me, but I always enjoyed that aspect of grading, largely because it was the last part and provided a sense of completion and accomplishment.

Hmm, I probably should not be imagining an upcoming sense of completion and accomplishment while I am still happily immersed in the procrastination step of the exam-grading process.  I must grudgingly accept that it’s time for me to proceed to step two.  If only I could remember what the second step is …

#65 Matching variables to graphs

On Friday of last week I asked my students to engage with an activity in which I presented them with these seven graphs:

I’m sure you’ve noticed that these graphs include no labels or scales on the axes.  But you can still discern some things about these seven distributions even without that crucial information.  I told my students the seven variables whose distributions are displayed in these graphs:

  • (A) point values of letters in the board game Scrabble
  • (B) prices of properties on the Monopoly game board
  • (C) jersey numbers of Cal Poly football players
  • (D) weights of rowers on the U.S. men’s Olympic team
  • (E) blood pressure measurements for a sample of healthy adults
  • (F) quiz percentages for a class of students (quizzes were quite straight-forward)
  • (G) annual snowfall amounts for a sample of cities taken from around the U.S.

But I did not tell students which variable goes with which graph.  Instead I asked them to work in groups* with these instructions: Make educated guesses for which variable goes with which graph.  Be prepared to explain the reasoning behind your selections.

* This being the year 2020, the students’ groups were breakout rooms in Zoom.

Before I invited the students to join breakout rooms, I emphasized that it’s perfectly fine if they know nothing about Scrabble or Monopoly or rowing or even snowfall*.  For one thing, that’s why they’re working with a group.  Maybe they know about some of these things and a teammate knows about others.  For another thing, I do not expect every group to match all seven pairs perfectly, and this activity is not graded.

* Most of my students are natives of California, and some have never seen snowfall.

I think you can anticipate the next sentence of this blog post: Please take a few minutes to match up the graphs and variables for yourself before you read on*.

* Don’t worry, I do not expect you to get them all right, and remember – this is not for a grade!

Also before I continue, I want to acknowledge that I adapted this activity from Activity-Based Statistics, a wonderful collection based on an NSF-funded project led by Dick Scheaffer in the 1990s.  This variation is also strongly influenced by Beth Chance’s earlier adaptations of this activity, which included generating the graphs from data collected from her students on various variables.

I only gave my students 5-6 minutes to discuss this in their breakout rooms.  When they came back to the main Zoom session, I asked for a volunteer to suggest one graph/variable pair that they were nearly certain about, maybe even enough to wager tuition money.  The response is always the same: Graph #4 displays the distribution of football players’ jersey numbers.  I said this is a great answer, and it’s also the correct answer, but then I asked: What’s your reasoning for that?  One student pointed out that there are no repeated values, which is important because every player has a distinct jersey number.  Another student noted that there are a lot of dots, which is appropriate because college football teams have a lot of players.

Next I asked for another volunteer to indicate a pairing for which they are quite confident, perhaps enough to wager lunch money.  I received two different answers to this.  In one session, a student offered that graph #1 represents the quiz percentages.  What’s your reasoning for that?  The student argued that quizzes were generally straight-forward, so there should be predominatly high scores.  The right side of graph #1 could be quiz percentages in the 80s and 90s, with just a few low values on the left side. 

In the other session, a student suggested that graph #2 goes with point values of letters in Scrabble.  What’s your reasoning for that?  The student noticed that the spacing between dots on the graph is very consistent, so the values could very well be integers.  It also makes sense that the leftmost value on the graph could be 1, because many letters are worth just 1 point in Scrabble.  This scale would mean that the large values on the right side of the graph are 8 (for 2 letters) and 10 (also for 2 letters).  Another student even noted that there are 26 dots in graph #2, which matches up with 26 letters in the alphabet.

When I asked for another volunteer, a student suggested that graph #7 corresponds to Monopoly prices.  What’s your reasoning for that?  The student commented that Monopoly properties often come in pairs, and this graph includes many instances of two dots at the same value.  Also, the distance between the dots is mostly uniform, suggesting a common increment between property prices.  I asked about the largest value on this graph, which is separated a good bit from the others, and a student responded that this dot represents Boardwalk.

After those four variables and graphs were matched up, students got much quieter when I asked for another volunteer.  I wish that I had set up a Zoom poll in advance to ask them to express their guesses for the rest, but I did not think of that before class.  Instead I asked for a description of graph #3.  A student said that there are a lot of identical values on the low end, and then a lot of different values through the high end.  When I asked about which variable that pattern of variation might make sense for, a student suggested snowfall amounts.  What’s your reasoning for that?  The student wisely pointed out that I had said that the cities were taken from around the U.S., so that should include cities such as Los Angeles and Miami that see no snow whatsoever.

Then I noted that the only graphs left were #5 and #6, and the only variables remaining were blood pressure measurements and rower weights.  I asked for a student to describe some differences between these graphs to help us decide which is which.  This is a hard question, so I pointed out that the smallest value in graph #6 is considerably smaller than all of the others, and there’s also a cluster of six dots fairly well separated from the rest in graph #6.  One student correctly guessed that graph #6 displays the distribution of rower weights.  What’s your reasoning for that?  The student knew enough about rowing to say that one member of the team calls out the instructions to help the others row in synch, without actually rowing himself.  Why does the team want that person to be very light?  Because he’s adding weight to the boat but not helping to row!

That leaves graph #5 for the blood pressure measurements.  I suggested that graph #5 is fairly unremarkable and that points are clustered near the center more than on the extremes.

You might be wondering why I avoided using the terms skewness, symmetry, and even outlier in my descriptions above.  That’s because I introduced students to these terms at the conclusion of this activity.  Then I asked students to look back over the graphs and: Identify which distributions are skewed to the left, which are skewed to the right, and which are roughly symmetric.  I gave them just three minutes to do this in the same breakout rooms as before.  Some students understandably confused skewed to the left and skewed to the right at first, but they quickly caught on.  We reached a consensus as follows:

  • Skewed to the left: quiz percentages (sharply skewed), rower weights (#1, #6)
  • Skewed to the right: Scrabble points, snowfall amounts (#2, #3)
  • Symmetric (roughly): jersey numbers, blood pressure measurements, Monopoly prices (#4, #5, #7)

I admitted to my students that while I think this activity is very worthwhile, it’s somewhat contrived in that we don’t actually start a data analysis project by making guesses about what information a graph displays.  In practice we know the context of the data that we are studying, and we produce well-labelled graphs that convey the context to others.  Then we examine the graphs to see what insights they provide about the data in context.

With that in mind, I followed the matching activity with a brief example based on the following graph of predicted high temperatures for cities around California, as I found them in my local newspaper (San Luis Obispo Tribune) on July 8, 2012:

I started with some basic questions about reading a histogram, such as what temperatures are contained in the rightmost bin and how many cities had such temperatures on that date.  Then I posed three questions that get to the heart of what this graph reveals:

  • What is the shape of this distribution?
  • What does this shape reveal about high temperatures in California in July?
  • Suggest an explanation for the shape of this distribution, using what you know about the context.

Students responded that the temperature distribution displays a bimodal shape, with one cluster of cities around 65-80 degrees and another cluster from about 90-100 degrees.  This reveals that California has at least two distinct kinds of locations with regard to high temperatures in July. 

For the explanation of this phenomenon, a student suggested that there’s a split between northern California and southern California.  I replied that this was a good observation, but I questioned how this split would produce the two clusters of temperature values that we see in the graph.  The student quickly followed up with a different explanation that is spot-on: California has many cities near the coast and many that are inland.  How would this explain the bimodality in the graph?  The student elaborated that cities near the coast stay fairly cool even in July, while inland and desert cities are extremely hot.

My students and I then worked through three more examples to complete the one-hour session.  Next I showed them the following boxplots of daily high temperatures in February and July of 2019 for four cities*:

* I discuss these data in more detail in post #7, Two dreaded words, part 2, here.

The students went back to their breakout rooms with their task to: Arrange these four cities from smallest to largest in terms of:

  • center of February temperature distributions;
  • center of July temperature distributions;
  • variability of February temperature distributions; and
  • variability of July temperature distributions

After we discussed their answers and reached a consensus, I then briefly introduced the idea of a log transformation in the context of closing prices of Nasdaq-100 stocks on September 15, 2020:

Finally, we discussed the example of cancer pamphlets’ readability that I described in post #4, Statistics of illumination, part 2, here.

As you can tell, the topic of the class session that I have described here was graphing numerical data.  I think the matching activity set the stage well, providing an opportunity for students to talk with each other about data in a fun way.  I also hope that this activity helped to instill in students a mindset that they should always think about context when examining graphs and analyzing data.

#64 My first week

Many thanks to all who sent encouragement in response to last week’s post (here) about my harrowing experience with creating my first video for my students.  I’m happy to report that my first-ever week of remote teaching went well.  I promise not to turn this blog into a personal diary, but I’d like to share some reflections based on this past week.

I woke up last Monday excited and nervous for the first day of the school year.  That was a good and familiar, even comforting, feeling.  Some unfamiliar feelings followed for the rest of the day.  It was very strange not to leave my house for the first day of school, and it was also weird to realize at the end of the day that I had not changed out of my sweat pants.

I was very glad that many students showed up for my first live zoom session at 8am on Monday.  I also appreciated that many of them turned their cameras on, so I could see their faces on the screen.  A large majority of my students are beginning their first term at Cal Poly, and they seemed eager to get started.  I was excited that these students were beginning the academic coursework of their college experience with me.

One fun thing is that the very first student to join the zoom session turned out to have her birthday on that day.  I know this because we worked through the infamous draft lottery example (see post #9, here), so I asked students to find their own birthday’s draft number, and it turned out that this student’s birthday had draft number 1, which meant that she was born on September 14, last Monday.

I have used three different zoom tools to interact with students:

  1. Breakout rooms provide an opportunity for students to discuss questions with each other.  For example, we used breakout rooms at the beginning of the first session for groups of 5-6 students to introduce themselves to each other.  Then we used the same breakout rooms later for students to discuss possible explanations for the apparent paradox with the famous Berkeley graduate admissions data (see post #3 here).
  2. Polls provide immediate feedback on students’ understanding (see Roxy Peck’s guest post #55 about clicker questions here).  For example, I used polls to ask students to identify variables as categorical or numerical and to indicate whether a number was a parameter or a statistic.
  3. Chat allows students to ask questions of me, and I’ve also asked them to type in responses to some questions in the chat window.  For example, students determined the median draft number for their birth month and typed their finding into the chat.

During Friday’s live zoom session, we studied ideas related to sampling, and we worked through the Gettysburg Address activity (see post #19, Lincoln and Mandela, part 1, here).  I was apprehensive about how this activity would work remotely, but I was pleasantly surprised that it went smoothly.  I prepared a google form in advance and pasted a link in the chat window, through which students entered the average word length in their self-selected sample of ten words from the speech.  This allowed me to see their responses in real time and paste the results into an applet (here), so we could examine a dotplot of the distribution of their sample averages.  Because a large majority of the students’ sample averages exceeded the population average of 4.3 letters per word, the resulting graph illustrated sampling bias:

I also created videos for students who could not attend the optional live sessions.  I’m even getting slightly more comfortable with making videos.  But making corrections to the auto-captioning takes a while, perhaps because the software has trouble translating words from my peculiar voice.  Some unfortunate mis-translations of what I have said include:

  • “grandmother” for “parameter”
  • “in America” for “a numerical variable”
  • “selected a tree” for “selected at random”
  • “once upon a time” for “one sample at a time”
  • “sample beans” for “sample means”

I have already given many quizzes to my students, even after just one week.  I give a quiz based on each handout, just to make sure that they were paying attention as they worked through the examples, either in a live session with me or on their own or by watching a video.  I also assign an application quiz for each handout, in which students apply what they have learned to a new context.  I have also asked students to complete several miscellaneous quizzes, for example by answering questions about a Hans Rosling video on visualizing human progress (here) that I asked them to watch.   I regard these quizzes as low-stakes assessments, and I encourage students to work together on them.

I conclude this brief post by offering five take-aways from my first week of remote teaching.  I realize that none of these is the least bit original, and I suspect that none will provide any insights for those who taught remotely in the spring or started remote teaching in the fall earlier than I did.

  1. Remote teaching can be gratifying.  Rather than thinking about how much I would prefer to be in a classroom with my students and down the hall from my colleagues, I hope to concentrate on my happy discovery that interacting with students virtually can be fun.
  2. Remote teaching can be engaging.  I greatly appreciate my students’ being such good sports about answering my questions and participating in activities.  (See Kelly Spoon’s guest post #60, here, for several ideas about connecting with students online.)
  3. Asking good questions is central to helping students learn*, remotely as well as in-person.
  4. Remote teaching requires considerable preparation**.  For me, some of this preparation has involved planning when to use breakout rooms and polls and chat.  Collecting data from students also requires more preparation than simply asking students to put their result on the board.  Writing quizzes also requires entering the questions into the learning management system after crafting the questions in the first place.
  5. Remote teaching is very tiring.***  I have found the combination of having to prepare so extensively, integrate different technologies at the same time, and stare at a screen for many hours per day to be exhausting!

* You did not see this one coming, did you?

** But on the positive side of the ledger, my commute time has been reduced by nearly 100%.

*** Of course, perhaps age is a confounding variable that explain my fatigue.  Never before have I been as old to start a new school year as I am now.

Here’s one more takeaway, one that I regret: I have much less time and thought to devote to this blog than I had last year.  That’s why this post is so brief and perhaps unhelpful.  As always, thanks for reading and bearing with me.

#63 My first video

I recently endured a harrowing, horrifying, humbling, even humiliating experience.  That’s right: I recorded my first video.

My first-ever online teaching experience begins today, September 14*.  In preparation, I thought I’d record a brief video to introduce myself to my students, hoping to begin the process of establishing a bit of a connection even though I’ll probably never meet these students in person.  I wanted the video to be brief, about five minutes or so.  I’ve never followed a script in class, so I did not write a script for the video, hoping that non-scripted spontaneity would make it more appealing.  But I did prepare some PowerPoint slides, partly to remember what I wanted to say, and also so the slides would occupy most of the screen with my face appearing only in a small corner.  I wanted to use Zoom to make the video, just because I like to keep things simple.  I’ve already used Zoom a bit, and I’ll be using Zoom for live sessions with my students this fall.

* This is the same date that was selected first and received draft number 1 in the infamous 1970 draft lottery.  In post #9 (here), I describe a class activity that illustrates statistical thinking by analyzing those lottery results.

So, I entered the room that now serves as my home office, started my computer, opened Zoom, launched a new meeting, shared my screen, put my PowerPoint file in presentation mode, looked into the camera, pressed the record button, and started talking to myself …

I finished about seven-and-a-half minutes later, only 50% beyond my target time of five minutes*.  I waited for Zoom to produce the recording, and then I eagerly pressed the play button.  This is when the experience turned harrowing.

* Post #28 (here) pertains to my pervasive pet-peeve involving student misunderstandings of percentage differences.

I really don’t like watching myself on a screen, but I understand that many people feel this way about themselves, and Zoom use over the past six months has somewhat inured me to this unpleasant feeling.  That wasn’t the harrowing part.

Those of you who know me, or have heard me give presentations, can probably anticipate that I found the horrifying part to be listening to my voice.  For those of you who have never heard me: I have a very unusual and peculiar* speaking voice.  It doesn’t sound nearly as odd to me in real life as it does on a recording.  After listening to just the first few seconds of the Zoom recording, I was overcome by a desire to apologize to everyone who’s ever had to listen to me – students, colleagues, friends, wife, cats, …  I only hope that this is something that you get used to and barely notice after a while.

* Friends use the word distinctive here to spare my feelings.

To be more specific, my voice tends to rise rather than fall at the end of sentences.  This vocal pattern is sometimes referred to as “upspeak.”  This is apparently a serious topic of research inquiry among linguists, and a Google search will provide a lot of information, references, and advice about upspeak.  My favorite anecdote about this phenomenon is that novelist Richard Russo invested one of his characters with upspeak in his delightful satire of academic life Straight Man.  Russo’s main character, the reluctant chairman of a college’s English department, describes the speaking voice of the department secretary as follows: Most of Rachel’s statements sound like questions.  Her inability to let her voice fall is related to her own terrible insecurity and lack of self-esteem.  To emphasize this aspect of her speaking voice, Russo uses a question mark at the end of Rachel’s sentences throughout the book?*

* Yes, I used that punctuation on purpose to demonstrate Russo’s technique.

In case you’re wondering whether I’m exaggerating about my own upspeaking, I’ll point out that during conference and workshop presentations, I often ask those in attendance to guess where I’m from.  Just asking the question is usually good for a laugh, as people realize that I am acknowledging my unusual vocal inflections, and they’re often curious to know the answer.  Common guesses often include Ireland, Scotland, Scandinavia, Canada, and the upper Midwest.  None of those is correct*.  I believe that my peculiar voice is more of an individual quirk than a regional dialect.

* I will reveal the answer later in this post.

After I overcame my revulsion at hearing my own voice enough to get back to work on my first video, I made and discarded several attempts due to mis-speakings and awkward pauses and the like.  Then as I went through the fifth take, I thought I had a keeper.  I successfully avoided the mis-speaking and pauses.  I was saying what I wanted to say in a reasonable manner.  As I got to the end, I was almost looking forward to playing it back to confirm that this would be the final take, the one to be posted for my students.  It probably would have been, except for one flaw: I realized to my horror that I had been sharing and recording the wrong screen!  I was sharing and recording my laptop screen rather than my monitor screen*, which was the one with the Powerpoint presentation! 

* I’ve actually used just a laptop for the past 20 years until recently.  Seeing that I would need to teach online in the fall, my wife very kindly bought me a new monitor a few months ago.  As this story reveals, I’m still getting used to it.

A few takes later, I again thought I had a keeper, and I was certain that I had shared and recorded the correct screen this time.  I was feeling very proud of myself, downright excited as I got to the last slide, in which I thanked students for taking the time to watch my first video.  But then …  My brain completely froze, and I couldn’t find the button to stop the recording!  I don’t know whether the Zoom control bar was hidden behind the PowerPoint presentation or behind some other application or what, but I flailed about for a full 30 seconds, muttering to myself (and, of course, to the microphone) the whole time.  I know this should be no big deal; it can’t be hard to edit out those last 30 seconds, but I didn’t know how to do that*!

* Now I wish that I had kept all of these outtakes.  But I didn’t realize at the time that there would be so many, or that the experience would make such an impact on me that I would write a full, self-indulgent blog post about it.

I know that none of this was Zoom’s fault, but at this point I decided to learn the basics and record the next few takes with Screencast-o-matic.  These actually went fairly well, and it only took a few more takes to end up with the final version that I posted for my students.  All together, I spent many, many hours making a 7.5-minute video.

Just for fun, let me show you some of the slides from my first video presentation.  I start by telling students where I’m from and pointing out that I slowly ventured a bit farther from home as I went to college and then graduate school and then my first teaching position:

I also wanted to let students know that while I am a very experienced teacher of statistics, I am a complete novice when it comes to teaching online courses:

To reveal a more personal side, I told students about some of my hobbies, along with some photos:

I have mentioned before (see posts #25 and #26 here and here) that I give lots of quizzes to my students.  I plan to do that again with my online course this fall.  In fact, I suspect that very frequent quizzes will be all the more useful in an online setting for helping to keep students on task, indicating what they should be learning, and providing them with feedback on their progress.  I even decided to give them a quiz based on my self-introduction video.  This is an auto-graded, multiple-choice quiz administered in our course management system Canvas.  I expect this quiz to provide students with easy points to earn, because all of the answers appear in the video, and they can re-watch the video after they see the quiz questions.  Here are the questions:

  1. In which state did I live for the first 39 years of my life? [Options: Arizona, California, Hawaii, Mississippi, Pennsylvania]
  2. How many states have I been in? [Options: 1, 13, 47, 50]
  3. What kind of pets have I had? [Options: Birds, Cats, Dogs, Fish, Snakes]
  4. Which of the following is NOT the name of one of my pets? [Options: Cosette, Eponine, Punxsutawney Phil, Puti]
  5. What is the name of my fantasy sports teams? [Options: Cache Cows, Domestic Shorthairs, Markov Fielders, Netminders, Sun Cats]
  6. For how many years have I been at Cal Poly? [Options: 2, 19, 31, 58]
  7. How much experience do I have with online teaching? [Options: None, A little, A lot]
  8. What was my primary project while on leave from Cal Poly for the past academic year? [Options: Playing online games, Proving mathematical theorems, Reading mystery novels, Starting a business, Writing a blog]
  9. What is my teaching philosophy? [Options: Ask good questions, Insist on perfection, Learn by viewing, Rely on luck]
  10. Am I funny? [Option: Well I try to be but I may not succeed often]

So, how did you do?  The correct answers are: Pennsylvania, 47 (all but Arkansas, Mississippi, North Dakota), Cats, Punxsutawney Phil, Domestic Shorthairs, 19, None, Writing a blog, Ask good questions, Well I try to be but I may not succeed often.

P.S. If you would like to watch my first video for yourself, please bear in mind my warning about the peculiarity of my speaking voice.  But if that does not dissuade you, the video can be found here.

#62 Moral of a silly old joke

I have always liked this silly old joke, which I first heard decades ago:

A man takes his dog to see a talent scout, proudly claiming that his dog can talk.  Of course, the talent scout is very skeptical. To convince her, the man asks the dog: What’s on top of a house?  The dog eagerly responds: “Roof, roof!”  The unimpressed talent scout rolls her eyes and tells the man to leave.  The man seizes a second chance and asks the dog: How does sandpaper feel?  The dog gleefully responds: “Rough, rough!”  The scout gets out of her chair and moves to escort the man out of her office.  Begging for one last chance, the man asks the dog: Who was the greatest baseball player of all time?  The dog enthusiastically responds: “Ruth, Ruth!”  The fed-up talent scout removes the man and dog from her office.  Out in the hallway, looking up at the man with a confused and crestfallen expression on his face, the dog says: “DiMaggio?”

Part of why I like this joke is that “DiMaggio?” strikes me as the perfect punch line.  I have seen versions of the joke in which the dog says: “Maybe I should have said DiMaggio?”  I don’t think that’s as funny as the single-word response.  I also don’t think the joke would work nearly as well with Mays* or Aaron or Williams or Trout as the punch line, because those names are so much easier to pronounce than DiMaggio**.

* Joe Posnanski, from whom I have copied this footnoting technique that he calls a Pos-terisk, ranks Willie Mays as the only baseball player better than Babe Ruth (here).

** A name that works nearly as well is Clemente.  Having grown up in western Pennsylvania in the 1960s and 1970s, my favorite baseball player will always be Roberto Clemente.

What in the world does this have to do with teaching statistics, which is the whole point of this blog?!

Please forgive me, as I’m a bit out of practice with writing blog posts*.  Now I will try to connect this silly old joke to the whole point of this blog.

* I again thank the nine guest bloggers who contributed posts during my hiatus in July and August.  If you missed any of these posts, please check them out from the list here

Please consider: What is the moral of this joke?  Let me rephrase that: What did the man do wrong?  Or, to put this in a more positive light: What should the man have done differently?

I’ll give you a hint, as I often do with my own students: The answer that I’m fishing for contains three words.  Want another hint?  Those three words contain a total of 16 letters.  One more hint? The first word has the fewest letters (3), and the last word has the most letters (9).

All right, I’ve dragged this on long enough.  I suspect that you’ve figured out what I think the moral of this silly old joke is.  In order to achieve his (and his dog’s) lifelong dream, all the man needed to do was: Ask good questions.

That’s where the man messed up, right?  His obvious mistake was asking questions for which the answers correspond so well with sounds that an ordinary dog makes.  The man’s incredibly poor choice of questions prevented the dog from demonstrating his remarkable ability.

I repeat: What does this have to do with teaching statistics?!  I suspect that my moral is abundantly clear at this point, but please allow me to summarize:

  • To help our students learn, we need to ask good questions. 
  • To enable our students to demonstrate what they can do, we need to ask good questions. 
  • To empower our students to achieve their potential, we need to ask good questions.

I said in my very first post (see question #8 here) that these three words capture whatever wisdom I may have to offer for teachers of statistics: Ask good questions.  I tried to provide many specific examples over the next 51 posts (here).  That is the whole point of this blog.  I think that’s how we teachers should focus most of our time, effort, and creativity.  Whenever I start to forget this, for example when I momentarily succumb to the temptation to believe that it’s more important to master intricacies of Canvas or Zoom or Powerpoint or Camtasia or Flipgrid or Discord or LockDown Browser or Github or even R, I remember the moral of a silly old joke.

P.S. My professional leave for the 2019-2020 academic year has come to an end, and I am preparing to return to my full-time teaching role*.  I’m hoping to find time to resume writing weekly blog posts, because I greatly enjoy this and hope that these essays have some value.  But I won’t have nearly as much time to devote to blogging for the next nine months, so I’ll need to make the essays shorter or fewer.  Please stick around, and we’ll see how it goes.  For the month of September, I ask for your indulgence as I write some short and unusual blog posts that are less directly applicable to teaching statistics than my typical essays.  As always, thanks very much for reading!

* Our fall classes at Cal Poly will begin on Monday, September 14.  I’ll be teaching online for the first time in my 30+-year career.  Wish me luck!

P.P.S. Thanks to Julie Clark for providing a photo of her dog Tukey. As far as I know, this Tukey cannot talk, but I would not bet against him being able to draw boxplots.

#61 Text as data

This guest post has been contributed by Dennis Sun.  You can contact Dennis at

Dennis Sun is a colleague of mine in the Statistics Department at Cal Poly. He teaches courses in our undergraduate program in data science* as well as statistics. Dennis also works part-time as a data scientist for Google. Dennis is a terrific and creative teacher with many thought-provoking ideas. I am very glad that he agreed to write this guest post about one aspect of his introductory course in data science that distinguishes it from most introductory courses in statistics.

* My other department colleague who has taught for our data science program is Hunter Glanz, who has teamed with Jo Hardin and Nick Horton to write a blog about teaching data science (here).

I teach an “Introduction to Data Science” class at Cal Poly for statistics and computer science majors. Students in my class are typically sophomores who have at least one statistics course and one computer science course under their belt. In other words, my students arrive in my class with some idea of what statistics can do and the programming chops to execute those ideas. However, many of them have never written code to analyze data. My course tries to bring these two strands of their education together.

Of course, many statisticians write code to analyze data. What makes data science different? In my opinion, one of the most important aspects is the variety of data. Most statistics textbooks start by assuming that the data is already in tabular form, where each row is an observation and each column is a variable. However, data in the real world comes in all shapes and sizes. For example, an audio file of someone speaking is data. So is a photograph or the text of a book. These types of data are not in the ready-made tabular form that is often assumed in statistics textbooks. In my experience, there is too much overhead involved to teach students how to work with audio or image data in an introductory course, so most of my non-standard data examples come from the world of textual data.

I like to surprise students with my first example of textual data: Dr. Seuss books. Observations in this “dataset” include:

  1. “I am Sam. I am Sam. Sam I am….”
  2. “One fish, two fish, red fish, blue fish….”
  3. “Every Who down in Whoville liked Christmas a lot….”

and so on. To analyze this data using techniques they learned in statistics class, it first must be converted into tabular form. But how?

One simple approach is a bag of words. In the bag of words representation, each row is a book (or, more generally, a “document”), and each column is a word (or, more generally, a “term”). Each entry in the table is a frequency representing the number of times a term appears in a document. This table, called the “term-frequency matrix,” is illustrated below:

The resulting table is very wide, with many more columns than rows and most entries equal to 0. Can we use this representation of the data to figure out which documents are most similar? This sparks a class discussion about how and why a data scientist would do this.

How might we quantify how similar two documents are? Students usually first propose calculating some variation of Euclidean distance. If xi represents the vector of counts in document i, then the Euclidean distance between two documents i and j is defined as:

This is just the formula for the distance between two points that students learn in their algebra class (and is essentially the Pythagorean theorem), but the formula is intimidating to some students, so I try to explain what is going on using pictures. If we think of xi and xj as vectors, then d(xi, xj) measures the distance between the tips of the arrows.

For example, suppose that the two documents are:

  1. “I am Sam. I am Sam. Sam I am.”
  2. “Why do I like to hop, hop, hop? I do not know. Go ask your Pop.”

and the words of interest are “Sam” and “I.” Then the two vectors are x1 = (3,3) and x2 = (0,2), because the first document contains 3 of each word, and the second includes no “Sam”s and two “I”s.  These two vectors, and the distance between them, are shown here:

At this point, a student will usually observe that the frequencies scale in proportion to the length of the document. For example, the following documents are qualitatively similar:

  1. “I am Sam.”
  2. “I am Sam. I am Sam. Sam I am.”

yet their vectors are not particularly close, since one vector is three times the length of the other:

How could we fix this problem?  There are several ways. Some students propose making the vectors the same length before comparing them, while others suggest measuring the angles between the vectors. What I like about this discussion is that students are essentially invoking ideas from linear algebra without realizing it or using any of the jargon. In fact, many of my students have not taken linear algebra yet at this point in their education. It is helpful for them to see vectors, norms, and dot products in a concrete application, where they arise naturally.

Why would anyone want to know how similar two documents are? Students usually see that such a system could be used to recommend books: “If you liked this, you might also like….”* Students also suggest that it might be used to cluster documents into groups**. However, rarely does anyone suggest the application that I assign as a lab.

* This is called a “recommender system” in commercial applications.

** Indeed, a method of clustering called “hierarchical clustering” is based on distances between observations.

We can use similarity between documents to resolve authorship disputes. The most celebrated example concerns the Federalist Papers, first analyzed by statisticians Frederick Mosteller and David Wallace in the early 1960s (see here). Yes, even though the term “data science” has only become popular in the last 10 years, many of the ideas and methods are not new, dating back over 50 years. However, whereas Mosteller and Wallace did quite a bit of probability modeling, our approach is simpler and more direct.

The Federalist Papers are a collection of 85 essays penned by three Founding Fathers (Alexander Hamilton, John Jay, and James Madison) to drum up support for the new U.S. Constitution.* However, the essays were published under a pseudonym “Publius.” The authors of 70 of the essays have since been conclusively identified, but there are still 15 papers whose authorship is disputed.

* When I first started using this example in my class, few students were familiar with the Federalist Papers. However, the situation has greatly improved with the immense popularity of the musical Hamilton.

I give my students the texts of all 85 Federalist papers (here), along with the authors of the 70 undisputed essays:

Their task is to determine, for each of the 15 disputed essays, the most similar undisputed essays. The known authorships of these essays are then used to “vote” on the authorship of the disputed essay.

After writing some boilerplate code to read in and clean up the texts of the 85 papers, we split each document into a list of words and count up the number of times each word appears in each document. My students would implement this in the programming language Python, which is a general-purpose language that is particularly convenient for text processing, but the task could be carried out in any language, including R.

Rare context-specific words, like “trembling,” are less likely to be a marker of a writer’s style than general words like “which” or “as.” We restrict to the 30 most common words. We also normalize the vectors to be the same length so that distances are invariant to the length of the document. We end up with a table like the following:

Now, let’s look at one of the disputed papers: Federalist Paper #18. We calculate the Euclidean distance between this document and every other document:

Of course, the paper that is most similar to Paper #18 is … itself. But the next few papers should give us some useful information. Let’s grab the authors of these most similar papers:

Although the second closest paper, Paper #19, is also disputed (which is why its author is given as the missing value NaN), the third closest paper was definitively written by Madison. If we look at the 3 closest papers with known authorship, 2 were written by Madison. This suggests attributing Paper #18 to Madison.

What the students just did is machine learning—training a K=3-nearest neighbors classifier on the 70 undisputed essays to predict the authorship Paper #18 — although we do not use any of that terminology. I find that students rarely have trouble understanding conceptually what needs to be done in this concrete problem, even if they struggle to grasp more abstract machine learning ideas such as training and test sets. Thus, I have started using this lab as a teaser for machine learning, which we study later in the course.

Next I ask students: How could you validate whether these predictions are any good? Of course, we have no way of knowing who actually wrote the disputed Federalist Papers, so any validation method has to be based on the 70 papers whose authorship is known.

After a few iterations, students come up with some variant of the following: for each of these 70 papers, we can find the 3 closest papers among the other 69 papers. Then, we can validate the prediction using these 3 closest papers against the known author of the paper, producing a table like the following:

In machine learning, this table is known as a “confusion matrix.” From the confusion matrix, we try to answer questions like:

  1. How accurate is this method overall?
  2. How accurate is this method for predicting documents written by Madison?

Most students assess the method overall by calculating the percentage of correct (or incorrect) predictions, obtaining an accuracy of 67/70 ≈ 96%.

However, I usually get two different answers to the second question:

  • The method predicted 15 documents to be written by Madison, but only 13 were. So the “accuracy for predicting Madison” is 13/15 ≈ 87%.
  • Madison actually wrote 14 of the documents, of which 13 were identified correctly. So the “accuracy for predicting Madison” is 13/14 ≈ 93%.

Which answer is right? Of course, both are perfectly valid answers to the question. These two different interpretations of the question are called “precision” and “recall” in machine learning, and both are important considerations.

One common mistake that students make is that they will include paper i itself as one of the three closest papers to paper i. They realize immediately why this is wrong when this is pointed out. If we think of our validation process as an exam, it is like giving away the answer key on an exam! This provides an opportunity to discuss ideas such as overfitting and cross-validation, again at an intuitive level, without using jargon.*

* The approach of finding the closest papers among the other 69 papers is formally known as “leave-one-out cross validation.”

I have several more labs in my data science class involving textual data. For example, I have students verify Zipf’s Law (learn about this from the video here) for different documents. A student favorite, which I adapted from my colleague Brian Granger (follow him on twitter here) is the “Song Lyrics Generator” lab, where students scrape song lyrics from their favorite artist from the web, train a Markov chain on the lyrics, and use the Markov chain to generate new songs by that artist. One of my students even wrote a Medium post (here) about this lab.

Although I am not an expert in natural language processing, I use textual data often in my data science class, because it is both rich and concrete. It has just enough complexity to stretch students’ imaginations about what data is and can do, but not so much that it is overwhelming to students with limited programming experience. The Federalist Papers lab in particular intersects with many technical aspects of data science, including linear algebra and machine learning, but the concreteness of the task allows us to discuss key ideas (such as vector norms and cross-validation) at an intuitive level, without using jargon. It also touches upon non-technical aspects of data science, including the emphasis on prediction (note the conspicuous absence of probability in this blog post) and the need for computing (the texts are long enough that the term frequencies are not feasible to count by hand). For students who know a bit of programming, this provides them with an end-to-end example of how to use data to solve real problems.