Skip to content

#60 Reaching students online

This guest post has been contributed by Kelly Spoon.  You can contact her at

Kelly Spoon teaches statistics in San Diego, at a two-year college (San Diego Mesa College) and for an AP Statistics class (Torah High School of San Diego).  I met Kelly through twitter (@KellyMSpoon), where she shares lots of ideas about teaching statistics and mathematics, and at the AMATYC conference in Milwaukee last fall.  Kelly has since hosted me to give a workshop for her colleagues in San Diego and to conduct a review session for her AP Statistics students via zoom.  Kelly is very passionate about teaching statistics, dedicated to helping all students succeed, and knowledgeable about content and pedagogy.  I am very glad that she agreed to contribute this guest blog post on the very timely topic of teaching statistics (and reaching students) online*.

* Speaking of timely, my first taste of online teaching will begin three weeks from today.

When Allan asked if I would write a guest blog post, I didn’t hesitate to email back with an emphatic yes. Not only because I owe him for presenting to faculty at my college AND doing a review for my AP Statistics students, but because I’m always excited to share my passion for teaching statistics.

Then the actual writing started, and I immediately regretted this decision. There’s just too much to share in such a short space. In the end, I wrote an entirely too long blog post for which Allan suggested some minor edits to fit a theme of fearlessness. I asked myself: What does it mean to teach fearlessly?

To me, the broadest definition is a willingness to thoughtfully try new things – whether tools, policies, assessments, or formats. And at this point, most of us fit that definition by the circumstances of distance learning that have been thrust upon us. Now that I’m a week into a new completely online semester, my previous draft felt like it was missing what most of us want to know right now: How do we teach statistics online?

After having a successful first week of the new fall term that mostly gave me energy rather than leaving me feeling drained (as most of last spring’s emergency remote classes did), I thought I’d share some insights as to how I made that first week work for me. To keep with the theme of this blog, these insights are presented as answers to questions that you might ask yourself as you’re designing your online statistics course. I hope these questions are generic enough to stand the test of time to remain relevant when we’re back in a classroom.

1. Cultivating curiosity

Knowing where you want to end up (your desired outcomes) is crucial when designing a course or individual lesson, but the starting point is sometimes overlooked. As you think about your course, whether you’re meeting in person, on Zoom, or you don’t have scheduled meetings, ask yourself: Does my lesson plan make students want to learn more?

This is where Allan’s blog comes in handy. He has many great examples of good questions that truly spark curiosity, often without requiring a deep understanding of the subject matter to start. However, simply including good questions in a lecture allows students to opt out and wait for the professor or another student to do the thinking for them. Simulation-based inference and the many awesome applets that exist in that same vein are one great way to build curiosity for theory-based inference. Regardless of class modality, one of my favorite tools for sparking curiosity is the activity builder in Desmos.

If you haven’t tried out the Desmos Activity Builder (here), you’re missing out. This one tool can answer questions such as: How do I do activities if I’m teaching online? What if I want to assign activities as homework? What if I don’t want to buy Hershey’s Kisses to make confidence intervals for the proportion that land on their base? The Desmos activity builder allows you to add math, graphs, tables, video, images, text to slides for students to work through. You can have students input math, graphs, tables, text, answer multiple choice, multiple selection, reorder selections, even doodle on a graph or image. That was quite the list. See the image below for a visual of all the things you can add to an activity in Desmos:

On the instructor end, you can see exactly where students are (so it’s great to use if you’re meeting with students at a particular time which we all now know is called synchronous) – I use this to pause the activity and debrief when most students have reached a particular point or nudge those students who seem to be stalled. You can also see student work in real-time and provide them feedback directly in the activity. And many activities have been designed to combine data from across the entire class, allowing you to recreate some favorite in-person activities in an online space.

Here are a few Desmos activities that I’ve created, used, or plan to use to build curiosity:

a) Reading Graphs (here)

This activity was inspired by a workshop on culturally responsive teaching. These graphs and questions appear in my lecture notes before we discuss displays for data. Typically, I have students work in groups of four to answer all of the questions for their graph. Then we do a numbered-head protocol (they number themselves 1-4, and I use a random number generator on the projector to choose a victim to report out) to debrief the activity.  I show them that they already know most everything in that section of the lecture notes, with the added bonus of being able to bring in topical graphs*, including ones on social justice issues. For my asynchronous classes, students go through this activity on their own but can see other student responses once they share. For my synchronous class, I occasionally “pause” the activity to discuss some of the responses to a particular graph.  For instance, the following bar chart of children in poor families leads to so many more questions than answers: What defines a family as poor? Are the observational units the children or the families? Does it matter? What if the parents have different education levels? Where are the other 8%?!

* Please ignore the titanic mosaic plot; I really haven’t found better.

b) Skew the Script – Lesson 1.1 (here)

I just found this activity, despite being a longtime fangirl of @AnkerMath on twitter. Skew the Script (here) has a great curriculum created for AP Statistics with student and instructor resources that strive to make the content relevant. It focuses on using real-world examples and creating equity-driven lessons. This particular exercise has students analyze and describe the problems with a lot of bad graphs. I plan on starting off the 2nd week with this one! I’ll tweet how it goes.

c) Does Beyoncé Write Her Own Songs? (here)

This activity is taken entirely from StatsMedic (here) and adapted for Desmos by Richard Hung (find him on twitter here). StatsMedic is built on a framework of “experience first, formalize later” (EFFL), so their entire curriculum – which they provide for free on their site – is inherently designed to build curiosity. For this particular activity, I’ve edited it a bit to bring in some Allan-inspired questions, like identifying observational units and variables (see post #11, Repeat after me, here). This activity is a variation of Allan’s Gettysburg Address activity (see post #19, Lincoln and Mandela, part 1, here) or the Random Rectangles* activity, and is great for building understanding of sampling bias, random sampling, and sampling distributions.

*I first did the Random Rectangles activity in a workshop conducted by Roxy Peck ; it apparently originated in Activity-Based Statistics by Scheaffer et al.

I believe lectures inherently kill curiosity – even a lecture with questions interspersed for this purpose. Students know that eventually you will tell them the answer, and many will sit and wait until someone else does the work. At least in my flipped classroom, these types of activities incentivize my students to go watch those lectures by making them curious enough to want to know more. As a bonus, I can keep referring back to that tangible activity: Remember when you placed your magnet on the class dotplot in the Random Rectangles activity?

2. Building a collaborative and safe learning environment

So, we can present good questions or well-designed activities to ignite that sense of wonder in our students, but we also need the students to feel connected to each other and to us as educators, especially in an online environment. That brings me to my next question: Am I providing opportunities for students to connect with and learn from one another?

In a traditional classroom, these opportunities may happen organically. Students may chat before class or set up study groups, even if our classes don’t explicitly carve out time for collaboration. In an online class, these moments need to be constructed and provided for students. 

Using Google slides with breakout rooms in Zoom is my go-to for collaboration between students in an online environment. For those of you unfamiliar with Google Slides, they are essentially Google’s version of PowerPoint. The bonus is that you can get a shareable link that allows anyone to edit the slides – even if they don’t have a google account! They just have to click the link, and then they are editing simultaneously. My typical setup is to create a slide for each group within one shared presentation. The slides contain the instructions about what the students should add to the slide to complete the activity. Here are a few of the activities I’ve already used in class:

a) Personality Coordinates

This activity is an ice-breaker – before you roll your eyes, let me finish! – where students put their names on four points and then have to work together to label the X and Y axes. I personally can tolerate this particular ice-breaker because it serves as a needed review of a coordinate plane that I can reference again when we start looking at scatterplots. You can read more about this activity where I originally found out about it on Dan Meyer’s blog (here).

In the image below, you’ll see the circle representing students on slides of the presentation and the highlighted areas are what students are working on. Slides make it easy at a glance to check that students are making progress and let you know which groups you should check in on. There’s even a comment feature so you can provide quick feedback without being too intrusive. If you want to know more about how I ran this activity, check out this twitter thread (here), where I provide the links to the slidedeck and instructions I presented before putting students in breakout rooms.

b) Sampling IRL

This particular activity is a discussion board in my fully online asynchronous class. However, in my synchronous class that meets on Zoom, I saved myself a lot of grading by creating a slide deck in the same vein. On day 1, students worked with a group to fill in a slide with how they would attempt to collect a sample from a given population (students at my college, students at all area community colleges, Starbucks customers, adults in our city).

Based on timing, the second half of this activity happened on the following day, which also allowed me to reformat the slides and add new questions. On Day 2, I moved each breakout room to a new slide and they had to answer two questions:

  1. Could you follow the sampling scheme that the group laid out? If not, what is unclear?
  2. Are there any groups of people who might be left out based on their sampling scheme? Who are they? What type of people from the population will be under/over represented?

In this particular example, I didn’t reinvent anything, I just took an existing prompt and turned it into a collaborative activity by having students answer these questions in groups. And again, the added bonus was that I only needed to grade 8 slides as opposed to 32 discussion posts!

I have loved using this type of activity in my classes. Previously I did a lot of similar activities in face-to-face classes utilizing giant post-its or just the whiteboards around class. I do like that Google slides allows these contributions to be saved to come back to. Here are some things I’ve found that help this run smoothly:

  • Provide roles for the breakout rooms – students don’t have to use them, but it sets expectations. You can see my slide with roles below:
  • Emphasize that someone must share their screen in the breakout rooms. I say this at least three times before opening breakout rooms and then broadcast it to all breakout rooms a few minutes in.
  • Aim for twenty minutes as the sweet spot in terms of length.
  • Monitor progress on the slides, and use the comments to give quick feedback.
  • Join each breakout room to check that all members are contributing.
  • Make your instructions the background image, so students don’t accidentally delete the stuff they need.
  • Know how to access version history, in case a student deletes a slide or encounters an equally devastating problem.
  • If you want to run an activity that requires more than one slide per group, use a slide as a landing page (shared as view only) with the edit links to all the group slides:
  • If you’re using Canvas, you can create a Google Cloud assignment (see a video here) to assign the slides to students who missed class. 

3. Connecting with students

Another key to student success is that students feel a connection to you. That brings us to my third question: How can I ensure that students feel connected to me?

For me, it’s about sharing things I’m interested in. I tried a “liquid syllabus” (see here) this semester rather than my traditional welcome letter, but they both contain the same information that is missing from a traditional syllabus:

  • A section about me and my extracurricular interests – which I try to keep varied so that each student might see some small thing we have in common.
  • My teaching philosophy.
  • What a typical week looks like in our course.

I also respond to each student’s introduction in my asynchronous classes. On our first quiz of the semester, I ask all of my students to ask one question about the course, statistics, or myself and tell me something about themselves. I make sure to respond to each and every one. Yes, my first week of classes is a challenge, but I find that connection pays off later. And it never hurts to interject something you’re passionate about into your lectures and examples – much like Allan, most of my examples are about cats (see blog post #16, Questions about cats, here), and my Canvas pages are adorned with cats too.

4. Creating a safe place for mistakes

If you creep on my welcome site for students, you would see this section: “My course is built on the idea that we aren’t perfect the first time we do something and those mistakes are how we improve and learn. Every assignment (with the exception of exams) can be redone after you receive some guidance from me on how to improve it. There are multiple ways for you to demonstrate your understanding – discussions, projects, exams, creative assignments… If you’ve struggled in a traditional classroom, I hope we’ll find a way to get through this together.” This brings me to my next question: How am I demonstrating to students the value in making mistakes?

I don’t know about you, but I have countless students who are frozen into inaction by their fear of failure. Students that I know understood the material will turn in tests with blank pages. When I ask them about it, they profess that they just weren’t sure they were on the right track. I try to demonstrate how useful mistakes are with my policies (see above), as well as in how I highlight student work and respond to students. I try to bring up “good mistakes” in class or in video debriefs, focusing on the thinking that led the student to that answer and all the understanding that their work shows. I hope that by applauding those efforts and working hard to build those connections with and between students, they will be more willing to share their thinking without fear.*

* This letter from a former student shows that I’m on the right track, but I need to add a question about this to my end-of-semester survey to make sure all students feel this way.

5. Assessing understanding

Online assessments are a tricky beast. It’s nearly impossible to be sure our students are the ones taking our assessments and that they are doing so without some outside aid. I feel like I have to include this section because it’s the most common question I get from faculty – how can I make sure my students aren’t cheating? Short answer, you can’t. So here’s the question to ask yourself: Are exams the best way to assess student knowledge?

Consider projects or other tasks where students can demonstrate that they understand the course content. Projects have the added bonus of letting students see how statistics is actually used to answer questions, relevant to what they are interested in, and connected to the other courses they are taking. I personally do a variation on the ASA Project Competition (here), where students can either submit a written report or record a presentation.

I still have exams, too. I’ve just lessened their weight so that student don’t have any real incentive to cheat. And I have embraced open-ended questions. For years, I avoided these types of questions because they were harder to grade and truly required students to have a better understanding and communication skills than the same question cleverly written as a multiple choice. On my latest exam, here’s one of the options for a particular question pool:

Many colleges were scrambling to provide resources for students with the switch to remote learning. They surveyed students by reaching out via the students’ listed email addresses to see what resources they would need to continue to attend classes in the switch to online. Do you believe this is a good survey technique? Explain why or why not. What are some issues that may arise from this survey option?

Four years of reading the AP Statistics exam has trained me not to fear reading free response questions like the one above. Even three years ago, I’d probably be shaking in my boots at the prospect of grading over a hundred free response questions on a given exam. I cannot emphasize enough how useful participating in the AP reading has been for me as an educator. Empowered by that experience, my “complete” student response to the question has four components:

  1. States that the voluntary response method described is not a good technique.
  2. Notes and provides a reason students may not be included in the survey responses – such as they choose not to take it, don’t check their email, or …
  3. Notes that students without resources are less likely to respond to the survey.
  4. Concludes that the schools will underestimate the resources needed as a result of (3).

Much like an AP scoring rubric, students must get component 1 in order to earn any points for the problem. And for full credit, they must include all four components. If you’re looking for some great questions, beyond those that Allan has provided us here over the past year, previous AP Statistics free response questions are a great place to get inspiration as you write assessments and corresponding rubrics*.

* StatsMedic has very helpfully categorized all of these questions by topic here.

6. The Real Question

All of the questions I’ve asked you to reflect on throughout this post come down to a common theme: Am I reaching ALL of my students?

I’m lucky enough to work at a campus that has provided me with data on my classes’ success rates disaggregated by gender, age, and ethnicity. I know what groups I need to work harder to reach. If possible, get these data from your school. If not, have students self report and then see if you notice any trends throughout the semester/year. If you’re new to the idea of culturally responsive teaching, I strongly recommend Zaretta Hammond’s Culturally Responsive Teaching and the Brain – it’s a great mix of research, practical tips, and reflection.

I hope you found something you can use in your classrooms in this post. Take what works for you, leave what doesn’t. And keep continuously reflecting on your own teaching practices.

Here are Allan’s own words (from post #52, Top thirteen topics, here), because I think they bear repeating: “I know that if I ever feel like I’ve got this teaching thing figured out, it will be time for me to retire, both from teaching and from writing this blog.”

This is my mantra*. Keep reflecting on your choices. Keep trying new things. Keep being fearless. Hopefully along the way, we’ll do better for all of our students.

* Minus the blog part, because I have no idea how he did this for 52 weeks!

#59 Popularity contest

This guest post has been contributed by Anna Fergusson. You can contact Anna at

Anna Fergusson is a Professional Teaching Fellow in the Department of Statistics at the University of Auckland.  I met Anna at the 2019 Joint Statistical Meetings, where she gave a terrific talk about introducing statistics students to data science, which is the topic of her Ph.D. research.  I admit that part of the appeal of Anna’s presentation was that her activity involved photos of cats.  But more impressive is that Anna described a fascinating activity through which she introduces introductory students to modern computational tools while emphasizing statistical thinking throughout.  I am delighted that Anna agreed to write this guest post about her activity, which also highlights her admirable and effective “sneaky” approach to student learning.  I also encourage you to follow Anna’s blog, with the not-so-subtle title of Teaching Statistics is Awesome and which has become one of my favourites*, here.

* I am using this non-conventional (for Americans) spelling in appreciation for Anna’s becoming my first guest contributor from outside the U.S.

I am thrilled to write this week’s guest post, not just because I get to add another activity to Allan’s examples of “stats with cats” (see post #16 here), but also because I strongly believe in asking good questions to guide students to discover “new-to-them” ideas or methods.

A current focus for my teaching and research is the design of accessible and engaging learning activities that introduce statistics students to new computational ideas or tools.  For these “first exposure” type learning tasks, I use What if..? style questions to encourage curiosity-driven learning. I also use the “changing stuff and seeing what happens” approach for introducing computational concepts, rather than starting the task with formal definitions and examples.

It’s an approach that has been described by both students and teachers as “sneaky,” but I think that it is a pretty good strategy for designing tasks that support the participation of a wide range of students. To pull off this undercover approach, you need a good cover story – something that is engaging, interesting and fun! A really “popular” task I have used to introduce APIs (Application Programming Interfaces) for accessing data involves searching for photos of cats and dogs online. I’ve tried out several versions of this task over the last few years with a range of school-level students and teachers, but this particular version of the task is from the introductory-level university course I’ve designed for students who have not completed Grade 12 mathematics or statistics. The overall question for the exploration is: What is more popular on Pixabay – photos of cats or photos of dogs?

I usually start the activity by asking students: What is your favourite type of animal, cats or dogs? I would like to say that there is a deeper learning point being made here, for example getting students to acknowledge their own personal biases before they attempt to learn from data, but really I ask this question so I can pretend to be offended when more students state that they prefer dogs than cats! And also so I can use this meme:


I then ask students to go to and explore what they can find out about whether photos of cats or dogs are more popular on this website. The only direction I give students is to make sure they have selected “photos” when they search and to point out that the first row of photos are sponsored ones. I encourage students to work in pairs or small groups for this activity.

While finding pretty adorable photos of cats and dogs, students are familiarising themselves with the website and what data might be available for analysis, which will come in handy later in the task. It also helps that popularity metrics such as likes and views are already familiar to students thanks to social media. I generally give students about five minutes to explore and then ask groups to share with the class what they have learned about the popularity of cat and dog photos, including what their “hunch” is about which animal is more popular on Pixabay.

There are a lot of approaches that students can take to explore and compare popularity, and it’s helpful to have some questions up your sleeve to ask each group as they share what they learned. For example, one approach is to determine how many photos are returned when you search for “cat” and compare this to the number of photos that are returned when you search for “dog”. You can ask students who use this approach What happens when you search for “cat” compared to “CAT” compared to “cats”? Students may or may not have noticed that their search terms are being “manipulated” in some way by the website.

Another good question is: Were all the photos returned the kind of “cat” that you expected? This can lead into a discussion about how photos are uploaded and given “tags” by the photographer, and whether the website checks whether the tags are appropriate or correct. Most students discover that if you hover over a photo returned in the search query, you can see some metrics associated with the photo, such as its top three tags and the number of likes, favourites and comments the photo has (see an example below).

To encourage students to think about how the photos are ordered in the search results, I ask students: What photos are being shown to you first when you search for “cat”? Can you spot a pattern to the order of the photos? Initially, students might think that it is just the number of likes (the thumbs-up count) that is determining the order, but if they look across the first 20 or so photos, they should notice that the pattern of decreasing like counts as you move “down the rank” doesn’t always hold.

I also prompt discussion about the nature of the “metrics” by asking: What is another reason why one photo might have more likes than another photo? Clearly, you can’t like a photo if you’ve never viewed it! Additionally, some photos may have been on the website for longer than others and some of these variables require more effort on the part of the “searcher” than others e.g. viewing a photo versus liking a photo.

This phase of the task works well because students are exploring data, generating questions, and integrating statistical and computational thinking, all without any requirements to perform calculations or write precise statistical statements. However, there is only so much you can learn from the website before needing a way to access more of the data faster than viewing each photo individually. Fortunately, Pixabay offers an API service to access photos and data related to the photos (you can find the documentation about the API here).

Don’t know anything about APIs? Don’t worry, neither do my students, and in keeping with my sneaky approach, we’re not going to jump into the API documentation. Instead, I ask students to pay attention to the URL when they search for different photos. I then use a sequence of questions to guide students towards structuring an API request for a particular search:

  • What do you notice changes about the URL each time you try a new search?
  • Can you change the photos searched for and displayed on the page by changing the URL directly?
  • Can you work out how to search for “dog costume” by changing the URL rather than using the search box?

For example, the screenshot below shows that the URL contains fixed information like “photos” and “search” but the last part changes depending on what you search for:

Through this sequence of questions, students start to notice the structure of the URL, and they also learn just a little bit about URL encoding when they try a search based on two words. For example, a search for “cat costume” will result in (1) cute photos of cats, but also (2) a URL where the spaces have been replaced with “%20”:

I then ask students to find a photo of a cat or a dog that they really like and to click on this photo to open its webpage. I then use a sequence of questions to guide students towards structuring an API request for a particular photo:

  • What do you notice about the URL for a specific photo?
  • How is it different from the URL when we were searching for photos?
  • Which part do you think is the ID for the photo?
  • What happens if you delete all the words describing the photo and leave just the ID number, such as:
  • Is there a photo that has an ID based on your birth date?
  • What was the first photo uploaded to the website?
  • How could we randomly select one photo from all the photos on Pixabay?

That last question is a sneaky way to bring in a little bit of discussion about sampling frames, which will be important later in the task if/when we discuss inference.

Once students have played around with changing the URL to change what is displayed on the webpage, I congratulate them on becoming “URL hackers.” Now it’s time to look more closely at what data about the photo is available on its webpage. I typically ask students to write down all the variables they could “measure” about their chosen photo. Depending on time, we can play a quick round of “Variable Boggle,” where each pair of students tries to describe another variable that no other pair has already described before them.

I then tell the students that the Pixabay API is basically a way to grab data about each photo digitally rather than us copying and pasting the data ourselves into a spreadsheet, and that to get data from the API we have to send a request. I then introduce them to an app that I have developed that allows students to: (1) play around with constructing and testing out Pixabay API requests, and (2) obtain samples of photos as datasets.

The app is available here.  Clicking on the top left button that says “API explorer” takes you to the screen shown below:

The API explorer is set up to show a request for an individual photo/image, and students only need to change the number to match the id of the photo they have selected. When they send the request, they will get data back about their photo as JSON (JavaScript Object Notation). As students have already recorded the data about their photo earlier in the task, they don’t seem to be intimidated by this new data structure. I then ask students to compare what we could view about the photo on its webpage with the data we can access about each photo from the API, asking: What is the same? What is missing? What is new?

For example, a comparison of the information available for a photo on the webpage and the JSON returned for an individual photo reveals that only the first three tags about a photo are provided by the API, that the date the photo was created is not provided, and that a new variable called imageSize is provided by the API:

Reminding them of earlier discussion about how long a photo has been online for, I point out that the date the image was uploaded is not directly available from the API (if students have not already identified this is missing when sharing the similarities and differences between data on the webpage and data from the API). I ask them: Is there another variable about the photo that we could use to estimate how long the photo has been online? Do any of these variables appear to contain date information? Once we’ve narrowed it down to two potential candidates – previewURL and userImageURL – I ask students to compare the dates shown in the URL to the date uploaded on the webpage for the photo. This mini-exploration leads to a discussion that we could use the date from the previewURL to estimate the date the photo was uploaded, and that while the dates don’t always match up, the date from previewURL appears to be a reasonable proxy.

One of the limitations of the Pixabay API is that you only get a maximum of 500 results for any request. You do have a choice of ordering the results in terms of popularity or date uploaded, and for my app I have chosen to return the results in terms of popularity (hence the title of the activity!). To help students discover this and also a little more about how JSON is structured, we can use the API explorer to search photos based on a keyword. To connect back to our initial search for “cat” or “dog”, I tell students they can change the API request from “id=” to “q=” to search for photos based on a key word or words. I ask them to use the API explorer to search for photos of cats, and to compare the first three results from their API request (q=cat) to the first three results from searching for “cat” on the Pixabay website (see screenshots below).

Now that we’ve learned a little how we can use the Pixabay API to access data about photos, it’s time to refocus on our overall question: What is more popular on Pixabay – photos of cats or photos of dogs? To do this, we’ll use another feature of the app that allows students to obtain random samples of the most popular photos. I direct students to use the app to take a random sample of 100 cats and 100 dogs from the most popular photos on Pixabay, and the app then displays all the photos in the sample on the left side of the screen:

The interface is designed to allow for a new categorical variable to be created, based on dragging the photos across the page in two groups (see later for examples of explorations of this nature). For this exploration, we don’t need a new categorical variable because we searched for photos of dogs and cats, and the search term used is one of the variables. To use all the photos under “No group” students need to re-label the “No group” heading to something else like “All.” Clicking the “Show data table” button allows students to see the data about each photo as a rectangular data structure (each row is a different photo):

Clicking the “Get links to data” button allows students a quick way to “jump with the data” into an online tool for exploring the data, as well as the option to download the data as a CSV file. I use this task with students after they have already used a tool like iNZight lite (here) to explore data. This means I can just ask my students to use the data to check their hunch about whether photos of cats or dogs are more popular on Pixabay, and give them time to explore their data with their partner/group. Similar to earlier in the task, after about 10 minutes I ask the different pairs/groups of students to share what they have learned. Most groups make plots comparing likes by the search term, as shown here:

Some students create a new variable, for example the number of likes per days online, and compare this for the cat and dog photos in the sample, as below:

Depending on where the class is at in terms of learning about sample-to-population inference, we can talk about more formal approaches for comparing the popularity of cat and dog photos on Pixabay. An important aspect to that discussion is that the population is not all photos on Pixabay, but the most popular photos as determined by Pixabay using some sort of algorithm unknown to us.

The activity ends with asking students to carry out their own exploration to compare the popularity of two types of photos on Pixabay. The huge advantage we have with introducing an API as a source of data to students, and providing an app that allows easy access to that API, is that students get to choose what they want to explore. By using an API connected to a photo-sharing website with search capabilities, students also have a way of building familiarity with the data before accessing the data set. Beyond comparisons of popularity, other interesting investigations involve using what is shown in the photo to create a new categorical variable. For example, I’ve had students explore whether most photos of dogs are outside shots (see earlier discussion and screenshot of creating new categorical variables using the popularity contest app). Other interesting research questions from students have included: Are most of the popular Pixabay photos tagged as “cat,” photos of domestic cats?

Often my students form their ‘hunch” for a research question based on viewing the first 20 or so photos from the website search.  Then they are surprised not to find a similar result when taking a random sample of popular photos. I think there’s something nice in this idea of not jumping to conclusions from searches generated by an algorithm designed to give prominence to some photos over others! My students have also written about how the task helps expand their ideas of where they can get data from and makes them more aware of how much data is being collected from them as they interact with websites.

I commented at the beginning of this post that tasks like these have been described by others as “sneaky.” I’ve also been accused of tricking students into learning because I made the activities so much fun. In fact, my students’ enjoyment continues even when I extend this task to introduce them to using R code to interact with Pixabay photos and the API. I say “even” because so many of my students have pre-determined negative views about learning computer programming, so they really are genuinely surprised to find that the experience of “coding with data” can be fun. Especially if you use a “cover story” of creating memes, using Pixabay photos as a sneaky way to learn about arguments for functions!

When we design activities that introduce students to new computational ideas or tools, it’s only natural to make the “new thing” the star of the show. Although the overall learning goal of this task is to introduce students to some new ideas related to APIs, the immersive experience of searching for photos to find out whether cats are more popular than dogs is the real star of every act of this show. By structuring and asking good questions to drive learning rather than focusing on formal definitions initially, I believe a wide range of students are supported to engage with the many statistical and computational ideas that they discover along the way. What else makes this task successfully sneaky? Cats, of course, lots and lots of photos of cats!

#58 Lizards and ladybugs: illustrating the role of questioning

This guest post has been contributed by Christine Franklin.  You can contact Chris at

Chris Franklin has been one of the strongest advocates for statistics education at the K-12 level for the past 25 years.  She has made a tremendous impact in this area through her writings and presentations, and also with her mentorship and leadership on individual levels.  Her work includes the PreK-12 GAISE report (here), the Statistical Education of Teachers report (here), and a college-level textbook (here).  Chris also served as Chief Reader of the AP Statistics program.  Chris is retired from the Statistics Department at the University of Georgia, and she currently serves as the inaugural K-12 Statistical Ambassador for the American Statistical Association (read more about this here).  I am very pleased that Chris agreed to write this guest blog post about the role of questioning described in the forthcoming revision of the PreK-12 GAISE report.

It has been my great fortune to be part of the writing teams for both the Pre-K-12 GAISE Framework published in 2005 (here) and the soon-to-be published Pre-K-12 GAISE II (tentatively planned for autumn release 2020)*. The GAISE Framework of essential concepts is built around the four-step statistical problem-solving process: formulate statistical investigative question, collect/consider data, analyze the data, and interpret the results.  This framework involves three levels of statistical experience, with Level A roughly equivalent to elementary, B to middle, and C to high school. Question-posing throughout the statistical problem-solving process and at each of the progressive levels is essential:

* The GAISE II writing team, which also developed the examples presented in this post, includes Anna Bargagliotti (co-chair), Pip Arnold, Rob Gould, Sheri Johnson, Leticia Perez, and Denise Spangler.

This four-step statistical problem-solving process typically begins with formulating a statistical investigative question. When analyzing secondary data from an available source, the process might start with considering the data. The problem-solving process is not linear, and it is important to interrogate continuously throughout analyzing the data and interpreting the results. Posing good questions and knowing when to question is a skill that we must constantly hone. The GAISE II report presents 22 examples across the three levels to illustrate the necessity of being able to reason statistically and to make sense of data. Key within all these examples is the role of questioning. I will present two of my favorite examples from GAISE II to illustrate the crucial role of questioning.

Example 1: Those Adorable Ladybugs

1. Formulate Statistical Investigative Questions

One of the new more science-focused investigations presented at Level A in GAISE II is about ladybugs. With beginning students, teachers might provide guidance when coming up with a statistical investigative question, the overarching question that begins the investigation. As students advance from Level A to Level B, students take more ownership in posing questions through the process. A statistical investigative question a student might pose is: What does a ladybug usually look like? or How many spots do ladybugs typically have? that ask for a summary. The statistical investigative question the student poses might also be comparative such as: Do red ladybugs tend to have more spots than black ladybugs?  Questions for this step of the process are shown here:

To answer these questions, we need to observe some ladybugs. Students might collect them outdoors. Teachers can also mail-order live ladybugs. An alternative is to use photo cards that allow students to observe a variety of ladybugs:

2. Collect/Consider Data – Data Collection Questions

To answer the statistical investigative questions posed by the students, data collection questions are developed. Some examples are given in the figure below:

These questions collect data for one numerical variable (number of spots) and two categorical variables (color of body and color of spots).  Collecting data requires careful measurement and even at this level, students will have to wrestle with questions such as: What is a spot versus a blemish? The class needs to agree upon some criteria as to what constitutes a spot. For example, they might decide not to count spots that are on the margins of the elytra, which is the hard wing cover.

How might young students organize the data? They could use data cards to organize the variable values for each ladybug, where each data card represents a case (the ladybug), as shown above. These physical data cards can assist beginning students to develop understanding on what is a ‘case’, a challenging concept for even advanced students. The students might next create a table, also as shown above.

3. Analyze the Data – Analysis Questions

How do the students now make sense of the data?  Beginning Level A students might use a picture graph that allows each ladybug to be identified. As students advance to Level B, they can use a dotplot. Teachers should support Level A students in thinking about the distribution and asking analysis questions. Analysis questions might prompt different representations or prompt the need for different data collection questions.  This step is depicted here:

4. Interpret – Connecting to the Statistical Investigative Question

As the analysis questions are answered, the results of the data analysis aid in answering the statistical investigative question(s). Level A students are not expected to reason beyond the sample, and the teacher should encourage the students to state their conclusion in terms of the sample. Some possible student responses are shown here:

The ladybug investigation allows students at a young age to experience the statistical problem- solving process, recognize the necessity of always questioning throughout the process, and learn how to make sense of data by developing understanding of cases, variables, data types, and a distribution. These young students can also begin to experience that questioning through the statistical problem process is not necessarily linear – a typical upper-end Level B and Level C experience as illustrated with the following example.

Example 2: Those Cute Lizards

As students transition from Level B to Level C, they are becoming more advanced with the types of questions posed throughout the statistical problem-solving process, considering datasets that are larger and not necessarily clean for analysis, and using more tools and methods for analyzing the data.

1. Formulate Statistical Investigative Questions

Suppose students in a science class are investigating the impact of human development on wildlife. In an earlier analysis of a small pilot dataset, the students concluded that lizards in “disturbed” habitats (those with human development) tended to have greater mass than lizards in natural habitats. This led the students to pose and investigate the following question: Can a lizard’s mass be used to predict whether it came from a disturbed or a natural habitat?

2. Collect/Consider Data – Data Collection Questions

The students searched for available data that might help answer this statistical investigative question. They found a dataset where a biologist randomly captured individual lizards of one species across these two different habitats on four islands in the Bahamas (see research article here). The biologist found 81 lizards from natural habitats and 78 from disturbed habitats and recorded measurements on several different variables, as shown here:

Students should explore and interrogate the dataset, asking what variables are included, what unit of measurement was used for each variable, and whether the variables will be useful and appropriate for answering the statistical investigative question. If the data are reasonable for investigating the posed statistical question, then the students will move to the analysis stage. If the data are not reasonable, they need to search for other data.

3. Analyze Data – Analysis Questions and Interpret

Recall the initial statistical investigative question:  Can a lizard’s mass be used to predict whether it came from a disturbed or a natural habitat?

Students at Level B/C might first consider the distribution of mass for each of the two groups, asking appropriate analysis questions to compare the characteristics of those groups with respect to shape, center, variability, and possible unusual observations. The dotplots below, created in the Common Online Data Analysis Platform (CODAP, available here), display the distributions of mass (in grams) for the two types of lizards:

Students see considerable overlap in the two distributions but some separation. We want students to recognize that the more separation in the distributions, the better we can predict lizard habitat from mass. In thinking about how they can use these distributions to predict lizard habitat from mass, a student can consider a classification approach by asking: Where would you draw a cutoff line for the two distributions of mass to predict type of habitat?

Students might see a separation of the two distributions at around 6.25 grams, thus proposing the classification rule: If the lizard’s mass is less than 6.25 grams, then classify the lizard as from a natural habitat; otherwise, classify the lizard as from a disturbed habitat. Due to the significant overlap, many lizards would be mis-classified with this rule. Students can then count the number of mis-classifications with this rule, as shown here:

Students can then create a table/matrix and calculate the mis-classification rate to be 55/159 ≈ 0.346, or 34.6%:

Should we be satisfied with a mis-classification rate of 35%, or can we improve with a different classification rule? We want students to revisit the two distributions of mass and consider finding a different cutoff point that will lower the mistakes made and reduce the mis-classification rate. Students may notice that if the cutoff point is lowered to 5 grams, we will mis-classify a few “natural” lizards but will correctly classify many more “disturbed” lizards:

The mis-classification rate becomes (32+11)/159 = 43/159 ≈ 0.270, or 27.0%, so this new classification rule reduces the mis-classification rate from 35% to 27%. Students can continue to develop other rules that further reduce the mis-classification rate.

Encourage students to be inventive as they develop more classification rules. They may soon be asking if there are others variables in the data set that may help in predicting the type of habitat in addition to mass. Thus, they now return to posing another possible statistical investigative question: Can a lizard’s mass and head depth be used to predict whether it came from a natural or disturbed habitat? 

Now back at the analysis component of the statistical problem-solving process, a student at Level B/C may first explore the bivariate relationship between the two numerical variables, mass and head depth, by examining a scatterplot. Utilizing output from a web applet in ArtofStat (here), we notice a moderate positive linear relationship between mass and head depth.  A line of best fit to the data yields the equation: predicted mass (grams) = -5.27 + 2.01×head depth (centimeters):

An analysis question at this stage could be: What is the interpretation of the slope 2.01?  Since this is a probabilistic rather than deterministic model, we want students to say: For each one centimeter increase in head depth, the mass of the lizard is predictedto increase by 2.01 grams, on average.”

This analysis provides useful information, but it does not allow us to address our statistical investigative question to use mass and head depth to predict whether a randomly chosen lizard is from a natural or a disturbed habitat. How might we refine our analysis to incorporate type of habitat?

Instead of displaying the lizards in the scatterplot together ignoring their type of habitat, we can display the lizards using different symbols for natural and disturbed habitat. This provides a multivariate analysis where we have incorporated a third variable. The following graph displays the output from this analysis with separate lines of best fit for the two habitats:

Now suppose a randomly chosen lizard has mass 3.6 grams and head depth 5.5 centimeters. Would you predict this lizard to be from a natural or disturbed habitat? How would you use the multivariate analysis to make this prediction?

Again, let students explore and try different approaches, asking students to justify their approach statistically. Some student approaches might be:

  1. A graphical approach: Plot the point (5.5, 3.6) on the scatterplot. This point lies closer to the prediction line for natural habitat than the prediction line for disturbed habitat. This point also falls more within the cluster of points for lizards from a natural than from a disturbed habitat.
  2. A computational approach: Evaluate the predicted mass based on a head depth of 5.5 cm for each of the two lines. The predictions turn out to be 5.05 grams for the “disturbed” line, 4.575 for the “natural” line.   The residuals for these predictions are (3.6 – 5.05) = -1.45 for the “disturbed” group, (3.6 – 4.575) = -0.975 for the “natural” group. Because the residual for “natural” is closer to zero than the residual for “disturbed,” we predict that this lizard came from a natural habitat.

All of these analyzes will result in some mis-classifications. Our goal is to minimize the mis-classification rate. Looking back at the dataset of variables measured on the lizards by the biologists, students might consider if more variables could be included to improve classification accuracy. Again, students might return in the process to posing a new statistical investigation question: How can different features of a lizard (e.g., head depth, hind limb length, mass) be best used to predict whether it came from a natural or a disturbed habitat? 

The analyses we have explored thus far can be generalized to more than two predictor variables, but developing classification rules becomes tedious without the use of computer technology. An algorithm known as Classification and Regression Tree (CART) produce a series of rules for making classifications based on a number of predictor variables. Below is a CART using mass, head depth, and hind limb length to predict type of habitat. The goal is that Level C students understand how to interpret output from the CART algorithm, not learn details of how the algorithm works.

Whether you are working with small samples from a population, experimental data, or vast datasets such as those found on public data repositories, questioning throughout the statistical problem-solving process is essential. This process typically starts with a statistical investigative question, followed by a study designed to collect data that aligns with answering the question. Analysis of the data is also guided by asking analysis questions. Constant interrogation of the data throughout the statistical problem-solving process can lead to the posing of new statistical investigative questions. When considering secondary data, the data first need to be interrogated.

The ladybug and lizard examples attempted to illustrate the essential role of questioning throughout the statistical process.  Notice that the ladybug example involves summary and comparative investigative questions, while the investigative questions posed and explored in the lizard example are associative – looking for relationships among two or more variables to aid in making predictions.

Now more than ever, questioning is a vital part of being able to reason statistically. In carrying out the statistical problem-solving process, we want students and adults to always be asking good questions. The Pre-K-12 GAISE II documents advocates that this role of questioning begin at a very young age and gain maturity with age and experience.

To conclude with a quote from the GAISE II document: It is critical that statisticians, or anyone who uses data, be more than just data crunchers. They should be data problem solvers who interrogate the data and utilize questioning throughout the statistical problem-solving process to make decisions with confidence, understanding that the art of communication with data is essential.

P.S. A file containing the lizard data is available from the link here:

#57 Some well-meaning but misguided questions

This guest post has been contributed by Emily Tietjen. You can contact Emily at

Emily was a student of mine as a statistics major at Cal Poly.  She was an invaluable help to me as an exceptional teaching assistant for several years*.  I was delighted when Emily decided to pursue a teaching career.  She has taught AP Statistics and other math courses at high schools in and near Merced in the central valley of California.  Since the beginning of her teaching career, I have very much enjoyed visiting Emily and her students every spring.  Emily has quickly moved into an administrative role, as she now serves as one of two math coordinators for the county of Merced. In this role she helps teachers throughout the county to teach mathematics (and statistics!) well.  I greatly appreciate Emily’s writing this guest blog post about some questions that she encounters in her position.

* In addition to very helpfully supporting students’ learning, Emily also displayed an indispensable but unteachable quality for a TA: She laughed at my jokes no matter how many times she heard me tell them in different classes over many terms**.

** I’ll be curious to know whether she laughs at this one as she reads for the first time.

The first thing that you should know about me is that I could easily be referred to as a fangirl of both Allan Rossman and Jo Boaler.  I had the distinct privilege of sitting through six years worth of Dr. Rossman’s courses as both a student and as his TA.  Six years included both a statistics degree and a math credential, but honestly, who doesn’t want to spend as much time as they can in San Luis Obispo?

I can confidently say that I gleaned more from sitting through repeated classes from Dr. Rossman than I ever got from any professional development.  Ideas that were intrinsic to his style of teaching, although we never directly discussed his philosophy, are concepts that as a new math coordinator I’m only beginning to have a name for.  Ask good questions?  I used to think that had more to do with the person asking the question: Were they articulate and educated and thoughtful enough to ask a really good question?  What I’ve come to understand is that asking a good question means to give the learner the authority to come to an understanding of a concept through their own intuition.

But asking a good question is intimidating for someone (yes, me) who regularly harbors the feelings of imposter syndrome.  In this post I will pose some well-intentioned but ultimately misguided questions about how students, educators, and adults view mathematics and primary and secondary mathematics education.  I will also discuss why I consider these well-meaning questions to be problematic.

1. Are you a math person?

Many people ask this question of each other and of children.  I have been asked this question often. 

I grew up in what you might describe as a humanities family.  My mom studied English and German and taught both but primarily German.  My dad and brother both majored in history, read voraciously, and after teaching the subject both went into administration.  I’m like them.  I was a teacher, and I’m now an administrator.  But I was also never quite like them.  It shows in the directness I expect in an answer given to a question and in the long (interesting, albeit) stories my mom tells before finally getting to the point.  It shows in my ability to remember numbers and to quickly solve problems and their ability to remember historical events and the interwoven understanding of how they overlay onto each other.  Math basically always came easy to me, and reading basically was always quick and comprehensible to them.  Clearly, I’m a math person, right? Wrong.  As a child, I enjoyed puzzles.  My parents praised my efforts.  In school, I liked math and they constantly reinforced my abilities.  Despite that, each of my elementary teachers (female, for the record) would talk about their favorite subject, which never included math, while I rolled my eyes at the thought that girls could not be as good at math as boys. 

Over the years, thanks to many privileges I had, none more powerful than my parents’ faith in me, I took honors and AP math courses with many inspiring teachers.  Even more incredibly, I had two particularly wonderful math teachers who were women for geometry and AP Statistics.  Both teachers brought math to life.  They made our classes collaborative and relevant to the world around us.  In both, I was asked to collect data from the outside world and apply meaning to what I had gathered.  They gave me manipulatives and visuals and allowed my classmates and me to formulate our understandings of the math.  They provided context that made the math meaningful to me.  Most of all I had fun. 

On the other hand, in most of my language, literature, and social science classes, teachers overwhelmed me with reading, taught history by having us read chapters aloud from a textbook, each student reading one paragraph at a time, followed by showing movies that partnered with the time period (yes, Mulan was shown with our unit on Chinese history).  I had a much more meaningful experience in school with math.  And I realize that others have stories like mine but completely in reverse. 

The work of Jo Boaler (see her book Mathematical Mindsets and her website here) has brought forth research about how brains learn and grow. Her work demonstrates how there is no research that makes someone have a “math brain.” Additionally, everyone has the capacity to continue learning any subject. A combination of factors led to my positive experiences with math.  My parents reinforced my ability.  I had teachers who empowered me and my learning. There’s no need for the question of whether or not you’re a math person, because there is no such thing. All students can learn math.

2. What class best meets the needs of the student?

This question is often considered as a student is being placed with a particular teacher or in a particular course.  Will it be “grade level” or “honors” or “remedial” or …? This one is so hard for me.  We want to do the best thing for our students, right? We want to make sure that students who are exceeding expectations are given enrichment and  opportunities to accelerate learning and students who are struggling are provided with support and remediation.  That sounds good, right?

I have classified this as a problematic question because even though it sounds innocent, it’s really about a practice called tracking. The problem is that research doesn’t back this up.  Ability grouping and tracking lead to differential outcomes for students.  At the secondary level, trying to meet the needs of where students are at means that teachers spend barely over a third of the year on grade-level material.  When students are given grade-level material they succeed more often than not yet they aren’t given the opportunity most of the time.  By tracking a student below grade-level content, a district is ensuring that those student’s will never be able to fill the gap between where they are and the grade-level content they deserve to see.  Students can be provided opportunities for advancement without needing to create specialized courses and should demonstrate that they have mastered the material before they advance rather than skipping concepts.  (You can read more about tracking issues in the reports here and here and also in the NCTM’s Catalyzing Change book series here.)

Another area where we suffer with this question is our undying race to Calculus in high school.  Too often we focus on how to prepare students to study calculus rather than consider what courses and skills would best serve their overall education and potential career. The vast majority of jobs in this country will depend on data literacy or statistics, yet statistical topics are typically found in the last chapter of the books and treated as the content that they’ll get to if they have time which they very rarely do.  Many of the above links also discuss the need for statistics and data literacy in the TK-12* educational system as well as the problematic nature of tracking.  Understanding data and statistics provides students relevance both to their current lives, through contexts that are inherent to subjects they are studying, and also relevance to their future careers.  When I was teaching math, students constantly asked when they would use the subject in their “real lives.”  When I was teaching statistics, students  never asked that question.

* TK stands for Transitional Kindergarten, a preliminary class to Kindergarten offered to children born in September – December.

Fortunately, there are efforts being made to encourage the prioritizing statistics and data literacy at the TK-12 level.  For example, Jo Boaler and her team have released a set of lessons on data science for grades 6-10 (here) along with an online teacher course (here) for data science and 21st Century Teaching and Learning. California university systems have considered adding quantitative reasoning courses as an addition to their subject requirements (see here) for applying to their schools (minimum course requirements to be accepted to public universities in California).  High school courses have been designed to address the needs of having a more relevant, equitable math course that highlights the use of data and statistics.  School districts and states have restructured their pathways that work to remove the tracking that is prevalent within our educational systems which leads to more equitable outcomes with specific inclusion of a statistics pathway.

This work must continue, as we know that data literacy will be crucial for our society to understand.  We face the need to comprehend data in multiple ways as we are constantly facing the collection of our own personal data on a daily basis and mostly have no practical access to knowing the ways in which it is used, for the good and the bad. On top of that, more often than not, the careers our students will go into will require the utilization of data and being able to analyze it.

3. Why do we have to do word problems?

Students often ask this of their math teachers.  I’m imagining my former students’ voices as I consider this topic. Heck, I hear my teenage self still wondering this.

Assigning word problems is sure to create anxiety, at least with the typical way that we approach them.  However, students often struggle with word problems for the wrong reason. The very prospect of word problems ignites so much fear in students that they are hesitant to even read them in the first place. Speed is all too often valued in the classroom and struggle is not, so confronting a word problem is asking students to work on a concept they’re likely still grappling while adding an additional complicating layer. Anyway, students see it as complicated because of the typical way we present it to them. While the traditional pressures of math still exist in many classrooms, or even worse, at home with little to no support. They’ll need to read, decode, create, and image or model, and transform that into something that they can then solve. I’d argue that we teachers haven’t done a sufficient job of preparing students for these situations.  It doesn’t have to feel this way. 

For example, we can expose students to a context and help them make sense of it before they even know what the question is. By initially excluding the question, students are relieved of the solution-finding inclination that we all too quickly jump to.  One of my favorite routines (see here) encourages students to suggest questions that would be a mathematically reasonable question given the context before presenting them with a question. After students have engaged in the context without the time pressure anticipated by typical math problems, they’re able to intuit what could and should be tested. This process gives meaning and helps us to understand the value of the problem.

When students have had this opportunity, word problems don’t feel so hard.  Word problems should pique interest and provide opportunities to make connections to the world around us.  They give us a reason to do math in the first place.  My assumption is that they feel hard because we feel rushed to solution finding.  Students are infrequently challenged to think slowly about a problem.  The pace of the class is often at the speed by which the first correct answer is given.  Word problems can instill fear and yet I think they’re truly key to making math feel relevant for our students as long as they aren’t arbitrary for the grade level.

For an example of what this might look like, consider the following background information from free-response question #3 on the 2018 AP Statistics exam (here): Approximately 3.5 percent of all children born in a certain region are from multiple births (that is, twins, triplets, etc.). Of the children born in the region who are from multiple births, 22 percent are left-handed.  Of the children born in the region who are from single births, 11 percent are left-handed.

At this point, a class might have a conversation about clarifications they may need for accessing the language used or understanding the context.  Then, the teacher could ask students to come up with a question for the context.  Depending on age (or maturity level), students may ask questions like, “Where do they live?” or “How old are the kids?” Those questions need to be redirected, because we are looking for mathematical questions.  For this context, students may ask, “What is the probability that a student born in the region is right-handed?”  This isn’t the ultimate question asked of students on the AP exam, but having students consider their own questions engages them in the context and gives them ownership of the question.  A class of students will often come up with the intended question after only a few suggestions*.  Pausing to consider other questions will also be helpful to give students insight into other aspects that may be important for solving the problem.  These aspects include what types of variables are present, how the information may be organized or depicted graphically, and what given information may be useful in determining the solution. 

* The first part of this particular AP question asked: What is the probability that a randomly selected child born in the region is left-handed?

4. What does good teaching sound and look like?

Okay this isn’t technically a bad question. Teachers and administrators ponder this year after and it continues throughout the career of everyone involved.  It’s in consideration when hiring, when deciding if a teacher should receive permanent status, and as the years pass and the field evolves and we learn more about equity and what methods work best.

The problem is that it’s very common for people to think of good teaching like how Trunchbull, from the film adaptation of Matilda, thought of an ideal school as “one in which there are no children at all.” Sadly, many teachers and administrators still consider a well-run class to be filled with students who are silent, only speaking when spoken to, and with students who sit down and stay there almost as if they don’t get to exist there as a person. 

We should instead nurture classrooms where students are given the authority to take ownership of their learning because students’ learning is more important than the teaching of lessons.  Teachers should be talking no more than half of the time.  Students should be talking.  Mostly to each other.  They should be positioned in a way where collaboration is convenient and encouraged. 

In my current role, I support mathematics teaching and learning for school districts within Merced county.  My office serves students from twenty school districts as well as our internal programs.  This accounts for about 60,000 students, of whom more than three-quarters are eligible for free and reduced lunch. Relative to the state, we have high populations of poverty and students who are classified as English learners. My office uses the following framework, developed by my colleague Duane Habecker based on Maslow’s hierachy of needs, to advocate for an effective mathematics program for all students:

  • Material Needs: Every student has a teacher with appropriate mathematics content knowledge and knowledge for teaching mathematics.  Math lessons are rooted in a solid understanding of the standards through rigorous, high-quality curriculum and meaningful tools. 
  • Mindset & Culture: Every student is immersed in a mindset and culture that intentionally communicates all students can learn math at high levels while being responsive culturally and personally in a learning environment that considers each and every student’s unique background, experiences, cultural perspectives, traditions, and knowledge.  Mistakes in mathematics are normalized. Students regularly experience high-quality, grade-appropriate lessons and assignments.
  • Student-Centered Instruction: Every student regularly experiences instruction that is student-centered and designed to maximize students’ use of language. Lessons create space for students to participate in discourse to promote conceptual understanding, which then leads to procedural fluency, problem-solving, and application. 
  • Equitable Assessment: Every student is regularly and humanely assessed in order to understand their own growth and to receive productive feedback for next steps in learning.  Students use the feedback to know where they are in their learning, assess any misconceptions that need to be addressed, and then use the results to drive the next level of learning.

I hope that these well-meaning but misguided questions have illustrated the misdirected focus that many have about how best to support our students in their mathematics education.  When we pigeonhole students into our own fixed beliefs, it’s no wonder that we consistently turn out students who underperform in mathematics as compared with other countries.  I believe we will see incredible growth by making mathematics more relevant to students at all ages, discontinuing the use of ability grouping and tracking, and offering more equitable pathways for college and career readiness.  Focusing on statistics and data science is a necessary and important part of the solution, as this leads to productive and supportive classroom environments and helps students to acquire essential skills for a modern workplace and world.

#56 Two questions to ask before making causal conclusions

This guest post has been contributed by Kari Lock Morgan.  You can contact Kari at

Kari Lock Morgan teaches statistics at Penn State. Along with other members of her family, she is co-author of Statistics: Unlocking the Power of Data, an introductory textbook that emphasizes simulation-based inference. Kari is an excellent and dynamic presenter of statistical ideas for both students and teachers. She gave a terrific presentation about evaluating causal evidence at the 2019 U.S. Conference on Teaching Statistics (a recording of which is available here), and I greatly appreciate Kari’s sharing some of her ideas as a guest blog post.

* I always implore students to read carefully to notice that causal is not casual.

How do we get introductory students to start thinking critically about evaluating causal evidence?  I think we can start by teaching them to ask good questions about potential explanations competing with the true causal explanation.

Let’s start with a generic example. (Don’t worry, we’ll add context soon, but for now just fill in your favorite two group comparison!).  Suppose we are comparing group A versus group B (A and B could be two treatments, two levels of an explanatory variable, etc.).  Suppose that in our sample, the A group has better outcomes than the B group.  I ask my students to brainstorm about: What are some possible explanations for this?  As we discuss their ideas, I look for (and try to tease out) three possible explanations:

  1. Just random chance (no real association)
  2. The A group differed from the B group to begin with (association, but due to confounding)
  3. A causes better outcomes than B (causal association)

This framework then leads naturally into what I think are the two key questions students should ask and answer when evaluating causal evidence:

  • Key question 1: Do we have convincing evidence against “just random chance”?  Why or why not?
  • Key question 2: Do we have convincing evidence against the groups differing to being with?  Why or why not?

If the answers to both of the above questions are “yes,” then we can effectively eliminate the first two alternatives in favor of the true causal explanation.  If the answer to either of the above questions is “no,” then we are left with competing explanations and cannot determine whether a true causal association exists.   

As teachers of introductory statistics, where do we come in? 

  • Step 1: We have to help students understand why each of these questions is important to ask.
  • Step 2: We have to help students learn how to answer these questions intelligently.

As a concrete example, let’s look at the health benefits of eating organic.  We’ll investigate this question with two different datasets:

1.  Data from the National Health and Nutrition Examination Survey (NHANES), a large national random sample.  Our explanatory variable is whether or not the respondent bought anything with the word organic on the label in the past 30 days, and the response variable is a dichotomized version of self-reported health status: poor/fair/good versus very good/excellent.  The sample data are visualized below:

In the sample, 45.9% of organic buyers had very good or excellent health, as compared to only 33% of people who hadn’t bought organic, for a difference in proportions of 0.459 – 0.33 = 0.129. 

In the second dataset, fruit flies were randomly divided into two groups of 1000 each; one group was fed organic food and the other group was fed conventional (non-organic) food*. The longevity of each fly by group is visualized below:

* Fun fact: This study was conducted by a high school student!  The research article is available here.

Organic-fed flies lived an average of 20.31 days, as compared to an average of 17.06 days for conventional-fed flies, giving a difference in means of 3.25 days (which is long in the lifespan of a fruit fly!).

In both of these datasets, the organic group had better outcomes than the non-organic group.  What are the possible explanations?

  1. Just random chance (no real association)
  2. The organic group differed from the non-organic group to begin with (association, but due to confounding)
  3. Eating organic causes better health status/longevity than not eating organic (causal association)

Do we have convincing evidence against alternative explanations (1) and (2)? How can we decide?

As I mentioned above, we teachers of introductory statistics have two jobs for each of these questions: first helping students understand why the question needs to be asked, and then helping students learn how to answer the question.  I’ll address these in that order:

STEP 1: Help students understand why each of the key questions is important to ask – why it’s important to consider them as potential competing explanations for why outcomes may be higher in one group than another.  (This is non-trivial!)

Key question 1: Do we have convincing evidence against “just random chance”?  Why or why not?

Why is this question needed?  We have to take the time to help students understand – deeply understand – the idea of statistical inference, at its most fundamental level.   Results vary from sample to sample.  Just because a sample statistic is above 0 (for example) doesn’t necessarily imply the same for the population parameter or the underlying truth.   This is NOT about illustrating the Central Limit Theorem and deriving the theoretical distribution for a sample mean – it is about illustrating to students the inherent variability in sample statistics.  While this can be illustrated directly from sample data, I think this is best conveyed when we actually have a population to sample from and know the underlying truth (which isn’t true for either of the datasets examined here).

Key question 2: Do we have convincing evidence against the groups differing to being with?  Why or why not?

Why is this question needed?  We have to take the time to help students understand – deeply understand – the idea of confounding, and why it’s dangerous to jump straight to the causal explanation if the groups differ to begin with. If the groups differ to begin with, we have no way of knowing whether this baseline difference or the A versus B distinction is causing the better outcomes.  I think that both talking through intuitive examples* and showing them real examples with measured data on the confounding variableare both important to help students grapple with this concept.  This is, inherently, reliant on multivariable thinking, and examples must go beyond bivariate context. 

* See posts #43 and #44 (here and here) for several examples.

In our NHANES organic example, I ask students to brainstorm: How might people who buy organic differ from the non-organic buyers?  Intuition is easy here, and students are good at this!  A common student answer is income, because organic food is more expensive. I respond by showing a real-data visualization of the relationship between eating organic and income, and between income and health status:

The sample data reveal that people who buy organic are richer, and richer people are healthier, so we would expect organic buyers to be healthier, even if buying organic food provided no real health benefit.  This is a concrete example of confounding, one that students can grasp.  Of course, income is not the only difference between people who buy organic and those who don’t, as students are quick to point out.  Given all of the differences, it is impossible to determine whether the better health statuses among organic buyers are actually due to buying organic food, or simply to other ways in which the groups differ. 

The key takeaway is that directly comparing non-comparable groups cannot yield causal conclusions; thus it is essential to think about whether the groups are comparable to begin with.

STEP 2: Help students learn how to reason intelligently about each of the key questions.

Key question 1: Do we have convincing evidence against “just random chance”?  Why or why not?

While we can assess this with any hypothesis test, I strongly believe that the most natural and intuitive way to help students learn to reason intelligently about this question is via simulation-based inference*.  We can directly simulate the values of statistics we would expect to see, just by random chance.  Once we have this collection of statistics, it’s relatively straightforward to assess whether we would expect to see the observed value of the sample statistic, just by random chance. 

* See posts #12, #27, and #45 (here, here, and here) for more on simulation-based inference.

I suggest that we can help students to initially reason about this in very extreme examples where a visual assessment is sufficient:

  • either the value of the sample statistic is close to the middle of the distribution of simulated statistics: could easily see such a statistic just by chance, so no, we don’t have convincing evidence against just random chance; or
  • the value of the sample statistic is way out in the tail: it would be very unlikely to see such a statistic just by chance, so yes, we have convincing evidence against just random chance.

In the case of the organic fruit flies dataset, we can use StatKey (here) to obtain the following distribution of simulated differences in sample means:

We notice that the observed difference in sample means of 3.25 days is nowhere to be seen on this distribution, and hence very unlikely to occur just by random chance.  (The sample statistic is even farther out in the tail for the NHANES dataset.)  We have convincing evidence against just random chance! 

Of course, not all examples are extreme one way or another, so eventually we quantify this extremity with the p-value (a natural concept once we have students thinking this way!), but this quantification can follow after developing the intuition of “would I expect a sample statistic this extreme just by chance?”.    

Key question 2: Do we have convincing evidence against the groups differing to being with?  Why or why not?

The best evidence against the groups differing to begin with is the use of random assignment to groups.  If the groups are randomly assigned, those groups should be similar regarding both observed and unobserved variables!  Although some differences may persist, any differences are purely random (by definition!).  You can simulate random assignment to convince students of this, which also makes a nice precursor to simulation-based inference!. 

Random assignment is not just an important part of study design, but a key feature to check for when evaluating causal evidence.  If my introductory students take only one thing away from my course, I want them to know to check for random assignment when evaluating causal evidence, and to know that random assignment is the best evidence against groups differing to begin with. 

Because the fruit flies were randomly assigned to receive either organic or non-organic food, we have convincing evidence against groups differing to begin with!   For the fruit flies we’ve now ruled out both competing explanations, and are left with the causal explanation – we have convincing evidence that eating organic really does cause fruit flies to live longer!!  Time to go buy some organic food*!!

* If you’re a fruit fly.

Because the NHANES respondents were not randomly assigned to buy organic food or not, it’s not surprising that we do observe substantial differences between the groups, and we would suspect differences even if we could not observe them directly.  This doesn’t mean that buying organic food doesn’t improve health status*, but this does mean that we cannot jump to the causal conclusion from these data alone.  We have no way of knowing whether the observed differences in reported health were due to a causal effect of buying organic food or due to the fact that the organic buyers differed from non-organic buyers to begin with.

* Make sure that students notice the double negative there.

Now I’ll offer some extra tidbits for those who want to know more about questioning causal conclusions.

When thinking about key question #2 about the groups differing to begin with, I want introductory students to understand (a) why we can’t make causal conclusions when comparing groups that differ to begin with, (b) without random assignment, groups will almost always naturally differ to begin with, and (c) with random assignment groups will probably look pretty similar.  These are important enough concepts that I try not to muddy them too much in an introductory course, but in reality it’s possible (in some situations) to create similar groups without randomization, and it’s also possible to obtain groups that differ even after randomization, just by chance.

Random assignment is not the only way to rule out groups differing to begin with; one could also collect data on all possible confounding variables (hard!) and force balance on them such as with propensity score matching or subclassification, but this is beyond the scope of an introductory course.  If you want to move towards this idea, you could compare units within similar values of an observed confounder (stratification).  For example, in the NHANES example, the organic buyers were healthier even compared to non-organic buyers within the same income bracket:

However, while this means the observed difference is not solely due to income, we still cannot rule out the countless other ways in which organic eaters differ from non-organic eaters.   We could extend this to balance multiple variables by stratifying by the propensity score, the probability of being in one group given all measured baseline variables (it can be estimated by logistic regression).  While this is a very powerful tool for making groups similar regarding all observed variables, it still can’t do anything to balance unobserved variables, leaving random assignment as the vastly superior option whenever possible.

While random assignment creates groups that are similar on average, in any particular randomization groups may differ just due to random variation.  In fact, my Ph.D. dissertation was on rerandomization – the idea that you can, and should, rerandomize (if you do it in a principled way) if randomization alone does not yield adequate balance between the groups.  In an introductory course, we can touch on some classical experimental designs aimed to help create groups even more similar than pure randomization, for example, by randomizing within similar blocks or pairs.  One classic example is identical twin studies, which I can’t resist closing with because I can show a picture of my identical twin sons Cal and Axel in their treatment and control shirts!

Questioning causal evidence involves evaluating evidence against competing explanations by asking the following key questions:

  1. Do we have convincing evidence against “just random chance”?  Why or why not?
  2. Do we have convincing evidence against the groups differing to being with?  Why or why not?

By the time students finish my introductory course, I hope that they have internalized both of these key questions –both why the questions need to be asked when evaluating causal evidence, and also how to answer them.

P.S. Below are links to datafiles for the examples in this post:

#55 Classroom assessment with clicker questions

This guest post has been contributed by Roxy Peck.  You can contact Roxy at

I consider Roxy Peck to be one of the most influential statistics educators of the past 30 years.  Her contributions extend beyond her widely used and highly regarded textbooks, encompassing the teaching and learning of statistics at secondary and undergraduate levels throughout California, the United States, and beyond.  Roxy has been an inspiration and role model throughout my career (and for many others, I’m sure). I greatly appreciate Roxy’s taking the time to write this guest post about the use of clicker questions for classroom assessment.

Asking good questions is key to effective and informative assessment. Faculty use tests and quizzes to help them assess student learning, often for the purposes of assigning course grades. In post #25 of this blog (Group quizzes, part 1, here), Allan says he uses lots of quizzes in his classes because they also provide students with the opportunity to improve their understanding of the material and to assess how well they understand the material, and no one would argue with the importance of those assessment goals. But in this blog post, I want to talk about another form of assessment – classroom assessment. Classroom assessment is the systematic collection and analysis of information for the purpose of improving instruction. The more you know about what your students know and understand, the better you can plan and adjust your classroom practice.

I think that the best types of classroom assessments are timely and inform teaching practice, sometimes in real time. For me, the worst-case scenario is to find out when I am grading projects or final exams that students didn’t get something important. That’s too late for me to intervene or to do anything about it but hang my head and pout. That’s why I think good classroom assessment is something worth thinking carefully about.

My favorite tool for classroom assessment is the use of “clicker questions.” These are quick, usually multiple choice, questions that students can respond to in real time. The responses are then summarized and displayed immediately to provide quick feedback to both students and the instructor. There are many ways to implement the use of clicker questions, ranging from low tech to high tech. I will talk a little about the options toward the end of this post, but first I want to get to the main point, and that’s what I think makes for a good clicker question.

Clicker questions can be used to do real-time quizzes, and also as a way create and maintain student engagement and to keep students involved during class even in situations where class sizes are large. But if the goal is to also use them to inform instruction, they need to be written to reveal more than just whether a student knows or understands a particular topic. They need to be written in a way that will help in the decision of what to do next, especially if more than a few students answer incorrectly. That means that if I am writing a clicker question, I need to write “wrong” answers that capture common student errors and misconceptions.

Clicker questions can be quick and simple. For example, consider the following question:

Seventy-five (75) college students were asked how many units of coursework they were enrolled in during the current semester. The resulting data are summarized in the following frequency table:

What is the median for this dataset?  Options: A) 10; B) 11; C) 12

For this question, the correct answer is 12. What are students who answer 10 or 11 thinking? A common student error is for students to confuse the frequencies with the actual data. A student who makes this error would find the median of the frequencies, which is 10. Another common student error is to confuse the possible values for number of units given in the frequency table with the actual data. A student who makes this error would find the median of the possible values (the numbers in the “Number of Units” column) and answer 11. The main thing to think about when putting a question like this together are these common student errors. That’s not a new idea when writing good multiple choice questions for student assessment, but the goal in writing for classroom assessment is to also think about what I am going to do if more than a few students pick one of the incorrect options. With this question, if almost all students get this correct, I can move on. But if more than a few students select incorrect answer (A), I can immediately adapt instruction to go back and address the particular student misunderstanding that leads to that incorrect answer. And I can do that in real time, not two weeks later after I have graded the first midterm exam.

Another example of a good clicker question that is related to the same student misunderstanding where frequencies are mistaken for data values is the following:

Which of the three histograms summarizes the dataset with the smallest standard deviation?

Students choosing either answers (A) or (C) are focusing on variability in the frequencies rather than variability in the data values. If I see students going for those answers, I can address that immediately, either through classroom discussion or by having students talk in small groups about the possibilities and come to an understanding of why answer choice (B) is the correct one.

Here is another example of a simple question that gets at understanding what is being measured by the interquartile range:

Which of the two dotplots displays the dataset with the smaller IQR?

What is the error in thinking for the students who choose answer (B)? What would you do next if you asked this question in class and more than a few students selected this incorrect option?

I will only use a clicker question if I have a plan for what I will do as an immediate reaction to how students respond. Often, I can see that it is safe to move on, knowing that students are with me and that further discussion is not needed. In other cases, I find that I have some work to do!

So what is the difference between a clicker question and a multiple choice question? I think that pretty much any well-written multiple choice question can be used as a clicker question, so strategies for writing good multiple choice questions apply here as well. But I think of a good clicker question as a good multiple choice question that I can deliver in real time AND that is paired with a plan for how student responses will inform and change what I do next in class. I have used multiple choice questions from sources like the LOCUS and ARTIST projects (described at the end of this post) as clicker questions.

Consider the following question from the ARTIST question bank:

A newspaper article claims that the average age for people who receive food stamps is 40 years. You believe that the average age is less than that. You take a random sample of 100 people who receive food stamps, and find their average age to be 39.2 years. You find that this is significantly lower than the age of 40 stated in the article (p < 0.05). What would be an appropriate interpretation of this result?

  • (A) The statistically significant result indicates that the majority of people who receive food stamps is younger than 40.
  • (B) Although the result is statistically significant, the difference in age is not of practical importance.
  • (C) An error must have been made. This difference is too small to be statistically significant.

This is a multiple choice question that makes a great clicker question because students who choose answer (A) or answer (C) have misconceptions (different ones) that can be addressed in subsequent instruction.

The same is true for the following clicker question:

In order to investigate a claim that the average time required for the county fire department to respond to a reported fire is greater than 5 minutes, county staff determined the response times for 40 randomly selected fire reports.  The data was used to test H0:  μ = 5 versus Ha:  μ > 5 and the computed p-value was 0.12.  If a 0.05 level of significance is used, what conclusions can be drawn?

  • (A) There is convincing evidence that the mean response time is 5 minutes (or less).
  • (B) There is convincing evidence that the mean response time is greater than 5 minutes.
  • (C) There is not convincing evidence that the mean response time is greater than 5 minutes.

If very many students choose response (A), I need to revisit the meaning of “fail to reject the null hypothesis.” If many students go for (B), I need to revisit how to reach a conclusion based on a given p-value and significance level. And if everyone chooses (C), I am happy and can move on. Notice that there is a reason that I put the incorrect answer choice (A) before the correct answer choice (C). I did that because I need to know that students recognize answer choice (A) as wrong and want to make sure that they understand that answer is incorrect. If the correct choice (C) came first, they might just select that because it sounds good without understanding the difference between what is being said in (A) – convincing evidence for the null hypothesis – and what is being said in answer choice (C) – not convincing evidence against the null hypothesis.

I have given some thought about whether to have clicker question responses count toward the student’s grade and have experimented a bit with different strategies. Some teachers give participation points for answering a clicker question, whether the answer is correct or not. But because the value of clicker questions to me is classroom assessment, I really want students to try to answer the question correctly and not just click a random response. I need to know that students are making a sincere effort to answer correctly if I am going to adapt instruction based on the responses. But I also don’t want to put a heavy penalty for an incorrect answer. If students are making an effort to answer correctly, then I share partial responsibility for incorrect answers and may need to declare a classroom “do-over” if many students answer incorrectly. I usually include 3 to 4 clicker questions in a class period, so what I settled on is that students could earn up to 2 points for correct responses to clicker questions in each class period where I use clicker questions. While I use them in most class meetings, some class meetings are primarily activity-based and may not incorporate clicker questions (although clicker questions can sometimes be a useful in the closure part of a classroom activity as a way to make sure that students gained the understanding that the activity was designed to develop). Of course, giving students credit for correct answers assumes that you are not using the low-tech version of clicker questions described below, because that doesn’t keep track of individual student responses to particular questions.

Teachers can implement clicker questions in many ways. For example, ABCD cards can be used for clicker questions if you are teaching in a low tech or no tech environment:

With ABCD cards, each student has a set of cards (colored cards make it easier to get a quick read on the responses). The instructor poses a question, provides time to think, and then has each student hold up the card corresponding to the answer. By doing a quick look around the classroom, the instructor gets a general idea of how the students responded.

The downside of ABCD cards is that there is no way to collect and display the responses or to record the responses for the purpose of awarding credit for correct responses. Students can also see which students chose which answers, so the responses are not anonymous to other students. In a big lecture class, it is also difficult for the instructor to “read” the class responses.

Physical clickers are small devices that students purchase. Student responses are picked up by a receiver and once polling is closed responses can be summarized and displayed immediately to provide quick feed back to both students and instructor. Several companies market clickers with educational discounts, such as TurningPoint (here) and iClickers (here).

There are also several web apps for polling that can be used for clicker questions if your students have smart phones or web access. A free app that is popular with teachers is Kahoot! (free for multiple choice; more question types, tools and reports for $3 or $6 per month, here). Another possibility is Poll Everywhere (free up to 25 students, then $120 per year for up to 700 students, here).

And finally, Zoom and some classroom management systems have built-in polling. I have used Zoom polls now that I am delivering some instruction online, and Zoom polls allow you to summarize and share results of polling questions. Zoom also has a setting that tracks individual responses if you want to use it for the purposes of assigning credit for correct answers.

I think incorporating good clicker questions has several benefits. It provides immediate feedback to students (they can see the correct answer and how other students answered), and it has changed the way that I interact with students and how students interact with the course. Students are more engaged and enjoy using this technology in class. They pay more attention because they never know when a clicker question is coming, and they want to get it right. And if they get it wrong, they want to see how other students answered.

But one important final note: If you are going to use clicker questions, it is really important to respond to them and be willing to modify instruction based on the responses. If students see that many did not get the right answer and you just say “Oh wow. Lots of you got that wrong, the right answer is C” and then move on as if you had never asked the question, students will be frustrated. On the other hand, if you respond and adjust instruction, students see that you are making student understanding a top priority!

P.S. LOCUS (Levels of Conceptual Understanding in Statistics, here) is a collection of multiple-choice and free-response assessment items that assess conceptual understanding of statistics. Items have all been tested with a large group of students, and the items on the website include commentary on student performance and common student errors. Designed to align with the Common Core State Standards, they follow the K-12 statistics curriculum. Because there is a great deal of overlap in the high school standards with the current college intro statistics course, there are many items (those for level B/C) that are usable at the college level.

ARTIST (Assessment Resource Tools for Improving Statistical Thinking, here) is a large bank of multiple-choice and free-response assessment items, which also includes several scales that measure understanding at the course level and also at a topic level. At the course level, the CAOS test (Comprehensive Assessment of Outcomes for a First Course in Statistics) consists of 40 conceptual multiple-choice questions. The topic scales are shorter collections of multiple-choice questions on a particular topic. There are more than 1000 items in the item bank, and you can search by topic and by question type, select items to use in a test and download them as a word document that you can edit to suit your own needs. You must register to use the item bank, but there is no cost.

#54 Probability without calculus or computation

This guest post has been contributed by Kevin Ross.  You can contact Kevin at

Kevin Ross is a faculty colleague of mine in the Statistics Department at Cal Poly – San Luis Obispo.  Kevin is a probabilist who excels at teaching introductory statistics as well as courses in probability and theoretical statistics.  Kevin is a co-developer of a Python package called Symbulate (here) that uses the language of probability to conduct simulations involving probability models (described in a JSE article here).  I have borrowed examples and exam questions from Kevin on many occasions, so I am very glad that he agreed to write this guest post describing some of his ideas for assessing students’ knowledge of probability concepts without asking for calculations or derivations.

Allan still hasn’t officially defined what a “good question” is (see the very end of post #52, Top thirteen topics, here), but he’s certainly given many examples.  I’ll try to add to the collection by presenting four types of questions for assessing knowledge of probability:

  1. Which is greater?
  2. How would you simulate?
  3. Sketch a plot
  4. “Don’t do what Donny Don’t does”

I frequently use each type of question in class, on homework assignments, on quizzes, and on exams. I use questions like the ones throughout this post in introductory statistics courses and in upper division probability courses typically taken by majors in statistics, mathematics, engineering, and economics. One common theme is that the questions require no probability calculations.  I think these questions facilitate and assess understanding of probability concepts much better than questions that require calculus derivations or formulaic computations.

1. Which is greater?

This type of multiple choice question was first inspired by “Linda is a bank teller” and other studies of Daniel Kahneman and Amos Tversky that Allan mentioned in post #51 (Randomness is hard, here).  The following example illustrates the basic structure:

a) Which of the following – A or B – is greater? Or are they equal? Or is there not enough information to decide? (A) The probability that a randomly selected Californians likes to surf; (B) The probability that a randomly selected American is a Californian who likes to surf; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

The structure is simple – two quantities A and B and the same four answer choices – but this framework can be used to assess a wide variety of concepts in probability. In all of the following examples, the prompt is: Which of the following – A or B – is greater? Or are they equal? Or is there not enough information to decide?

b) Randomly select a U.S. resident. Let R be the event that the person is a California resident, and let G be the event that the person is a Cal Poly graduate. (A) P(G|R); (B) P(R|G); (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

The answer to (a) is A because the sample space for A (Californians) is a subset of the sample space for B (Americans). The answer to (b) is B because although the two conditional probabilities have the same numerator, the denominator is smaller for the conditional probability in B than for the one in A.

I ask many versions of “what is the denominator?” questions like (a) and (b). Symbols can easily be interchanged with words. Also, “probability” can be replaced with “proportion” to assess proportional reasoning in introductory courses.

c) A fair coin is flipped 10 times. (A) The probability that the results are, in order, HHHHHHTTTT; (B) The probability that the results are, in order, HHTHTHHTT; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

d) A fair coin is flipped 10 times. (A) The probability that the flips result in 6 Hs and 4 Ts; (B) The probability that the results are, in order, HHTHTHHTT; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

Questions like (c) and (d) can assess the ability to differentiate between specific outcomes (six Hs followed by four Ts) and general events (six Hs in ten flips). Many students select B in (c) because the sequence “looks more random”, but the outcomes in A and B are equally likely. The answer to (d) is A because the sequence in B is only one of many outcomes that satisfy the event in A.

e) Shuffle a standard deck of 52 playing cards (of which 4 are aces) and deal 5 cards, without replacement. (A) The probability that the first card dealt is an ace; (B) The probability that the fifth card dealt is an ace; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

Students find this question very tricky, but it gets at an important distinction between conditional versus unconditional probability (or independence versus “identically distributed”).  The correct answer is C, because in the absence of any information about the first 4 cards dealt, the unconditional probability that the fifth card is an ace is 4/52. (I like to use five cards rather than just two or three to discourage students from enumerating the results of the draws.)

f) A box contains 30 marbles, about half of which are green and the rest gold.  A sample of 5 marbles is selected at random with replacement.  X is the number of green marbles in the sample and Y is the number of gold marbles in the sample. (A) Cov(X, Y); (B) 0; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

Many students select C, thinking that “with replacement” implies independence.  But while the individual draws are independent, the random variables X and Y have a negative correlation: If there is a large number of green marbles in the sample, then there must be necessarily a small number of gold ones.

g) E and F are events (defined on the same probability space) with P(E) = 0.7 and P(F) = 0.6. (A) 0.42; (B) P(E ꓵ F); (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

The answer would be C if the events E and F were independent. But that is not necessarily true, and without further information all we can say is that P(E ꓵ F) is between 0.3 and 0.6, so the correct answer is D.  I frequently remind students to be careful about assuming independence.

h) X, Y, and Z are random variables, each following a Normal(100, 10) distribution. (A) P(X + Y > 200); (B) P(X + Z > 200); (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

Some students select C, thinking that because Y and Z have the same distribution, then so do X + Y and X + Z.  However, X and Y do not necessarily have the same joint distribution as X and Z, and the joint distribution affects the distribution of the sum.  If X, Y, and Z were independent, then the answer would be C, but without that information (remember to be careful about assuming independence!) the answer is D.

i) X and Y are independent random variables, each following a Normal(100, 10) distribution. (A) X; (B) Y; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

Some students select C, because X and Y have the same distribution.  But there are (infinitely) many potential values these random variables can take, so it’s impossible to know which one will be greater.  The following is a more difficult version of this idea; again, students often choose C but the correct answer is D.

j) X, Y, and Z are independent random variables, with X ~ Poisson(1), Y ~ Poisson(2), and Z~Poisson(3). (A) X + Y; (B) Z; (C) A and B are exactly the same; (D) Not enough information to determine which of A or B is greater

The last four examples illustrate two major themes behind many of the questions I ask in probability courses:

  • Marginal distributions alone are not enough to determine joint distributions.
  • Do not confuse a random variable with its distribution.

Many common mistakes in probability result from not heeding these two principles, so I think it’s important to give students lots of practice with these ideas and assess them frequently.

2. How would you simulate?

In virtually every probability problem I introduce, one of the first questions I ask is “how would you simulate?” Such questions are a great way to assess student understanding of probability distributions and their properties, and concepts like expected value or conditional probability, without doing any calculations.

a) Describe in detail how you could, in principle, perform by hand a simulation involving physical objects (coins, dice, spinners, cards, boxes, etc.) to estimate P(X = 5 | X > 2), where X has a Binomial distribution with parameters n=5 and p=2/7.  Be sure to describe (1) what one repetition of the simulation entails, and (2) how you would use the results of many repetitions.  Note: You do NOT need to compute any numerical values.

Here is a detailed response:

  1. To simulate a single value of X, we can use the “story” for a Binomial distribution and think of X as counting the number of successes in 5 Bernoulli trials with probability of success 2/7.  To simulate a single trial, construct a spinner with 2/7 of the area shaded as success*.  To simulate a single value of X, spin the spinner 5 times and count the number of successes. If X > 2, record the value of X.  Otherwise, discard it and try again to complete step (1)**.
  2. Repeat step (1) 10,000 times, to obtain 10000 values of X with X > 2.  Count the number of simulated values of X that are equal to 5 and divide by 10,000 to approximate P(X = 5 | X > 2).

* There are many possible randomization devices, including a seven-sided die or a deck of seven cards with two labeled as success.  However, it’s important that students implement independent trials, so they must indicate that cards are drawn with replacement.

** I also accept an answer that omits the “discard” part of step (1) and replaces step (2) with: Repeat step (1) 10,000 times to obtain 10,000 values of X.  Divide the number of simulated values of X that are equal to 5 by the number of simulated values of X that are greater than 2 to approximate P(X = 5 | X > 2).  Each method provides a point estimate of the conditional probability, but they differ with respect to simulation margin-of-error.  I discuss in class how the method which includes the “discard” part of step (1) is less computationally efficient but results in a smaller margin-of-error.

Students often write vague statements like “repeat this many times.”  But “this” could be a single spin of the spinner or a generating a single value of X. Therefore, it’s important that students’ responses clearly distinguish between (1) one repetition and (2) many repetitions. 

(b) Repeat (a) for the goal of estimating Cov(V, W), where V = X + Y, W = max(X, Y), and X, Y are i.i.d. Normal(100, 15). Assume that you have access to a Normal(0, 1) spinner.

Part (b) illustrates how tactile simulation can be used even with more advanced concepts like continuous or joint distributions.  I repeatedly use the analogy that every probability distribution can be represented by a spinner, like the following picture corresponding to a Normal(0, 1) distribution:

Notice how the values on the spinner are not evenly spaced; the sector corresponding to the range [0, 1] comprises 34.1% of the area while [1, 2] comprises 13.6%. (With more mathematically inclined students I discuss how to create such spinners by inverting cumulative distribution functions.) I have many clear plastic spinners that can be overlaid upon pictures like the above so students can simulate by hand values from a variety of distributions.

Here is a detailed response to part (b):

  1. To simulate a single (V, W) pair: Spin the Normal(0, 1) spinner to obtain Z1, and let X = 100 + 15 × Z1. Spin the Normal(0, 1) spinner again to obtain Z2, and let Y = 100 + 15 × Z2. Add the X and Y values to obtain V = X + Y, and take the larger of X and Y to obtain W = max(X, Y). Record the values of V, W, and their product VW.
  2. Repeat step (1) 10,000 times to obtain 10,000 values each of V, W, and VW.  Average the values of VW and subtract the product of the average of the V values and the average of the W values to approximate Cov(V, W).

I do think it’s important that students can write their own code to implement simulations.  But I generally prefer “describe in words” questions to “write the code” to avoid syntax issues, especially during timed exams.  When I want to assess student understanding of actual code on an exam, I provide the code and ask what the output would be. Of course, after discussing how to simulate and simulating a few repetitions by hand, we then carry out a computer simulation.  But before looking at the results, I often ask students to sketch a plot, as described in the next section.

3. Sketch a plot

As students progress in probability and statistics courses, they encounter many probability distributions but often have difficulty understanding just what all these distributions are.  Asking students to sketch plots, as in the following example, helps solidify understanding of random variables and distributions without any difficult calculus.

Suppose that X has a Normal(0, 1) distribution, U has a Uniform(-2, 2) distribution, X and U are independent, and Y = UX. For each of the following, sketch a plot representing the distribution.  The sketch does not have to be exact, but it should explicitly illustrate the most important features.  Be sure to clearly label any axes with appropriate values.  Explain the important features your plot illustrates and your reasoning*. (a) the conditional distribution of Y given U = -0.5; (b) the joint distribution of X and Y.

* I usually give full credit to well-drawn and carefully labeled plots regardless of the quality of explanation.  But “explaining in words” can help students who have trouble translating ideas into pictures.

Part (a) is not too hard once students realize they should draw a Normal(0, 0.5) density curve*, but it does take some thought to get to that point.  Even though the answer is just a normal curve, the question still assesses understanding of conditioning (treating U as constant) and the effect of a linear transformation.  The question also assesses the important difference between operations on random variables versus operations on distributions; it is X that is multiplied by -0.5, not its density. (Unfortunately, some students forget this and draw an upside-down normal curve.)

* However, I do deduct points if the variable axis isn’t labeled, or if the inflection points are not located at -0.5 and 0.5.  (The values on the density axis are irrelevant.)

Part (b) is much harder. Here is an excellent student solution:

Students tend to find this type of question challenging, even after encountering examples in class activities and assignments. Here are some questions that I pose during class examples, which I hope students ask of themselves on assessments, to help them unpack these problems:

  1. What is one possible plot point?  A few possible points? Students often have trouble even starting these problems, so just identifying a few possibilities can help.
  2. What type of plot is appropriate? Since X and Y are two continuous random variables, a scatterplot or joint density plot is appropriate.
  3. What are the possible values of the random variable(s)? After identifying a few possible values, I ask students to identify all the possible values and start labeling axes. Since X ~ Normal(0, 1), 99.7% of the values of X will fall between -3 and 3, so we can label the X-axis from -3 to 3.  (Remember, it doesn’t have to be perfect.) The value of Y depends on both X and U; identifying a few examples in step 1 helps students see how.  Given X = x, Y has a Uniform(-2|x|, 2|x|) distribution, so larger values of |x| correspond to more extreme values of Y.  Since most values of X lie between -3 and 3, most values of Y lie between -6 and 6, so we can label the Y-axis from -6 to 6.  But not all (X, Y) pairs are possible; only pairs within the region bounded by the lines y = 2x and y = -2x have nonzero density.  If students can make it to this point, drawing a plot with well-labeled axes and the “X-shaped” region of possible values, then they’ve made great progress.
  4. What ranges of values are more likely?  Less likely? Values of X near 0 are more likely, and far from 0 are less likely.  Within each vertical strip corresponding to an x value, the Y values are distributed uniformly, so the density is stretched thinner over longer vertical strips. These observations help us shade the plot as in the example.

Determining an expression for the joint density in part (b) is a difficult calculus problem involving Jacobians.  Even students who are able to do the calculus to obtain the correct density might not be able to interpret what it means for two random variables to have this joint density.  Furthermore, even if students are provided the joint density function, they might not be able to sketch a plot or understand what it means. But I’m pretty confident that students who draw plots like the above have a solid understanding of concepts including normal distributions, uniform distributions, joint distributions, and transformations.

4. “Don’t do what Donny Don’t does”

This title is an old Simpson’s reference (see here). In these questions, Donny Don’t represents a student who makes many common mistakes. Students can learn from the common mistakes that Donny makes by identifying what is wrong and why, and also by helping Donny understand and correct his mistakes.

At various points in his homework, Donny Don’t writes the following expressions. Using simple examples, explain to Donny which of his statements are nonsense, and why. (A represents an event, X a random variable, P a probability measure, and E an expected value.) a) P(A = 0.5); b) P(A)∪ P(B); c) P(X); d) P(X = E(X)).

I’ll respond to Donny using tomorrow’s weather as an example, with A representing the event that it rains tomorrow, X tomorrow’s high temperature (in degrees F), and B the event that tomorrow’s high temperature is above 80 degrees.

(a) It doesn’t make sense to say “it rains tomorrow equals 0.5.” If Donny wants to say “the probability that it rains tomorrow equals 0.5” he should write P(A) = 0.5. (Mathematically, A is a set and 0.5 is a number, so it doesn’t make sense to equate them.)

(b) What Donny has written reads as “the probability that it rains tomorrow or the probability that tomorrow’s high temperature is above 80 degrees F,” which doesn’t make much sense.  Donny probably means “the probability that (it rains tomorrow) or (tomorrow’s high temperature is above 80 degrees),” which he should write as P(A ∪ B). (Mathematically, P(A) and P(B) are numbers while union is an operation on sets, so it doesn’t make mathematical sense to take a union of numbers.) Donny might have meant to write P(A) + P(B), which is valid expression since P(A) and P(B) are numbers. However, he should keep in mind that P(A) + P(B) is not necessarily a probability of anything; this sum could even be greater than one.  In particular, since there are some rainy days with high temperatures above 80 degrees, P(A) + P(B) is greater than P(A ∪ B).

(c) Donny has written “the probability that tomorrow’s high temperature,” which is a subject in need of a predicate.  We assign probabilities to things that could happen (events) like “tomorrow’s high temperature is above 80 degrees,” which has probability P(X > 80).

(d) Donny’s notation is actually correct!  Students often find this expression strange at first, but since E(X) represents a single number, P(X = E(X)) makes just as much sense as P(X = 80). Even if we don’t know the value of E(X), it still makes sense to consider “the probability that tomorrow’s high temperature is equal to the average high temperature.” Some students might object that X is continuous and so P(X = E(X)) = 0, but P(X = E(X)) is still a valid expression even when it equals 0.

Questions like this do more than encourage and assess proper use of notation.  Explaining to Donny why he is wrong helps students better understand the probabilistic objects that symbols represent and how they connect to real-world contexts.

I hope these examples demonstrate that even in advanced courses in probability or theoretical statistics, instructors can ask a variety of probability questions that don’t require any computation or calculus.  Such questions can not only assess students’ understanding of probability concepts but also help them to develop their understanding in the first place.  I have many more examples that I’d be happy share, so please feel free to contact me (!

P.S. Many thanks to Allan for having me as a guest, and thanks to you for reading!

#53 Random champions

This guest post has been contributed by Josh Tabor. You can contact Josh at

Josh Tabor teaches AP Statistics at Canyon del Oro High School in Oro Valley, Arizona, near Tucson*.  He is a co-author of a widely used textbook for AP Statistics, titled The Practice of Statistics.  He also co-wrote Statistical Reasoning in Sports, a textbook that uses simulation-based inference from the very first chapter.  Josh and I have worked together for many years at the AP Statistics Reading, and we have also presented at some of the same workshops and conferences.  Even more fun, we have attended some pre-season baseball games together in Arizona.  Josh is a terrific presenter and expositor of statistical ideas, so I am delighted that he agreed to bat lead-off for this series of guest bloggers.  Sticking with the baseball theme, he has written a post about randomness, simulation, World Series champions, teaching statistical inference, and asking good questions.

* Doesn’t it seem like the letters c and s are batting out of order in Tucson?

I am a big believer in the value of simulation-based inference, particularly for introducing the logic of significance testing. I start my AP Statistics class with a simulation-based inference activity, and try to incorporate several more before introducing traditional inference. Many of these activities foreshadow specific inference procedures like a two-sample z-test for a difference in proportions, but that isn’t my primary goal. Instead, my goal is to highlight how all significance tests follow the same logic, regardless of the type of data being collected. The example that follows doesn’t align with any of the tests in a typical introductory statistics class, but it is a fun context and helps achieve my goal of developing conceptual understanding of significance testing.

In a 2014 article in Sports Illustrated (here), author Michael Rosenberg addresses “America’s Wait Problem.” That is, he discusses how fans of some teams have to wait many, many years for their team to win a championship. In Major League Baseball, which has 30 teams, fans should expect to wait an average of 30 years for a championship—assuming all 30 teams are equally likely to win a championship each season. But is it reasonable to believe that all teams are equally likely to win a championship?

Rosenberg doesn’t think so. As evidence, he points out that in the previous 18 seasons, only 10 different teams won the World Series. Does having only 10 different champions in 18 seasons provide convincing evidence that the 30 teams are not equally likely to win a championship?

Before addressing whether the evidence is convincing, I start my students off with a (perhaps) simpler question:

  • Rosenberg suggests that having 10 different champions in 18 seasons is evidence that teams are not equally likely to win a championship. How does this evidence support Rosenberg’s claim?

This isn’t the first time I have asked such a question to my students. From the beginning of the year, we have done a variety of informal significance tests, like the ones Allan describes in posts #12, #27, and #45 (here, here, and here). In most previous cases, it has been easy for students to identify how the given evidence supports a claim. For example, if we are testing the claim that a population proportion p > 0.50 and obtain a sample proportion of p-hat = 0.63, then recognizing that p-hat = 0.63 > 0.50 is very straightforward.

In this case, the statistic presented as evidence is quite different from a simple proportion or mean or even a correlation coefficient. Here the statistic is the number of different champions in an 18-year period of time. Some students will naively suggest that if teams are equally likely to win a championship, there should be 18 different champions in 18 seasons. And because 10 < 18, these data provide the evidence we are looking for. If students go down this path, you might ask a follow-up question: If you were to roll a die 6 times, would you expect to get 6 different results? If you have the time, you might even pull out a die and give it 6 rolls. (If you are nervous, there is less than a 2% chance of getting 6 different outcomes in 6 rolls of a fair die*.)

* This calculation is:

Once students are convinced that 18 is the wrong number to compare to, I pose a new question:

  • If all 30 teams are equally likely to win a championship, what is the expected value of the number of different champions in 18 seasons?

There is no formula that I know of that addresses this question. Which leads to another question:

  • What numbers of different champions (in 18 seasons) are likely to happen by chance alone, assuming all 30 teams are equally likely to win a championship?

Upon hearing the words “by chance alone,” my students know how to determine an answer: Simulation! Now for more questions:

  • How can you simulate the selection of a World Series champion, assuming all teams are equally likely to win the championship?
  • How do you conduct 1 repetition of your simulation?
  • What do you record after each repetition of your simulation?

If we have time, I like students to work in groups and discuss their ideas. There are a variety of different approaches that students take to answer the first question: rolling a 30-sided die, with each side representing a different team; putting the names of the 30 teams in a hat, mixing them up, and choosing a team; or spinning 30-section spinner, with each section having the same area and representing one of the teams. I am happy when students think of physical ways to do the simulation, as that is what I have modeled since the beginning of the year. But I am also happy when they figure out a way to use technology: Generate a random integer from 1–30, where each integer represents a different team.

Assuming that students settle on the random integer approach, they still need to figure out how to complete one repetition of the simulation. In this case, they would need to generate 18* integers from 1–30, one integer (champion) for each season, allowing for repeated integers**. To complete the repetition, they must determine the value of the simulated statistic by recording the number of different integers in the set of 18. For example, there are 14 different champions in the following set of 18 random integers (repeat champions underlined): 22, 24, 17, 14, 8, 1, 11, 9, 25, 17, 17, 24, 16, 7, 18, 16, 30, 19.

* As I was brainstorming for this post, I started by counting the number of champions in the previous 30 MLB seasons, rather than the 18 seasons mentioned in the article. I didn’t want to be guilty of cherry-picking a boundary to help make my case. And 30 seemed like a nice number because it would allow for the (very unlikely) possibility of each team winning the championship once (not because of the central limit theorem!). But, using the same number in two different ways (30 teams, 30 seasons) is sure to create confusion for students. So I stuck with the 18-season window from the article.  Also, I realized that an 18-season window captures an entire lifetime for my students.

** Early in my teaching career (2001 to be precise), there was a simulation question on the AP Statistics exam that required students to account for sampling without replacement. Until then, we had always done examples where this wasn’t an issue. After 2001, I made a big deal about “ignoring repeats” until I realized that students were now including this phrase all the time, even when it wasn’t appropriate. I now try include a variety of examples, with only some requiring students to “ignore repeats.”  In this context of sports champions, of course, repeats are at the very heart of the issue we’re studying.

Once students have had the opportunity to share their ideas, we turn to technology to run the simulation. My software of choice for simulation is Fathom (here), but there are many alternatives. Here are the results of 10,000 repetitions of the simulation. That is, the results of 10,000 simulated sets of 18 seasons, assuming all 30 teams are equally likely to win the championship each year:

In this simulation of 10,000 seasons, the mean number of different champions is 13.71, and the standard deviation is 1.39. The minimum value is 9, and the maximum is 18, which indicates that every season had a different champion for at least one of the 10,000 simulated seasons.

Back to the questions:

  • There is a dot at 9. What does this dot represent?

This is one of my very favorite questions to ask anytime we do a simulation. In this case, the dot at 9 represents one simulated 18-year period where there were 9 different champions.

  • Using the results of the simulation, explain how having 10 champions in 18 seasons is evidence for Rosenberg’s claim that teams are not equally likely to win a championship.

Note that I am not asking whether the evidence is convincing. Yet. For now, I want students to notice that the expected number of different champions is 13 or 14 (expected value  13.71) when each team is equally likely to win the championship over an 18-year period. And most importantly, 10 is less than 13 or 14. So, Rosenberg’s intuition was correct when he cited the value of this statistic as evidence for his claim. Now that we have identified the evidence, I ask the following:

  • What are some explanations for the evidence? In other words, what are some plausible explanations for why we got a value less than 14?

My students have already been through this routine several times, so they are pretty good about answering this question. And if they can provide the explanations in my preferred order*, I am especially happy.

  • Explanation #1: All teams are equally likely to win the championship each year, and the results in our study happened by chance alone. Note that both clauses of this sentence are very important. My students always get the second half (“it happened by chance!”), but they also need the first part to have a complete explanation.
  • Explanation #2: Teams aren’t equally likely to win the championship. In other words, some teams are more likely to win championships than others (sorry, Seattle Mariners fans!).

* This is my preferred order because it parallels the null and alternative hypotheses that we will discuss later in the year.

Once these two explanations are identified, we return to the original question:

  • Does having 10 different champions in 18 seasons provide convincing evidence that all teams are not equally likely to win a championship?

For evidence to be convincing, we must be able to essentially rule out Explanation #1. Can we? To rule out Explanation #1, we need to know how likely it is to get evidence as strong or stronger than the evidence we found in our study, assuming that all teams are equally likely to win the championship each year.

  • How can you use the dotplot to determine if the evidence is convincing?

When I am leading students through this discussion, there are usually a few who correctly respond “See how often we got a result of 10 or fewer by chance alone.” But when I ask similar questions on exams, many students don’t provide the correct answer. Instead, they give some version of the following: “Because nearly half of the dots are less than the mean, it is possible that this happened by chance alone.”* The use of the word “this” in the previous sentence points to the problem: students aren’t clear about what event they are supposed to consider. Once I started asking students to state the evidence at the beginning of an example, this error has occurred less often.  

* This is even more common when there is a clearly stated null hypothesis like H0: p1p2 = 0 and students are tempted to say “because about half of the dots are positive…”

  • In the simulation, 98 of the 10,000 simulated seasons resulted in 10 or fewer different champions, as highlighted in the graph below.  Based on this result, what conclusion would you make?

In the simulation, getting a result of 10 or fewer different champions was pretty rare, occurring only 98 times in 10,000 repetitions* (probability  0.0098). Because it is unlikely to get 10 or fewer different champions by chance alone when all 30 teams are equally likely to win the championship, there is convincing evidence that teams in this 18-year period were not equally likely to win the championship.

* Of course, this describes a p-value. I don’t call it a p-value until later in the year, but I am careful to use correct language, including the assumption that the null hypothesis is true.

As always, the scope of inference is important to consider. I also like to give students experience with raw data that allows them to determine the value of the statistic for themselves.  I remind students that the conclusion above was about “this 18-year period.” That is, the 18-year period prior to the article’s publication in November 2014. Here are the World Series champions for the 18-year period from 2002–2019*:

* In addition to matching the 18-year period length from the article, this allows me to include my favorite team in the list of World Series champions: Go Angels! It also makes me feel old as most of my current students weren’t even alive in 2002!

  • What are the observational units for these sample data?  What is the variable?  What statistic will we determine from this sample?  What is the value of that statistic for this sample?

The observational units are the 18 seasons, and the variable is the World Series champion for that season. The statistic is the number of different champions in these 18 seasons. There were 12 different champions in this 18-year period. The repeat champions were the Boston Red Sox (4 times), San Francisco Giants (3 times), and St. Louis Cardinals (twice).

  • To determine if these data provide convincing evidence that all teams are not equally likely to win a championship in 2002–2019, do we need to conduct a different simulation?

No. Because the number of seasons (18) and the number of teams (30) are still the same, we can use the results of the previous simulation to answer the question about 2002–2019.

  • For the 18-year period from 2002–2019, is there convincing evidence that all teams are not equally likely to win a championship?

No. The graph of simulation results shows that a result of 12 or fewer different champions in 18 seasons is not unusual (probability  0.1916). Because it is not unlikely to get 12 or fewer different champions by chance alone, when all 30 teams are equally likely to win the championship each season, the data do not provide convincing evidence that teams in this 18-year period were not equally likely to win the championship. In other words, it is plausible that all 30 teams were equally likely to win the championship in the period from 2002–2019*.

* To avoid the awkward double negative in their conclusions, it is very tempting for students to include statements like the final sentence in the preceding paragraph. Unfortunately, they usually leave out wiggle phrases like “it is plausible that” or “it is believable that.” Once your students have had some experience making conclusions, it is important to caution them to never “accept the null hypothesis” by suggesting that there is convincing evidence for the null hypothesis.  In this context, no sports fan really believes that all teams are equally likely to win the championship each season, but the small sample size does not provide convincing evidence to reject that claim.

If you have the time and students seem interested in this topic, you can expand into other sports. Here are some questions you might ask about the National Football League:

  • Do you think there would be stronger or weaker evidence that NFL teams from the previous 18 seasons aren’t equally likely to win a championship?

Most people expect the evidence to be stronger for the NFL. Even though the NFL tries to encourage parity, the New England Patriots seem to hog lots of Super Bowl titles.

  • If we were to simulate the number of different champions in an 18-year period for the NFL, assuming all 32 teams are equally likely to win a championship, how would conducting the simulation differ from the earlier baseball simulation?

Instead of generating 18 integers from 1–30, we would generate 18 integers from 1–32.

  • How do you think the results of the simulation would differ?

With more teams available to win the championship, the expected value of the number of different champions should increase.

  • It just so happens that 12 different NFL teams have won a championship in the previous 18 seasons, the same as the number of MLB teams that have won a championship in the previous 18 seasons. (The Patriots won 5 of these championships.) Based on your answer to the previous question, would the probability of getting 12 or fewer NFL champions by chance alone be larger, smaller, or about the same as the probability in the MLB simulation (0.1916)?

This probability will be smaller, as the expected number of different champions in the NFL is greater than in MLB, so values of 12 or fewer will be less likely in the NFL simulation.

Here are the results of 10,000 simulated 18-season periods for the NFL:

The most common outcome is still 14 different champions, but the mean number of different champions increases from about 13.71 with MLB to about 13.94 with NFL. (The standard deviation also increases from 1.39 to 1.41).

The p-value for the NFL data is about 0.1495, smaller (as expected) than the p-value of 0.1916 for the MLB data. However, because the  p-value is not small, these data do not provide convincing evidence that the 32 NFL teams are not equally likely to win the championship each season.

Each time we do an informal significance test like this one, I rehearse the logic with my students:

  1. Identify the statistic to be used as evidence, and explain why it counts as evidence for the claim being tested.
  2. Describe the two explanations for the evidence.
  3. Use simulation to explore what is likely to happen by chance alone.
  4. Compare the evidence to what is likely to happen by chance alone. If it is unlikely to get evidence as strong as or stronger than the observed evidence, then the evidence is convincing.

P.S. Thanks to Allan for letting me share some thoughts in this post. And thanks for each of the 52 entries that precede this one!

#52 Top thirteen topics

After I present an activity on a particular statistical topic while conducting a workshop for teachers, I often say something like: I think this is one of the top ten things for students to learn in introductory statistics.  Naturally enough, a workshop participant always asks me to provide my complete “top ten” list.  My sheepish response has always been to beg off, admitting that I have never taken the time to sit down and compile such a list*.

* Workshop participants have always been too polite to ask how, in that case, I can be so sure that the topic in question is actually on that imaginary list of mine.

To mark the 52nd post and one-year milestone for this weekly blog, I have finally persuaded myself to produce my list of most important topics for students to learn in introductory statistics.  I hope you will forgive me for expanding the number of topics to a lucky thirteen*.  Commenting on this list also provides an opportunity for me to reflect on several earlier posts from my year of blogging. * I also recommend the “top seven” list produced by Jessica Utts in an article for The American Statistician in 2003 (here), to which she added an additional four topics at an ICOTS presentation in 2010 (here).

Unlike previous posts, this one poses no questions for students to appear in italics.  Instead I focus on the question that has often been asked of me: What are the most important topics for students to learn in introductory statistics?

1. Identifying observational units and variables points the way.

In post #11 (Repeat after me, here), I repeated over and over again that I ask students to identify observational units and variables for almost every example that we study throughout the entire course.  Along with identifying the variables, I ask students to classify them as categorical or numerical, explanatory or response.  Thinking through these aspects of a statistical study helps students to understand how the study was conducted and what its research questions were.  These questions also point the way to knowing what kind of graph to produce, what kind of statistic to calculate, and what kind of inference procedure to conduct.  I have found that identifying observational units and variables is more challenging for students than I used to think.

One of my favorite examples to illustrate this concerns the murder trial of Kristen Gilbert, a nurse accused of being a serial killer of patients.  The following data were presented at her trial:

The observational units here are hospital shifts, not patients.  The explanatory variable is whether or not Gilbert was working on the shift, which is categorical and binary.  The response variable is whether or not a patient died on the shift, which is also categorical and binary.  Students need to understand these basic ideas before they can analyze and draw conclusions from these data.

2. Proportional reasoning, and working with percentages, can be tricky but are crucial.

I suspect that at least two-thirds of my blog posts have included proportions or percentages*.  Proportions and percentages abound in everyday life.  Helping students to work with percentages, and to recognize the need for proportional reasoning, is a worthy goal for introductory statistics courses.

* This very sentence contains a proportion, even if it is only a guess.

Look back at the table of counts from the Kristen Gilbert trial.  Students who do not think proportionally simply compare the counts 40 and 34, which suggests a small difference between the groups.  But engaging in proportional reasoning reveals a huge discrepancy: 40/257 ≈ 0.156 and 34/1384 ≈ 0.025.  In other words, 15.6% of shifts on which Gilbert worked saw a patient death, compared to 2.5% of shifts on which Gilbert did not work.  These percentages are displayed in the segmented bar graph:

What’s so tricky about this?  Well, converting the proportions to statements involving percentages is non-trivial, particularly as these are conditional percentages.  More challenging is that many students are tempted to conclude that the death rate on Gilbert shifts is 13.1% higher than the death rate on non-Gilbert shifts, because 0.156 – 0.025 = 0.131.  But that’s not how percentage difference works, as I ranted about at length in post #28 (A pervasive pet peeve, here).  The actual percentage difference in the death rates between these groups is (0.156 – 0.025) / 0.025 × 100% ≈ 533.6%.  Yes, that’s right: The death rate on a Gilbert shift was 533.6% higher than the death rate on a non-Gilbert shift!  This gives quite a different impression that the incorrect claim of a 13.1% difference.

The importance of proportional reasoning also arises when working with probabilities.  I strongly recommend producing a table of hypothetical counts to help students work with conditional probabilities.  For example, I used that technique in post #10 (My favorite theorem, here) to lead students to distinguish between two conditional probabilities: (1) the probability that a person with a positive test result has the disease, and (2) the probability that the test result is positive among people who have the disease, as shown in the table:

3. Averages reveal statistical tendencies.

The concept of a statistical tendency is a fundamental one that arises in all aspects of life.  What do we mean when we say that dogs are larger than cats?  We certainly do not mean that every dog is larger than every cat.  We mean that dogs tend to be larger than cats.  We also express this idea by saying that dogs are larger than cats on average.  We can further explain that if you encounter a dog and a cat at random, it’s more likely than not that the dog will be larger than the cat*.

Understanding statements of statistical tendencies, and learning to write such statements clearly, is an important goal for introductory statistics students.  Is this an easy goal to achieve?  Not at all.  I mentioned in post #37 (What’s in a name? here) that psychologist Keith Stanovich has described this skill, and probabilistic reasoning more generally, as the “Achilles Heel” of human cognition.

The dogs and cats example is an obvious one, but averages can also help us to see a signal in the midst of considerable noise.  Post #9 (Statistics of illumination, part 3, here) about the infamous 1970 draft lottery illustrates this point.  The scatterplot on the left, displaying draft number versus day of the year, reveals nothing but random scatter (noise) on first glance.  But calculating the median draft number for each month reveals a clear pattern (signal), as shown on the right:

You might be thinking that students study averages beginning in middle school or even sooner, so do we really need to spend time on averages in high school or college or courses?  In post #5 (A below-average joke, here), I argued that we can help students to develop a deeper understanding of how averages work by asking questions such as: How could it happen that the average IQ dropped in both states when I moved from Pennsylvania to California?

4. Variability, and distributional thinking, are fundamental.

Averages are important, but variability is at the core of statistical thinking.  Helping students to regard a distribution of data as a single entity is important but challenging.  For example, post #4 (Statistics of illumination, part 2, here) described an activity based on data about readability of cancer pamphlets.  I ask students to calculate medians for a dataset on pamphlet readability and also for a dataset on patient reading levels.  The medians turn out to be identical, but that only obscures the more important point about variability and distribution.  Examining a simple graph reveals the underlying problem that many patients lack the skill to read the simplest pamphlet:

In posts #6 and #7 (Two dreaded words, here and here), I suggested that we can help students to overcome their dread of the words standard deviation by focusing on the concept of variability rather than dreary calculations that are better performed by technology.  I also argued in post #8 (End of the alphabet, here) that z-scores are an underappreciated idea that enable us to compare proverbial apples and oranges by taking variability into account.

5. Visual displays of data can be very illuminating.

In light of the graphs presented above, I trust that this point needs no explanation.

6. Association is not causation; always look for other sources of variability.

Distinguishing causation from association often prompts my workshop comment that I mentioned in the first sentence of this post.  I want students to emerge from their introductory statistics course knowing that inferring a cause-and-effect relationship from an observed association is often unwarranted.  Posts #43 and #44 (Confounding, here and here) provide many examples.

The idea of confounding leads naturally to studying multivariable thinking.  Post #3 (Statistics of illumination, part 1, here) introduced this topic in the context of graduate admission decisions.  Male applicants had a much higher acceptance rate than female applicants, but the discrepancy disappeared, and even reversed a bit, after controlling for the program to which they applied.  For whatever reason, most men applied to the program with a high acceptance rate, while most women applied to the program with a very low acceptance rate.

Post #35 (Statistics of illumination, part 4, here) continued this theme in the context of comparing lung capacities between smokers and non-smokers.  Surprisingly enough, smokers in that study tended to have larger lung capacities than non-smokers.  This perplexing result was explained by considering the ages of the people, who were all teenagers and younger.  Smokers were much more likely to be older than younger, and older kids tended to have larger lung capacities than younger ones.  The following graph reveals the relationships among all three variables:

7. Randomized experiments, featuring random assignment to groups, allow for cause-and-effect conclusions.

Some students take the previous point too far, leaving their course convinced that they should never draw cause-and-effect conclusions.  I try to impress upon them that well-designed randomized experiments do permit drawing cause-and-effect conclusions, as long as the difference between the groups turns out to be larger than can plausibly be explained by random chance.  Why are the possible effects of confounding variables less of a concern with randomized experiments?  Because random assignment of observational units to explanatory variable groups controls for other variables by balancing them out among the groups.

Post #20 (Lincoln and Mandela, part 2, here) describes a class activity that investigates the psychological phenomenon known as anchoring by collecting data from students with a randomized experiment.  Students are asked to guess the age at which Nelson Mandela died, but some students first see the number 16 while others see the number 160.  The following graph displays the responses for one of my classes. These data strongly suggest that those primed with 160 tend to make larger guesses than those primed with 16:

Posts #27 and #45 (Simulation-based inference, parts 2 and 3, here and here) also featured randomized experiments.  We used simulation-based inference to analyze and draw conclusions from experiments that investigated effects of metal bands on penguin survival and of fish oil supplements on weight loss.

8. Random sampling allows for generalizing, but it’s very hard to achieve.

Random sampling is very different from random assignment.  These two techniques share an important word, but they have different goals and consequences.  Random sampling aims to select a representative sample from a population, so results from the sample can be generalized to the population.

I described how I introduce random sampling to my students in post #19 (Lincoln and Mandela, part 1, here).  In this activity, students select samples of words from the Gettysburg Address.  First students select their sample simply by circling ten words that appeal to them.  They come to realize that this sampling method is biased toward longer words.  Then they use genuine random sampling to select their sample of words, finding that this process is truly unbiased.  The following graphs (from the applet here) help students to recognize the difference between three distributions: 1) the distribution of word lengths in the population, 2) the distribution of word lengths in a random sample from that population, and 3) the distribution of sample mean word lengths in 1000 random samples selected from the population:

I emphasize to students that while selecting a random sample of words from a speech is straight-forward, selecting a random sample of human beings is anything but.  Standing in front of the campus library or recreation center and selecting students in a haphazard manner does not constitute random sampling.  Even if you are fortunate enough to have a list of all people in the population of interest from which to select a random sample, some people may choose not to participate, which leaves you with a non-random sample of people for your study.

9. Analyzing random phenomena requires studying long-run behavior.

There’s no getting around the fact that much of statistics, and all of probability, depends on asking: What would happen in the long run?  Such “long run” concepts are hard to learn because they are, well, conceptual, rather than concrete.  Fortunately, we can make these concepts more tangible by employing the most powerful tool in our pedagogical toolbox: simulation!

Post #17 (Random babies, here) presents an activity for introducing students to basic ideas of randomness and probability.  Students use index cards to simulate the random process of distributing four newborn babies to their mothers at random.  Then they use an applet (here) to conduct this simulation much more quickly and efficiently.  Post #18 (What do you expect? here) follows up by introducing the concept of expected value.  The following graph shows how the average number of correct matches (of babies to mothers) changes for the first 1000 repetitions of simulating the random process, gradually approaching the long-run average of 1.0:

The usefulness of simulation for studying and visualizing randomness permeates all of the posts about probability.  For example, post #23 (Random rendezvous, part 1, here) presents the following graph of simulation results to display the probability that two people successfully meet for lunch, when their arrival times are independent uniform distributions and they agree to wait fifteen minutes for each other:

10. Sampling distributions lay the foundation for statistical inference.

One of the questions posed by prospective teachers in post #38 (here) asked me to identify the most challenging topic for introductory statistics students.  My response was: how the value of a sample statistic varies from sample to sample, if we were to repeatedly take random samples from a population.  Of course, for those who know the terminology*, I could have answered with just to words: sampling distributions.  I expanded on this answer in posts #41 and #42 (Hardest topic, here and here).

* Dare I say jargon?

Understanding how a sample statistic varies from sample to sample is crucial for understanding statistical inference. I would add that the topic of randomization distributions deserves equal status with sampling distributions, even though that term is much less widely used.  The difference is simply that whereas the sampling distribution of a statistic results from repeated random sampling, the randomization distribution of a statistic results from repeated random assignment.  In his classic article titled The Introductory Statistics Course: A Ptolemaic Curriculum? (here), George Cobb argued that statistics teachers have done a disservice to students by using the same term (sampling distributions) to refer to both types, which has obscured the important distinction between random sampling and random assignment.

You will not be surprised that I consider the key to studying both sampling distributions and randomization distributions to be … drumroll, please … simulation!

11. Confidence intervals estimate parameters with a margin-of-error.

The need for interval estimation arises from the fundamental idea of sampling variability, and the concept of sampling distributions provides the underpinning on which confidence interval procedures lie.  I described activities and questions for investigating confidence intervals in a three-part series of posts #14, #15, and #46 (How confident are you? here, here, and here).

In post #15 (here), I argued that many students fail to interpret confidence intervals correctly because they do not think carefully about the parameter being estimated.  Instead, many students mistakenly interpret a confidence interval as a prediction interval for an individual observation.  Helping students to recognize and define parameters clearly is often overlooked but time well spent.

As with many other topics, interactive applets can lead students to explore properties of confidence intervals.  The following graph, taken from post #14 (here) using the applet here, illustrates the impact of confidence level, while revealing that confidence level refers to the proportion of intervals (in the long run, under repeated random sampling) that successfully capture the value of the population parameter:

12. P-values indicate how surprising the sample result would be if a hypothesized model were true.

The p-value has been the subject of much criticism and controversy in recent years (see the 2019 special issue of The American Statistician here).  Some have called for eliminating the use of p-values from scientific inquiry and statistical inference.  I believe that p-values are still essential to teach in introductory statistics, along with the logic of hypothesis testing.  I think the controversy makes clear the importance of helping students to understand the concept of p-value in order to avoid misuse and misinterpretation.

Yet again I advocate for using simulation as a tool for introducing students to p-values.  Many posts have tackled this topic, primarily the three-part series on simulation-based inference in posts #12, #27, and #45 (here, here, and here).  This topic also featured in posts #2 (My favorite question, here), #9 (Statistics of illumination, part 3, here), and #13 (A question of trust, here).

The basic idea behind a p-value is to ask how likely an observed sample result would be if a particular hypothesis about a parameter were true.  For example, post #12 (here) described a study that investigated whether people are more likely to attach the name Tim (rather than Bob) to the face on the left below:

When I asked this question of my students in a recent class, 36 of 46 students associated Tim with the face on the left.  A simulation analysis of 10,000 coin flips (using the applet here) reveals that such an extreme result would happen very rarely with a 50-50 random process, as shown in the graph below.  Therefore, we conclude that the sample result provides strong evidence against the 50-50 hypothesis in favor of the theory that people are more likely to attach the name Tim to the face on the left.

13. Statistical inference does not reveal, or account for, everything of interest.

It’s imperative that we statistics teachers help students realize that statistical inference has many, many limitations.  This final topic on my list is a catch-all for many sub-topics, of which I describe a few here.

I mentioned the importance of interval estimates earlier, but margin-of-error does not account for many things that can go wrong with surveys.  Margin-of-error pertains to variability that arises from random sampling, and that’s all.  For example, margin-of-error does not take into account the possibility of a biased sampling method.  I described one of my favorite questions for addressing this, with an admittedly ridiculous context, in post #14 (How confident are you? Part 1, here).  If an alien lands on earth, sets out to estimate the proportion of humans who identify as female, and happens upon the U.S. Senate as its sample, then the resulting confidence interval will drastically underestimate the parameter of interest.

Margin-of-error also fails to account for other difficulties of conducting surveys, such as the difficulty of wording questions in a manner that does not influence responses, and the distinct possibility that some people may exaggerate or lie outright in their response.

The distinction between statistical significance and practical importance is also worth emphasizing to students.  One of my favorite questions for addressing this is a somewhat silly one from post #16 (Questions about cats, here).  Based on a large survey of households in the U.S., the proportion of households with a pet cat differs significantly from one-third but is actually quite close to one-third.  Reporting a confidence interval is much more informative than simply producing a p-value in this context and many others.

Another misuse of p-values is to mindlessly compare them to 0.05 as a “bright line” that distinguishes significant results from insignificant ones.  In fact, the editorial (here) in the special issue of The American Statistician mentioned above calls for eliminating use of the term statistical significance in order to combat such “bright line” thinking.

A related and unfortunately common misuse is the practice of p-hacking, which means to conduct a very large number of hypothesis tests on the same dataset and then conclude that those with a p-value less than 0.05 are noteworthy.  A terrific illustration of p-hacking is provided in the xkcd comic here (with explanation here).

Writing this blog for the past year and compiling this list have helped me to realize that my own teaching is lacking in many respects.  I know that if I ever feel like I’ve got this teaching thing figured out, it will be time for me to retire, both from teaching and from writing this blog.

But I am far from that point.  I look forward to returning to full-time teaching this fall after my year-long leave*.  I also look forward to continuing to write blog posts that encourage statistics teachers to ask good questions

* I picked a very, shall we say, eventful academic year in which to take a leave, didn’t I?

In the short term, though, I am going to take a hiatus in order to catch my breath and recharge my batteries.  I am delighted to announce that this blog will continue uninterrupted, featuring weekly posts by a series of guest bloggers over the next couple of months.

Oh wait, I just realized that I still have not answered a question that I posed in post #1 (here) and promised to answer later: What makes a question good?  I hope that I have illustrated what I think makes a question good with lots and lots and lots of examples through the first 51 posts.  But other than providing examples, I don’t think I have a good answer to this question yet.  This provides another motivation for me to continue writing this blog.  I will provide many, many more examples of what I think constitute good questions for teaching and learning introductory statistics.  I will also continue to reflect on this thorny question (what makes a question good?), and I vow once again to answer the question in a later* post.

* Possibly much later

P.S. I greatly appreciate Cal Poly’s extending a professional leave to me for the past year, which has afforded me the time to write this blog.  

I extend a huge thanks to Beth Chance and Tom Moore, who have read draft posts and offered helpful comments every week*.

* Well, except for the weeks in which I was unable to produce a draft in time.

My final and sincere thanks go to all of you who have read this blog and encouraged me over the past year.

#51 Randomness is hard

I enjoy three-word sentences, such as: Ask good questions. I like cats*. What about lunch**?  Here’s another one: Randomness is hard.

* See post #16 (Questions about cats, here).

** I borrowed this one from Winnie-the-Pooh (see here).

What do I mean when I say that randomness is hard? I mean several things: Randomness is hard to work with, hard to achieve, hard to study.  For the purpose of this post, I mean primarily that randomness is hard to predict, and also that it’s hard to appreciate just how hard randomness is to understand.

Psychologists have studied people’s misconceptions about randomness for decades, and I find these studies fascinating.  I try not to overuse class examples that emphasize misunderstandings, but I do think there’s value in helping students to realize that they can’t always trust their intuition when it comes to randomness.  Applying careful study and thought to the topic of randomness can be worthwhile.

In this post, I discuss some examples that reveal surprising aspects of how randomness behaves and lead students to recognize some flaws in most people’s intuition about randomness.  As always, questions that I pose to students appear in italics.

I ask my students to imagine a light that flashes every few seconds.  The light randomly flashes a green color with probability 0.75 and red with probability 0.25, independently from flash to flash.  Then I ask: Write down a sequence of G’s (for green) and R’s (for red) to predict the colors for the next 40 flashes of this light.  Before you read on, please take a minute to think about how you would generate such a sequence yourself.

Most students produce a sequence that has 30 G’s and 10 R’s, or close to those proportions, because they are trying to generate a sequence for which each outcome has a 75% chance for G and a 25% chance for R.  After we discuss this tendency, I ask: Determine the probability of a correct prediction (for one of the outcomes in the sequence) with this strategy.

We’ll figure this out using a table of hypothetical counts*.  Suppose that we make 400 predictions with this strategy.  We’ll fill in the following table by assuming that the probabilities hold exactly in the table:

* For more applications of this method, see post #10 (My favorite theorem, here).

First determine the number of times that the light flashes green and the number of times that the light flashes red:

Now fill in the counts for the interior cells of the table.  To do this, remember that the strategy is to predict green 75% of the time and to predict red 25% of the time, which gives:

Fill in the remaining totals.  This gives:

How many times is your prediction correct?  You correctly predict a green light 225 times (top left cell of the table), and you correctly predict a red light 25 times (bottom right), so you are correct 250 times.  These counts are shown in bold here:

For what proportion of the 400 repetitions is your prediction correct?  You are correct for 250 of the 400 repetitions, which is 250/400 = 5/8 = 0.625, or 62.5% of the time. 

Here’s the key question: This is more than half the time, so that’s pretty good, right?  Students are tempted to answer yes, so I have to delicately let students know that this percentage is actually, well, not so great. 

Describe a method for making predictions that would be correct much more than 62.5% of the time.  After a few seconds, I give a hint: Don’t overthink.  And then: In fact, try a much more simple-minded approach.  For students who have not yet experienced the aha moment, I offer another hint: How could you be right 75% of the time?

This last question prompts most students to realize that they could have just predicted green for all 40 flashes.  How often will your prediction be correct with this simple-minded strategy?  You’ll be correct whenever the light flashes green, which is 75% of the time.  Fill in the table to analyze this strategy.  The resulting table is below, with correct predictions again shown in bold.  Notice that your prediction from this simple-minded strategy is correct for 300 of the 400 repetitions:

I learned of this example from Leonard Mlodinow’s book The Drunkard’s Walk: How Randomness Rules Our Lives.  I recount for my students the summary that Mlodinow provides: “Humans usually try to guess the pattern, and in the process we allow ourselves to be outperformed by a rat.”  Then I add: Randomness is hard*.

* At least for humans!

What percent better does the simple-minded (rat) strategy do than the guess-the-pattern (human) strategy?  Well, we have determined these probabilities to be 0.750 for rats and 0.625 for humans, so some students respond that rats do 12.5% better.  Of course, that’s not how percentage change works*.  The correct percentage difference is [(0.750 – 0.625) / 0.625] × 100% = 20.0%.  Rats do 20% better at this game than humans.

* I discussed this at length in post #28 (A persistent pet peeve, here).

For more mathematically inclined students taking a probability course, I often ask a series of questions that generalizes this example: Now let p represent the probability that the light flashes green.  Let’s stipulate that the light flashes green more often than red, so 0.5 < p < 1.  The usual (human) strategy is to guess green with probability p and red with probability (1 – p).  Determine the probability of guessing correctly with this strategy, as a function of p.

We could use a table of hypothetical counts again to solve this, but instead let’s directly use the law of total probability, as follows:

Graph this function.  Here’s the graph:

Describe the behavior of this function.  This function is increasing, which makes sense, because your probability of guessing correctly increases as the lop-sidedness of the green-red breakdown increases.  The function equals 0.5 when p = 0.5 and increases to 1 when p = 1.  But the increase is more gradual for smaller values of p than for larger values of p, so the curve is concave up.

Determine the probability of a correct guess for our rat friends, as a function of p.  This one is easy, right?  Pr(correct) = p.  That’s all there is to it.  Rats will always guess green, so they guess correctly at whatever probability green appears.

Graph these two functions (probability of guessing correctly for humans and rats) on the same scale.  Here goes, with the human graph in black and the rat graph in blue:

For what values of p does the rat do better (i.e., have a higher probability of success) than the human?  That’s also easy: All of them!*  Randomness is hard.

* Well, okay, if you want to be technical: Rats and humans tie at the extremes of p = 0.5 and p = 1.0, in case that provides any consolation for your human pride.

Where is the difference between the human and rat probabilities maximized?  Examining the graph that presents both functions together, it certainly looks like the difference is maximized when p = 0.75.  We can confirm this with calculus, by taking the derivative of p2 + (1-p)2p, setting the derivative equal to zero, and solving for p.

The “rats beat humans” example reminds me of a classic activity that asks students: Produce a sequence of 100 H’s and T’s (for Heads and Tails) that you think could represent the results of 100 flips of a fair coin. 

Your prediction will be correct 50% of the time no matter how you write your sequence of Hs and Ts.  This activity focuses on a different aspect of randomness, namely the consequence of the independence of the coin flips.  Only after students have completed their sequence do I reveal what comes next: Determine the longest run of consecutive heads in your sequence.  Then I have students construct a dotplot on the board of the distribution of their values for longest run of heads.

How can we investigate how well students performed their task of producing a sequence of coin flip outcomes?  Yet again the answer I am fishing for is: Simulate!  The following graph displays the resulting distribution of longest runs of heads from simulating one million repetitions of 100 flips of a fair coin:

The mean of these one million results is 5.99 flips, and the standard deviation is 1.79 flips.  The maximum value is 25.  The proportion of repetitions that produced a longest run of 5 or more flips is 0.810, and the proportion that produced a longest run of 8 or more flips is 0.170.

How do you anticipate students’ results to differ from simulation results?  Student-generated sequences almost always have a smaller mean, a smaller standard deviation, and a smaller proportion for (5 or more) and for (8 or more).  Why?  Because people tend to overestimate how often the coin alternates between heads and tails, so they tend to underestimate the average length for the longest consecutive run of heads.  In other words, people generally do a poor job of producing a plausible sequence of heads and tails.  Randomness is hard.

As a class activity, this is sometimes conducted by having half the class generate a sequence of coin flips in their head and the other half use a real coin, or a table of random digits, or a calculator, or a computer.  The instructor leaves the room as both groups put a dotplot of their distributions for longest runs of heads on the board.  When the instructor returns to the room, not knowing which graph is which, they can usually make a successful prediction for which is which by guessing that the student-generated graph is the one with a smaller average and less variability.

As another example that illustrates the theme of this post, I ask my students the “Linda” question made famous by cognitive psychologists Daniel Kahneman and Amos Tversky: Linda is 31 years old, single, outspoken, and very bright.  She majored in philosophy.  As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.  Which is more probable? (1) Linda is a bank teller. (2) Linda is a bank teller and is active in the feminist movement.

Kahneman and Tversky found that most people answer that (2) is more probable than (1), and my students are no exceptions.  This is a classic example of the conjunction fallacy: It’s impossible for the conjunction (intersection) of two events to be more probable than either of the events on its own.  In other words, there can’t be more feminist bank tellers in the world than there are bank tellers overall, feminist or otherwise.  In more mathematical terms, event (2) is a subset of event (1), so (2) cannot be more likely than (1).  But most people respond with the impossibility that (2) is more likely than (1).  Randomness is hard.

When I present these examples for students, I always hasten to emphasize that I am certainly not trying to make them feel dumb or duped.  I point out repeatedly that most people are fooled by these questions.  I try to persuade students that cognitive biases such as these are precisely why it’s important to study randomness and probability carefully. 

I also like to think that these examples help students to recognize the importance of humility when confronting randomness and uncertainty.  Moreover, because randomness and uncertainty abound in all aspects of human affairs, I humbly suggest that a dose of humility might be helpful at all times. That thought gives me another three-word sentence to end with: Let’s embrace humility.

P.S. I learned of the activity about longest run of heads from Activity-Based Statistics (described here) and also an article by Mark Schilling (here).

I highly recommend Daniel Kahneman’s book Thinking: Fast and Slow and also Michael Lewis’s book about Kahneman and Tversky’s collaboration and friendship, The Undoing Project: A Friendship that Changed Our Minds.