In this mission, we'll be learning about the chi-squared test for categorical data. This test is a statistical hypothesis testing that assumes the observed frequencies for a categorical variable match the expected frequencies for the categorical variable.

When looking at two varying distributions, we might know that something looks off. However, we don't quite know how to quantify how different the observed and expected values are. We also don't have any way to determine if there's a statistically significant difference between the two groups and if we need to investigate further.

This is where a chi-squared test can help. The chi-squared test enables us to quantify the difference between sets of observed and expected categorical values.

In this lesson, you will discover the formula for the chi-squared test statistic and build intuition around why and how the chi-squared quantifies the difference between a set of categorical values. You will also learn about p-values, a critical value metric that allows us to determine whether the difference between two categorical values is due to chance or some deeper and meaningful difference. 

We’ll also cover what degrees of freedom are and how they play a role in statistics.

As you work through each concept, you’ll get to apply what you’ve learned using our interactive Python environment and answer-checking, so that you’re getting practice writing Python and getting feedback about your new statistics skills as you learn. 

Objectives

  • Learn to determine the statistical significance of observing a set of categorical values.
  • Learn to generate and work with the chi-squared distribution.
  • Learn to use R functions for the chi-squared distribution.

Lesson Outline

1. Looking At Categorical Data
2. Observed Data vs Expected Data
3. Dealing With Cancellation
4. Some Statistical Insights
5. Developing a Null Hypothesis
6. Importance of Samplle Size
7. Considering More Categories
8. Adjusting the Distribution Under The Null
9. Next steps
10. Takeaways