Chi-Square Test: Definition, Formula, Applications & Python Guide

The Chi-Square Test

The Chi-Square (Χ²) test of independence is a non-parametric statistical test used to determine whether there is a relationship between two categorical variables. It assesses how well the observed data distribution aligns with the expected distribution if the variables were independent.

Table of Contents

Understanding the Chi-Square Test
Applications of the Chi-Square Test
Importance of the Chi-Square Test
Implementing the Chi-Square Test in Python
Conclusion

Understanding the Chi-Square Test

The Chi-Square test of independence is a type of Pearson’s Chi-Square test, which is commonly used to analyze categorical data. This test helps determine whether there is a significant association between two categorical variables.

If two variables are independent, the probability of one variable taking a particular value should not be influenced by the value of the other variable. The test compares the observed frequency of occurrences in different categories with the expected frequency, which assumes no association between the variables.

Contingency Tables

To conduct a Chi-Square test of independence, data is typically organized into a contingency table, also known as a crosstabulation. This table displays the frequency distribution of variables across different categories and includes row and column totals.

Example: Contingency Table

A city wants to determine whether different intervention methods influence household recycling behavior. They collect data from 300 households and organize it as follows:

Intervention Type	Recycles	Does Not Recycle	Row Totals
Flyer (Pamphlet)	89	9	98
Phone Call	84	8	92
Control (No Intervention)	86	24	110
Column Totals	259	41	N = 300

This table helps visualize how frequently each outcome occurs under different intervention methods.

Hypotheses for the Chi-Square Test of Independence

The Chi-Square test of independence is an inferential statistical method, meaning it allows conclusions to be drawn about an entire population based on sample data.

Like other hypothesis tests, it involves two competing hypotheses:

Null Hypothesis (H₀): The two variables are not related; their distributions are independent of each other.
Alternative Hypothesis (Hₐ): The two variables are related; their distributions are not independent.

Example: Hypotheses for Recycling Study

H₀: There is no relationship between the type of intervention received and whether a household recycles. The proportion of recycling remains the same across all intervention types.
Hₐ: There is a relationship between the type of intervention received and whether a household recycles. The proportion of recycling differs based on the intervention type.

Expected Frequencies in the Chi-Square Test

The test compares observed frequencies with expected frequencies. Expected frequencies assume that the proportion of one variable is consistent across the values of the other variable.

The formula to compute the expected frequency for any cell in a contingency table is:

Where:

When to Use the Chi-Square Test of Independence

The chi-square test of independence is applicable under the following conditions:

You aim to evaluate the relationship between two categorical variables (binary, nominal, or ordinal).
Typically, this test is conducted on binary or nominal variables.
It may also be used for ordinal variables, especially when there are fewer than five categories.
The sample should be randomly drawn from the target population.
Each category should have at least five expected observations.

Formula for the Chi-Square Test Statistic

The chi-square (χ2) test statistic is computed using Pearson’s formula:

Where:

χ2 is the chi-square test statistic.
∑ denotes the summation across all categories.
O represents the observed frequency in each category.
E denotes the expected frequency for each category.

The chi-square test quantifies how much the observed frequencies deviate from the expected frequencies under the assumption that the variables are independent. Larger values of χ2 indicate greater discrepancies between observed and expected distributions.

Steps to Perform a Chi-Square Test of IndependenceStep 1: Calculate the expected frequencies

Step 1: Compute Expected Frequencies

Use the contingency table and apply the following formula to determine expected values:

Where:

Er,cE_{r,c}Er,c represents the expected frequency for the cell at row rrr and column ccc.
Row Total is the sum of all observations in that row.
Column Total is the sum of all observations in that column.
Grand Total is the overall sample size.

Step 2: Compute the Chi-Square Test Statistic

Once expected frequencies are obtained, substitute them into the chi-square formula:

Step 3: Determine the Critical Chi-Square Value

The critical chi-square value can be found using a chi-square distribution table or statistical software. Two values are required:

Degrees of freedom (df):

where rrr is the number of rows and ccc is the number of columns in the contingency table.

Significance level (α\alphaα): Typically set at 0.05 (5%).

Step 4: Compare Test Statistic to Critical Value

If χ2 exceeds the critical value, the difference between observed and expected distributions is statistically significant. This allows rejection of the null hypothesis, indicating a relationship between the variables.
If χ2 is smaller than the critical value, the difference is not significant (p> α), meaning there is insufficient evidence to conclude that the variables are related.

Applications of the Chi-Squared Test

The Chi-squared test is extensively used across various fields to evaluate associations between categorical variables and assess goodness of fit. Below are some of its key applications:

Medical Research:
- Risk Factor Analysis: Used to determine relationships between lifestyle factors (e.g., smoking, diet) and disease occurrence.
Biology and Genetics:
- Genetic Predictions: Helps in verifying whether observed genetic distributions align with expected Mendelian inheritance ratios.
Market Research:
- Consumer Behavior: Analyzes survey data to identify if demographic factors influence product preferences.
Quality Control:
- Product Defects: Assesses whether the frequency of defective products in a batch matches expected proportions.
Education:
- Academic Performance: Evaluates if test performance is significantly associated with variables like teaching methods or study habits.

These examples highlight the versatility of the Chi-squared test in analyzing categorical data and identifying meaningful relationships in research and business applications.

Importance of the Chi-Squared Test

The Chi-squared test is a fundamental tool in statistical analysis due to its broad applicability and effectiveness. Here’s why it holds significance:

Evaluating Relationships Between Variables:
- The test determines whether two categorical variables are statistically associated, which is valuable for understanding dependencies in data.
Goodness of Fit Assessment:
- It checks whether observed data distribution aligns with an expected theoretical model, making it useful in validating assumptions.
Non-Parametric Nature:
- Unlike parametric tests, the Chi-squared test does not assume a normal data distribution, making it applicable to a wide range of datasets.
Suitability for Categorical Data:
- Designed specifically for categorical variables, the test is ideal for organizing and analyzing contingency tables.
Ease of Interpretation:
- The test results provide a p-value that helps determine statistical significance. If the p-value is below a chosen threshold (commonly 0.05), it indicates enough evidence to reject the null hypothesis.

Overall, the Chi-squared test is a powerful statistical tool that aids in data-driven decision-making across disciplines.

Implementation of Chi Squared Test in python

# Problem Statement:
 
A researcher wants to determine if there is a significant association between two categorical variables: "Treatment Type" and "Outcome." The data collected from a clinical trial are as follows:

## Treatment Type: Two types of treatments, A and B.
## Outcome: Two possible outcomes, "Success" and "Failure."

The observed frequencies of each combination of treatment type and outcome are recorded in a contingency table. The researcher aims to test if the treatment type is independent of the outcome or if there is a significant association between them.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

# Observed frequencies in a 2x2 contingency table
observed_frequencies = np.array([[30, 10], [20, 40]])

# Perform the Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(observed_frequencies)

# Print results
print(f'Chi-squared statistic: {chi2_stat:.2f}')
print(f'P-value: {p_value:.4f}')
print(f'Degrees of freedom: {dof}')
print('Expected frequencies:')
print(expected)

# Interpretation
if p_value < 0.05:
    print('There is a significant association between the variables.')
else:
    print('There is no significant association between the variables.')

Chi-squared statistic: 15.04
P-value: 0.0001
Degrees of freedom: 1
Expected frequencies:
[[20. 20.]
 [30. 30.]]
There is a significant association between the variables.

Google Colab Code

Conclusion

The Chi-squared test is an essential statistical method used to analyze relationships between categorical variables and evaluate how well observed data align with expected distributions. Below is a summary of its key aspects:

Assessing Associations:
- The Chi-squared test for independence determines whether two categorical variables are statistically related, helping researchers identify patterns and dependencies in data.
Evaluating Goodness of Fit:
- This test is used to check if observed data conform to an expected distribution, making it useful for validating hypotheses about categorical data patterns.
Analyzing Categorical Data:
- Specifically designed for categorical variables, the Chi-squared test is widely applied in organizing and interpreting data structured in contingency tables.
Non-Parametric Nature:
- Since the test does not require assumptions about the underlying data distribution, it is highly versatile and applicable across different types of datasets.
Use of Contingency Tables:
- The test is commonly applied to contingency tables, comparing observed and expected frequencies to determine whether significant differences exist.

In conclusion, the Chi-squared test is a powerful and widely used statistical tool that provides valuable insights into categorical data. Its simplicity, flexibility, and broad applicability make it an indispensable method for hypothesis testing and data analysis across various fields.