This chapter extends the methods of the previous chapter to apply confidence intervals and hypothesis tests to differences in population proportions: \(p_1 - p_2\).

Sampling distribution of the difference of two proportions¶

[!def] Conditions for the sampling distribution of the difference of two proportions to be normal \(\hat{p_1} - \hat{p_2}\) can be modeled using a normal distribution when: - Independence, extended: The data are independent within and between the two groups. Generally satisfied if the data come from two independent random samples or if the data come from a randomized experiment. - Success-failure condition: The success-failure condition holds for both groups (checked separately) When these conditions are satisfied, the standard error of \(\hat{p_1} - \hat{p_2}\) is \(SE = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\) where \(p_1\) and \(p_2\) represent the population proportions, and \(n_1\) and \(n_2\) represent the sample sizes.

We can apply the generic confidence interval formula for a difference of two proportions, using \(\hat{p_1} - \hat{p_2}\) as the point estimate and substituting the SE formula: \(\(\hat{p_1} - \hat{p_2} \pm z^* \times \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\)\)

Hypothesis testing for a difference of proportions¶

When the null hypothesis is that the proportions are equal (i.e. this type of test is not more effective than the other), use the pooled proportion to verify the success-failure condition and estimate the standard error. \(\large \hat{p}_{pooled}=\frac{\text{number of successes}}{\text{number of cases}}=\frac{\hat{p}_1 n_1 + \hat{p}_2 n_2}{n_1+n_2}\)

Are mammograms effective?

	Died	Lived
Mammogram	500	44425
Control	505	44405

The null hypothesis here (that mammograms are not effective) is \(p_1 - p_2 = 0\). We use a pooled proportion to check the success-failure condition: \(\(\large \hat{p}_{pooled} = \frac{\text{total \# of patients who died}}{\text{total \# of patients in study}} = \frac{500+505}{500+44425+505+44405}=0.0112\)\)

This is an estimate of the breast cancer death rate across the entire study, and it's our best estimate of the proportions if the null hypothesis is true, that \(p_1 = p_2\).

Is it reasonable to model the difference in proportions using a normal distribution? Since all value are at least 10, the success-failure condition is satisfied.