Point estimates and error¶

Suppose a poll suggested the US President's approval rating is 45%. We would consider 45% to be a point estimate of the approval rating we might see if we collected responses from the entire population. This entire-population response proportion is generally referred to as the parameter of interest.

When the parameter is a proportion, it is often denoted by \(p\), and we often refer to the sample proportion as \(\hat{p}\) (p-hat). Unless we collect responses from every person in the population, \(p\) is unknown and we use \(\hat{p}\) as the estimate of \(p\). The difference in the poll vs. the parameter is called the error in the estimate.

Generally, the error consists of two aspects:

Sampling error (aka sampling uncertainty: how much an estimate will tend to vary from one sample to the next
- e.g. estimate from one sample might be 1% too low while in another it may be 3% too high
- sample size is often represented by \(n\)
Bias: a systematic tendency to over/underestimate the true population value.
- e.g. wording questions in certain ways can provoke certain responses that don’t indicate the true value

Understanding the variability of a point estimate¶

Suppose the proportion of a population who supports a topic is \(p = 0.88\), which is the parameter of interest. If we took a poll of X number of people from the population, how close would the sample proportion be to the parameter of interest? In other words, how does the sample proportion \(\hat{p}\) behave when the true population proportion is 0.88?

We can simulate situations like this with computer code. Multiple simulations need to be run to get an accurate estimate. The distribution of sample proportions is called a sampling distribution. Sampling distributions have three areas of interest: the center (same as the parameter), the spread (standard deviation - aka standard error), and the shape.

Central Limit Theorem¶

The distribution above looks very similar to a normal distribution. This is due to the principle known as the Central Limit Theorem.

[!def] Central Limit Theorem When observations are independent and the sample size is sufficiently large, the sample proportion \(\hat{p}\) will tend to follow a normal distribution with the following mean \(\mu_{\hat{p}}=p\) and standard error \(SE_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}}\) If we do not know \(p\) we can use \(\hat{p}\), which is known as a substitution approximation.

The sample size is typically considered sufficiently large when \(np \ge 10\) and \(n(1-p)\ge10\), which is called the success-failure condition

Sometimes another condition added is that samples from a population must not exceed 10% of the population. When the sample exceeds 10% of the population size, some simpler methods (that we use in this book) tend to overestimate the sampling error. v

Verifying that sample observations are independent¶

Subjects in an experiment are considered independent if they undergo random assignment to treatment groups.

If observations are from a simple random sample, they are independent.

If a sample is from a seemingly random process, e.g. an occasional error on an assembly line, checking independence is more difficult.