- Formal name for a row in a table is a case or observational unit
- Columns represent characteristics, aka variables
- Data matrix is like a table/excel spreadsheet type thing
- Types of variables
- Numerical
- Continuous or Discrete
- Categorical
- Values are called the variable's levels
- Ordinal variables have a natural order, i.e. 'education level' - high school below university, undergrad below postgrad, etc.
- Nominal variables have no natural order - i.e. types of fruit
- Numerical
- When two variables show a connection with each other, they are said to be associated or dependent
-
An explanatory variable has an affect on a response variable - i.e. study time affects grade results.
-
An observational study is when you collect data in a way that does not directly interfere with how the data arises
- prospective study identifies individuals and collects information as events unfold
- retrospective study collects data after events have taken place
-
Simple random sample - equivalent to using a raffle to select cases, all cases in population have an equal chance of being included
- Non-response bias is introduced when there is a high rate of non-response - certain groups may not be represented because they choose not to respond to surveys
-
Convenience sample - picking the most easily accessible to be included in the sample
-
Making causal conclusions based on observational data is not recommended - generally it only shows associations that can be used to form hypotheses.
- A confounding variable is correlated to the explanatory and response variables
- i.e. study shows that people who have higher use of sunscreen get skin cancer more often
- the confounding variable is sun exposure time; people who are outside more are more likely to use sunscreen and get skin cancer
Four sampling methods¶
- Simple random sampling
- raffle; all cases in population are equally likely to be included
- e.g. randomly choosing NBA players to compare salary
- Stratified sampling
- population divided into groups called strata, chosen so that similar cases are grouped
- second sampling method, usually simple random, is employed within each stratum
- e.g. randomly picking 5 members of each NBA team to compare salary
- Cluster sampling
- Break population into groups (clusters), then sample a fixed number of clusters and include all observations from each of those clusters in the sample
- multistage sample is similar, but instead of keeping all observations from each cluster, we collect a random sample within each selected cluster
- Useful when different clusters vary a lot, but within clusters there isn't much variation
Experiments¶
Any study where the researchers assign treatments is an experiment. Randomized experiments generally follow four principles: - Controlling - researchers control differences in groups. - e.g. some people take pills with a small amount of water, others a large amount. Doctors may ask all patients to drink the same volume of water with the pill instead - Randomization - randomize patients into treatment groups to account for variables that cannot be controlled - e.g. some patients may be more susceptible to a disease because of their diet; randomization helps even out these differences - Replication - the more cases you observe, the more accurately you can estimate the effect of the explanatory variable on the response. Replicate the experiment until you collect a large enough sample. Additionally, studies can be replicated by others to verify their results - Blocking - sometimes variables other than treatment influence the response. When this happens, group individuals based on this variable into blocks, then randomize cases within each block - e.g. looking at the effect of a drug on heart attacks, we split the patients into low-risk and high-risk, then randomly assign half the patients from each block to the control group and the other half to the treatment group
- When a patient doesn't know whether they are in the treatment group or control group, the study is said to be blind
- A double-blind study is when the researcher also doesn't know which group is which - if they know, they can introduce bias