Skip to content
  • Formal name for a row in a table is a case or observational unit
    • Columns represent characteristics, aka variables
  • Data matrix is like a table/excel spreadsheet type thing
  • Types of variables
    • Numerical
      • Continuous or Discrete
    • Categorical
      • Values are called the variable's levels
      • Ordinal variables have a natural order, i.e. 'education level' - high school below university, undergrad below postgrad, etc.
      • Nominal variables have no natural order - i.e. types of fruit
  • When two variables show a connection with each other, they are said to be associated or dependent
  • An explanatory variable has an affect on a response variable - i.e. study time affects grade results.

  • An observational study is when you collect data in a way that does not directly interfere with how the data arises

    • prospective study identifies individuals and collects information as events unfold
    • retrospective study collects data after events have taken place
  • Simple random sample - equivalent to using a raffle to select cases, all cases in population have an equal chance of being included

  • Non-response bias is introduced when there is a high rate of non-response - certain groups may not be represented because they choose not to respond to surveys
  • Convenience sample - picking the most easily accessible to be included in the sample

  • Making causal conclusions based on observational data is not recommended - generally it only shows associations that can be used to form hypotheses.

  • A confounding variable is correlated to the explanatory and response variables
    • i.e. study shows that people who have higher use of sunscreen get skin cancer more often
    • the confounding variable is sun exposure time; people who are outside more are more likely to use sunscreen and get skin cancer

Four sampling methods

  • Simple random sampling
    • raffle; all cases in population are equally likely to be included
    • e.g. randomly choosing NBA players to compare salary
  • Stratified sampling
    • population divided into groups called strata, chosen so that similar cases are grouped
    • second sampling method, usually simple random, is employed within each stratum
    • e.g. randomly picking 5 members of each NBA team to compare salary
  • Cluster sampling
    • Break population into groups (clusters), then sample a fixed number of clusters and include all observations from each of those clusters in the sample
    • multistage sample is similar, but instead of keeping all observations from each cluster, we collect a random sample within each selected cluster
    • Useful when different clusters vary a lot, but within clusters there isn't much variation

Experiments

Any study where the researchers assign treatments is an experiment. Randomized experiments generally follow four principles: - Controlling - researchers control differences in groups. - e.g. some people take pills with a small amount of water, others a large amount. Doctors may ask all patients to drink the same volume of water with the pill instead - Randomization - randomize patients into treatment groups to account for variables that cannot be controlled - e.g. some patients may be more susceptible to a disease because of their diet; randomization helps even out these differences - Replication - the more cases you observe, the more accurately you can estimate the effect of the explanatory variable on the response. Replicate the experiment until you collect a large enough sample. Additionally, studies can be replicated by others to verify their results - Blocking - sometimes variables other than treatment influence the response. When this happens, group individuals based on this variable into blocks, then randomize cases within each block - e.g. looking at the effect of a drug on heart attacks, we split the patients into low-risk and high-risk, then randomly assign half the patients from each block to the control group and the other half to the treatment group

  • When a patient doesn't know whether they are in the treatment group or control group, the study is said to be blind
  • A double-blind study is when the researcher also doesn't know which group is which - if they know, they can introduce bias