2.1 Examining numerical data

The formal name for a row is a case or observational unit. Columns represent characteristics/variables

Case 1
Variable 1
Variable 2

Box Plot example

  • The distance of an observation from its mean is its deviation
    • e.g. if the mean is 5 and the observation is 3, the deviation is 2
  • The average of the square of deviations is called the variance and is denoted by \(s^2\) or \(\sigma^2\)
  • The standard deviation is the square root of the variance and is denoted by \(s\) or \(\sigma\)
  • Usually ~70% of data is within one standard deviation of the mean

  • Robust Statistics

    • The median and IQR are called robust statistics because outliers have little influence on their values
    • Conversely, the mean and standard deviation are more heavily influenced by outliers
  • When data is strongly skewed, we sometimes transform them so they are easier to model
    • Consider a histogram of the population of countries, with an x-axis scaling by 100 million.
      • Almost all data will be in the first bin, and this doesn’t tell us much
      • We could instead take the \(\log\) of the x-axis, allowing us to see much more information about the data
        • Other common transformations are the square root (\(\sqrt{\text{original observation}}\)) or inverse \(\frac{1}{\text{original observation}}\)
    • A transformation is a rescaling of data using a function
    • Common goals of transformation are
      • Seeing the data structure differently
      • Reducing skew
      • Assisting in modelling
      • Straightening a nonlinear relationship in a scatter plot

Intensity maps should be used for geographical data; these are maps that use colour to indicate the value of a variable. They are not very helpful for seeing precise values, but they are very helpful for identifying geographic trends.

Contingency table - Useful for summarizing data for two or more categorical variables.