2.1 Examining numerical data

The formal name for a row is a case or observational unit. Columns represent characteristics/variables

	Case 1
Variable 1
Variable 2

Box Plot example

The distance of an observation from its mean is its deviation
- e.g. if the mean is 5 and the observation is 3, the deviation is 2
The average of the square of deviations is called the variance and is denoted by \(s^2\) or \(\sigma^2\)
The standard deviation is the square root of the variance and is denoted by \(s\) or \(\sigma\)
Usually ~70% of data is within one standard deviation of the mean
Robust Statistics
- The median and IQR are called robust statistics because outliers have little influence on their values
- Conversely, the mean and standard deviation are more heavily influenced by outliers
When data is strongly skewed, we sometimes transform them so they are easier to model
- Consider a histogram of the population of countries, with an x-axis scaling by 100 million.
  - Almost all data will be in the first bin, and this doesn’t tell us much
  - We could instead take the \(\log\) of the x-axis, allowing us to see much more information about the data
    - Other common transformations are the square root (\(\sqrt{\text{original observation}}\)) or inverse \(\frac{1}{\text{original observation}}\)
- A transformation is a rescaling of data using a function
- Common goals of transformation are
  - Seeing the data structure differently
  - Reducing skew
  - Assisting in modelling
  - Straightening a nonlinear relationship in a scatter plot

Intensity maps should be used for geographical data; these are maps that use colour to indicate the value of a variable. They are not very helpful for seeing precise values, but they are very helpful for identifying geographic trends.

Contingency table - Useful for summarizing data for two or more categorical variables.