2.1 Examining numerical data
The formal name for a row is a case or observational unit. Columns represent characteristics/variables
| Case 1 | |
|---|---|
| Variable 1 | |
| Variable 2 |

Box Plot example
- The distance of an observation from its mean is its deviation
- e.g. if the mean is 5 and the observation is 3, the deviation is 2
- The average of the square of deviations is called the variance and is denoted by \(s^2\) or \(\sigma^2\)
- The standard deviation is the square root of the variance and is denoted by \(s\) or \(\sigma\)
-
Usually ~70% of data is within one standard deviation of the mean
-
Robust Statistics
- The median and IQR are called robust statistics because outliers have little influence on their values
- Conversely, the mean and standard deviation are more heavily influenced by outliers
- When data is strongly skewed, we sometimes transform them so they are easier to model
- Consider a histogram of the population of countries, with an x-axis scaling by 100 million.
- Almost all data will be in the first bin, and this doesn’t tell us much
- We could instead take the \(\log\) of the x-axis, allowing us to see much more information about the data
- Other common transformations are the square root (\(\sqrt{\text{original observation}}\)) or inverse \(\frac{1}{\text{original observation}}\)
- A transformation is a rescaling of data using a function
- Common goals of transformation are
- Seeing the data structure differently
- Reducing skew
- Assisting in modelling
- Straightening a nonlinear relationship in a scatter plot
- Consider a histogram of the population of countries, with an x-axis scaling by 100 million.
Intensity maps should be used for geographical data; these are maps that use colour to indicate the value of a variable. They are not very helpful for seeing precise values, but they are very helpful for identifying geographic trends.
Contingency table - Useful for summarizing data for two or more categorical variables.