Skip to content

Fitting a line to data

Linear variables can be perfectly modeled with a straight line, but this is rarely the case in the real world.

Linear regression is the statistical method for fitting a line to data where the relationship between two variables can be modeled by a straight line with some error: \(\(\large y = \beta_0 + \beta_1 x + \epsilon\)\) The \(\beta\) values represent the model's parameters, and the error is represented by \(\epsilon\). When we use x to predict y, we usually call x the explanatory or predictor variable and y the response variable.

Residuals are the leftover variation in the data after accounting for the model fit. If an observation is above the regression line, its residual is positive, and if it is below, it is negative.

[!def] Residual: The difference between observed and expected The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference of the observed response and the response we would predict based on the model fit \(\(e_i=y_i-\hat{y}_i\)\)

Describing linear relationships with correlation

Correlation always takes values between -1 and 1, and describes the strength of the linear relationship between two variables. Correlation is denoted by \(R\).

Formula for correlation (beyond scope of this course): \(\(R = \frac{1}{n-1} \sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x} \frac{y_i - \bar{y}}{s_y}\)\) where \(\bar{x}\), \(\bar{y}\) are the sample means and \(s_x\) and \(s_y\) are the standard deviations of each variable.

The correlation is intended to quantify the strength of a linear trend. Nonlinear trends (even when strong) sometimes produce correlations that do not reflect the strength of the relationship.