Linear Regression, Training and Loss

Software Methodologies


Linear regression

Linear Regression
A method for finding the straight line or hyperplane that best fits a set of points
$$y=b+w_1x_1$$
  • y - the predicted label
  • b - the bias, sometimes referred to as $w_0$
  • $w_1$ - the weight of feature 1
  • $x_1$ - a feature

Training and loss

Training a model
Learning good values for all weights and the bias from labelled examples
Loss
The penalty for a bad prediction
Empirical Risk Minimisation
The process of examining many examples and attempting to find a model that minimises loss

Squared loss

The square of the difference between the label and the prediction

$$(\text{observation}-\text{prediction}(x))^2$$
$$(y-\hat{y})^2$$

Mean square error

$$MSE=\dfrac{1}{N}\sum_{(x,y)\in D}(y-\text{predicition}(x))^2$$

(x,y) is an example where

  • x is the set of features used by the model to make predictions

  • y is the example’s label

prediction(x) is a function of the weights and bias in combination with the set of features x

D is the dataset containing many labelled examples

N is the number of examples in D

Reducing loss

  • Hyperparameters are the configuration settings used to tune how the model is trained

  • Derivative of loss with respect to weights and biases tells us how loss changes for a given example

  • So we repeatedly take small steps in the direction that minimises loss, we call these Gradient steps

Gradient steps

Weight initialisation

For convex problem, weights can start anywhere forming a graph that looks like $x^2$

Foreshadowing: not true for neural networks

  • More than one minimum

  • Strong dependency on initial values

Efficiency of reducing loss

  • Could compute gradient over entire dataset on each step, but this turns out to be unnecessary

  • Computing gradient on small data examples works well

  • Stochastic Gradient Descent - one example at a time

  • Mini-batch Gradient Descent - batches of 10-1000

Learning rate

The ideal learning rate in one-dimension is

$$\dfrac{1}{f(x)''}$$

The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives)