Linear Regression

A method for finding the straight line or hyperplane that best fits a set of points

$$y=b+w_1x_1$$

- y - the predicted label
- b - the bias, sometimes referred to as $w_0$
- $w_1$ - the weight of feature 1
- $x_1$ - a feature

Training a model

Learning good values for all weights and the bias from labelled examples

Loss

The penalty for a bad prediction

Empirical Risk Minimisation

The process of examining many examples and attempting to find a model that minimises loss

The square of the difference between the label and the prediction

$$(\text{observation}-\text{prediction}(x))^2$$

$$(y-\hat{y})^2$$

$$MSE=\dfrac{1}{N}\sum_{(x,y)\in D}(y-\text{predicition}(x))^2$$

(x,y) is an example where

x is the set of features used by the model to make predictions

y is the example’s label

prediction(x) is a function of the weights and bias in combination with the set of features x

D is the dataset containing many labelled examples

N is the number of examples in D

Hyperparameters are the configuration settings used to tune how the model is trained

Derivative of loss with respect to weights and biases tells us how loss changes for a given example

So we repeatedly take small steps in the direction that minimises loss, we call these

**Gradient steps**

For convex problem, weights can start anywhere forming a graph that looks like $x^2$

Foreshadowing: not true for neural networks

More than one minimum

Strong dependency on initial values

Could compute gradient over entire dataset on each step, but this turns out to be unnecessary

Computing gradient on small data examples works well

**Stochastic Gradient Descent**- one example at a time**Mini-batch Gradient Descent**- batches of 10-1000

The ideal learning rate in one-dimension is

$$\dfrac{1}{f(x)''}$$

The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives)