Regression Models

Introduction

While AI and especially Deep Learning techniques have advanced tremendously, the fundamental data analysis methods still provide a good and, in most cases, efficient way of solving many data analysis problems. Linear regression is one of those methods that provide at least a good starting point to have an informative and insightful understanding of the data. Linear regression models are relatively simple and do not require significant computing power in most cases, which makes them widely applied in different contexts. The term regression towards a mean value of a population was widely promoted by Francis Galton, who introduced the term “correlation” in modern statistics^[1, ^2, ^3].

Linear regression model

Linear regression is an algorithm that computes the linear relationship between the dependent variable and one or more independent features by fitting a linear equation to observed data. In its essence, linear regression allows the building of a linear function – a model that approximates a set of numerical data in a way that minimises the squared error between the model prediction and the actual data. Data consists of at least one independent variable (usually denoted by x) and the function or dependent variable (usually denoted by y). If there is just one independent variable, then it is known as Simple Linear Regression, while in the case of more than one independent variable, it is called Multiple Linear Regression. In the same way, in the case of a single dependent variable, it is called Univariate Linear Regression. In contrast, in the case of many dependent variables, it is known as Multivariate Linear Regression. For illustration purposes in the figure below, a simple data set is illustrated that was used by F. Galton while studying relationships between parents and their children's heights. The data set might be found here: ^[4]

If the fathers' heights are Y and their children's heights are X, the liner regression algorithm is looking for a liner function that, in the ideal case, will fit all the children's heights to their parent heights. So, the function would look like the following equation:

Where:

Yi – ith child height
Xi – ith father height
β0 and β1 y axis crossing and slope coefficients of the liner function correspondingly

Unfortunately, in the context of the given example, finding such a function is not possible for all x-y pairs at once since x and y values differ from pair to pair. However, finding a linear function that minimises the distance of the given y to the y' produced by the function or model for all x-y pairs is possible. In this case, y' an estimated or forecasted y value. At the same time, the distance between each y-y'pair is called an error. Since the error might be positive or negative, a squared error is used to estimate the error. It means that the following equation might describe the model:

where

Y'i – ith child height estimated by the model
Xi – ith father height
Β’0 and β’1 y axis crossing and slope coefficients' estimates of the liner function correspondingly, which minimises the error term:

The estimated beta values might be calculated as follows:

Where:

Cor(X,Y) – Correlation between X and Y (capital letter mean vectors of individual x and y corresponding values)
σx and σy – standard deviations of vectors X and Y
µx and µy – mean values of the vectors X and Y

Most modern data processing packages possess dedicated functions for building linear regression models with few lines of code. The result is illustrated in the following figure:

Errors and their meaning

As discussed previously, an error in the context of the linear regression model represents a distance between the estimated dependent variable values and the estimate provided by the model, which the following equation might represent:

where,

y'i – ith child height estimated by the model
yi - ith childer height true values
ei - error of the model's ith output

Since an error for a given yith might be positive or negative and the model itself minimises the overall error, one might expect that the error is normally distributed around the model, with a mean value of 0 and its sum close to or equal to 0. Examples of the error for a few randomly selected data points are depicted in the following figure in red colour:

^[1] Everitt, B. S. (August 12, 2002). The Cambridge Dictionary of Statistics (2 ed.). Cambridge University Press. ISBN 978-0521810999.

^[2] Upton, Graham; Cook, Ian (21 August 2008). Oxford Dictionary of Statistics. Oxford University Press. ISBN 978-0-19-954145-4.

^[3] 3. Stigler, Stephen M (1997). “Regression toward the mean, historically considered”. Statistical Methods in Medical Research. 6 (2): 103-114. doi:10.1191/096228097676361431. PMID 9261910

^[4] josephsalmon.eu/enseignement/TELECOM/MDI720/datasets/Galton.txt - Cited on 03.08.2024.

en/iot-reloaded/regression_models.1722687966.txt.gz · Last modified: 2024/08/03 12:26 by agrisnik