Linear Regression
Author: Dr. Hannah Volk-Jesussek
Updated:
What is a Linear Regression Analysis?
Linear regression analysis models the relationship between a dependent variable and one or more independent variables.
In short: it explains or predicts one variable by assuming it changes linearly with another.
The model estimates how much the dependent variable changes for a one-unit change in each independent variable and assesses how well the model fits the data.
Types of Regression
- If there is one independent variable, it is called simple linear regression.
- If there are multiple independent variables, it is called multiple linear regression.
Example: Simple Linear Regression
Does the height have an influence on the weight of a person?
Example: Multiple Linear Regression
Do the height and gender have an influence on the weight of a person?
- Dependent variable
- Independent variables
Note: In linear regression, the level of measurement of the dependent variable must be metric. If is is categorical (nominal or ordinal), logistic regression is used.
Simple Linear Regression
The goal of simple linear regression is to predict the value of a dependent variable based on a single independent variable.
- The stronger the linear relationship between the two variables, the more accurate the prediction.
- This strength shows how much of the dependent variable can be explained by the independent variable.
The relationship between the variables can be illustrated in a scatter plot.
- A strong linear relationship is indicated when data points closely align along a straight line.
- A weak relationship is indicated when data points are more widely scattered.
- To determine that line, linear regression uses the method of least squares.
Calculation of a Simple Linear Regression
The regression line can be described by the following equation:
Definition of "Regression coefficients":
- a : point of intersection with the y-axis (y-intercept)
- b : gradient of the straight line (slope)
- y-hat: estimated y value (dependent variable)
In short: By using the equation above, for each x value, the corresponding y value is estimated.
Example Simple Linear Regression
In our example the height of people is used to estimate their weight.
- Independent variable: height
- Dependent variable: weight
- "Perfect" estimation: All points (measured values) lay exactly on a single straight line.
- In practice: This is almost never the case.
- Therefore: Straight line must be found, that lies as close as possible to the data points.
- Goal: Keep estimation error (epsilon) as small as possible.
Estimation error
- The goal is to keep the estimation error as small as possible.
- This means that the distance between the estimated value and the true value should be as small as possible.
- This distance (error) is called the residual.
- The residual is abbreviated as e (error) and can be represented by the Greek letter ε.
When calculating the regression line, the regression coefficients (a and b) are determined so that the sum of the squared residuals is minimal. (Method of Ordinary Least Squares)
Regression coefficient b
The sign of the regression coefficient b indicates the direction of the relationship:
- b > 0: there is a positive correlation between x and y (the greater x, the greater y)
- b < 0: there is a negative correlation between x and y (the greater x, the smaller y)
- b = 0: there is no correlation between x and y
Standardized regression coefficients are labeled beta. These values are comparable across predictors because the units are removed.
Multiple Linear Regression
Unlike simple linear regression, multiple linear regression allows more than two independent variables to be considered.
The goal is to estimate a variable based on several other variables. Like in simple linear regression, the variable to be estimated is called the dependent variable (criterion). The variables that are used for the prediction are called independent variables (predictors).
Multiple linear regression is used in many differents fields like market research health & medicine, economics & finance or data science & machine learning.
In all areas it is of interest to find out what influence different factors have on a variable.
Marketing example:
For a video streaming service, you want to predict how many times a month a person streams videos. For this you get a record of user data (age, income, gender, ...).
Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For this purpose, you analyze a patient data set with cholesterol level, age, hours of sport per week and so on.
Calculation Multiple Linear Regression
The equation for multiple regression with k independent variables is:
where:
- y is the dependent (response) variable
- x1, x2, …, xk are the independent (explanatory) variables
- b1, b2, …, bk are the regression coefficients
- a intersection with y-axis (y-intercept)
The coefficients can now be interpreted similarly to the linear regression equation.
If all independent variables are 0, the resulting value is a.
If the independent variable xi increases by one unit, the dependent variable y increases by bi.
Multiple Regression vs. Multivariate Regression
Multiple regression should not be confused with multivariate regression. Multiple regression examines the effect of several independent variables on one dependent variable, whereas multivariate regression analyzes several dependent variables simultaneously.
Coefficient of Determination R2
To evaluate how well the regression model can predict or explain the dependent variable, two main measures are used: the coefficient of determination R2 and the standard estimation error. The coefficient of determination R2, also known as explained variance, indicates how much of the variance is explained by the independent variables. The more variance that can be explained, the better the model. To calculate R2, the variance of the estimated values is related to the variance of the observed values:
Adjusted R2
The coefficient of determination R2 is influenced by the number of independent variables used. The more independent variables are included in the model, the greater the explained variance R2. To take this into account, the adjusted R2 is used.
Standard estimation error
The standard estimation error is the standard deviation of the estimation error. This gives an impression of how much the prediction differs from the correct value. Graphically interpreted, the standard estimation error is the dispersion of the observed values around the regression line.
The coefficient of determination and the standard estimation error are used for simple and multiple linear regression.
Standardized and unstandardized regression coefficient
Regression coefficients can be reported in unstandardized or standardized form. The unstandardized coefficients are the ones used in the regression equation and are abbreviated b.
The standardized regression coefficients are obtained by multiplying the regression coefficient bi by the standard deviation of the dependent variable Sxi and dividing by the standard deviation of the corresponding independent variable Sy.
Assumptions of Linear Regression
In order to interpret the results of the regression analysis meaningfully, certain conditions must be met.
- Linearity: There must be a linear relationship between the dependent and independent variables.
- Homoscedasticity: The residuals must have a constant variance.
- Normality: Errors are normally distributed.
- No multicollinearity: No high correlation between the independent variables
- No autocorrelation: Errors should be independent.
Linearity
In linear regression, a straight line is drawn through the data. This straight line should represent all points as good as possible. If the points are distributed in a non-linear way, the straight line cannot fulfill this task.
In the upper left graph there is a linear relationship between the dependent and independent variable, so the regression line is meaningful. In the right graph the relationship is clearly non-linear, so fitting a straight line is not meaningful. In that case, the coefficients cannot be interpreted reliably and prediction errors can be larger than expected.
Therefore it is important to check beforehand whether a linear relationship between the dependent variable and each independent variable exists. This is usually checked graphically.
Homoscedasticity
Since in practice the regression model never exactly predicts the dependent variable, there is always an error. This very error must have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted on the x-axis and the error on the y-axis. Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is present. If this is not the case, heteroskedasticity is present. In the case of heteroscedasticity, the error has different variances, depending on the value range of the dependent variable.
Normal distribution of the error
The next requirement of linear regression is that the error epsilon must be normally distributed. There are two ways to find it out: One is the analytical way and the other is the graphical way. In the analytical way, you can use either the Kolmogorov-Smirnov test or the Shapiro-Wilk test. If the p-value is greater than 0.05, there is no deviation of the data from the normal distribution and one can assume that the data are normally distributed.
However, these analytical tests are used less and less because they tend to attest normal distribution for small samples and become significant very quickly for large samples, thus rejecting the null hypothesis that the data are normally distributed. Therefore, the graphical variant is increasingly used.
In the graphical approach, you can inspect a histogram or, even better, a QQ-plot (quantile-quantile plot). The more the data lie on the line, the closer they are to a normal distribution.
Multicollinearity
Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem with multicollinearity is that the effects of each independent variable cannot be clearly separated from one another.
If, for example, there is a high correlation between x1 and x2, then it is difficult to determine b1 and b2. If both are e.g. completely equal, the regression model does not know how large b1 and b2 should be, becoming unstable.
This is not critical if the model is only used for prediction, where you care about the prediction rather than the individual effects. However, if the model is used to measure the influence of the independent variables on the dependent variable, multicollinearity makes the coefficients difficult to interpret.
More detailed information about multicollinearity can be found hereSignificance test and Regression
The regression analysis is often carried out in order to make statements about the population based on a sample. Therefore, the regression coefficients are calculated using the data from the sample. To rule out the possibility that the regression coefficients are not just random and have completely different values in another sample, the results are statistically tested with significance test. This test takes place at two levels:
- Significance test for the whole regression model
- Significance test for the regression coefficients
It should be noted, however, that the assumptions in the previous section must be met.
Here it is checked whether the coefficient of determination R2 in the population differs from zero. The null hypothesis is therefore that the coefficient of determination R2 in the population is zero. To confirm or reject the null hypothesis, the following F-test is calculated
The calculated F-value must now be compared with the critical F-value. If the calculated F-value is greater than the critical F-value, the null hypothesis is rejected and the R2 deviates from zero in the population. The critical F-value can be read from the F-distribution table. The denominator degrees of freedom are k and the numerator degrees of freedom are n-k-1.
Significance test for the regression coefficients
The next step is to check which variables contribute significantly to the prediction of the dependent variable. This is done by testing whether the slopes (regression coefficients) differ from zero in the population. The following test statistic is calculated:
where bj is the jth regression coefficient and sb_j is the standard error of bj. This test statistic is t-distributed with the degrees of freedom n-k-1. The critical t-value can be read from the t-distribution table.
Calculate with numiqo
Housing Market example data Medical example data Agriculture example data Economics example dataAs an example of linear regression, a model is set up that predicts the body weight of a person. The dependent variable is thus the body weight, while the height, age and gender are chosen as independent variables. The following example data set is available:
| Weight | Height | Age | Gender |
|---|---|---|---|
| 79 | 1.80 | 35 | Male |
| 69 | 1.68 | 39 | Male |
| 73 | 1.82 | 25 | Male |
| 95 | 1.70 | 60 | Male |
| 82 | 1.87 | 27 | Male |
| 55 | 1.55 | 18 | Female |
| 69 | 1.50 | 89 | Female |
| 71 | 1.78 | 42 | Female |
| 64 | 1.67 | 16 | Female |
| 69 | 1.64 | 52 | Female |
After you have copied your data into the statistics calculator, you must select the variables that are relevant for you. Then you receive the results in table form.
Interpretation of the results
This table shows that 75.4% of the variation in weight can be explained by height, age, and sex. When predicting a person's weight, the model is off by about 6.587 on average, which is the standard error.
Weight = 47,379 · Height + 0,297 · Age + 8,922 · is_male -24.41
The equation shows for example, that if age increases by one year, weight increases by 0.297 kg according to the model. In the case of the dichotomous variable sex, the slope is to be interpreted as the difference: according to the model a man weighs 8.922 kg more than a woman. If all independent variables are zero, the result is a weight of -24.41.
The standardized coefficients beta are measured separately and always range between -1 and +1. The larger the beta, the greater the contribution of each independent variable to explaining the dependent variable. In this regression analysis, age has the greatest influence on weight.
The calculated coefficients refer to the sample used for the calculation by the regression analysis, so it is of interest whether the B-values deviate from zero only by chance or whether they are also different from zero in the population. For this purpose, the null hypothesis is formulated that the respective calculated B value is equal to zero in the population. If this is the case, it means that the respective dependent variable has no significant influence on the dependent variable.
The p-value indicates whether a variable has a significant influence. P-values smaller than 0.05 are considered significant. In this example, only age can be considered a significant predictor of a person's weight.
Presenting the results of the regression
When presenting your results, you should include the estimated effect, that is, the regression coefficient, the standard error of the estimate, and the p-value. Of course, it is also useful to interpret the regression results so that everyone knows what the regression coefficients mean.
For example: a significant relationship (p < .041) was found between a person's weight and a person's age.
If a simple linear regression was calculated, the result can also be displayed using a scatter plot.
Lasso and Ridge Regression
Lasso adds an L₁ penalty to shrink some coefficients to zero. Use it for automatic feature selection and to prevent overfitting. You can use the Lasso Regression Calculator
Ridge applies an L₂ penalty to uniformly shrink coefficients. It keeps all features but reduces their impact to improve stability. You can use the Ridge Regression Calculator
Statistics made easy
- many illustrative examples
- ideal for exams and theses
- statistics made easy on 454 pages
- 6th revised edition (March 2025)
- Only 8.99 €
"Super simple written"
"It could not be simpler"
"So many helpful examples"