# Linear regression and correlation

## Introduction Learning objectives: You will learn about the concepts of correlation, linear regression, and multiple regression.

This section looks at the relationship between two continuous variables. Usually one variable is the causal or input variable, and the other is the outcome variable. We may be interested in whether there is a relationship between the two variables and trying to predict one variable from another. In this case we use linear regression. If we have two or more predictor variables then we use multiple linear regression. If we are interested in the strength of the relationship, we measure it using a correlation coefficient.

## Resource text

### Correlation coefficient

The correlational coefficient is the statistical technique used to measure strength of linear association, r, between two continuous variables, i.e. closeness with which points lie along the regression line, and lies between -1 and +1

• if r = 1 or -1 it is a perfect linear relationship
• if r = 0 there is no linear relationship between x & y

Using the observed data, it is commonly known as Pearson's correlation coefficient (after K Pearson who first defined it). Using the ranks of the data instead of the observed data it is known as Spearman's rank correlation.

Conventionally:

|r|>0.8 =>            very strong relationship

0.6 ≤|r|    strong relationship

0.4≤|r|     moderate relationship

0.2 ≤|r|    weak relationship

|r|           very weak relationship

Note, however, that the statistical significance depends on the sample size (see below).

You can test whether r is statistically significantly different from zero. Note that the larger the sample, the smaller the value of r that becomes significant. For example with n=10 pairs, r is significant if it is greater than 0.63. With n=100 pairs, r is significant if it is greater than 0.20.

Important points:

• Only measures linear association. A U shaped relationship may have a correlation of zero
• Is symmetric about x and y - the correlation of (x and y) is the same as the correlation of (y and x)
• A significant correlation between two variables does not necessarily mean they are causally related
• For large samples very weak relationships can be detected

Video 1: A video of giving an introduction to correlation. (This video footage is taken from an external site. The content is optional and not necessary to answer the questions.)

### Regression

Linear Regression

Technique used to describe the relationship between two variables where one variable (the dependent variable denoted by y) is expected to change as the other one (independent, explanatory or predictor variable denoted by x) changes.

Linear regression is the statistical technique of fitting a straight line to data, where the regression line is:

y = a + bx ,
a = constant (y intercept) and b = gradient (regression coefficient)

The value y is the predicted value and the difference between y and the observed value is the error.

The model is fitted by choosing a and b to minimize the sum of the squares of the prediction errors (method of least squares). The method produces an estimate for b, together with a standard error and confidence interval. From this, we can test the statistical significance of b.

The regression coefficient (b) tells us that for unit change in x (explanatory variable), y (the response variable) changes by an average of b units.

Important points:

• Relationship is assumed linear, which means that as x increases by a unit amount, y increases by a fixed amount b, irrespective of the initial value of x
• The variability of the error is assumed to be constant
• The error term is normally distributed with mean zero
• The observations are independent (for example they do not arise by sampling some subjects repeatedly)
• Unlike correlation, the relationship is not symmetric, so one would get a different equation if one exchanged the dependent and independent variables, unless all the observations fell on a perfect straight line
• The significance test for b yields the same p-value as the significance test for the correlation coefficient r
• A statistically significant regression coefficient does not imply a causal relationship between y and x
• For the dependent variable, the error term should be normally distributed but the distribution of the y variable will also depend on the distribution of the independent variable, which can take any form.
• There is no requirement for the independent variable to be Normally distributed (e.g. it could be just 0/1)

### Multiple linear regression

This allows analysis of several explanatory variables with one response variable. The formula is:

Y = a + b1x1 + b2 x2 +... bkxk

where y is the predicted value. The main use for multiple regression is to adjust for confounding.

The observed outcome y is assumed to be continuous and the x variables are either continuous or binary. The coefficients b1, b2..., bk are again chosen to minimise the sum of squares of the difference y-Y.

When x1 is a categorical variable such as treatment group and x2 is a continuous variable such as age (a potential confounder) this is known as analysis of covariance.

For further details see Campbell MJ (2006) Chapter 2.

Example

Consider the results of Llewellyn-Jones et al. (BMJ 1999), parts of which are given in table 1. This study was a randomised controlled trial of the effectiveness of a shared care intervention for depression in 220 subjects over the age of 65. Depression was measured using the Geriatric Depression Scale, taken at baseline and after 9.5 months of blinded follow-up. Here y is the depression scale after 9.5 months of treatment (continuous), x1 is the value of the same scale at baseline and x2 is the group variable, taking the value 1 for intervention and 0 for control. The purpose of this analysis is to examine the effect of a shared care intervention allowing for baseline depression score.

One can see that the baseline values are highly correlated with the follow-up values of the score. On average, the intervention resulted in patients with a score of 1.87 units (95% CI 0.76 to 2.97) lower than those in the control group, throughout the range of the baseline values. This was highly statistically significant (p=0.0011).

Table 1: Factors affecting Geriatric Depression Scale score at follow up.

 Variable Regression coefficient (95% CI) P value Baseline score 0.73 (0.56 to 0.91) Treatment Group -1.87 (-2.97 to 0.76) 0.0011

This analysis assumes that the treatment effect is the same for all subjects and is not related to values of their baseline scores. This possibility could be checked by the methods discussed earlier. When two groups are balanced with respect to the baseline value, one might assume that including the baseline value in the analysis will not affect the comparison of treatment groups. However, it is often worthwhile including it because it can improve the precision of the estimate of the treatment effect; i.e. the standard errors of the treatment effects may be smaller when the baseline covariate is included.

Video 2: A video summarising linear regression. (This video footage is taken from an external site. The content is optional and not necessary to answer the questions.)

## References

Campbell MJ. Blackwell BMJ Books, 2006. Statistics at Square Two. 2nd Ed.
Llewellyn-Jones RH, Baikie KA, Smithers H, Cohen J, Snowdon J, Tennant CC. BMJ 1999; 319: 676-82. Multifaceted shared care intervention for late life depression in residential care: randomised controlled trial.