The UK Faculty of Public Health has recently taken ownership of the Health Knowledge resource. This new, advert-free website is still under development and there may be some issues accessing content. Additionally, the content has not been audited or verified by the Faculty of Public Health as part of an ongoing quality assurance process and as such certain material included maybe out of date. If you have any concerns regarding content you should seek to independently verify this.

Regression and correlation

 

Regression and correlation

 

Statistics: Correlation and Regression

 

This section covers:

  •        Correlation coefficient
  •        Simple linear Regression

 

Correlation Coefficient

Statistical technique used to measure the strength of linear association between two continuous variables, i.e. the closeness with which points lie along the regression line (see below). The correlation coefficient (r) lies between -1 and +1 (inclusive).

  •        If r = 1 or -1, there is perfect positive (1) or negative (-1) linear relationship
  •        If r = 0, there is no linear relationship between the two variables

When calculated using the observed data, it is commonly known as Pearson's correlation coefficient (after Karl Pearson who first defined it). When using the ranks of the data, instead of the observed data, it is known as Spearman's rank correlation.

Conventionally

       0.8 ≤ |r| ≤ 1.0            => very strong relationship
       0.6 ≤ |r| < 0.8            => strong relationship
       0.4 ≤ |r| < 0.6            => moderate relationship
       0.2 ≤ |r| < 0.4            => weak relationship
       0.0 ≤ |r| < 0.2            => very weak relationship

…where |r| (read “the modulus of r”) is the absolute (non-negative) value of r.

One can test whether r is statistically significantly different from zero (the value of no correlation). Note that the larger the sample the smaller the value of r that becomes significant. For example, with n=10 paired observations, r is significant if it is greater than 0.63. With n=100 pairs, r is significant if it is greater than 0.20.

The square of the correlation coefficient (r2) indicates how much of the variation in variable y is accounted for (or “explained”) by the variable x. For example, if r = 0.7, then r2 = 0.49, which suggests that 49% of the variation in y is explained by x.

 

Important Points:

  •        Correlation only measures linear association. A U-shaped relationship may have a correlation of zero.
  •        It is symmetric about the variables x and y - the correlation of (x and y) is the same as the correlation of (y and x).
  •        A significant correlation between two variables does not necessarily mean they are causally related.
  •        For large samples very weak relationships can be detected.

 

Simple Linear Regression

Simple linear regression is used to describe the relationship between two variables where one variable (the dependent variable, denoted by y) is expected to change as the other one (independent, explanatory or predictor variable, denoted by x) changes.

This technique fits a straight line to data, where this so-called “regression line” has an equation of the form:

      y = a + bx

      a = constant (y intercept)
      b = gradient (regression coefficient)

The model is fitted by choosing a and b such that the sum of the squares of the prediction errors (the difference between the observed y values and the values predicted by the regression equation) is minimised. This is known as the method of least squares.

The method produces an estimate for b, together with a standard error and confidence interval. From this, one can test the statistical significance of b. In this case, the null hypothesis is that b = 0, i.e. that the variation in y is not predicted by x.

The regression coefficient b tells us that for every 1 unit change in x (explanatory variable) y (the response variable) changes by an average of b units.

Note that the constant value a gives the predicted value of y when x = 0.

 

Important Points:

  •        The relationship is assumed to be linear, which means that as x increases by a unit amount, y increases by a fixed amount, irrespective of the initial value of x.
  •        The variability of the error is assumed not to vary with x (homoscedasticity).
  •        Unlike correlation, the relationship is not symmetric, so one would get a different equation if one exchanged the dependent and independent variables, unless
           all the observations fell on the perfect straight line y = x.
  •        The significance test for b yields the same P value as the significance test for the correlation coefficient r.
  •        A statistically significant regression coefficient does not imply a causal relationship between y and x.

 

Reference

  •        Campbell MJ. Statistics at Square Two. 2nd Ed. Blackwell: BMJ Books, 2006

                                                                                               

© MJ Campbell 2016, S Shantikumar 2016