Logistic regression

Introduction

Learning objectives: You will learn about the use of logistic regression.

Logistic regression is used when the outcome variable is binary, and the input variables are either binary or continuous. In the simplest case when there is one input variable which is binary, then it gives the same result as a chi-squared test.

Please now read the resource text below.

Resource text

In logistic regression the outcome variable is binary, with either an event (e.g. death, or cure) or no event (e.g. survival or not cured).

The model is:

logit (n) = a+b1X1+...+bpXp,

where logit (n) = loge{n/(1- n)}

and n is the expected probability of an event and loge is the natural logarithm.

Points

The expected probabilities are linked to the observed ones by the binomial distribution
The coefficient bi is the log odds ratio of an event for an increase in one unit of Xi. Thus if Xi is binary it is the log odds ratio for X=1 relative to X=0. The X's are known as the covariates
The model is usually fitted using maximum likelihood. (This is described in Campbell, 2006)
When the probability of an event is rare, the odds ratios approximate the relative risk of an event

The main assumption for logistic regression is that the events are independent. An example of dependent events would be decayed, missing or filled teeth (DMF) where the probability of having a DMF tooth is higher if there is another DMF tooth in the mouth.

Logistic regression is particularly useful where we want to compare the number of events in two groups, but where there is an imbalance in a potential confounder that we wish to control for. Suppose X1=0 for group A and X1=1 for group B and the confounder is a continuous variable X2. We fit a model that contains both X1 and X2. Then the interpretation of b1 is that it is the log odds ratio of an event in group B relative to group A, given that X2 is held constant. This is the effect of X1 on the outcome allowing for X2.

The other main use of logistic regression is in case-control studies. Here, cases are subjects with a disease and controls are subjects from the same population but without the disease. Under some general assumptions, the odds ratio from a logistic regression, in which the outcome is case/control status, will approximate the relative risk that would have been obtained from the relevant cohort study.

Example of logistic regression

Lavie et al (BMJ, 2000) surveyed 2,677 adults referred to a sleep clinic with suspected sleep apnoea. They developed an apnoea severity index, and related this to the presence or absence of hypertension.

They wished to answer two questions:

i) Is the apnoea index predictive of hypertension, allowing for age, sex and body mass index?

ii) Is sex a predictor of hypertension, allowing for the other covariates?

The results are given in table 1 below.

Table 1: Risk factors for hypertension

Risk factor	Estimate (log odds)	(95% CI)	Odds ratio
Age (10 years)	0.805	(0.718 to 0.892)	2.24
Sex (male) 0.161	(-0.061 to 0.383)	1.17
BMI (5 kg/m2)	(0.256 to 0.409)	0.332	1.39
Apnoea index (10 units)	0.116	(0.075 to 0.156)	1.12

The coefficient associated with the dummy variable Sex is 0.161, so the odds of having hypertension for a man are exp(0.161)=1.17 times that of a woman of the same age, BMI, and Apnoea Index in this study.

On the odds ratio scale, the 95% confidence interval is exp(-0.061) to exp(0.383) = 0.94 to 1.47. Note that this includes one (as we would expect since the confidence interval for the regression coefficient includes zero) and so we cannot say that sex is a significant predictor of hypertension in this study. We interpret the age coefficient by saying that if we had two people of the same sex, and given that their BMI and apnoea index were also the same but one subject was 10 years older than the other, then we would predict that the older subject would be 2.24 times more likely to have hypertension.

The reason for the choice of 10 years is because that is how age was scaled. Note that factors that are additive on the log scale are multiplicative on the odds scale. Thus a man who is ten years older than a woman is predicted to be 2.24 × 1.17=2.62 times more likely to have hypertension. Thus the model assumes that age and sex act independently on hypertension, and so the risks multiply.

For further details see Campbell MJ (2006) Chapter 3.

Video 1: A video summary of regression analysis. (This footage is taken from an external site. The content is optional and not necessary to answer the questions.)

Reference

Campbell MJ. Blackwell BMJ Books, 2006. Statistics at Square Two. 2nd Ed.

Logistic regression

Introduction

Resource text

Logistic regression

Reference

Related links