Your shopping cart is empty.

Summarising binary data


Learning objectives: You will learn about risks, odds, relative risks, odds ratios, absolute risk differences, and number needed to treat.

This section covers methods of summarising binary data for one variable, and when we wish to look at the relationship between two variables.

Please now read the resource text below.

Resource text

Summarising one binary variable

Binary data only take one of two values such as 'alive' or 'dead', 'male' or 'female'. We assign values 0 and 1 to the two states. For a single variable there are two ways of summarising the information: proportions and odds. Proportions can be classified as risks or rates.

Consider 10 observations:

1 1 1 1 1 0 0 0 0 0

We could say that 5 out of 10 observations were 1, i.e. a proportion, 0.5, or a percentage, 50%, were 1. A proportion that is common in medicine is prevalence. This is defined as the number of people in a population with a particular condition divided by the number of people in the population. This is sometimes multiplied by a round number such as 1000, so we have the rate per thousand, which is easier to understand. For example, the prevalence of type II diabetes is currently 0.003 or 3 per thousand.

A proportion is a special type of ratio in that it must lie between 0 and 1. Another type of ratio that is worth mentioning is a rate. This is the proportion of events that occur within a given time period. For example, the population of the UK is approximately 60,000,000. Every year about 600,000 people die. Thus the crude mortality rate for the UK is 600,000/60,000,000=1/100 or 0.01. This is often expressed per 1000, so that we say the crude mortality rate for the UK is about 10 per thousand per year.

If the data referred to earlier arose because we followed up a group of 10 people for, say, a year and 5 developed a disease, then we often refer to the proportion as a risk of developing the disease. Strictly speaking, epidemiologists would call this an incidence rate and would require a time period to be specified. When you hear a risk quoted, always ask over what period of time. After all, in the long run the risk of death is one.

An alternative way of looking at the 10 observations, particularly if they arise prospectively, is to say that out of the 10 observations, 5 observations are 1 and 5 are 0 i.e. a ratio of 5:5 or what is known as an odds of 1 to 1. Statisticians drop the 'to 1' and take it as being understood. We might say something has a fifty-fifty chance, meaning a probability of 0.5.

Odds are commonly used amongst the horse racing fraternity, where odds of 10 to 1 mean that out of 11 races they would expect a horse to win only once. Usually in betting, odds are bigger than one, since gamblers would not quote you odds on something they thought is likely to happen. However, odds can be less than one, and unlike proportions their only restriction is that they must be positive.

In general, if you have x events and y non-events, the odds of an event are x/y and the proportion is x/(x+y). It is a simple matter to relate odds (o) to proportions (p). The odds of an event are o = p/(1-p). Thus, the odds are the ratio of the proportion of 1's to the proportion of 0's. Rewriting the equation we find that p = o/(1+o). Thus an odds of 1 implies a proportion of 0.5.

Summarising the relationship between two variables

Statistics become more interesting when we have two groups, and we will start with what is known as a 2x2 contingency table, such as the following. It has 3 columns but the totals column is not included in the "2x2" description.

2x2 contingency table for comparison of two groups

Positive Negative Total
Group 1 a b a+b
Group 2 c d c+d

We express the risk in group 1 as p1 = a/(a+b) and the risk in group 2 as p2 = c/(c+d).

As an example, consider Zar et al (2007), who report on a trial of isoniazid for the treatment of tuberculosis in children with HIV [1]. The outcome was death after 6 months follow-up. The results are given in the following table:

Results from the Isoniazid trial after 6 months follow-up

Dead Alive Total
Placebo 21 110 131
Isoniazid 11 121 132

With placebo, there was a risk of p1 = a/(a+b) = 21/131 = 0.16 of dying 6 months after randomisation. In the isoniazid group the risk was p2 = c/(c+d) = 11/132 = 0.083.

To compare two proportions, what we really require is to look at the contrast between different therapies. We can do this by looking at either the difference in risks or the ratio of risks.

Consider the difference in risk first. If we ignore the sign (+ or -) this is sometimes known as the absolute risk difference (ARD) or, if the risk in the intervention group is lower than the control, as the absolute risk reduction.

Thus ARD = |p2-p1| (where '||' means ignore the sign)

The difference in risks in this case is 0.160-0.083 = 0.077 or 7.7%. One way of thinking about this is that if 100 patients were treated with placebo and 100 treated with isoniazid, we would expect 16 to have died on placebo and 8.3 on isoniazid. Thus an extra 7.7 died with placebo.

Another way of looking at this is to ask: how many patients would be treated for one extra person to be saved by isoniazid? If 7.7 extra deaths resulted from treating 100 patients per group, so 100/7.7 = 13 patients per group would be treated for 1 extra death. Thus, roughly, if 13 patients were treated with placebo and 13 with isoniazid, we would expect 1 fewer patient to die on isoniazid. This is known as the number needed to treat (NNT), or if the treatment is beneficial, the number needed to treat for benefit (NNTB). It is simply expressed as the inverse of the absolute risk difference.

Thus NNT = 1/|p2-p1|

When the new therapy is harmful, it is known as the number needed to treat for harm (NNTH) [2].

The number needed to treat has been suggested by Sackett et al [3] as a useful and clinically intuitive way of thinking about the outcome of a clinical trial. For example, in a clinical trial of pravastatin against usual therapy to prevent coronary events in men with moderate hypercholestreolemia and no history of MI, the NNT is 42. Thus you would have to treat 42 men with pravastatin to prevent one extra coronary event, compared with usual therapy. It is claimed that this is easier to understand than the relative risk reduction, or other summary statistics, and can be used to decide whether an effect is 'large' by comparing the NNT for different therapies.

However, it is important to realise that comparison between NNTs can only be made if the baseline risks are similar. Thus, suppose a new therapy managed to reduce 5 year mortality of Creutzfeldt-Jakob disease from 100% on standard therapy to 90% on the new treatment. This would be a major breakthrough and has an NNT of 1/(1-0.9) = 10. In contrast, a drug that reduced mortality from 50% to 40% would also have an NNT of 10, but would have much less impact.

We can also express the outcome as a risk ratio or relative risk (RR), which is the ratio of the two risks, experimental risk divided by control risk.

Thus RR = p2/p1

This is also sometimes called the incidence rate ratio (IRR). In the isoniazid trial RR = 0.083/0.16 = 0.52. With a relative risk less than one, we can also consider the relative risk reduction (RRR).

RRR = (p1-p2)/p1

This is easily shown to be 1-RR and is often expressed as a percentage. Thus, a child in the isoniazid trial has a reduced risk of death in 6 months compared with placebo of 1-0.52 = 0.48 = 48%.

We can also summarise the trial in terms of odds. The odds of death on placebo are (21/131)/(110/131) = 21/110 = 0.191 and on isoniazid it is 11/121 = 0.091.

The odds ratio (OR) =

In this case OR = 0.091/0.191 = 0.48. In this case the odds ratio has almost the same value as the relative risk.

The odds ratio and the relative risk are related by

We illustrate this relationship with some same values given in the table below. This demonstrates an important fact: the odds ratio is a close approximation to the relative risk when the baseline risk is low, but is a poor approximation if the baseline risk is high.

Odds ratios and relative risks for different values of absolute risks

Treatments can do harm as well as good. As an example, consider Kennedy et al (1998), a report on the study of acetazolamide and furosemide versus standard therapy for the treatment of post haemorrhagic ventricular dilatation (PHVD) in premature babies [4]. The outcome was death or a shunt placement by 1 year of age. The results are given in the following table:

Results from the PHVD trial 4

Death/Shunt No Death/Shunt Total
Standard Therapy 35 41 76
Drug plus standard therapy 49 26 75

Here the risk in the control group is 0.46 (35/76) and in the intervention group it is 0.65 (49/75). The relative risk of death or shunt in the intervention group compared with standard therapy is 1.42. This can be expressed as a 42% increased risk, but not an increase of 142% as is sometimes suggested. The NNTH is 1/|0.46-0.65| = 5.3, which approximates to 6. Hence only 6 children need be treated in each group for one extra to experience harm.

As a ratio of two numbers, the relative risk hides the actual size of the numbers. Thus a relative risk of 2 could be 8 people out of 10 having an event compared with 4 people out of 10. Or it could be 2 people out of 1000 having an event compared with 1 person out of 1000. These two examples have completely different interpretations. When a relative risk is quoted, always ask about the absolute risk as well, so that a proper interpretation can be made.

For example, the risk of deep vein thrombosis in women on a new type of contraceptive is 30 per 100,000 women years, compared with 15 per 100,000 women years. Thus the relative risk is 2, which shows that the new type of contraceptive carries quite a high risk of deep vein thrombosis. However, an individual woman need not be unduly concerned, since she has a probability of 0.0003 of getting a deep vein thrombosis in one year on the new drug, which is much less than if she were pregnant.

Relative risks versus odds ratios

The odds ratio may not seem like an intuitively obvious statistic, but it has some useful properties. Consider an exposure that in a low risk group has been found to double the risk of disease. What would happen in a high risk population where the risk of the disease in the unexposed group is already over 50%? Clearly, you cannot simply multiply the risk by the incidence in the unexposed group to get the risk in the exposed group, since you would get a risk greater than one. However, there are no such problems with the odds ratio.

A further use for the odds ratio arises when data come from a cross-sectional study or a case-control study. In a case-control study it is not possible to calculate a relative risk directly but you can use the odds ratio to estimate the relative risk.

Suppose there are two conditions A and B, which are present or absent, and we wish to see if there is an association between the two as in the table below.

2x2 table for association studies

Condition A
Present Absent Total
Condition B Present a b a+b
Absent c d c+d
Total a+c b+d

We can argue across the rows:

  • Given Condition B is present, the odds for A being present are a/b
  • Given Condition B is absent, the odds for A being present are c/d
  • Thus, the odds ratio for A given B is (a/b)/(c/d) = ad/bc.
  • However we can also argue across the columns
  • Given Condition A is present the odds for B being present are a/c
  • Given Condition A is absent the odds for B being present are b/d.
  • Thus, the odds ratio for B given A is (a/c)/(b/d) = ad/bc.
  • Thus, the odds ratio for A given B is the same as the odds ratio for B given A.

To illustrate this, consider the table below, showing the prevalence of hay fever and eczema in a cross-sectional survey of 11-year-old children [5,6].

Association between hay fever and eczema in 11-year-old children [5,6]

Hay fever present Hay fever absent Total
Eczema present 141 420 561
Eczema absent 928 13525 14453
Total 1069 13945 15014

If you have hay fever, the risk of eczema is 141/1069 = 0.132 and the odds are 141/928 = 0.152. If you do not have hay fever, the risk of eczema is 420/13945 which is 0.030, and the odds are 420/13525 which is 0.031. Thus the relative risk of having eczema, given that you have hay fever is 0.132/0.030 which is 4.4. We can also find the odds ratio of having eczema given that you have hay fever as 0.152/0.031 = 4.90.

We can consider the table the other way around, and ask what the risk of hay fever is, given that a child has eczema. In this case the two risks are 141/561 = 0.251 and 928/14453 = 0.064, and the relative risk is 0.251/0.064 = 3.92. Thus, the relative risk of hay fever, given that a child has eczema, is 3.92, which is not the same as the relative risk of eczema, given that a child has hay fever. However, the two respective odds are 141/420 = 0.336 and 928/13525 = 0.069, and the odds ratio is 0.336/0.069 = 4.87, which to the limits of rounding is the same as the odds ratio for eczema, given a child has hay fever.

The fact that the two odds ratios are the same can be seen from the fact that

and this remains the same if we switch rows and columns.

Thus, we can either say that children with hay fever have five times the odds of getting eczema, or that children with eczema have five times the odds of getting hay fever. This will be approximately true for risks because hay fever and eczema are quite rare in the population, but would not be true if the incidence was higher.

Another useful property of the odds ratio is that the odds ratio for an event not happening is just the inverse of the odds ratio for it happening. Thus, the odds ratio for not getting eczema, given that a child has hay fever is just 1/4.90=0.204. This is not true of the relative risk, where the relative risk for not getting eczema given hay fever is (420/561)/(13525/14453) = 0.80, which is not 1/3.92.

Odds ratios and case control studies

Case control and cohort studies relate exposure to some hazard to outcome in the form of disease or death. A cohort study measures exposure and then observes events to answer the question: if one is exposed to a hazard (E) what is the probability of disease D (i.e. Prob(D|E))? A case control study argues the other way around. It measures events and looks backwards for exposure (i.e. Prob(E|D)).

The outcome from a case control study can be expressed as a 2x2 table as shown in the following example:

2x2 table for a case control study

Case Control Total
Not exposed a b a+b
Exposed c d c+d
Total a+c b+d n

Notice that the 'outcome' in the first table of the module has been replaced by whether a subject is a case or a control. We cannot think of (a+c)/n as the prevalence of the disease since in case control studies the relative number of cases to controls can be decided by the investigator. It is common to use the same number of cases and controls, and yet one would not think that the prevalence was 50%. Similarly, the usual measure of the relative risk no longer holds. Suppose the investigator decided to double the number of controls as shown in the following table:

2x2 table for a case control study with double the number of controls

Case Control Total
Not exposed a 2b a+b
Exposed c 2d c+d
Total a+c 2b+2d

It is a simple matter to show that the estimate of the relative risk is changed from {a/(a+b)}/{c/(c+d)} to {a/(a+2b)}/{c/(c+2d)}. However, the estimate of the odds ratio remains as ad/bc.

When the assumption of a low absolute risk holds true (which is usually the situation for case control studies), then the odds ratio is assumed to approximate the relative risk that would have been obtained if a cohort study had been conducted.

Choice of summary statistics for binary data

The table below gives a summary of the different methods of summarising a binary outcome for a prospective study such as a clinical trial (Campbell and Swinscow 2009 [7]). Note how in the PHVD trial the odds ratio and relative risk differ markedly, because the event rates are quite high.

Methods of summarising a binary outcome in a two group prospective study 1: Risk in group 1 (control) is p1 , risk in group 2 is p2.

Term Formula Observed in PHVD trial
Absolute Risk Difference (ARD) |p 1 -p 2| 0.160-0.083=0.077
Relative Risk (RR) p 2 /p 1 0.083/0.160=0.52
Relative Risk Reduction (RRR) (p 1 -p 2 )/p 1 (0.160-0.083/0.16=0.48
Number Needed to Treat or Harm(NNT or NNH)) 1/|P 1 -P 2 | 1/0.077≈13
Odds Ratio (OR) p 2 /(1-p 2 )}/{p 1 /(1-p 1 )} (0.083/0.917)/(0.160/0.840)=0.48