Validity, reliability and generalisability

PLEASE NOTE:

We are currently in the process of updating this chapter and we appreciate your patience whilst this is being completed.

Validity

Validity is the extent to which an instrument, such as a survey or test, measures what it is intended to measure (also known as internal validity). This is important if the results of a study are to be meaningful and relevant to the wider population. There are four main types of validity:

Construct validity
Construct validity is the extent to which the instrument specifically measures what it is intended to measure, and avoids measuring other things. For example, a measure of intelligence should only assess factors relevant to intelligence and not, for instance, whether someone is a hard worker. Construct validity subsumes the other types of validity.
Content validity
Content validity describes whether an instrument is systematically and comprehensively representative of the trait it is measuring. For example, a questionnaire aiming to score anxiety should include questions aimed at a broad range of features of anxiety.
Face validity
Face validity is the degree to which a test is subjectively thought to measure what it intends to measure. In other words, does it “look like” it will measure what it should do. The subjective opinion for face validity can come from experts, from those administering the instrument, or from those using the instrument.
Criterion validity
Criterion validity involves comparing the instrument in question with another criterion which is taken to be representative of the measure. This can take the form of concurrent validity (where the instrument results are correlated with those of an established, or gold standard, instrument), or predictive validity (where the instrument results are correlated with future outcomes, whether they be measured by the same instrument or a different one).

Reliability

Reliability is the overall consistency of a measure. A highly reliable measure produces similar results under similar conditions so, all things being equal, repeated testing should produce similar results. Reliability is also known as reproducibility or repeatability. There are different means for testing the reliability of an instrument:

Inter-rater (or inter-observer) reliability
The degree of agreement between the results when two or more observers administer the instrument on the same subject under the same conditions.
Intra-rater (or intra-observer) reliability
Also known as test-retest reliability, this describes the agreement between results when the instrument is used by the same observer on two or more occasions (under the same conditions and in the same test population).
Inter-method reliability
This is the degree to which two or more instruments, that are used to measure the same thing, agree on the result. This is also known as equivalence.
Internal consistency reliability
This is the degree of agreement, or consistency, between different parts of a single instrument.

Internal consistency can be measured using Cronbach’s alpha (α) – a statistic derived from pairwise correlations between items that should produce similar results. The usual range for the alpha will be zero to one, with values above 0.7 generally deemed acceptable, and a figure of one indicating perfect internal consistency. A negative value will occur if the choice of items is poor and there is inconsistency between them, or the sampling method is faulty. In these cases the items chosen need to be reviewed, along with possibly the sampling methods used for the items.

Inter-rater reliability can be measured using the Cohen’s kappa (k) statistic. Kappa indicates how well two sets of (categorical) measurements compare. It is more robust than simple percentage agreement as it accounts for the possibility that a repeated measure agrees by chance. Kappa values range from -1 to 1, where values ≤0 indicate no agreement other than that which would be expected by chance, and 1 is perfect agreement. Values above 0.6 are generally deemed to represent moderate agreement. Limitations of Cohen’s kappa are that it can underestimate agreement for rare outcomes, and that it requires the two raters to be independent.

Generalisability

Generalisability is the extent to which the findings of a study can be applicable to other settings. It is also known as external validity. Generalisability requires internal validity as well as a judgement on whether the findings of a study are applicable to a particular group. In making such a judgement, you can consider factors such as the characteristics of the participants (including the demographic and clinical characteristics, as affected by the source population, response rate, inclusion criteria, etc.), the setting of the study, and the interventions or exposures studied. Threats to external validity, that may result in an incorrect generalisation, include restrictions within the original study (eligibility criteria), and pre-test/post-test effects (where cause-effect relationships within a study are only found when pre-tests or post-tests are also carried out).