Difference Scores | Are They Okay to Use?

A difference score is a variable that has been formed by subtracting one variable from another.
i.e., DIFFSCORE = VAR1 - VAR2.
Some researchers have heard that difference scores are 'bad'. This post discusses some of the issues, provides some additional references, and discusses calculating reliability of difference scores.

Scenarios:

The following are some scenarios where either I have thought about or researchers have asked me about difference scores:

Examining change on a variable over two time points
Comparing scores in two conditions in a repeated measures experiments (e.g., conscientiousness in an honest versus a job applicant role play condition)
Comparing scores before and after an intervention

General References

Jeffrey Edwards (2001) provides a good starting point for learning about difference scores: 10 Myths about Difference Scores
For a discussion in the longitudinal data analysis context, check out Singer and Willet (2003).

My casual observations

The appropriateness of difference scores depends on the general concepts of validity and reliability.
A difference score is valid to the extent to which it actually measures what you intend to measure.
A difference score is reliable if whatever it estimates, it estimates it with little error. Reliability can be defined in terms of accuracy (difference between observed and true) or in terms of correlation (correlation between observed and true).
If you are interested in the effect of time on a variable, then you should try to measure the dependent variable at more than two time points. The aim of the research should usually be to describe the functional form of the relationship between time and the dependent variable. Thus, designs with just two time points are often inadequate. In short the difference score is not the best summary of the change process.
If you are interested in the differences in scores between two conditions (e.g. honest versus role play) the difference score is a natural measure. An important strategy for increasing the reliability of the difference score is increasing the reliability of the two variables used to form the difference score. For example, I have typically used the 20 items per scale version of the IPIP instead of the 10 items per scale version when looking at difference scores in personality across experimental conditions.
Reliability of a difference score also depends on there being actual variability in the difference. If there are no real differences or if the differences are the same for all individuals, it makes no sense to use individual differences in the difference score.
If the difference score is not reliable, sample correlations between the difference score and other variables will be reduced.
At first, the behaviour of difference scores can seem a little strange. For example, low conscientiousness scores in an honest condition tend to be correlated with response distortion (a difference score between conscientiousness in an honest and a job applicant role-play condition). My interpretation of this correlation is mainly that low conscientiousness respondents have greater scope for increasing their score. Another example can be seen in training research. A participant who already knows the content of a training program may learn (as defined as a difference between pre and post training) less than someone who does not know the material of the training material at the start.
If you have a multilevel dataset with each person rating a series of objects on two dimensions and you are getting a difference scores based on differences between dimensions, reliability of the difference score is likely to vary between individuals.

Calculating reliability

Page 995 of McFarland and Ryan (2006) lists a formula for calculating the reliability of a difference score citing Rogosa, BRandt, and Zimowski (1982). Marley Watkins has some software to calculate the reliability of a difference score, but I have not used it.

References

Edwards, J. R. (2001). Ten difference score myths. Organizational Research Methods, 4, 264-286.
McFarland, L., & Ryan, A. (2006). Toward an Integrated Model of Applicant Faking Behavior 1. Journal of Applied Social Psychology, 36(4), 979-1016.
Rogosa, D., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 92(3), 726-748.
Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis. Modeling change and event occurrence. New York: Oxford University Press.

Jeromy Anglim's Blog: Psychology and Statistics

Tuesday, September 29, 2009