Worrying developments in NHS England mental health outcomes monitoring

Mental health service users hope that the therapeutic interventions they receive will help them feel better. Randomised controlled trials are one important way to test whether a therapy “works”; however, they don’t reveal how interventions are experienced in routine care. This has led to routine outcomes monitoring which uses questionnaires to ask service users and clinicians to rate symptoms and other relevant information before, during, and after treatment. Outcomes monitoring has been used by NHS services for some years, for example through the Child Outcomes Research Consortium and Improving Access to Psychological Therapies (IAPT). It is, however, controversial. Ros Mayo (2010, p. 63) for example argues that:

“The application of oversimplified questions requiring tick-box answers … are driven by short-term and superficial policies and management techniques, largely incorporated from industry and the financial sector and primarily concerned with speed, change, results, cost effectiveness – turnover and minimising human contact and time involvement… They have nothing to do with human engagement…”

And yet, outcomes monitoring could be better than bureaucracy. There is emerging evidence that providing regular progress feedback to clinicians improves outcomes, especially when questionnaires are completed by service users. Intuitively this seems to make sense: people could sometimes reveal more about how they feel on paper than they can orally face-to-face. Items used in IAPT include:

  • “How often have you been bothered by… not being able to stop or control worrying?”
  • “Heart palpitations bother me when I am around people”

They also ask directly about the care received, for example:

  • “Did staff listen to you and treat your concerns seriously?”
  • “Did you feel involved in making choices about you treatment and care?”

Responses to these items could help clinicians understand to what extent services are helping service users. Also using standardised questionnaires means that expected progress curves can be developed (for example see work by Lambert and colleagues), so clinicians can see, for example, if progress is slower than would be expected given initial assessment and, if warranted, try a different approach.

It’s early days for outcomes monitoring but the above examples suggest that it could be a promising approach. However, closer examination shows that there are clear problems with how questionnaires are being used in practice, and I think NHS services in England are being asked to implement actively damaging approaches to outcomes monitoring.

Problem 1. Use of an unreliable measure

Suppose you wish to develop a rating scale for quality of life or how distressed you are so that you can monitor progress over time with a summary “score”. As a bare minimum requirement, all of the questions for one topic should be related to each other: the items should be “internally consistent”. This questionnaire probably wouldn’t do very well:


  1. How often do you sing in the shower?
  2. What height are you?
  3. How far do you live from the nearest park?
  4. What’s your favourite number?
  5. How often do you go dancing?

You might learn interesting things from some of the individual answers, but summing all the answers together is unlikely to be revealing. This questionnaire would fare better:


  1. How do you feel? (0 is terrible, 10 fantastic)
  2. How do you feel? (0 is terrible, 10 fantastic)
  3. How do you feel? (0 is terrible, 10 fantastic)
  4. How do you feel? (0 is terrible, 10 fantastic)
  5. And finally… how do you feel? (0 is terrible, 10 fantastic)

However, you might wonder if questions 2 to 5 add anything. There are many ways to test the internal consistency of questionnaires, using the answers that people give. One is to use a formula called Cronbach’s alpha which gives answers from 0 to 1. Higher, say around 0.8, is better. Too close to 1 suggests redundancy in questions, as would be likely for the Reliable Feelings Questionnaire above.

In England, it is now recommended to use a “Mental Health Clustering Tool” to evaluate outcomes (see section 7.1 of recent guidelines). This is a questionnaire completed by clinicians covering areas such as hallucinations, delusions, depression, and relationship difficulties. The questionnaire suffers from a very basic problem: it’s not internally consistent. This has been discovered by the very people who proposed the approach (see p.30 of their report): “As a general guideline, alpha values of 0.70 or above are indicative of a reasonable level of consistency”. Their results are: 0.44, 0.58, 0.63, 0.57 – conspicuously smaller than 0.70. The authors also refer to previous studies explaining that this would always be the case, due to “its original intended purpose of being a scale with independent items” (p. 30). So, by design, it’s closer to the General Stuff Questionnaire above: a mixed bag of independent questions with low reliability.

Problem 2. Proposals to link outcomes to payment

Given evidence that collecting regular feedback might improve the quality of care people receive, it may be a good idea that the IAPT programme includes regular progress monitoring. IAPT uses service user completed questionnaires, which could in principle provide information clinicians might not otherwise have learned. There is, however, another potential difficulty over and above that of the quality of questionnaires used, and that is how external influences such as “Payment by Results” (PbR) initiatives can change for the worse how data is gathered and used. And PbR initiatives are beginning to be used in practice. The IAPT webpage notes, “An outcome based payment and pricing system is being developed for IAPT services. This is unique as other systems of PbR are activity or needs based.” Initial pilot results were “encouraging,” says the web page, and another pilot is currently running.

The idea with this proposal is that the more improvement shown by service users, as partly determined by outcomes scores, the more money service providers would receive. This is a worry as linking measures to targets has a tendency to cause the measures to stop measuring what it is hoped that they measure. For instance targets on ambulance response times have led to statistically unlikely peaks at exactly the target, suggesting that times have been changed. A national phonics screen has a statistically unlikely peak just at the cutoff score, suggesting that teachers have rounded marks up where they fell just below the cutoff. The effect has been around for such a long time that it has a name, Goodhart’s law:

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

Faced with funding cuts, how many NHS managers in overstretched services will be forced to “game” performance-based payment systems to ensure their service survives? It’s not hard to do so, for example people who drop out of therapy tend to do so because they didn’t think it was helping. It can be easy to justify not asking people who leave therapy to complete questionnaires. Those who stay in therapy may be the ones with higher scores (see Clark, 2011, p. 321). Therefore, the missing data from those whom therapy was not helping could lead to a false picture of how well a service is working for service users. It is difficult to see how any data gathered that has been subject to these difficulties could tell clinicians or service providers anything helpful about their services or the wellbeing of those who use them.

Concluding thoughts

I think it is possible that statistically reliable, high quality questionnaires could be helpful in clinical practice, if thoughtfully used and explained to service users. Perhaps such questionnaires could be thought of as a bit like blood pressure readings; few patients feel reduced to numbers when told their results, and it’s obvious that more information is required to formulate a treatment if something is awry. However, using unreliable measures – especially when the developers know they are unreliable – is unacceptable for a national mental health programme. Moreover, linking questionnaire scores to payment raises even more complex ethical issues: there is a risk that the bureaucratic burden of questionnaires for service users would increase; also poorer treatment and financial decisions could result because those decisions would be made on the basis of low quality, unreliable data. Mental health services need to do much better than that, for the sake of everyone’s wellbeing.

Thanks very much to Martha Pollard and Justine McMahon for helpful comments.

Andy Fugard is a Lecturer in Research Methods and Statistics at the Department of Clinical, Educational and Health PsychologyUniversity College London. His research investigates psychological therapies for children and young people: are they effective in routine practice and what moderates effectiveness. He is also interested in policy around practice-based evidence and recently became a member of the NHS England/Monitor Quality and Cost Benchmarking Advisory Group.