Challenges of Generalizing Clinical Prediction Models in Psychiatry

Precision psychiatry presents an exciting and compelling vision for the future.

Imagine a patient struggling with their mental health. They drop into their nearest doctor's office, and request an assessment. The assessment is standard across any doctor's office in the country — or even a CVS or Walmart — like a blood test at your local Quest or an eye exam. Maybe it can even be done at home, like a Covid test. An AI-assisted clinician or trained technician reviews the outcomes of the assessment — ah! You have schizophrenia type 13c. We'd recommend two treatment options — Drug A, which will work quickly with fewer side effects, and has a 85% of reducing hallucinations and 65% likelihood of improving negative symptoms, or Drug B, which will have more side effects like weight-gain and sleepiness, but have a 95% of reducing hallucinations and 75% of improving negative symptoms. The patient and clinician discuss the trade-offs, and with a click of the button, the preferred prescription is ordered.

Today, this vision for precision psychiatry is closer to science fiction than reality. A new Science paper by Chekroud et. al., notes that, in schizophrenia specifically, 20-30% of first episode patients and more than 50% of relapsed patients do not respond sufficiently to existing antipsychotic medications, continuing to have residual hallucinations and delusions. Critically, patients with residual symptoms fare less well with talk therapies and are at higher risk for suicide (here, here, here). 

We can interpret this in two ways — either a) clinician's existing treatment algorithms and heuristics are not precise enough to properly match patients with the right medications, or b) there aren't good enough medications available to adequately treat the range of patient presentations. We expect both are at play. 

The field of precision psychiatry is working to solve problem A — leveraging technology to improve and nuance treatment algorithms for clinicians. This approach uses machine learning to incorporate more patient information into the treatment selection process (phenotyping) than clinicians can work with currently in their heads, and to leverage historical outcomes information from similar patients to systematically learn and improve recommendations over time.

This paper is an important caution to counterbalance the excitement around these approaches: leveraging a ~1,500 person data set from multiple anti-psychotic randomized controlled trials around the world, they examine the base-rate accuracy of treatment prediction models, and then establish two important validations for predictive psychiatric models: 

  1. Cross validation – validating that a model can predict outcomes for patients it hasn’t seen as well as it does on the sample the model was trained on. Cross validation is considered table stakes for all machine learning work – testing to see if a model is overfitted to the training population. Unfortunately, it is not always used in standard practice. 

  2. Validate the model from one trial on data from a separate site or trial. 

They find that while base-rate model accuracy is significantly above random (averaging 72% accuracy in predicting treatment outcomes), this accuracy rapidly falls under the validation scenarios. When models are validated against other trials, most perform no better than random (~50% accuracy). This means that models trained on one set of patients perform worse when tested against other patients in the same trial, and just don’t work at all under different clinical conditions. That throws a real monkey wrench in our vision for a precision psychiatry future!

So why do these models fail? This is an open question, but the authors suggest a few important theories:

  • Insufficiently extensive features – the data used to profile patients may not be sufficiently nuanced in multiple ways:

    • Missing features: In schizophrenia, socio-economic factors and treatment history are significant predictors of treatment outcomes, but weren't included as features here. 

    • Limited mechanistic information: drugs work based on a specific “mechanism of action” – they target something in the body or brain and change it. The problem is that there are likely multiple distinct underlying mechanisms of psychosis – more analogous to a fever or cough than a specific virus (e.g. COVID-19) that causes the fever or cough. In other words, psychosis is a non-specific symptom with heterogeneous causes. Subjective symptom ratings and participant demographics are a primary input to these models, which have limited information to distinguish underlying mechanisms of psychosis. If there are multiple biological routes to the same symptom,  then we might not  expect the presence or absence of that symptom to be a particularly good predictor of treatment responses. More objective and precise measures of exactly why people have psychosis might really improve prediction accuracy - we could even select for participation in trials based on these measures and ideally address their underlying pathophysiology. 

  • Treatment settings matter — the same drug delivered by two different doctors in two different hospitals will affect the patient differently. Features of the treatment setting may be as important as features of the patient in predicting outcomes, and were not included here.

  • Insufficient data — when we think about where predictive models have been most effective, in ad models on Google, Facebook or content recommendations on the likes of Spotify or Netflix, these models are trained on millions of new data points a day; the 1,500 in this data set is likely not sufficient, particularly when that data is trying to predict based on 217 predictor variables. In machine learning, there’s a “Rule of 50” rule of thumb, that ~50 unique observations are required for each feature in the model. That would suggest that this study would need 7x the patients to meet this threshold. This volume of data is a tall order in a medical context — as data is much harder to acquire in healthcare than on social media. 

Notably — despite these clear challenges, the authors on this paper are all techno-optimists, including our own co-founder, Dr. Phil Corlett, and co-founder of Spring Health, Dr. Adam Chekroud. We draw two conclusions from the paper: 

  1. We must remain skeptical of any models without proper cross validation

  2. If the predictors of outcomes do not, in fact, generalize across sites, we’re required to think deeply about and potentially challenge our constructs of disease and treatment – is schizophrenia as a robust enough construct to define treatment algorithms? We suspect no, for schizophrenia as well as other psychiatric conditions.

This level of skepticism, rigor and openness, we believe, is absolutely critical to develop tools that actually work for all patients, and that meaningfully improve standard of care. Understanding the problems we face is the first step to solving them.

Previous
Previous

Tetricus Labs Partners with Silver Hill Hospital on Research to Advance Precision Psychiatry

Next
Next

Research Bridge Partners Makes Pre-Seed Capital Investment in Tetricus Labs