arnoldtk

Thinking Differently About Health

Mean Spend by Percentile vs Health by Z Score

Two Different Ways to Look at Health

In my previous post, I discussed a rising health risk model with a 39% R Squared, which is higher than the R Squared for the best commercial health risk models. I wanted to explain the logic behind this model. The reason for the exceptional accuracy of the rising health risk model is the transformation of the annual increase in spend by using a rank based inverse normal transformation. The theoretical basis for this transformation is that health is normal. I wanted to provide the evidence I had discovered that health is normal.

Assuming that I am right, and health really is normally distributed, it would mean that there are two different ways to think about health. One can look at the distribution of dollars spent per patient per year, or one might consider that health is normally distributed. These two versions of health are plotted above.

Note that the x axes for the two plots are percentiles and Z scores. The relationship between the percentiles and the Z scores is a negative squared exponential function. This is the formula for a standard normal distribution.

The spend distribution of annual healthcare dollars spent per patient percentile is well known in the healthcare literature. A small fraction of the patients use up a large fraction of the money spent on healthcare each year. In one healthcare system, the average cost per patient in the 99th percentile was $576,000 per year. The mean cost per patient in the 98th percentile was $94,000 per year, and only $34,000 per year in the 97th percentile. The median cost per patient in the 50th percentile was only $94.00 per year. About 45% of the patients seen in the previous three years spent no money on healthcare at all.

The possibility that health is normal seems to have been almost completely ignored in discussions about the nature of health. I tried to Google search this topic and found no substantial mention of this possibility. Part of the problem seems to be that health spend does not look normal. Another problem is probably that health is a latent trait, which means that it can’t be observed directly. That is part of the reason that there are few global measures of health.

I was able to use the total number of chronic conditions per patient as a global measure of health. When I examined the cumulative distribution of chronic conditions in a healthcare population, the results indicated that health is probably normally distributed. The R-Squared on the Q-Q test for normality was 99.7%.

This suggested that we need to think differently about health. Perhaps it is time to move beyond thinking about the cost distribution as a measure of health and start thinking about the normal distribution as a measure of health. This seems to have important implications for measurement and efforts to change the mean health of populations.

The Evidence that Health is Normal

As mentioned, health is a latent trait, which is unobservable. However, chronic conditions are observable and well defined. The Epic EHR provides a measure of the number of chronic conditions per patient for patients diagnosed with a chronic condition in the past two years. Chronic conditions are things like diabetes, obesity, etc. There are a total of about 45 different chronic conditions one might have.

I collected the numbers of chronic conditions per patient, zero, one, two, three, etc. up to twenty-one chronic conditions for one patient. This data is shown in the table below. I calculated the percentage of patients with each number of chronic conditions and then calculated the cumulative percentages. I used the NORM.S.INV(p) function in Excel to compute the Z scores for the cumulative percentages. Since the Z Score for a 100% probability is undefined, I set the last line to Z = 5.

Chronic Conditions

# of Chronic Conditions%Cumulative %Z Score
069.4%69.45%.51
110.4%79.85%.84
26.28%86.13%1.09
34.30%90.43%1.31
43.22%93.65%1.53
52.21%95.85%1.73
61.50%97.35%1.93
71.00%98.35%2.13
8.65%99.00%2.33
9.41%99.41%2.52
10.24%99.65%2.70
11.15%99.80%2.88
12.091%99.89%3.07
13.051%99.94%3.25
14.026%99.969%3.42
15.019%99.987%3.66
16.0053%99.9928%3.80
17.0050%99.9978%4.09
18.00094%99.9987%4.21
19.00094%99.9997%4.52
200%99.9997%4.52
21.00031%100%5.00

I plotted the Z scores for the cumulative percentages in Excel and tried to fit a linear model to the line. This is the equivalent of the Q-Q plot provided by many statistical packages. The Z Scores are Quantiles and should follow a straight line if the distribution is normal. The R-Squared for the fit between the Z scores and a straight line was 99.7%. This suggested that health risk is normally distributed.

Z Score by Chronic Conditions

The formula for the Normal Cumulative Distribution Function (CDF) is an integral function of a negative squared exponential function. This produces a sigmoid (S Shaped) curve.

As another check, I plotted the relationship between the cumulative percentage of chronic conditions and a normal cumulative distribution (CDF). There seemed to be a fairly close fit, also around a 99% R Squared.

Normal CDF vs Cumulative Percentage of Chronic Conditions

Why is this Important?

While this one set of statistical analyses does not prove that health is normal, the evidence strongly suggests that this is the case. Why might this be important?

This seems to be important because it suggests that the annual healthcare spend is generated from a population where health is normally distributed. I reasoned that it should be theoretically justified to convert the annual patient spend to Z scores using a rank based inverse normal distribution. See the article “Rank-Based Inverse Normal Transformations are Increasingly Used, But are They Merited?” for a good overview.

When I did, the R Squared for the risk prediction model I had developed jumped from 26% to 39%. This was a 50% increase in explained variation. This result provides additional evidence that health is normally distributed. Assuming that health is normally distributed provides a substantial increase in the ability to predict health problems.

Posted by arnoldtk

Looking for a Rising Health Risk Model?

Looking for Collaborators

I am trying to find an opportunity to keep working on some health risk models that I have developed. These models seem to show promise, and could provide new directions for healthcare. One model that seems to have a high potential is a rising risk ranking model that predicts which patients will have an increase in health problems in the next year.

This rising risk model has a high degree of accuracy and has the potential to help lower health costs if suitable interventions can be developed. The model was developed in a healthcare system using Epic electronic health records. If anyone is interested in partnering on this project, please contact me.

Some Background

In 2017, I developed a Probit Health Risk Ranking model that seems to provide a substantial breakthrough in health risk prediction. This model predicts population health rank in the next year with a 39% R-Squared, which seems to be exceptional accuracy.

The best health risk prediction model that I have seen documented anywhere is a commercial health risk model developed by Milliman that has an R-Squared of 37.7%. The Probit model has an R-Squared that is 42% higher than the Milliman model. This appears to be a substantial improvement.

Some Problems

There were some problems when the Probit Health Risk Ranking model was rolled out to providers.

  • First, when the model worked well, providers would remark that it was obvious that high risk score patients were unhealthy. (We don’t need this.)
  • Second, when the model provided a risk score that did not conform to their opinion, providers assumed that the model was not working. (We don’t trust this.)
  • Finally, since the total score was a black box (the model accuracy was due in part to a regression model with over two hundred predictor variables) providers were uncomfortable with not knowing how the scores were calculated. (We don’t understand this.)

A Proposed Solution (Rising Risk Prediction)

One way to overcome some of these problems is to use “rising risk” as the dependent variable. Rising risk is a measure of the likelihood of future health problems. A person with a high rising risk is a seemingly healthy patient who is likely to have substantial health problems in the next year. In tests, it appears that rising risk can also be predicted with the same 39% R-Squared that is obtained with the Probit Health Risk Ranking model.

In theory, if we could identify the patients with hidden health problems who are most likely to have substantial health problems in the next year, something might be done to change their health trajectory. Further work will need to be done to see what factors cause high levels of rising risk and whether the potential risk is treatable.

Interested?

If this seems interesting, please contact me. I would love to show you how the model is built and discuss the possibility of some type of collaboration.

Thomas Arnold, (320) 252-4993.

Posted by arnoldtk

Theory Driven Data Science

This is a personal web site that is intended to promote a concept that I have developed called “theory driven data science.

Theory driven data science is based on the premise that a theoretical understanding of the things one is trying to predict will produce more accurate predictions and enhance the ability to develop prescriptive solutions. The theory driven data science model is based on three core functions. Theory, method, and practice.

  1. Theory: A Physics of Living Systems
  2. Method: Data Pattern Analysis
  3. Practice: Improved prediction and enhanced prescription

The theoretical foundation of theory driven data science is based on the “Physics of Living Systems,” which is a set of core principles related to the structure of data related to living systems. By understanding the “nature of nature” we can build better predictive models and choose better options for fixing the problems we are trying to address. The Physics of Living Systems has three core propositions which are referred to as “The Facts of Life.”

  1. Living systems have infinite variety
  2. Living systems are constantly changing
  3. Living systems are subject to selection processes

For an example of improved prediction, I developed a theory that health is normally distributed. This theory arose after examining the percentages of chronic conditions in the healthcare population. Using this normal health theory, I was able to boost the explanatory power of a population health ranking model by 50%. Theory driven data science.

A theory driven approach to data science focuses on the “six honest serving men” that Kipling wrote about in his poem from “The Elephant’s Child.”

I KEEP six honest serving-men
(They taught me all I knew);
Their names are What and Why and When 
And How and Where and Who.

Posted by arnoldtk