Data science has been described as the Fourth Paradigm. The essence of the fourth paradigm model is that data science can provide answers without scientific theory. If one has “big data,” data science can provide accurate predictions without any theoretical input. The fourth paradigm is seen as an advance over the “third paradigm” which suggests that science should be driven by scientific theory and theory testing.
I am suggesting that throwing out theory completely was a mistake. A “Fifth Paradigm”, called “Theory Driven Data Science” is needed. Theory driven data science provides a theoretical framework that is designed to guide data science. The goal is to produce better predictions from “big data.” In addition, theory driven data science is better able to provide prescriptions in addition to predictions.
I’m a theory-driven data scientist. My data science solutions are driven by a theoretical model I call the “Physics of Living Systems.” I suggest that data science can be dramatically improved by understanding the physics of living systems. For example, using the physics of living systems, I have developed a population health ranking system that predicts current patient health or future health problems at a 39% R-Squared, which is an explained variation that is 50% higher than the best commercial health risk models. This is only one example of the potential for theory driven data science.
While the physics of living systems is relatively easy to grasp, the paradigms run counter to many of the existing scientific paradigms. Existing scientific paradigms were developed using assumptions designed to simplify science. These assumptions don’t do a very good job of reflecting reality. A new set of scientific paradigms are needed that begin with reality.
One can understand the premises behind the physics of living systems by thinking about the physics of gasses. In the physics of gasses, we understand that there are two levels we can study, 1) the physics of individual particles, and 2) the physics of gas volumes. Individual particles have position, speed, and direction. Gas volumes have size, temperature, and pressure.
The physics of living systems postulates that there are three “facts of life” shared by all individual living systems. These facts of life are found from observing natural phenomena. Living organisms are “complex adaptive systems” and we should begin by acknowledging their characteristics. The facts of life are as follows.
- Individual living systems have infinite variation.
- The characteristics of individual living systems are constantly fluctuating through a process of “high dimensional chaos.”
- Living systems are constantly developing over their life course.
The next three postulates are not immediately obvious. They are not taught in schools. I discovered the importance of these issues by studying the age crime curve and the exponential rise in healthcare costs as health problems increase. There are “asymmetric selection” processes at work that are not well understood.
Because of the “facts of life,” the population properties of living systems are as follows.
- The population probability density function (PDF) for individual characteristics tends to be normal.
- The population cumulative density function (CDF) when there is “selection without replacement” is a sigmoid curve.
- The population cumulative density function (CDF) when there is “selection with replacement” is an exponential curve.
The physics of living systems predicts that health is normally distributed and that chronic condition percentages, which use “selection without replacement” will follow a sigmoid curve. This theory is supported by the data. The physics of living systems also predicts that the health cost distributions, which involve “selection with replacement” will rise exponentially. Again, this is supported by the data.
It is probably most instructive to show how this works. Therefore, I will demonstrate how to build a highly accurate “rising health risk ranking model” using theory driven data science. I have already built, tested and validated this model, so I know it works.
Writing up the explanation of how to create this model is an ongoing process, so stop back periodically to monitor progress.