Andrew Gelman’s blog (andrewgelman.com) often has interesting comments on different aspects of statistics. I particularly like this comment on the differences between values of .20 and .01 (copied from his blog entry “Ride a Crooked Mile” posted on 14 Jun 2017):
3. In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.
This post is indirectly related to things Bayesian. One of the nodes in a Bayes net I have been working on is the tumor control probability (TCP) for oropharyngeal cancer. Where do we get the TCP values? One place is from Kaplan-Meier (KM) curves. If you do a KM curve for each of several doses and then focus on a particular time point, the TCP values at that time can be obtained. Now KM curves have a confidence interval which is +/- 1.96*sqrt(Variance).
Well, what is the variance? According to the textbook, it is a function only of the survival probability at time, t, times the sum of a weighted (by 1/number_at_risk) average of the conditional risk at prior failure times. In other words, it does not take into account the biological variability.
Take the example of two experiments. The first (the “naive”) experiment does not stratify patients by cancer stage, i.e. you draw from a patient pool that includes all patients with that type of cancer regardless of whether it is Stage I or IV. Pick some “n” number of patients and perform the KM analysis. Note that there is no measurement uncertainty: at any given time you know how many patients you have and how many fail. The variance is a measure of some dispersion based on number at risk. The second (“stratified”) experiment only chooses patients with one particular stage, e.g. II. Choose the same number “n” of patients. It is not unlikely that both experiments will give you the same KM curve since the naive experiment is an average effect over all stages. Now if you were to repeat these two experiments a number of times in order to measure the variance at different times, you should get different variances. In the naive experiment, your distribution of stages within any selected sample will vary somewhat with a resultant difference in survival. In the stratified experiment, the population distribution will be narrower. (If you don’t think that is true, then figure out how they defined “stage” to begin with). In this thought experiment, there are two very different variances based on the biology. Now, since we are measuring a mean which might be argued is distributed normally, the variance of the mean survival is relatively narrow, but there should still be a difference mathematically if not clinically significant.
My conclusion is that either the KM confidence interval doesn’t contain the whole story or I don’t understand the statistics as well as I think I do. In any case, it has helped sharpen my thinking about the confidence limits on TCP curves.
Contemplating a project of some colleagues regarding decision making under the uncertainty of where a tumor will be with respect to the radiation field during breathing led me to wonder about the whole range of uncertainties that should be considered. Traditionally in radiation oncology we are concerned whether the tumor will always be in the radiation field when you set up the patient on a daily basis for weeks. When the tumor position is affected by respiration or bowel gas, then it is even harder to know this. Recently, on-board imaging has helped us to understand (and sometimes manage) the motion. Increasing the size of the radiation field (a.k.a. using a PTV) is one approach to reducing uncertainty.
But what about other uncertainties? Take any cubic millimeter. What is the number of tumor cells? Classic radiation biology uses Poisson statistics to calculate the uncertainty in radiation’s ability to sterilize the tumor. So uncertainty exists and can be accounted for. What about the more recent realization that tumors do not contain a single clonogen but, rather, many genetically different cells? Hopefully, genetic characterization would give us some insight as to how the differences affect the cells’ radiation sensitivity. Epigentic factors, too, play a role in establishing a phenotypic radiation response. However, even in this optimistic case where we have some mechanistic understanding, we can only alter the probabilities. So here we have an understanding that uncertainty exists, and in some cases we may be able to characterize it, but at this point even accurate estimates of the probabilities are hard to come by. Near the margins of the tumor, we talk about the clinical tumor volume which consists of “microscopic disease”, by which we mean possible tumor cells that we have no solid knowledge regarding their existence. What we know comes from surgical/biopsy specimens or from clinical outcomes with regard to treating such a region in other patients. Here our uncertainty is complete with regard to the particular patient and our only knowledge comes from population averages.
Much of radiation oncology (and medicine in general) is devoted to reducing the uncertainty by techniques such as recursive partitioning analysis and classification algorithms, e.g. support vector machines and logistic regression. Concepts such as stage, grade, TNM classification are all ways of predicting outcomes as a function of therapies, thereby reducing our uncertainty. Such musing leads us to consider the confluence of medical decision making and uncertainty. On one side, we can say that the minimum uncertainty is when we know for sure that the treatment will effect a cure or will surely fail. Then we have a probability of 1.0 or 0.0 and, hence, no uncertainty. The most uncertainty we have is when there is a 50% chance of cure. Surely it is better in the decision making realm to have no uncertainty. However in the real world–that is, the world of the patient and doctor–a 50% chance of cure is better than 0%. So we can conclude that uncertainty in these types of decisions is not necessarily a bad thing. Therefore, we are left to continue our quest for better strategies for making decisions under uncertainty. The question of the day is: do we want to continue understanding the biology to the point that we know exactly what will happen to a person when we know that in some fraction of the cases we will be depriving the patient of hope?
I have written before (here) about the influence prior beliefs have on how people act. Two recent events (one lecture, one book) have led me to further contemplate this.
Daniella Witten, a biostatistician at UW, gave a bioethics lecture today on the Duke saga (her words) regarding genomics research and related clinical trials, and the consequent scandal when it turned out that the genomics science was not correct. Although statisticians led the discovery, they themselves did not use any Bayesian arguments–what follows are my own ramblings. One of the points that I got from Witten’s talk was that the very persistent reluctance for authorities at Duke (and elsewhere) to realize that the published papers were wrong came, at least partly, from their belief that it had to be correct. There had been such hype about the potential of genomics to guide cancer therapies, and the PI was a good scientist, and good journals accepted the papers. It took a lot of evidence before their prior probabilities were modified by data to the more correct posteriors.
This tale seems to me to be a good example of the type of thinking described by Jonathan Haidt in “The Righteous Mind: Why Good People are divided by Politics and Religion.” He uses the metaphor of an elephant with a rider on its back. The rider is the rationale mind whereas the elephant is everything else about us. More often than not, the rider doesn’t guide the elephant, but rather spends a lot of time justifying the direction the elephant is going in on its own volition. To stretch the analogy, I think that part of the elephant’s momentum (in the vector sense) has to do with prior beliefs, without saying where they came from. The rider is new data, and the inability of our rationale mind to change the elephant’s course reflects the difficulty we have in overcoming prior beliefs. One way of thinking of it is that the processes that guide our “elephant” multiply the importance or frequency of a few early data so that once that prior has been established, it takes an awful lot to influence it later.
A look at whether medical physics are Bayesians through the example of maintenance of certification (MOC).
Abstract: Though few will admit it, many physicists are Bayesians in at least some situations. This post discusses how the world looks through a Bayesian eye. This is accomplished through a concrete example, the Maintenance of Certification (MOC) by the American Board of Radiology. It is shown that a priori acceptance of the value of MOC relies on a Bayesian attitude towards the meaning of probabilities. Applying Bayesian statistics, it is shown that a reasonable prior greatly reduces any possible gain in information by going through the MOC, as well as providing some numbers on the possible error rate of the MOC. It is hoped that this concrete example will result in a greater understanding of the Bayesian approach in the medical physics environment.
For several decades, a debate has raged regarding the nature of probabilities. On one side of the debate are “frequentists”. They hold that probabilities are obtained by repeated identical observations with the probability of any given outcome being the ratio of the number of events with that outcome to the total number of events. The classical example is the probability of observing a “heads” or “tails” when flipping a coin. On the other side of the debate are “Bayesians” (more on that name in a bit). They hold that probabilities can also represent the degree of belief in relative frequency of a given outcome. While there are many paths by which one can reach this point, there are several common ones. The influence of prior knowledge on one’s belief that a certain event will happen is certainly one ingredient. Another path by which people reach the Bayesian viewpoint is recognition of the fact that probabilities are often useful even when it is impossible to reproduce precisely the situation so that multiple measurements can be made, such as in the field of medicine.
For those of us in the medical field, randomized controlled trials (RCT) are our effort to achieve the frequentist goal of measuring outcomes in identical situations. However, we are usually more interested in discovering the differences in probabilities for different situations, namely when an element of a therapeutic procedure has been changed. The frequentist approach lies behind the statistical tests that are used to determine whether our observations warrant the conclusion that the therapeutic modification has resulted in a true difference or not. In other words, the frequentist view is one in which seeks to determine whether the data observed are consistent with a given hypothesis. This is to be contrasted with the Bayesian view in which one seeks to determine the probability of a certain hypothesis given the data.
All of this still leaves us with the question: Why do we care whether medical physicists are Bayesians or frequentists? One good reason has been in the news recently, namely, personalized medicine. How will we ever obtain the required numbers of patients if everything is personal? Even if we take “personal” to mean harboring one or several (nearly) identical genes, recent developments are demonstrating that biological processes are nearly always the result of a large set of genes. In addition, the role of epigenetic factors reduces the homogeneity in any group selected for their genetic homogeneity.
In general, medical physicists tend to be a bit under-educated with respect to probabilities and statistics, especially in a medical environment. A very good reference for the Bayesian statistical approach is “Bayesian Approaches to Clinical Trials and Health-Care Evaluation” by DJ Speigelhalter et al. This post is a brief attempt to highlight some of the issues, but should be considered a very faint ghost of a complete discussion. To make it more concrete, I have looked at a specific situation.
An article in the NY Times highlights an interesting development in which a theoretical construct is becoming an actual reality. The article describes how immunotherapy is being tested in patients with melanoma. Several drugs seem to result in some improvement in survival and combinations of drugs result in more drastic improvements. However, these combinations can be lethal (in one test 3/46 died from the drugs), as well as causing myriad side-effects.
In decision theory, there is a concept of “utility” which is basically a quantitative measure of how much one values something. In economic terms, this value is relatively easy to assess since you are usually dealing with either actual money or something with monetary value. In health care, it is not so straightforward. Would you rather live 10 years with pain or 5 years without? One way of trying to assess these utilities when the outcome is uncertain (as it always is in medicine) is called the Standard Gamble. Imagine trying to assess how someone values a given health state, for example living with “dry mouth” which may limit how well you can swallow and eat certain foods. In this test, the person is given a choice: (a) live the rest of your life with dry mouth, or (b) take a pill which has a probability, P, of curing you but also a probability, 1-P, of killing you instantly. Let’s start off with P = 90%. That is, if you take the pill, there is a 90% chance you will be cured of dry mouth for the rest of your life. Do you take the pill with its 10% chance of dying or do you live the rest of your life with your condition? The test consists of varying the probability, P, until the person cannot choose between them. That value of P is then called the “utility for the condition of dry mouth.”
We now have the situation where patients (and physicians) can choose between a situation where the outcome is pretty well-known or trying a new therapy which has the promise of making things much better but can also kill you. Now the situation is not exactly the same as the Standard Gamble since the new therapy brings with it some new complications even if they are not fatal. But it does make one think of whether we can use data from these real life situations to study how utilities are measured.
From Think Bayes–Bayesian statistics made simple (Allen Downey, Green Tea Press)
“Also, notice that in a Bayesian update, we multiply each prior probability
by a likelihood, so if p(H) is 0, p(HjD) is also 0, regardless of D. In the
Euro problem, if you are convinced that x is less than 50%, and you assign
probability 0 to all other hypotheses, no amount of data will convince you
This observation is the basis of Cromwell’s rule, which is the recommendation
that you should avoid giving a prior probability of 0 to any hypothesis
that is even remotely possible (see http://en.wikipedia.org/wiki/
Cromwell’s rule is named after Oliver Cromwell, who wrote, “I beseech
you, in the bowels of Christ, think it possible that you may be mistaken.”
For Bayesians, this turns out to be good advice (even if it’s a little overwrought).”