# p values

Andrew Gelman’s blog (andrewgelman.com) often has interesting comments on different aspects of statistics.  I particularly like this comment on the differences between values of .20 and .01 (copied from his blog entry “Ride a Crooked Mile” posted on 14 Jun 2017):

3. In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

# Uncertainty in Kaplan-Meier curves

This post is indirectly related to things Bayesian.  One of the nodes in a Bayes net I have been working on is the tumor control probability (TCP) for oropharyngeal cancer.  Where do we get the TCP values?  One place is from Kaplan-Meier (KM) curves.  If you do a KM curve for each of several doses and then focus on a particular time point, the TCP values at that time can be obtained.  Now KM curves have a confidence interval which is +/- 1.96*sqrt(Variance).

Well, what is the variance?  According to the textbook, it is a function only of the survival probability at time, t, times the sum of a weighted (by 1/number_at_risk) average of the conditional risk at prior failure times.  In other words, it does not take into account the biological variability.

Take the example of two experiments. The first (the “naive”) experiment does not stratify patients by cancer stage, i.e. you draw from a patient pool that includes all patients with that type of cancer regardless of whether it is Stage I or IV.  Pick some “n” number of patients and perform the KM analysis.  Note that there is no measurement uncertainty: at any given time you know how many patients you have and how many fail.  The variance is a measure of some dispersion based on number at risk.  The second (“stratified”) experiment only chooses patients with one particular stage, e.g. II.  Choose the same number “n” of patients.  It is not unlikely that both experiments will give you the same KM curve since the naive experiment is an average effect over all stages.  Now if you were to repeat these two experiments a number of times in order to measure the variance at different times, you should get different variances.  In the naive experiment, your distribution of stages within any selected sample will vary somewhat with a resultant difference in survival.  In the  stratified experiment, the population distribution will be narrower.  (If you don’t think that is true, then figure out how they defined “stage” to begin with).  In this thought  experiment, there are two very different variances based on the biology.  Now, since we are measuring a mean which might be argued is distributed normally, the variance of the mean survival is relatively narrow, but there should still be a difference mathematically if not clinically significant.

My conclusion is that either the KM confidence interval doesn’t contain the  whole story or I don’t understand the statistics as well as I think I do. In any case, it has helped sharpen my thinking about the confidence limits on TCP curves.

# When prior beliefs interfere…

I have written before (here) about the influence prior beliefs have on how people act.  Two recent events (one lecture, one book) have led me to further contemplate this.

Daniella Witten, a biostatistician at UW, gave a bioethics lecture today on the Duke saga (her words) regarding genomics research and related clinical trials, and the consequent scandal when it turned out that the genomics science was not correct.  Although statisticians led the discovery, they themselves did not use any Bayesian arguments–what follows are my own ramblings.  One of the points that I got from Witten’s talk was that the very persistent reluctance for authorities at Duke (and elsewhere) to realize that the published papers were wrong came, at least partly, from their belief that it had to be correct.  There had been such hype about the potential of genomics to guide cancer therapies, and the PI was a good scientist, and good journals accepted the papers.  It took a lot of evidence before their prior probabilities were modified by data to the more correct posteriors.

This tale seems to me to be a good example of the type of thinking described by Jonathan Haidt in “The Righteous Mind: Why Good People are divided by Politics and Religion.”  He uses the metaphor of an elephant with a rider on its back.  The rider is the rationale mind whereas the elephant is everything else about us.  More often than not, the rider doesn’t guide the elephant, but rather spends a lot of time justifying the direction the elephant is going in on its own volition. To stretch the analogy, I think that part of the elephant’s momentum (in the vector sense) has to do with prior beliefs, without saying where they came from.  The rider is new data, and the inability of our rationale mind to change the elephant’s course reflects the difficulty we have in overcoming prior beliefs.  One way of thinking of it is that the processes that guide our “elephant” multiply the importance or frequency of a few early data so that once that prior has been established, it takes an awful lot to influence it later.

# Cromwell’s rule

From Think Bayes–Bayesian statistics made simple (Allen Downey,  Green Tea Press)

“Also, notice that in a Bayesian update, we multiply each prior probability
by a likelihood, so if p(H) is 0, p(HjD) is also 0, regardless of D. In the
Euro problem, if you are convinced that x is less than 50%, and you assign
probability 0 to all other hypotheses, no amount of data will convince you
otherwise.

This observation is the basis of Cromwell’s rule, which is the recommendation
that you should avoid giving a prior probability of 0 to any hypothesis
that is even remotely possible (see http://en.wikipedia.org/wiki/
Cromwell’s_rule).
Cromwell’s rule is named after Oliver Cromwell, who wrote, “I beseech
you, in the bowels of Christ, think it possible that you may be mistaken.”
For Bayesians, this turns out to be good advice (even if it’s a little overwrought).”

# An interesting chain of connections

As a great example of the inter-relatedness of science and math, I would like to lay out a recent chain of connections I made the other day.  One cool point is that the last link in the chain is a memory of an experience I had when in graduate school (over 30 years ago).  It started when I was researching a new project that was discussed with some colleagues.  Each item in the following list indicatesa different, but related, concept.

• How to compare the cost effectiveness of the treatment of lung cancer with x-rays or protons;
• Markov models are one approach
• In reading a little more about Markov models, I found an intriguing reference to connection between Markov models and Monte Carlo methods
• A quick google search on the above-mentioned concepts kept pointing me to Markov Chain Monte Carlo  (MCMC)
• Having seen the topic of MCMC in books and articles, but never having the time or interest to followup and understand it, I looked for articles on this
• Found a great article that caught my attention in the first page or two with some anecdotal history of Monte Carlo methods in physics:                                                                                                    The Evolution of Markov Chain Monte Carlo Methods,  Matthew Richey, The American Mathematical Monthly, Vol. 117, No. 5 (May 2010), pp. 383-413
• The article explains the Metropolis algorithm, which was first used in nuclear physics, but quickly applied to topics like Ising glasses.  The Metropolis algorithm was a way of searching a very large state space very efficiently by sampling the most likely states most often.  It used the energy of an ensemble of particles in thermodynamics.
• From there, the Metropolis algorithm was found to be useful in problems of combinatorial optimization, where, again, many states had to be searched to find the optimal solution.
• One of the consequences was the development of simulated annealing algorithm.  This is one personal connection since I remember very well the day I sat in the library at PSI (Switzerland) and read Steve Webb’s paper on simulated annealing for IMRT.  This was pretty early in the development of this algorithm and I give Steve a lot of credit for understanding and applying it so well and so quickly.
• One of the classic combinatorial optimization problems is the traveling salesman problem.
• At the time that this algorithm was being applied to optimization, solid state circuits were starting their incredible rise in number of components and density.  A big problem was how best to add new ones. There is a cost associated with the distance between related components so the traveling salesman problem was relevant.
• A physicist who had studied spin glasses–Scott Kirkpatrick–went to work for IBM and started working on the circuit board problem.  He recognized that the objectives that needed to be met were very similar to the equation for the energy of spin glasses, the configuration of which was being solved using the Metropolis algorithm.
• Final step: I have always remembered–and often cited as an example of cross-field intellectual fertilization–a talk I heard as a graduate student at the University of Wisconsin in which a physicist described the exact scenario I just recounted above.

So–loop closed.  I am glad (a) to find out who it was, (b) that the point I always took away from it was correct, and (c) to learn more about the actual problem.

P.S. The Bayesian connection of this story will be told in a forthcoming post.

# Models versus data

From the geographer Strabo:

“And whenever we have not been able to learn by the evidence of sense, there reason points the way.”

He was speaking about knowing the edges of the inhabitable world, but it fits just as well when we speak about whichever world in which we are interested.

# Virtual trials

Physicists are fond of conducting “virtual trials” by which is meant that they select a number of random or representative cases, compute treatment plans for them using two different methods and then compare the results.  Usually these are done to show the differences (or lack thereof) between two different methods of radiation delivery, or sometimes, of optimization.

In general, this is a reasonable and cost effective means of coming to some conclusion about the appropriate uses of new technology. However, as they are most often conducted, these trials do little to answer any relevant questions.  In general, they meet few, if any, of the criteria for a clinical trial.  Instead, it seems as though physicists have defined their own standards for a virtual trial.  What are these standards? How do they compare with the norms in clinical medicine?

Clinical trials are grouped into four stages, ranging from determination of the intervention’s safety, to its efficacy in a controlled group, to its efficacy in the population at large.  Do our physics-oriented virtual trials call into any of these categories.  At one end of the spectrum, physicists are concerned with safety, namely a Phase I trial.  They wish to avoid initiating a technology or procedure that will lead to patient harm.  At the other end, one could argue (thought physicists never do) that they are also conducting Phase IV-like trials since the cases are selected with little regard for the biological and physiological variables that can mediate the response to the intervention.  Most often, cases are selected because they are dosimetrically “interesting” or, on the other hand, “tractable”.  The latter characteristic underpins the continued popularity of virtual trials of prostate cancer with its two significant organs-at-risk.  Once those cases are dealt with and the technology has been shown to handle the simple situations, then interesting cases are selected based on their dosimetric complexity.

Does this way of viewing the issue lead to any worthwhile considerations?  In the sense that clinical trials are now the gold standard for progress in medical practice, the answer is yes.  If we, as physicists, wish to lead the field forward by definitively answering questions, then we need to meet the same standards as other in the field.  So what are the characteristics of clinical trials that translate directly to a medical physics approach?

First, the endpoint of the trial must be described and justified at the beginning.  Too often, physicists merely pile up metrics at the end of the project, calculate statistical significance of differences, and then make some pronouncement based thereon.  Clinical trials do not have the luxury of waiting until the end of the trial to define their endpoints for several reasons, chief among them the ethics of human research.  Physicists are free of that limitation, but then suffer the possibility of being accused of cherry-picking the results.  More importantly, however, the failure to declare and justify the endpoints at the beginning vitiates the impact of the results since others are less likely to be convinced by this method of conducting the trial.  Providing a convincing rationale at the beginning of the work puts the results on a firm footing and helps structure the entire virtual trial.

Elucidation of a clear set of metrics by which to judge the trial’s efficacy must be done in conjunction with the relevant clinicians.   It is sometimes the case in current comparisons that dose metrics are tested for statistical significance with the somewhat absurd results that dose differences of less than 1 Gy are reported as significant.  Statistically, maybe (although it can certainly be argued that any set of cases in which dosimetric parameters that are so close in value, yet statistically significant, exhibit a homogeneity that hardly reflects clinical practice); clinically, no. [e.g. DC Weber, et al, Int J Radiat Oncol Biol, Phys, 75(5): 1578-86, 2009]  It is important to determine up front what is going to be conclusive evidence of improvement for the application being studied.  It is at this point that determination of the phase of the trial is important.  Evaluating safety is likely to result in a different set of trial metrics than would be used in a Phase III trial.

Rigor in methods is also an important component  of a virtual trial.  GIven the complexities of modern treatment plans, optimization algorithms are often used in virtual trials.  However, the algorithms in the current generation of treatment planning software is very operator dependent. In other cases, such as comparisons of protons and x-rays, different planning systems and dose calculation algorithms must be used. Great care must be taken  in designing methods that provide a fair comparison for plans.  In the case of optimization algorithms, user options must be constrained.  When different planning systems are used, some effort at judging their relative differences (outside the parameters of the trial) must be made.

There is an additional burden that the use of optimization places on virtual trials that is not usually a part of clinical trials.  That is, clinical trials do not usually look at the correspondence between normal tissue outcomes (complications) in conjuction with tumor response.  In some cases, there may be reason to believe that tumor response is related to or coupled with normal tissue response and hence justifies the reporting of the correlation, but this is rarely done.  In inverse planning, the algorithm searches for some ideal solution and when it cannot find one that meets all the objectives, finds a plan that incorporates a trade-off between the competing (tumor vs normal tissue) objectives.  For this reason, it is imperative that virtual trials incorporating inverse planning report results for each individual, not just aggregate measures such as averages of single metrics.

Finally, to make the connection between clinical trials and virtual trials, it is interesting to consider Phase III and IV trials.  One definition is: “Phase IV studies are conducted after the intervention has been marketed. These studies are designed to monitor effectiveness of the approved intervention in the general population and to collect information about any adverse effects associated with widespread use.” [Gates Foundation]  If we replace the word “marketed” with the words “clinically implemented”, then we have a good description of the introduction of new technologies and methods into clinical use including the performance of a virtual trial at the beginning of the process.  To those who argue that such trials are likely to not achieve statistical significance because of a lack of sufficient numbers of patients, one may ask whether there is justification in spending the money to purchase the new technology.

For those institutions that conduct virtual trials and based (at least partly) on the results take the next step of clinical use, it would be very worthwhile for them to collect data and report back on the correspondence between the trial and the clinical outcomes.  In many cases, e.g. IMRT and VMAT, the differences are so small that it is certainly not unethical to randomize patients between the two and measure the differences (if any) in outcomes, thereby conducting a Phase III trial.  For those who are so convinced that using the old technology is not justified, then certainly reporting on the outcomes and comparing them to the historical results and the conclusions of the virtual trial would be of great value.

In conclusion, it behooves the medical physics community to meet the standards that we and society in general (particularly given the Patient Protection and Affordable Care Act) expect of medical research.  These changes will enhance the usefulness of medical physics research, provide comfort to the public knowing that careful measures are being taken to insure the safe and efficacious introduction of new therapies, and hopefully also lead to the more rational use of our health care dollars.