Why can’t we replicate studies?

Interesting article in the NY Times today by Sally Satel (here) that was lamenting the lack of reproducibility in psychology experiments.  She starts by talking in particular about the concept of “goal priming” which has gotten a lot of attention.;  However, she expands her discussion to the lack of reproducibility in science (or at least health science) in general.  For example,

“To be sure, a failure to replicate is not confined to psychology, as the Stanford biostatistician John P. A. Ioannidis documented in his much-discussed 2005 article “Why Most Published Research Findings Are False.” The cancer researchers C. Glenn Begley and Lee M. Ellis could replicate the findings of only 6 of 53 seminal publications from reputable oncology labs.”

Which of course, leads me to wonder about models built on studies.  If we aim at QUANTEC again, there are many studies that give widely divergent results.  Clearly, the frequentist view of clinical studies  representing one out of the many possible trials inherent in the concept of statistical significance.  I think it is pretty clear that the lack of reproducibility doesn’t reflect poor scientific technique but rather indicates that we know so little  about the relevant variables that we are unable to repeat the study.  So if the model is built under the belief that the  study is reproducible (in fact, that belief underlies all the analysis justifying the study), what happens when that belief is proven false?  The Bayesian approach would say that there was some probability that all the variables were accounted for, but under the evidence, that prior belief needs to be reassessed.  If we look at Campbell-Ricketts’ blog entry about theology, it might be a similar type argument that the domain of variables is essentially infinite for such a trial, therefore rendering the probability of our belief of its repeatibility being very nearly zero.


Model testing from a Bayesian perspective

There is an excellent post by Tom Campbell-Ricketts in response to a paper by Gelman and Shalizi (“Philosophy and and the practice of Bayesian statistics“).  He talks about some of their statements regarding testing of models and whether it needs to be deductive or inductive.  I particularly like his  parting comments about Popper and his ostensible views of science.

As a side note, the journal in which this paper is published, “British Journal of Mathematical  and Statistical Psychology” makes me wonder of perhaps we have gone a bit too far off the analytic deep-end when it  comes to some of these disciplines involving human beings.  I am reminded of the QUANTEC effort to use statistical methods to model complications from radiation therapy.  In general, the effort is hampered by the paucity of  consistent data, so that the models tend not to be useful and I have the (unproven) belief that physicians can basically guess the results of these models.  The next question that comes to mind is whether or not even an accurate model at the usual low levels of complications would change any treatment decisions.  Much of this radiation therapy modeling seems to be based on  a need to come up with a mathematical ranking of different actions, e.g. treatment plans, more than from any clinical need.  In other words, physicans–given their current levels of knowledge of important variables and their effects on clinical outcomes–are willing to accept plans that are far from mathematically optimal.

Utility theory applied to moral decisions (a slight digression)

A continuing interest of mine has been the use of Bayesian networks, which  when coupled with utilities, allows one to calculated expected utilities.  As any good decision theorist knows, classical  decision theory (von Neumann and Morgenstern) rests on finding the maximum expected utility (MEU).  One of the standard ways of measuring utilities, albeit excluding one’s attitude towards risk, is the time-tradeoff method.  In essence, one is asked a series of questions about a given situation that elicits the respondent’s values regarding the outcomes of that  situation.  In medicine, a classic time-tradeoff study asks how much life are you willing to give up to  avoid a bad outcome.  For example, would you rather live the rest of your life after prostate cancer therapy (say, 20 yr) being impotent or would you rather give up 2 years (and live 18 years) but be normally potent?  What if the choice was living 2 yr of potency, then death?  Typically the questioner ping-pongs back and forth in time until you conclude that the two states are indistinguishable, e.g. 20 years of impotence is the same as 14 years of potency.  In classic utility theory, your utility for the state of potency is (14/20) = 0.7.

In reading a short history of Boss Tweed of New York, in which he lived a riotously good life for many years but spent  the  last few years in ignominy, eventually dying alone in a bleak cell, I got to thinking of how many years of riotous good living I would demand in order to balance the humiliation, loneliness and misery of some number of years after being caught.  In analogy to the above example, the tradeoff is not between full health and an early death, but rather, the ratio of good years to bad years.

Can one construct a scale of morality based on such considerations?

“What Nate Silver gets wrong”, article in New Yorker (25 Jan 2013)

IThis is an interesting discussion of Nate Silver’s book, “The Signal and the Noise”.  While generally in agreement with Silver, the authors, Gary Marcus and Ernest Davis, take issue with what they see as Silver’s emphasis (insistence?) on Bayesian inference.  They are a bit satirical when noting that this is based on a theorem that has been around a while (several centuries).  They seem particularly miffed at the perceived slight of Fischerian statistics, which as they correctly note, is what is commonly used in scientific hypothesis testing.

Without having read “The Signal and the Noise”, I hesitate to delve too deeply into this, but it is pretty clear that Marcus and Davis are missing two boats.  The first has to do with the fact that there has been a pretty stark divide between the frequentists (Fischer statistics) and Bayesians.  It has only been in the last few decades that the classic academic Fischerians have acknowledged the Bayesian approach and I believe that Silver, in his book, is  reacting to that rift.  This rift is most clearly seen when it comes to applying probability to most real-life situations, which is what Silver emphasizes, and for which he is rapidly becoming famous.

This leads to their second missed boat.  The problem that frequentists face is exactly with the situations that Marcus and Davis highlight as the weakness of Bayesian statistics, namely the case of sparse data.  And while it is certainly a problem for Bayesians, it is an even greater problem for frequentists.  Because they define probabilities strictly in an enumerative sense, a situation with no or few priors leaves them with no probabilities–which is all right if you don’t actually have to do anything.  But in real life, lack of data is not a good excuse.  Your widget may not have been used often enough to fail, but that doesn’t mean that you don’t have to make some contingency plans for when it does (even though frequentists would say that the probability of failure is zero).

Bayesians, on the other hand, view probabilities as degrees of belief (which may or may not be bolstered by counting).  Thus, they can operate in situations where there is little actual data.  Each time they get data, Bayes’ theorem provides a way to update that probability, which is helpful, as well.

In scientific modeling, Fischerians assume a model and see if the data fit it.  Bayesians have data and use it to find the model that best fit it.  Fischer’s tests are all about providing degrees of confidence as to whether two models are actually different given the data.

So, in reading this article, it seems to me that the authors are missing a major point about the two approaches, but as I said, since I didn’t read Silver’s book, I can’t comment on whether he did a poor job in describing it or not.