D. G. Mayo's slides for her presentation given March 17, 2017 at Boston Colloquium for Philosophy of Science, Alfred I.Taub forum: "Understanding Reproducibility & Error Correction in Science"
1. Severe Testing: The Key to Error
Correction
Deborah G Mayo
Virginia Tech
March 17, 2017
âUnderstanding Reproducibility and Error Correction in
Scienceâ
2. Statistical Crisis of Replication
O Statistical âfindingsâ disappear when
others look for them.
O Beyond the social sciences to genomics,
bioinformatics, and medicine (Big Data)
O Methodological reforms (some welcome,
others radical)
O Need to understand philosophical,
statistical, historical issues
2
3. American Statistical Association
(ASA):Statement on P-values
âThe statistical community has been deeply
concerned about issues of reproducibility and
replicability of scientific conclusions. âŠ. much
confusion and even doubt about the validity of
science is arising. Such doubt can lead to
radical choices such asâŠto ban p-valuesâ
(ASA, Wasserstein & Lazar 2016, p. 129)
3
4. I was a philosophical observer at the
ASA P-value âpow wowâ
4
5. âDonât throw out the error control baby
with the bad statistics bathwaterâ
The American Statistician
5
6. O The most used methods are most
criticized
O âStatistical significance tests are a small
part of a rich set of:
âtechniques for systematically appraising
and bounding the probabilities ⊠of
seriously misleading interpretations of
dataâ (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)â.
6
7. Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O Itâs qualified by a claim about the
methodâs capabilities to control and alert
us to erroneous interpretations (error
probabilities)
7
8. âp-value. âŠto test the conformity of the
particular data under analysis with H0 in
some respect:
âŠwe find a function T = t(y) of the data, to
be called the test statistic, such that
âą the larger the value of T the more
inconsistent are the data with H0;
âą The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.
âŠthe p-value corresponding to any t0bs as
p = p(t) = Pr(T â„ t0bs; H0)â
(Mayo and Cox 2006, p. 81)
8
9. Testing Reasoning
O If even larger differences than t0bs occur fairly
frequently under H0
(i.e.,P-value is not small), thereâs scarcely
evidence of incompatibility with H0
O Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
O This indication isnât evidence of a genuine
statistical effect H, let alone a scientific conclusion
H*
Stat-Sub fallacy H => H* 9
10. O Iâm not keen to defend many uses of
significance tests long lampooned
O I introduce a reformulation of tests in
terms of discrepancies (effect sizes) that
are and are not severely-tested
O The criticisms are often based on
misunderstandings; consequently so are
many âreformsâ
10
11. Replication Paradox
(for Significance Test Critics)
Critic: Itâs much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
11
12. Only 36 of 100 psychology experiments yielded
small P-values in Open Science Collaboration on
replication in psychology
OSC: Reproducibility Project: Psychology:
2011-15 (Science 2015): Crowd-sourced effort to
replicate 100 articles (Led by Brian Nozek, U. VA)
12
13. O R.A. Fisher: itâs easy to lie with statistics by
selective reporting, not the testâs fault
O Sufficient finaglingâcherry-picking, P-
hacking, significance seeking, multiple
testing, look elsewhereâmay practically
guarantee a preferred claim H gets support,
even if itâs unwarranted by evidence
(biasing selection effects, need to adjust P-values)
Note: Support for some preferred claim H is by
rejecting a null hypothesis
O H hasnât passed a severe test
13
14. Severity Requirement:
If the test procedure had little or no capability
of finding flaws with H (even if H is incorrect),
then agreement between data x0 and H
provides poor (or no) evidence for H
(âtoo cheap to be worth havingâ Popper)
O Such a test fails a minimal requirement for a
stringent or severe test
O My account: severe testing based on error
statistics (requires reinterpreting tests)
14
15. This alters the role of probability:
typically just 2
O Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0
(e.g., Bayesian, likelihoodist)âwith regard for
inner coherency
O Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson)
15
16. What happened to using probability to
assess the error probing capacity by
the severity requirement?
O Neither âprobabilismâ nor
âperformanceâ directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition
for severity
16
17. O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runsâ
O Itâs that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
Key to revising the role of error probabilities
17
18. A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) unless
something (a fair amount) has been done
to probe ways we can be wrong about C
18
19. O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O Itâs crucial to be able to say, C is highly
believable or plausible but poorly tested
19
20. Biasing selection effects:
O One function of severity is to identify
problematic selection effects (not all are)
O Biasing selection effects: when data or
hypotheses are selected or generated (or
a test criterion is specified), in such a way
that the minimal severity requirement is
violated, seriously altered or incapable
of being assessed
O Picking up on these alterations is
precisely what enables error statistics to
be self-correctingâ
20
21. Nominal vs actual Significance levels
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be âsignificant at
the 5 percent level.â âŠ.The actual level of
significance is not 5 percent, but 64
percent! (Selvin, 1970, p. 104)
From (Morrison & Henkelâs Significance Test
controversy 1970!)
21
22. O Morrison and Henkel were clear on the
fallacy: blurring the âcomputedâ or
ânominalâ significance level, and the
âactualâ level
O There are many more ways you can be
wrong with hunting (different sample
space)
22
23. Spurious P-Value
You report: Such results would be difficult to
achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
(Formally):
O You say Pr(P-value †Pobs; H0) ~ Pobs small
O But in fact Pr(P-value †Pobs; H0) = high
23
24. Scapegoating
O Nowadays, weâre likely to see the tests
blamed
O My view: Tests donât kill inferences,
people do
O Even worse are those statistical accounts
where the abuse vanishes!
24
25. On some views, taking account of biasing
selection effects âdefies scientific senseâ
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-
valueâŠBut adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of âobjectivityâ that is often made
for the P-valueâ (Goodman 1999, p. 1010)
(To his credit, heâs open about this; heads the
Meta-Research Innovation Center at Stanford) 25
26. Technical activism isnât free of philosophy
Ben Goldacre (of Bad Science) in a 2016 Nature
article, is puzzled that bad statistical practices
continue even in the face of the new "technical
activismâ:
The editors at Annals of Internal Medicine,âŠ
repeatedly (but confusedly) argue that it is
acceptable to identify âprespecified outcomesâ
[from results] produced after a trial
began; âŠ.they say that their expertise allows
them to permit â and even solicit â
undeclared outcome-switching
26
27. His paper: âMake journals report
clinical trials properlyâ
O He shouldnât close his eyes to the
possibility that some of the pushback heâs
seeing has a basis in statistical
philosophy!
27
28. Likelihood Principle (LP)
The vanishing act links to a pivotal
disagreement in the philosophy of statistics
battles
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
The data x0 are fixed, while the hypotheses
vary
28
29. All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]âsomething that is
irrelevant in Bayesian inferenceânamely the
sample space (Lindley 1971, p. 436)
The LP impliesâŠthe irrelevance of
predesignation, of whether a hypothesis
was thought of before hand or was
introduced to explain known effects
(Rosenkrantz, 1977, p. 122)
29
30. Paradox of Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(ÎŒ, Ï2), 2-sided H0: ÎŒ = 0 vs. H1: ÎŒ â 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
30
31. âTrying and trying againâ
O Keep sampling until H0 is rejected at 0.05
level
i.e., keep sampling until M ïł 1.96 s/ân
O Trying and trying again: Having failed to
rack up a 1.96 s difference after 10 trials,
go to 20, 30 and so on until obtaining a
1.96 s difference
31
32. Nominal vs. Actual
significance levels again:
O With n fixed the Type 1 error probability is
0.05
O With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O Violates Cox and Hinkleyâs (1974) âweak
repeated sampling principleâ
32
33. O âThe ASA (p. 131) correctly warns that
â[c]onducting multiple analyses of the data
and reporting only those with certain p-
valuesâ leads to spurious p-values
(Principle 4)
O They donât mention that the same p-
hacked hypothesis can occur in Bayes
factors, credibility intervals, likelihood
ratios
33
34. With One Big Difference:
O âThe direct grounds to criticize inferences
as flouting error statistical control is lost
O They condition on the actual data,
O Error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)â
34
35. How might probabilists block intuitively
unwarranted inferences
(without error probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the interpretation of
the evidence, we wouldnât declare thereâs
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
35
36. Rescued by beliefs
O That could work in some cases (it still
wouldnât show what researchers had done
wrong)âbattle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O So now youâve got two sources of
flexibility, priors and biasing selection
effects
36
37. No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
37
38. Most Bayesians use âdefaultâ priors
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
38
39. O âThe priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Default priors may not even be probabilitiesâŠâ
(Cox and Mayo 2010, p. 299)
O Default Bayesian Reforms are touted as free of
selection effects
O ââŠBayes factors can be used in the complete
absence of a sampling planâŠâ (Bayarri,
Benjamin, Berger, Sellke 2016, p. 100)
39
40. Granted, some are prepared to abandon
the LP for model testing
In an attempted meeting of the minds Andrew
Gelman and Cosma Shalizi say:
O â[C]rucial parts of Bayesian data analysis, such
as model checking, can be understood as âerror
probesâ in Mayoâs senseâ which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.â (2013, p.10).
O An open question
40
41. The ASA doc highlights classic foibles
that block replication
âIn relation to the test of significance, we
may say that a phenomenon is
experimentally demonstrable when we
know how to conduct an experiment
which will rarely fail to give us a
statistically significant resultâ
(Fisher 1935, p. 14)
âisolatedâ low P-value â > H: statistical effect
41
42. Statistical â > substantive (H â > H*)
â[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to
accepting the efficacy of the cause in
question. The latter...requires obtaining
more significant results when the
experiment, or an improvement of it, is
repeated at other laboratories or under
other conditionsâ (Gigerentzer 1989, pp.
95-6)
42
43. The Problem is with so-called NHST
(ânull hypothesis significance testingâ)
O NHSTs supposedly allow moving from
statistical to substantive hypotheses
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
43
44. Neyman-Pearson (N-P) tests:
A null and alternative hypotheses
H0, H1 that are exhaustive
H0: Ό †12 vs H1: Ό > 12
O So this fallacy of rejection HïšH* is
impossible
O Rejecting the null only indicates statistical
alternatives (how discrepant from null)
44
45. P-values Donât Report Effect Sizes
(Principle 5)
Who ever said to just report a P-value?
O âTests should be accompanied by
interpretive tools that avoid the
fallacies of rejection and non-rejection.
These correctives can be articulated in
either Fisherian or Neyman-Pearson
termsâ (Mayo and Cox 2006, Mayo and
Spanos 2006)
45
46. To Avoid Inferring a Discrepancy
Beyond Whatâs Warranted:
large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2 )
46
47. O Whatâs more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesnât
go off unless the house is fully ablaze?
O [The larger sample size is like the one
that goes off with burnt toast]
47
48. What About Fallacies of
Non-Significant Results?
O They donât warrant 0 discrepancy
O Use the same severity reasoning to rule out
discrepancies that very probably would have
resulted in a larger difference than observed- set
upper bounds
O If you very probably would have observed a more
impressive (smaller) p-value than you did, if Ό >
ÎŒ1 (ÎŒ1 = ÎŒ0 + Îł), then the data are good evidence
that Ό< Ό1
O Akin to power analysis (Cohen, Neyman) but
sensitive to x0
48
49. O Thereâs another kind of fallacy behind a
move thatâs supposed improve replication
but it confuses the notions from significance
testing and it leads to âMost findings are
falseâ
O Fake replication crisis.
49
50. Diagnostic Screening Model of Tests:
urn of nulls
(âmost findings are falseâ)
O If we imagine randomly select a hypothesis
from an urn of nulls 90% of which are true
O Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
O Take the prevalence of 90% as
Pr(H0 you picked) = .9, Pr(H1)= .1
O Rejecting H0 with a single (just) .05 significant
result, Cherry-picking to boot
50
51. 51
The unsurprising result is that most âfindingsâ are
false: Pr(H0| findings with a P-value of .05) > .5
Pr(H0| findings with a P-value of .05) â
Pr(P-value of .05 | H0)
Only the second one is a Type 1 error probability)
Major source of confusionâŠ.
(Berger and Sellke 1987, Ioannidis 2005,
Colquhoun 2014)
52. O A: Announce a finding (a P-value of .05)
O Not properly Bayesian (not even
empirical Bayes), not properly frequentist
O Where does the high prevalence come
from?
52
53. Concluding Remark
O If replication research and reforms are to lead to
error correction, they must correct errors: they
donât always do that
O They do when they encourage preregistration,
control error probabilities & require good design
RCTs,
checking model assumptions)
O They donât when they permit tools that lack error
control
53
54. Donât Throw Out the Error Control Baby
O Main source of hand-wringing behind the
statistical crisis in science stems from cherry-
picking, hunting for significance, multiple testing
O These biasing selection effects are picked up
by tools that assess error control (performance
or severity)
O Reforms based on âprobabilismsâ enable rather
than check unreliable results due to biasing
selection effects
54
55. Repligate
O Replication research has pushback:
some call it methodological terrorism
(enforcing good science or bullying?)
O My gripe is that replications, at least in
social psychology, should go beyond the
statistical criticism
55
56. Non-replications construed as
simply weaker effects
O One of the non-replications: cleanliness and
morality: Does unscrambling soap words make
you less judgmental?
âMs. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether itâs
cool to eat your dog after it gets run over by a
car. âŠChronicle of Higher Ed.
56
57. âŠTurns out, it did. Subjects who had
unscrambled clean words werenât as harsh on the
guy who chows down on his chow.â (Chronicle of
Higher Education)
O Focusing on the P-values ignore larger
questions of measurement in psych & the leap
from the statistical to the substantive.
HïšH*
O Increasingly the basis for experimental
philosophy-needs philosophical scrutiny
57
58. The ASAâs Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
58
59. Test T+: Normal testing: H0: Ό < Ό0 vs. H1: Ό > Ό0
Ï known
(FEV/SEV): If d(x) is not statistically significant,
then
ÎŒ < M0 + kΔÏ/ân passes the test T+ with
severity (1 â Δ)
(FEV/SEV): If d(x) is statistically significant, then
ÎŒ > M0 + kΔÏ/ân passes the test T+ with
severity (1 â Δ)
where P(d(X) > kΔ) = Δ 59
60. References
O Armitage, P. 1962. âContribution to Discussion.â In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. âThe Case for Objective Bayesian Analysis.â Bayesian Analysis
1 (3): 385â402.
O Birnbaum, A. 1970. âStatistical Methods in Scientific Inference (letter to the
Editor).â Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. âAn Apology for Ecumenism in Statistics,â in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. âObjectivity and Conditionality in
Frequentist Inference.â In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276â304. Cambridge: Cambridge University
Press.
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. âStatistical Methods and Scientific Induction.â Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69â78.
60
61. O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical
Psychology 66(1): 8â38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. âMake journals report clinical trials properlyâ, Nature
530(7588);online 02Feb2016.
O Goodman SN. 1999. âToward evidence-based medical statistics. 2: The Bayes
factor,â Annals of Internal Medicine 1999; 130:1005 â1013.
O Lindley, D. V. 1971. âThe Estimation of Many Parameters.â In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435â455. Toronto:
Holt, Rinehart and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. âSevere Testing as a Basic Concept in a
NeymanâPearson Philosophy of Induction.â British Journal for the Philosophy of
Science 57 (2) (June 1): 323â357.
61
62. O Mayo, D. G., and A. Spanos. 2011. âError Statistics.â In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152â198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O Meehl, P. E., and N. G. Waller. 2002. âThe Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.â Psychological Methods 7
(3): 283â300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in Bul. Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. âA critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Trafimow D. and Marks, M. 2015. âEditorialâ, Basic and Applied Social Psychology
37(1): pp. 1-2.
O Wasserstein, R. and Lazar, N. 2016. âThe ASAâs statement on p-values: context,
process, and purposeâ, The American Statistician 62
63. Abstract
If a statistical methodology is to be adequate, it needs
to register how âquestionable research practicesâ
(QRPs) alter a methodâs error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
todayâs statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacementsâBayesian updating,
Bayes factors, likelihood ratiosâfail to control severity.
63