SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Severe Testing: The Key to Error
Correction
Deborah G Mayo
Virginia Tech
March 17, 2017
“Understanding Reproducibility and Error Correction in
Science”
Statistical Crisis of Replication
O Statistical ‘findings’ disappear when
others look for them.
O Beyond the social sciences to genomics,
bioinformatics, and medicine (Big Data)
O Methodological reforms (some welcome,
others radical)
O Need to understand philosophical,
statistical, historical issues
2
American Statistical Association
(ASA):Statement on P-values
“The statistical community has been deeply
concerned about issues of reproducibility and
replicability of scientific conclusions. 
. much
confusion and even doubt about the validity of
science is arising. Such doubt can lead to
radical choices such as
to ban p-values”
(ASA, Wasserstein & Lazar 2016, p. 129)
3
I was a philosophical observer at the
ASA P-value “pow wow”
4
“Don’t throw out the error control baby
with the bad statistics bathwater”
The American Statistician
5
O The most used methods are most
criticized
O “Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities 
 of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O These I call error statistical methods (or
sampling theory)”.
6
Error Statistics
O Statistics: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O The inference may be in error
O It’s qualified by a claim about the
method’s capabilities to control and alert
us to erroneous interpretations (error
probabilities)
7
“p-value. 
to test the conformity of the
particular data under analysis with H0 in
some respect:

we find a function T = t(y) of the data, to
be called the test statistic, such that
‱ the larger the value of T the more
inconsistent are the data with H0;
‱ The random variable T = t(Y) has a
(numerically) known probability
distribution when H0 is true.

the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≄ t0bs; H0)”
(Mayo and Cox 2006, p. 81)
8
Testing Reasoning
O If even larger differences than t0bs occur fairly
frequently under H0
(i.e.,P-value is not small), there’s scarcely
evidence of incompatibility with H0
O Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
O This indication isn’t evidence of a genuine
statistical effect H, let alone a scientific conclusion
H*
Stat-Sub fallacy H => H* 9
O I’m not keen to defend many uses of
significance tests long lampooned
O I introduce a reformulation of tests in
terms of discrepancies (effect sizes) that
are and are not severely-tested
O The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
10
Replication Paradox
(for Significance Test Critics)
Critic: It’s much too easy to get a small P-
value
You: Why do they find it so difficult to
replicate the small P-values others found?
Is it easy or is it hard?
11
Only 36 of 100 psychology experiments yielded
small P-values in Open Science Collaboration on
replication in psychology
OSC: Reproducibility Project: Psychology:
2011-15 (Science 2015): Crowd-sourced effort to
replicate 100 articles (Led by Brian Nozek, U. VA)
12
O R.A. Fisher: it’s easy to lie with statistics by
selective reporting, not the test’s fault
O Sufficient finagling—cherry-picking, P-
hacking, significance seeking, multiple
testing, look elsewhere—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
(biasing selection effects, need to adjust P-values)
Note: Support for some preferred claim H is by
rejecting a null hypothesis
O H hasn’t passed a severe test
13
Severity Requirement:
If the test procedure had little or no capability
of finding flaws with H (even if H is incorrect),
then agreement between data x0 and H
provides poor (or no) evidence for H
(“too cheap to be worth having” Popper)
O Such a test fails a minimal requirement for a
stringent or severe test
O My account: severe testing based on error
statistics (requires reinterpreting tests)
14
This alters the role of probability:
typically just 2
O Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0
(e.g., Bayesian, likelihoodist)—with regard for
inner coherency
O Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson)
15
What happened to using probability to
assess the error probing capacity by
the severity requirement?
O Neither “probabilism” nor
“performance” directly captures it
O Good long-run performance is a
necessary, not a sufficient, condition
for severity
16
O Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O It’s that we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
Key to revising the role of error probabilities
17
A claim C is not warranted _______
O Probabilism: unless C is true or probable
(gets a probability boost, is made
comparatively firmer)
O Performance: unless it stems from a
method with low long-run error
O Probativism (severe testing) unless
something (a fair amount) has been done
to probe ways we can be wrong about C
18
O If you assume probabilism is required for
inference, error probabilities are relevant
for inference only by misinterpretation
False!
O I claim, error probabilities play a crucial
role in appraising well-testedness
O It’s crucial to be able to say, C is highly
believable or plausible but poorly tested
19
Biasing selection effects:
O One function of severity is to identify
problematic selection effects (not all are)
O Biasing selection effects: when data or
hypotheses are selected or generated (or
a test criterion is specified), in such a way
that the minimal severity requirement is
violated, seriously altered or incapable
of being assessed
O Picking up on these alterations is
precisely what enables error statistics to
be self-correcting—
20
Nominal vs actual Significance levels
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ 
.The actual level of
significance is not 5 percent, but 64
percent! (Selvin, 1970, p. 104)
From (Morrison & Henkel’s Significance Test
controversy 1970!)
21
O Morrison and Henkel were clear on the
fallacy: blurring the “computed” or
“nominal” significance level, and the
“actual” level
O There are many more ways you can be
wrong with hunting (different sample
space)
22
Spurious P-Value
You report: Such results would be difficult to
achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
(Formally):
O You say Pr(P-value ≀ Pobs; H0) ~ Pobs small
O But in fact Pr(P-value ≀ Pobs; H0) = high
23
Scapegoating
O Nowadays, we’re likely to see the tests
blamed
O My view: Tests don’t kill inferences,
people do
O Even worse are those statistical accounts
where the abuse vanishes!
24
On some views, taking account of biasing
selection effects “defies scientific sense”
Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or, as
they are more commonly called, data dredging
and peeking at the data. The frequentist solution
to both problems involves adjusting the P-
value
But adjusting the measure of evidence
because of considerations that have nothing
to do with the data defies scientific sense,
belies the claim of ‘objectivity’ that is often made
for the P-value” (Goodman 1999, p. 1010)
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford) 25
Technical activism isn’t free of philosophy
Ben Goldacre (of Bad Science) in a 2016 Nature
article, is puzzled that bad statistical practices
continue even in the face of the new "technical
activism”:
The editors at Annals of Internal Medicine,

repeatedly (but confusedly) argue that it is
acceptable to identify “prespecified outcomes”
[from results] produced after a trial
began; 
.they say that their expertise allows
them to permit – and even solicit –
undeclared outcome-switching
26
His paper: “Make journals report
clinical trials properly”
O He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
27
Likelihood Principle (LP)
The vanishing act links to a pivotal
disagreement in the philosophy of statistics
battles
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
The data x0 are fixed, while the hypotheses
vary
28
All error probabilities violate the LP
(even without selection effects):
Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space (Lindley 1971, p. 436)
The LP implies
the irrelevance of
predesignation, of whether a hypothesis
was thought of before hand or was
introduced to explain known effects
(Rosenkrantz, 1977, p. 122)
29
Paradox of Optional Stopping:
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
Xi ~ N(ÎŒ, σ2), 2-sided H0: ÎŒ = 0 vs. H1: ÎŒ ≠ 0.
Instead of fixing the sample size n in
advance, in some tests, n is determined by
a stopping rule:
30
“Trying and trying again”
O Keep sampling until H0 is rejected at 0.05
level
i.e., keep sampling until M ï‚ł 1.96 s/√n
O Trying and trying again: Having failed to
rack up a 1.96 s difference after 10 trials,
go to 20, 30 and so on until obtaining a
1.96 s difference
31
Nominal vs. Actual
significance levels again:
O With n fixed the Type 1 error probability is
0.05
O With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O Violates Cox and Hinkley’s (1974) “weak
repeated sampling principle”
32
O “The ASA (p. 131) correctly warns that
“[c]onducting multiple analyses of the data
and reporting only those with certain p-
values” leads to spurious p-values
(Principle 4)
O They don’t mention that the same p-
hacked hypothesis can occur in Bayes
factors, credibility intervals, likelihood
ratios
33
With One Big Difference:
O “The direct grounds to criticize inferences
as flouting error statistical control is lost
O They condition on the actual data,
O Error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)”
34
How might probabilists block intuitively
unwarranted inferences
(without error probabilities)?
A subjective Bayesian might say:
If our beliefs were mixed into the interpretation of
the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
35
Rescued by beliefs
O That could work in some cases (it still
wouldn’t show what researchers had done
wrong)—battle of beliefs
O Besides, researchers sincerely believe
their hypotheses
O So now you’ve got two sources of
flexibility, priors and biasing selection
effects
36
No help with our most important
problem
O How to distinguish the warrant for a
single hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
37
Most Bayesians use “default” priors
O Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
38
O “The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Default priors may not even be probabilities
”
(Cox and Mayo 2010, p. 299)
O Default Bayesian Reforms are touted as free of
selection effects
O “
Bayes factors can be used in the complete
absence of a sampling plan
” (Bayarri,
Benjamin, Berger, Sellke 2016, p. 100)
39
Granted, some are prepared to abandon
the LP for model testing
In an attempted meeting of the minds Andrew
Gelman and Cosma Shalizi say:
O “[C]rucial parts of Bayesian data analysis, such
as model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.” (2013, p.10).
O An open question
40
The ASA doc highlights classic foibles
that block replication
“In relation to the test of significance, we
may say that a phenomenon is
experimentally demonstrable when we
know how to conduct an experiment
which will rarely fail to give us a
statistically significant result”
(Fisher 1935, p. 14)
“isolated” low P-value ≠> H: statistical effect
41
Statistical ≠> substantive (H ≠> H*)
“[A]ccording to Fisher, rejecting the null
hypothesis is not equivalent to
accepting the efficacy of the cause in
question. The latter...requires obtaining
more significant results when the
experiment, or an improvement of it, is
repeated at other laboratories or under
other conditions” (Gigerentzer 1989, pp.
95-6)
42
The Problem is with so-called NHST
(“null hypothesis significance testing”)
O NHSTs supposedly allow moving from
statistical to substantive hypotheses
O If defined that way, they exist only as
abuses of tests
O ASA doc ignores Neyman-Pearson (N-P)
tests
43
Neyman-Pearson (N-P) tests:
A null and alternative hypotheses
H0, H1 that are exhaustive
H0: ÎŒ ≀ 12 vs H1: ÎŒ > 12
O So this fallacy of rejection HH* is
impossible
O Rejecting the null only indicates statistical
alternatives (how discrepant from null)
44
P-values Don’t Report Effect Sizes
(Principle 5)
Who ever said to just report a P-value?
O “Tests should be accompanied by
interpretive tools that avoid the
fallacies of rejection and non-rejection.
These correctives can be articulated in
either Fisherian or Neyman-Pearson
terms” (Mayo and Cox 2006, Mayo and
Spanos 2006)
45
To Avoid Inferring a Discrepancy
Beyond What’s Warranted:
large n problem.
O Severity tells us: an α-significant
difference is indicative of less of a
discrepancy from the null if it results from
larger (n1) rather than a smaller (n2)
sample size (n1 > n2 )
46
O What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
O [The larger sample size is like the one
that goes off with burnt toast]
47
What About Fallacies of
Non-Significant Results?
O They don’t warrant 0 discrepancy
O Use the same severity reasoning to rule out
discrepancies that very probably would have
resulted in a larger difference than observed- set
upper bounds
O If you very probably would have observed a more
impressive (smaller) p-value than you did, if Ό >
ÎŒ1 (ÎŒ1 = ÎŒ0 + Îł), then the data are good evidence
that Ό< Ό1
O Akin to power analysis (Cohen, Neyman) but
sensitive to x0
48
O There’s another kind of fallacy behind a
move that’s supposed improve replication
but it confuses the notions from significance
testing and it leads to “Most findings are
false”
O Fake replication crisis.
49
Diagnostic Screening Model of Tests:
urn of nulls
(“most findings are false”)
O If we imagine randomly select a hypothesis
from an urn of nulls 90% of which are true
O Consider just 2 possibilities: H0: no effect
H1: meaningful effect, all else ignored,
O Take the prevalence of 90% as
Pr(H0 you picked) = .9, Pr(H1)= .1
O Rejecting H0 with a single (just) .05 significant
result, Cherry-picking to boot
50
51
The unsurprising result is that most “findings” are
false: Pr(H0| findings with a P-value of .05) > .5
Pr(H0| findings with a P-value of .05) ≠
Pr(P-value of .05 | H0)
Only the second one is a Type 1 error probability)
Major source of confusion
.
(Berger and Sellke 1987, Ioannidis 2005,
Colquhoun 2014)
O A: Announce a finding (a P-value of .05)
O Not properly Bayesian (not even
empirical Bayes), not properly frequentist
O Where does the high prevalence come
from?
52
Concluding Remark
O If replication research and reforms are to lead to
error correction, they must correct errors: they
don’t always do that
O They do when they encourage preregistration,
control error probabilities & require good design
RCTs,
checking model assumptions)
O They don’t when they permit tools that lack error
control
53
Don’t Throw Out the Error Control Baby
O Main source of hand-wringing behind the
statistical crisis in science stems from cherry-
picking, hunting for significance, multiple testing
O These biasing selection effects are picked up
by tools that assess error control (performance
or severity)
O Reforms based on “probabilisms” enable rather
than check unreliable results due to biasing
selection effects
54
Repligate
O Replication research has pushback:
some call it methodological terrorism
(enforcing good science or bullying?)
O My gripe is that replications, at least in
social psychology, should go beyond the
statistical criticism
55
Non-replications construed as
simply weaker effects
O One of the non-replications: cleanliness and
morality: Does unscrambling soap words make
you less judgmental?
“Ms. Schnall had 40 undergraduates unscramble
some words. One group unscrambled words
that suggested cleanliness (pure, immaculate,
pristine), while the other group unscrambled
neutral words. They were then presented with
a number of moral dilemmas, like whether it’s
cool to eat your dog after it gets run over by a
car. 
Chronicle of Higher Ed.
56

Turns out, it did. Subjects who had
unscrambled clean words weren’t as harsh on the
guy who chows down on his chow.” (Chronicle of
Higher Education)
O Focusing on the P-values ignore larger
questions of measurement in psych & the leap
from the statistical to the substantive.
HH*
O Increasingly the basis for experimental
philosophy-needs philosophical scrutiny
57
The ASA’s Six Principles
O (1) P-values can indicate how incompatible the data are
with a specified statistical model
O (2) P-values do not measure the probability that the
studied hypothesis is true, or the probability that the data
were produced by random chance alone
O (3) Scientific conclusions and business or policy
decisions should not be based only on whether a p-value
passes a specific threshold
O (4) Proper inference requires full reporting and
transparency
O (5) A p-value, or statistical significance, does not
measure the size of an effect or the importance of a result
O (6) By itself, a p-value does not provide a good measure
of evidence regarding a model or hypothesis
58
Test T+: Normal testing: H0: Ό < Ό0 vs. H1: Ό > Ό0
σ known
(FEV/SEV): If d(x) is not statistically significant,
then
ÎŒ < M0 + kΔσ/√n passes the test T+ with
severity (1 – Δ)
(FEV/SEV): If d(x) is statistically significant, then
ÎŒ > M0 + kΔσ/√n passes the test T+ with
severity (1 – Δ)
where P(d(X) > kΔ) = Δ 59
References
O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical
Inference: A Discussion, edited by L. J. Savage. London: Methuen.
O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32.
O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis
1 (3): 385–402.
O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin
of the American Mathematical Society 50(1): 126-46.
O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and
Robustness. New York: Academic Press.
O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and
Hall.
O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the
Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
60
O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The
Empire of Chance. Cambridge: Cambridge University Press.
O Goldacre, B. 2008. Bad Science. HarperCollins Publishers.
O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature
530(7588);online 02Feb2016.
O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The
Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357.
61
O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7
(3): 283–300.
O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical
Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in Bul. Acad. Pol.Sci. 73-96.
O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London:
Methuen.
O Selvin, H. 1970. “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888.
O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology
37(1): pp. 1-2.
O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context,
process, and purpose”, The American Statistician 62
Abstract
If a statistical methodology is to be adequate, it needs
to register how “questionable research practices”
(QRPs) alter a method’s error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
today’s statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacements–Bayesian updating,
Bayes factors, likelihood ratios–fail to control severity.
63

Weitere Àhnliche Inhalte

Was ist angesagt?

Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2jemille6
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slidesjemille6
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1jemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 

Was ist angesagt? (20)

Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1Phil 6334 Mayo slides Day 1
Phil 6334 Mayo slides Day 1
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 

Andere mochten auch

RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetupPeter Laurinec
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Arthur Charpentier
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Arthur Charpentier
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Arthur Charpentier
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Bayes rpp bristol
Bayes rpp bristolBayes rpp bristol
Bayes rpp bristolAlexander Etz
 
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Intel IT Center
 
Preprints, short and sweet
Preprints, short and sweetPreprints, short and sweet
Preprints, short and sweetMatti Heino
 
P values and the art of herding cats
P values  and the art of herding catsP values  and the art of herding cats
P values and the art of herding catsStephen Senn
 
Historys Greatest
Historys GreatestHistorys Greatest
Historys Greateste-Lightenment
 
Immuno-oncology Discoveries, University of Chicago
Immuno-oncology Discoveries, University of ChicagoImmuno-oncology Discoveries, University of Chicago
Immuno-oncology Discoveries, University of Chicagouchicagotech
 
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...COIICV
 
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...Antonio Lieto
 
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messagesSßnică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messagesCodecamp Romania
 
Programming Languages For The Cloud
Programming Languages For The CloudProgramming Languages For The Cloud
Programming Languages For The CloudTed Leung
 
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...Mine Cetinkaya-Rundel
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluChristian Robert
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)jemille6
 
Droptrax
DroptraxDroptrax
Droptraxgrizhatch
 

Andere mochten auch (20)

RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Fifth Assessment Report - Working Group I
Fifth Assessment Report - Working Group IFifth Assessment Report - Working Group I
Fifth Assessment Report - Working Group I
 
Bayes rpp bristol
Bayes rpp bristolBayes rpp bristol
Bayes rpp bristol
 
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization...
 
Preprints, short and sweet
Preprints, short and sweetPreprints, short and sweet
Preprints, short and sweet
 
P values and the art of herding cats
P values  and the art of herding catsP values  and the art of herding cats
P values and the art of herding cats
 
Historys Greatest
Historys GreatestHistorys Greatest
Historys Greatest
 
Immuno-oncology Discoveries, University of Chicago
Immuno-oncology Discoveries, University of ChicagoImmuno-oncology Discoveries, University of Chicago
Immuno-oncology Discoveries, University of Chicago
 
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...
N. Jimenez_InformĂĄtica para la salud: la genĂłmica computacional y la medicina...
 
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...
Computational Explanation in Biologically Inspired Cognitive Architectures/Sy...
 
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messagesSßnică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sßnică Alboaie - Programming for cloud computing Flows of asynchronous messages
 
Programming Languages For The Cloud
Programming Languages For The CloudProgramming Languages For The Cloud
Programming Languages For The Cloud
 
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...
Mark Ward - Learning Communities: An Emerging Platform for Research in Statis...
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li Chenlu
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)
 
Droptrax
DroptraxDroptrax
Droptrax
 

Ähnlich wie Severe Testing: The Key to Error Correction

Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)DeborahMayo4
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhdHimanshuSharma723273
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
Spss session 1 and 2
Spss session 1 and 2Spss session 1 and 2
Spss session 1 and 2Judianto Nugroho
 

Ähnlich wie Severe Testing: The Key to Error Correction (20)

Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd20200519073328de6dca404c.pdfkshhjejhehdhd
20200519073328de6dca404c.pdfkshhjejhehdhd
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 
Spss session 1 and 2
Spss session 1 and 2Spss session 1 and 2
Spss session 1 and 2
 

Mehr von jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 

Mehr von jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 

KĂŒrzlich hochgeladen

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 

KĂŒrzlich hochgeladen (20)

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Kamla Market (DELHI) 🔝 >àŒ’9953330565🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSÂź Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 

Severe Testing: The Key to Error Correction

  • 1. Severe Testing: The Key to Error Correction Deborah G Mayo Virginia Tech March 17, 2017 “Understanding Reproducibility and Error Correction in Science”
  • 2. Statistical Crisis of Replication O Statistical ‘findings’ disappear when others look for them. O Beyond the social sciences to genomics, bioinformatics, and medicine (Big Data) O Methodological reforms (some welcome, others radical) O Need to understand philosophical, statistical, historical issues 2
  • 3. American Statistical Association (ASA):Statement on P-values “The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. 
. much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as
to ban p-values” (ASA, Wasserstein & Lazar 2016, p. 129) 3
  • 4. I was a philosophical observer at the ASA P-value “pow wow” 4
  • 5. “Don’t throw out the error control baby with the bad statistics bathwater” The American Statistician 5
  • 6. O The most used methods are most criticized O “Statistical significance tests are a small part of a rich set of: “techniques for systematically appraising and bounding the probabilities 
 of seriously misleading interpretations of data” (Birnbaum 1970, p. 1033) O These I call error statistical methods (or sampling theory)”. 6
  • 7. Error Statistics O Statistics: Collection, modeling, drawing inferences from data to claims about aspects of processes O The inference may be in error O It’s qualified by a claim about the method’s capabilities to control and alert us to erroneous interpretations (error probabilities) 7
  • 8. “p-value. 
to test the conformity of the particular data under analysis with H0 in some respect: 
we find a function T = t(y) of the data, to be called the test statistic, such that ‱ the larger the value of T the more inconsistent are the data with H0; ‱ The random variable T = t(Y) has a (numerically) known probability distribution when H0 is true. 
the p-value corresponding to any t0bs as p = p(t) = Pr(T ≄ t0bs; H0)” (Mayo and Cox 2006, p. 81) 8
  • 9. Testing Reasoning O If even larger differences than t0bs occur fairly frequently under H0 (i.e.,P-value is not small), there’s scarcely evidence of incompatibility with H0 O Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. O This indication isn’t evidence of a genuine statistical effect H, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 9
  • 10. O I’m not keen to defend many uses of significance tests long lampooned O I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested O The criticisms are often based on misunderstandings; consequently so are many “reforms” 10
  • 11. Replication Paradox (for Significance Test Critics) Critic: It’s much too easy to get a small P- value You: Why do they find it so difficult to replicate the small P-values others found? Is it easy or is it hard? 11
  • 12. Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA) 12
  • 13. O R.A. Fisher: it’s easy to lie with statistics by selective reporting, not the test’s fault O Sufficient finagling—cherry-picking, P- hacking, significance seeking, multiple testing, look elsewhere—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence (biasing selection effects, need to adjust P-values) Note: Support for some preferred claim H is by rejecting a null hypothesis O H hasn’t passed a severe test 13
  • 14. Severity Requirement: If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H (“too cheap to be worth having” Popper) O Such a test fails a minimal requirement for a stringent or severe test O My account: severe testing based on error statistics (requires reinterpreting tests) 14
  • 15. This alters the role of probability: typically just 2 O Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (e.g., Bayesian, likelihoodist)—with regard for inner coherency O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson) 15
  • 16. What happened to using probability to assess the error probing capacity by the severity requirement? O Neither “probabilism” nor “performance” directly captures it O Good long-run performance is a necessary, not a sufficient, condition for severity 16
  • 17. O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs— O It’s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data Key to revising the role of error probabilities 17
  • 18. A claim C is not warranted _______ O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer) O Performance: unless it stems from a method with low long-run error O Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 18
  • 19. O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False! O I claim, error probabilities play a crucial role in appraising well-testedness O It’s crucial to be able to say, C is highly believable or plausible but poorly tested 19
  • 20. Biasing selection effects: O One function of severity is to identify problematic selection effects (not all are) O Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed O Picking up on these alterations is precisely what enables error statistics to be self-correcting— 20
  • 21. Nominal vs actual Significance levels Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ 
.The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104) From (Morrison & Henkel’s Significance Test controversy 1970!) 21
  • 22. O Morrison and Henkel were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” level O There are many more ways you can be wrong with hunting (different sample space) 22
  • 23. Spurious P-Value You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 (Formally): O You say Pr(P-value ≀ Pobs; H0) ~ Pobs small O But in fact Pr(P-value ≀ Pobs; H0) = high 23
  • 24. Scapegoating O Nowadays, we’re likely to see the tests blamed O My view: Tests don’t kill inferences, people do O Even worse are those statistical accounts where the abuse vanishes! 24
  • 25. On some views, taking account of biasing selection effects “defies scientific sense” Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value
But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value” (Goodman 1999, p. 1010) (To his credit, he’s open about this; heads the Meta-Research Innovation Center at Stanford) 25
  • 26. Technical activism isn’t free of philosophy Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism”: The editors at Annals of Internal Medicine,
 repeatedly (but confusedly) argue that it is acceptable to identify “prespecified outcomes” [from results] produced after a trial began; 
.they say that their expertise allows them to permit – and even solicit – undeclared outcome-switching 26
  • 27. His paper: “Make journals report clinical trials properly” O He shouldn’t close his eyes to the possibility that some of the pushback he’s seeing has a basis in statistical philosophy! 27
  • 28. Likelihood Principle (LP) The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses P(x0;H1)/P(x0;H0) The data x0 are fixed, while the hypotheses vary 28
  • 29. All error probabilities violate the LP (even without selection effects): Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, p. 436) The LP implies
the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, p. 122) 29
  • 30. Paradox of Optional Stopping: Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules: Xi ~ N(ÎŒ, σ2), 2-sided H0: ÎŒ = 0 vs. H1: ÎŒ ≠ 0. Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule: 30
  • 31. “Trying and trying again” O Keep sampling until H0 is rejected at 0.05 level i.e., keep sampling until M ï‚ł 1.96 s/√n O Trying and trying again: Having failed to rack up a 1.96 s difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 s difference 31
  • 32. Nominal vs. Actual significance levels again: O With n fixed the Type 1 error probability is 0.05 O With this stopping rule the actual significance level differs from, and will be greater than 0.05 O Violates Cox and Hinkley’s (1974) “weak repeated sampling principle” 32
  • 33. O “The ASA (p. 131) correctly warns that “[c]onducting multiple analyses of the data and reporting only those with certain p- values” leads to spurious p-values (Principle 4) O They don’t mention that the same p- hacked hypothesis can occur in Bayes factors, credibility intervals, likelihood ratios 33
  • 34. With One Big Difference: O “The direct grounds to criticize inferences as flouting error statistical control is lost O They condition on the actual data, O Error probabilities take into account other outcomes that could have occurred but did not (sampling distribution)” 34
  • 35. How might probabilists block intuitively unwarranted inferences (without error probabilities)? A subjective Bayesian might say: If our beliefs were mixed into the interpretation of the evidence, we wouldn’t declare there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences) 35
  • 36. Rescued by beliefs O That could work in some cases (it still wouldn’t show what researchers had done wrong)—battle of beliefs O Besides, researchers sincerely believe their hypotheses O So now you’ve got two sources of flexibility, priors and biasing selection effects 36
  • 37. No help with our most important problem O How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? 37
  • 38. Most Bayesians use “default” priors O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006) 38
  • 39. O “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Default priors may not even be probabilities
” (Cox and Mayo 2010, p. 299) O Default Bayesian Reforms are touted as free of selection effects O “
Bayes factors can be used in the complete absence of a sampling plan
” (Bayarri, Benjamin, Berger, Sellke 2016, p. 100) 39
  • 40. Granted, some are prepared to abandon the LP for model testing In an attempted meeting of the minds Andrew Gelman and Cosma Shalizi say: O “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests.” (2013, p.10). O An open question 40
  • 41. The ASA doc highlights classic foibles that block replication “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result” (Fisher 1935, p. 14) “isolated” low P-value ≠> H: statistical effect 41
  • 42. Statistical ≠> substantive (H ≠> H*) “[A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions” (Gigerentzer 1989, pp. 95-6) 42
  • 43. The Problem is with so-called NHST (“null hypothesis significance testing”) O NHSTs supposedly allow moving from statistical to substantive hypotheses O If defined that way, they exist only as abuses of tests O ASA doc ignores Neyman-Pearson (N-P) tests 43
  • 44. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustive H0: ÎŒ ≀ 12 vs H1: ÎŒ > 12 O So this fallacy of rejection HH* is impossible O Rejecting the null only indicates statistical alternatives (how discrepant from null) 44
  • 45. P-values Don’t Report Effect Sizes (Principle 5) Who ever said to just report a P-value? O “Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms” (Mayo and Cox 2006, Mayo and Spanos 2006) 45
  • 46. To Avoid Inferring a Discrepancy Beyond What’s Warranted: large n problem. O Severity tells us: an α-significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) 46
  • 47. O What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? O [The larger sample size is like the one that goes off with burnt toast] 47
  • 48. What About Fallacies of Non-Significant Results? O They don’t warrant 0 discrepancy O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds O If you very probably would have observed a more impressive (smaller) p-value than you did, if ÎŒ > ÎŒ1 (ÎŒ1 = ÎŒ0 + Îł), then the data are good evidence that ÎŒ< ÎŒ1 O Akin to power analysis (Cohen, Neyman) but sensitive to x0 48
  • 49. O There’s another kind of fallacy behind a move that’s supposed improve replication but it confuses the notions from significance testing and it leads to “Most findings are false” O Fake replication crisis. 49
  • 50. Diagnostic Screening Model of Tests: urn of nulls (“most findings are false”) O If we imagine randomly select a hypothesis from an urn of nulls 90% of which are true O Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored, O Take the prevalence of 90% as Pr(H0 you picked) = .9, Pr(H1)= .1 O Rejecting H0 with a single (just) .05 significant result, Cherry-picking to boot 50
  • 51. 51 The unsurprising result is that most “findings” are false: Pr(H0| findings with a P-value of .05) > .5 Pr(H0| findings with a P-value of .05) ≠ Pr(P-value of .05 | H0) Only the second one is a Type 1 error probability) Major source of confusion
. (Berger and Sellke 1987, Ioannidis 2005, Colquhoun 2014)
  • 52. O A: Announce a finding (a P-value of .05) O Not properly Bayesian (not even empirical Bayes), not properly frequentist O Where does the high prevalence come from? 52
  • 53. Concluding Remark O If replication research and reforms are to lead to error correction, they must correct errors: they don’t always do that O They do when they encourage preregistration, control error probabilities & require good design RCTs, checking model assumptions) O They don’t when they permit tools that lack error control 53
  • 54. Don’t Throw Out the Error Control Baby O Main source of hand-wringing behind the statistical crisis in science stems from cherry- picking, hunting for significance, multiple testing O These biasing selection effects are picked up by tools that assess error control (performance or severity) O Reforms based on “probabilisms” enable rather than check unreliable results due to biasing selection effects 54
  • 55. Repligate O Replication research has pushback: some call it methodological terrorism (enforcing good science or bullying?) O My gripe is that replications, at least in social psychology, should go beyond the statistical criticism 55
  • 56. Non-replications construed as simply weaker effects O One of the non-replications: cleanliness and morality: Does unscrambling soap words make you less judgmental? “Ms. Schnall had 40 undergraduates unscramble some words. One group unscrambled words that suggested cleanliness (pure, immaculate, pristine), while the other group unscrambled neutral words. They were then presented with a number of moral dilemmas, like whether it’s cool to eat your dog after it gets run over by a car. 
Chronicle of Higher Ed. 56
  • 57. 
Turns out, it did. Subjects who had unscrambled clean words weren’t as harsh on the guy who chows down on his chow.” (Chronicle of Higher Education) O Focusing on the P-values ignore larger questions of measurement in psych & the leap from the statistical to the substantive. HH* O Increasingly the basis for experimental philosophy-needs philosophical scrutiny 57
  • 58. The ASA’s Six Principles O (1) P-values can indicate how incompatible the data are with a specified statistical model O (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone O (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold O (4) Proper inference requires full reporting and transparency O (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result O (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis 58
  • 59. Test T+: Normal testing: H0: ÎŒ < ÎŒ0 vs. H1: ÎŒ > ÎŒ0 σ known (FEV/SEV): If d(x) is not statistically significant, then ÎŒ < M0 + kΔσ/√n passes the test T+ with severity (1 – Δ) (FEV/SEV): If d(x) is statistically significant, then ÎŒ > M0 + kΔσ/√n passes the test T+ with severity (1 – Δ) where P(d(X) > kΔ) = Δ 59
  • 60. References O Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference: A Discussion, edited by L. J. Savage. London: Methuen. O Berger, J. O. 2003 'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?' and 'Rejoinder,', Statistical Science 18(1): 1-12; 28-32. O Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. O Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. O Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', Bulletin of the American Mathematical Society 50(1): 126-46. O Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T. and Wu, D. F. J. (eds.), pp. 51-84, Scientific Inference, Data Analysis, and Robustness. New York: Academic Press. O Cox, D. R. and Hinkley, D. 1974. Theoretical Statistics. London: Chapman and Hall. O Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. O Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. O Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. 60
  • 61. O Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian Statistics' and 'Rejoinder', British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. O Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. The Empire of Chance. Cambridge: Cambridge University Press. O Goldacre, B. 2008. Bad Science. HarperCollins Publishers. O Goldacre, B. 2016. “Make journals report clinical trials properly”, Nature 530(7588);online 02Feb2016. O Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. O Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. O Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. O Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes- Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275. O Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. 61
  • 62. O Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. O Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300. O Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. O Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96. O Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. O Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. O Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. O Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone", Psychological Science, vol. 24, no. 10, pp. 1875-1888. O Trafimow D. and Marks, M. 2015. “Editorial”, Basic and Applied Social Psychology 37(1): pp. 1-2. O Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician 62
  • 63. Abstract If a statistical methodology is to be adequate, it needs to register how “questionable research practices” (QRPs) alter a method’s error probing capacities. If little has been done to rule out flaws in taking data as evidence for a claim, then that claim has not passed a stringent or severe test. The goal of severe testing is the linchpin for (re)interpreting frequentist methods so as to avoid long-standing fallacies at the heart of today’s statistics wars. A contrasting philosophy views statistical inference in terms of posterior probabilities in hypotheses: probabilism. Presupposing probabilism, critics mistakenly argue that significance and confidence levels are misinterpreted, exaggerate evidence, or are irrelevant for inference. Recommended replacements–Bayesian updating, Bayes factors, likelihood ratios–fail to control severity. 63