# Significance Testing'

Understand concept of significance and testing of significance

Significance testing is:

- The process of determining whether a difference between groups in a study is due to a real difference, or chance alone
- Performed using p-values
- Does not imply clinical significance

For a result to be statistically significant, there must be a 'real' difference between groups.- This difference does not have to be clinically meaningful
- e.g. A drug may reliably cause a 5mmHg decrease in SBP - this is unlikely to cause a meaningful drop in cardiovascular mortality but may be statistically significant

- This difference does not have to be clinically meaningful

*P* Values

The ** p-value** is the

**probability of obtaining a summary statistic**(e.g. a mean)

**equal to or more extreme than the observed result,**.

*provided the null hypothesis is true*The *p*-value is commonly (mis)used in frequentist significance testing.

- Prior to performing an experiment, a significance threshold (α) is selected
- Traditionally 0.05 (5%) or 0.01 (1%)

These values define the "false-positive rate".- When multiple tests are being performed on one set of data, the chance of a false-positive will increase
- To reduce the chance of a false positive occurring, the significance threshold for each test can be reduced. One method of this is the Bonferroni correction, where α is divided by the number of tests being performed.

- When multiple tests are being performed on one set of data, the chance of a false-positive will increase

- Traditionally 0.05 (5%) or 0.01 (1%)
- Then the experiment is performed, and a value for
*p*is calculated

If*p*< α, it suggests that the results are inconsistent with the null hypothesis (at that significance level), and it should be rejected.

### Problems with *P*-values

*P*-values are, when employed correctly, are useful. However, they do have several weaknesses:

- Assume the null hypothesis is true

The*p*-value assumes that there is no real difference between groups.- This may not be the case
- Not all hypotheses are created equal

There may be significant prior evidence supporting (or refuting) H_{A}- this will be ignored when interpreting a*p*-value.- Any study with significant results must therefore be interpreted in the context of:
- Biological plausibility of those results
- The previous evidence on the topic

- Any study with significant results must therefore be interpreted in the context of:
- It is a common misconception that the
*p*-value estimates the chance that the result is true

This is not the case. The p-value measures*how inconsistent the observed results are with the null hypothesis*.

- A threshold of 0.05 is not always appropriate

The cost of being wrong must be included when interpreting a p-value. If this is a true result, what are the potential benefits? If this is a false positive, what are the potential harms?

- Vulnerable to multiple comparisons

Conducting repeated analyses will eventually find a 'significant' result. At an α of 0.05, we would expect 1/20 analyses to be a false positive. Conducting 20 analyses would therefore generate one false positive result.

- Does not quantify effect size

A significant*p*-value simply suggests a difference exists, it does not measure how big this difference is.- A result may be statistically significant but clinically unimportant, e.g. an antihypertensive medication causing a decrease in SBP by 2mmHg may be statistically significant, but clinically unimportant.

- Related to sample size
*p*-values are affected by sample size:- A large effect size may be hidden by an insignificant
*p*-value if sample size is small - Similarly, a tiny effect size may be detected (i.e. a significant
*p*-value) if sample size is large

- A large effect size may be hidden by an insignificant

- Does not account for bias

Like other statistical test, the*p*-value cannot account for bias or confounding.

## References

- Wasserstein RL, Lazar NA. The ASA's Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016 vol: 70 (2) pp: 129-133.