Which Statistical Test Do I Use?

This is an interactive clickable wizard with immediate feedback, a live decision tree, and MFM-specific examples.

Interactive Test Selector

Answer the prompts. The recommended test updates instantly.

ProgressStep 1 of 5

1. What type of outcome are you analyzing?

2. Are the groups independent or paired?

3. How many groups or levels are being compared?

4. Is the continuous data approximately normal, or is sample size small?

5. Do you need to adjust for confounders or model predictors?

Start the wizard

Choose the outcome type first. The tool will narrow to the best statistical test and explain why.

When to use it

Select a path to see the correct scenario.

Common

Many questions hinge on paired vs independent data or small expected counts.

TIP: If you first identify the outcome type correctly, most questions become straightforward.

Decision Tree

The highlighted path shows how the tool is reasoning.

Outcome type

→

Pairing

→

Groups

→

Distribution

→

Modeling

→

Test

Side-by-Side Examples

Independent categorical

Compare the rate of preeclampsia in patients randomized to aspirin versus placebo.

Usually Chi-square

Paired categorical

Same patients classified as abnormal vs normal before and after an intervention.

McNemar test

Continuous, 2 groups

Mean birthweight in diabetic versus non-diabetic pregnancies.

t-test if roughly normal

Continuous, skewed

Bile acid levels compared between two groups with strong right skew.

Mann-Whitney U

Quick Reference

Categorical outcomes

Independent groups → Chi-square

Small expected counts → Fisher exact

Paired yes/no → McNemar

Continuous outcomes

2 groups, normal → t-test

3+ groups, normal → ANOVA

Paired, normal → paired t-test

Nonparametric + models

2 groups, skewed → Mann-Whitney

3+ groups, skewed → Kruskal-Wallis

Binary outcome model → Logistic regression

Time-to-event → Cox proportional hazards

Diagnostic Test Interpretation Mini-Module

1) Start with the 2×2 table

Every diagnostic test question becomes easier if you first identify true positives, false positives, false negatives, and true negatives.

Sensitivity = true positive rate

Specificity = true negative rate

TIP: sensitivity and specificity describe the test itself more than the population prevalence.

2) Rule-out vs rule-in

High sensitivity helps rule out disease when the test is negative. High specificity helps rule in disease when the test is positive.

SnNout = Sensitive test, Negative result, rule OUT

SpPin = Specific test, Positive result, rule IN

3) Predictive values depend on prevalence

PPV and NPV change when disease prevalence changes, which is why they can vary across screening populations.

PPV rises as prevalence rises

NPV rises as prevalence falls

Board trap: do not treat PPV/NPV as fixed properties of the test.

4) Likelihood ratios are very high yield

Likelihood ratios help you move from pretest probability to post-test probability and are usually more portable across populations than predictive values.

LR+ = sensitivity / (1 − specificity)

LR− = (1 − sensitivity) / specificity

Use LR+ when the test is positive and LR− when the test is negative.

5) Quick LR interpretation

LR+ > 10 strongly increases probability

LR− < 0.1 strongly decreases probability

LR near 1 changes little

6) ROC curve and AUC

ROC curves plot sensitivity against 1 − specificity across thresholds. The AUC summarizes overall discrimination.

AUC = 0.5 no discrimination

AUC = 1.0 perfect discrimination

TIP: ROC is useful when comparing overall performance of competing tests or thresholds.

7) Example: screening test

A screening test for fetal anemia is chosen to have very high sensitivity so that a negative result is reassuring.

Best interpretation: a negative test helps rule out the condition.

8) Example: positive screen in low prevalence setting

Even a good screening test can have a modest PPV when the underlying condition is uncommon.

Best interpretation: prevalence affects PPV, so positive screens often need confirmatory testing.

Fast algorithm: First decide whether the question is asking about intrinsic test performance (sensitivity, specificity, LR, ROC) or population-dependent performance (PPV, NPV). Then decide whether the result is being used to rule out, rule in, or compare thresholds.

Interactive Ultimate Test Selection Flowchart

Click through the decision tree below. It highlights the path and can auto-fill the wizard above.

Quick interactive map

1. Choose outcome type

Free Review Links

Focused external review pages for the tests most commonly used in MFM research and board questions.

Choosing the right test

UCLA OARC: What statistical analysis should I use?

Excellent quick decision guide for picking among common tests.

General free textbook

OpenIntro Statistics

Free textbook covering t-tests, ANOVA, regression, confidence intervals, and more.

Choosing the correct statistical test

NIH/PMC overview article

Good clinical review article on selecting the correct test.

Likelihood ratios

BMJ Statistics Notes: Likelihood ratios

Very high-yield for diagnostic test interpretation.

ROC curves

Critical Care review: ROC curves

Clear free review of ROC curves and AUC interpretation.

Diagnostic test characteristics

NCBI Bookshelf: Sensitivity, Specificity, Predictive Values and Likelihood Ratios

Strong overview of the 2×2 table and core diagnostic test metrics.

Quiz Mode

Use these vignette-style questions after the wizard to lock in the pattern recognition.

How it works

Select the best statistical test for each scenario. Your score updates instantly, and the explanation appears after each answer.

Difficulty: Basic = core patterns · MFM = most likely exam · Tricky = common mistakes and edge cases

Question 1 of 10

Score: 0 / 0

Loading question...

Reference Page: Side-by-Side Calculations

More detailed worked examples for the most commonly tested biostatistics methods in MFM board-style questions.

How to use this section: Read the stem, identify the outcome type, decide whether the groups are independent or paired, and then compare your reasoning with the worked calculation.

Chi-square

Use: independent categorical data, adequate expected counts.

Example: aspirin group 10/100 with preeclampsia vs placebo group 20/100.

	Preeclampsia	No Preeclampsia	Total
Aspirin	10	90	100
Placebo	20	80	100
Total	30	170	200

Expected count formula: E = (row total × column total) / grand total

Aspirin + preeclampsia = (100 × 30) / 200 = 15
Aspirin + no preeclampsia = (100 × 170) / 200 = 85
Placebo + preeclampsia = 15
Placebo + no preeclampsia = 85

Chi-square formula: χ² = Σ (O−E)²/E

χ² = (10−15)²/15 + (90−85)²/85 + (20−15)²/15 + (80−85)²/85

= 25/15 + 25/85 + 25/15 + 25/85

≈ 1.67 + 0.29 + 1.67 + 0.29 = 3.92

Key point: uses all 4 cells of the table.

Fisher exact

Use: independent categorical data, small expected counts.

Example: rare outcome, exposure 1/10 vs no exposure 8/10.

	Rare Outcome	No Rare Outcome	Total
Exposure	1	9	10
No Exposure	8	2	10
Total	9	11	20

Expected exposed-event count = (10 × 9) / 20 = 4.5

Because an expected count is < 5, the Chi-square approximation is unreliable.

What Fisher exact does: it calculates the exact probability of the observed 2×2 table, and tables more extreme, given the fixed row and column totals.

Key point: best for sparse 2×2 tables.

McNemar

Use: paired yes/no data.

Example: before positive/after negative = 20, before negative/after positive = 10.

	After Positive	After Negative	Total
Before Positive	30	20	50
Before Negative	10	40	50
Total	40	60	100

Only discordant cells matter: b = 20 and c = 10.
The concordant cells (30 and 40) do not drive the test.

McNemar formula: χ² = (b−c)² / (b+c)

= (20−10)² / (20+10)

= 100 / 30 ≈ 3.33

With continuity correction: χ² = (|b−c|−1)² / (b+c) = 9²/30 = 2.7

Key point: uses only discordant pairs.

t-test

Use: 2 independent groups, continuous normal data.

Example: mean birthweight 3200 g vs 3000 g.

Group	n	Mean	SD
Diabetic	50	3200	400
Non-diabetic	50	3000	350

Difference in means = 3200 − 3000 = 200 g

Conceptual formula: t = (mean₁ − mean₂) / standard error of the difference

Here the question is whether the 200 g difference is large relative to the variability within groups.

Key point: compare means when data are roughly normal.

Paired t-test

Use: paired continuous normal data.

Example: hemoglobin before and after treatment in the same patients.

Patient	Before	After	Difference
1	9.0	10.1	+1.1
2	9.4	10.2	+0.8
3	8.9	10.3	+1.4
4	9.5	10.1	+0.6
5	9.1	10.1	+1.0

Mean paired difference = (1.1 + 0.8 + 1.4 + 0.6 + 1.0) / 5 = 0.98

Instead of comparing two separate means, calculate a within-person difference for each subject and test whether the mean difference is 0.

Key point: same patients measured twice.

ANOVA

Use: 3 or more independent groups, continuous normal data.

Example: mean birthweight across BMI class I, II, and III.

Group	n	Mean birthweight
BMI I	40	3000 g
BMI II	40	3150 g
BMI III	40	3350 g

Overall question: are these three means different beyond what would be expected from within-group variation alone?

ANOVA asks whether the between-group variability is larger than the within-group variability.

Conceptually: F = variance between groups / variance within groups

Key point: use instead of repeated t-tests across 3+ groups.

Mann-Whitney U

Use: 2 independent groups, skewed continuous data.

Example: bile acids reported as medians and IQRs.

Group A	Group B
4	3
6	5
8	7
30	9

Pool and rank all values: 3(1), 4(2), 5(3), 6(4), 7(5), 8(6), 9(7), 30(8)

Group A ranks = 2, 4, 6, 8 → rank sum = 20
Group B ranks = 1, 3, 5, 7 → rank sum = 16

The test asks whether one group tends to have systematically higher ranks than the other.

Key point: rank-based, robust to skew and outliers.

Wilcoxon signed-rank

Use: paired continuous non-normal data.

Example: paired sodium values before and after therapy, skewed distribution.

Patient	Difference	Absolute difference	Rank
1	+2	2	3.5
2	+1	1	1.5
3	−1	1	1.5
4	+3	3	5
5	+2	2	3.5

Positive signed ranks = 3.5 + 1.5 + 5 + 3.5 = 13.5
Negative signed ranks = 1.5

The test asks whether the signed ranks are centered around 0, rather than whether the mean paired difference is 0.

Key point: paired analogue of rank-based testing.

Kruskal-Wallis

Use: 3 or more independent groups, skewed continuous data.

Example: skewed length of stay across three treatment groups.

Group A	Group B	Group C
1	2	4
2	3	5
10	12	15

Rank all values together: 1(1), 2(2.5), 2(2.5), 3(4), 4(5), 5(6), 10(7), 12(8), 15(9)

Average ranks differ across groups if one group tends to have systematically higher values.

This is the nonparametric analogue of asking whether the group means differ in ANOVA.

Key point: nonparametric analogue of ANOVA.

Logistic regression

Use: binary outcome with adjustment.

Example: odds of preeclampsia adjusted for age, BMI, parity.

Predictor	β coefficient	Approximate interpretation
BMI	0.08	Higher BMI raises log odds
Age	0.03	Older age slightly raises log odds
Prior PE	1.10	Much higher odds

Model form: log odds(outcome) = β₀ + β₁x₁ + β₂x₂ + …

Example: if β for BMI = 0.08, then OR = e^0.08 ≈ 1.08

Interpretation: for each 1-unit increase in BMI, the odds of preeclampsia increase by about 8%, holding the other covariates constant.

Key point: think adjusted odds ratio for yes/no outcomes.

Linear regression

Use: continuous outcome with adjustment or prediction.

Example: predict birthweight from gestational age and maternal covariates.

Predictor	β coefficient	Interpretation
Gestational age (weeks)	180	+180 g per week
Smoking	−120	120 g lower if smoker
Male fetus	+90	90 g higher

Model form: y = β₀ + β₁x₁ + β₂x₂ + …

If β for gestational age is 180, then each additional week is associated with an average increase of 180 g in birthweight, holding the other predictors constant.

Key point: continuous modeled outcome.

Cox proportional hazards

Use: time-to-event outcome.

Example: time from FGR diagnosis to delivery.

Group	Median time to delivery	Hazard ratio
Abnormal UA Doppler	10 days	1.8
Normal UA Doppler	18 days	Reference

Focus is not simply whether delivery occurred, but when it occurred.

A hazard ratio of 1.8 means the instantaneous rate of delivery is 80% higher in the abnormal UA Doppler group at a given time point, assuming proportional hazards.

Key point: use when timing matters.