Which Statistical Test Do I Use?

This is an interactive clickable wizard with immediate feedback, a live decision tree, and MFM-specific examples.

Interactive Test Selector

Answer the prompts. The recommended test updates instantly.

ProgressStep 1 of 5

1. What type of outcome are you analyzing?

2. Are the groups independent or paired?

3. How many groups or levels are being compared?

4. Is the continuous data approximately normal, or is sample size small?

5. Do you need to adjust for confounders or model predictors?

Start the wizard

Choose the outcome type first. The tool will narrow to the best statistical test and explain why.

When to use it

Select a path to see the correct scenario.

Common

Many questions hinge on paired vs independent data or small expected counts.
TIP: If you first identify the outcome type correctly, most questions become straightforward.

Decision Tree

The highlighted path shows how the tool is reasoning.

Outcome type
Pairing
Groups
Distribution
Modeling
Test

Side-by-Side Examples

Independent categorical

Compare the rate of preeclampsia in patients randomized to aspirin versus placebo.
Usually Chi-square

Paired categorical

Same patients classified as abnormal vs normal before and after an intervention.
McNemar test

Continuous, 2 groups

Mean birthweight in diabetic versus non-diabetic pregnancies.
t-test if roughly normal

Continuous, skewed

Bile acid levels compared between two groups with strong right skew.
Mann-Whitney U

Quick Reference

Categorical outcomes

Independent groups → Chi-square
Small expected counts → Fisher exact
Paired yes/no → McNemar

Continuous outcomes

2 groups, normal → t-test
3+ groups, normal → ANOVA
Paired, normal → paired t-test

Nonparametric + models

2 groups, skewed → Mann-Whitney
3+ groups, skewed → Kruskal-Wallis
Binary outcome model → Logistic regression
Time-to-event → Cox proportional hazards

Diagnostic Test Interpretation Mini-Module

1) Start with the 2×2 table

Every diagnostic test question becomes easier if you first identify true positives, false positives, false negatives, and true negatives.
Sensitivity = true positive rate
Specificity = true negative rate
TIP: sensitivity and specificity describe the test itself more than the population prevalence.

2) Rule-out vs rule-in

High sensitivity helps rule out disease when the test is negative. High specificity helps rule in disease when the test is positive.
SnNout = Sensitive test, Negative result, rule OUT
SpPin = Specific test, Positive result, rule IN

3) Predictive values depend on prevalence

PPV and NPV change when disease prevalence changes, which is why they can vary across screening populations.
PPV rises as prevalence rises
NPV rises as prevalence falls
Board trap: do not treat PPV/NPV as fixed properties of the test.

4) Likelihood ratios are very high yield

Likelihood ratios help you move from pretest probability to post-test probability and are usually more portable across populations than predictive values.
LR+ = sensitivity / (1 − specificity)
LR− = (1 − sensitivity) / specificity
Use LR+ when the test is positive and LR− when the test is negative.

5) Quick LR interpretation

LR+ > 10 strongly increases probability
LR− < 0.1 strongly decreases probability
LR near 1 changes little

6) ROC curve and AUC

ROC curves plot sensitivity against 1 − specificity across thresholds. The AUC summarizes overall discrimination.
AUC = 0.5 no discrimination
AUC = 1.0 perfect discrimination
TIP: ROC is useful when comparing overall performance of competing tests or thresholds.

7) Example: screening test

A screening test for fetal anemia is chosen to have very high sensitivity so that a negative result is reassuring.
Best interpretation: a negative test helps rule out the condition.

8) Example: positive screen in low prevalence setting

Even a good screening test can have a modest PPV when the underlying condition is uncommon.
Best interpretation: prevalence affects PPV, so positive screens often need confirmatory testing.
Fast algorithm: First decide whether the question is asking about intrinsic test performance (sensitivity, specificity, LR, ROC) or population-dependent performance (PPV, NPV). Then decide whether the result is being used to rule out, rule in, or compare thresholds.

Interactive Ultimate Test Selection Flowchart

Click through the decision tree below. It highlights the path and can auto-fill the wizard above.

Quick interactive map
1. Choose outcome type

Free Review Links

Focused external review pages for the tests most commonly used in MFM research and board questions.

Choosing the right test

Excellent quick decision guide for picking among common tests.

General free textbook

Free textbook covering t-tests, ANOVA, regression, confidence intervals, and more.

Choosing the correct statistical test

Good clinical review article on selecting the correct test.

Likelihood ratios

Very high-yield for diagnostic test interpretation.

ROC curves

Clear free review of ROC curves and AUC interpretation.

Diagnostic test characteristics

Strong overview of the 2×2 table and core diagnostic test metrics.

Quiz Mode

Use these vignette-style questions after the wizard to lock in the pattern recognition.

How it works

Select the best statistical test for each scenario. Your score updates instantly, and the explanation appears after each answer.
Basic = core patterns · MFM = most likely exam · Tricky = common mistakes and edge cases
Question 1 of 10
Score: 0 / 0
Loading question...

Reference Page: Side-by-Side Calculations

More detailed worked examples for the most commonly tested biostatistics methods in MFM board-style questions.

How to use this section: Read the stem, identify the outcome type, decide whether the groups are independent or paired, and then compare your reasoning with the worked calculation.

Chi-square

Use: independent categorical data, adequate expected counts.
Example: aspirin group 10/100 with preeclampsia vs placebo group 20/100.
PreeclampsiaNo PreeclampsiaTotal
Aspirin1090100
Placebo2080100
Total30170200
Expected count formula: E = (row total × column total) / grand total

Aspirin + preeclampsia = (100 × 30) / 200 = 15
Aspirin + no preeclampsia = (100 × 170) / 200 = 85
Placebo + preeclampsia = 15
Placebo + no preeclampsia = 85
Chi-square formula: χ² = Σ (O−E)²/E

χ² = (10−15)²/15 + (90−85)²/85 + (20−15)²/15 + (80−85)²/85

= 25/15 + 25/85 + 25/15 + 25/85

≈ 1.67 + 0.29 + 1.67 + 0.29 = 3.92
Key point: uses all 4 cells of the table.

Fisher exact

Use: independent categorical data, small expected counts.
Example: rare outcome, exposure 1/10 vs no exposure 8/10.
Rare OutcomeNo Rare OutcomeTotal
Exposure1910
No Exposure8210
Total91120
Expected exposed-event count = (10 × 9) / 20 = 4.5

Because an expected count is < 5, the Chi-square approximation is unreliable.
What Fisher exact does: it calculates the exact probability of the observed 2×2 table, and tables more extreme, given the fixed row and column totals.
Key point: best for sparse 2×2 tables.

McNemar

Use: paired yes/no data.
Example: before positive/after negative = 20, before negative/after positive = 10.
After PositiveAfter NegativeTotal
Before Positive302050
Before Negative104050
Total4060100
Only discordant cells matter: b = 20 and c = 10.
The concordant cells (30 and 40) do not drive the test.
McNemar formula: χ² = (b−c)² / (b+c)

= (20−10)² / (20+10)

= 100 / 30 ≈ 3.33

With continuity correction: χ² = (|b−c|−1)² / (b+c) = 9²/30 = 2.7
Key point: uses only discordant pairs.

t-test

Use: 2 independent groups, continuous normal data.
Example: mean birthweight 3200 g vs 3000 g.
GroupnMeanSD
Diabetic503200400
Non-diabetic503000350
Difference in means = 3200 − 3000 = 200 g
Conceptual formula: t = (mean₁ − mean₂) / standard error of the difference

Here the question is whether the 200 g difference is large relative to the variability within groups.
Key point: compare means when data are roughly normal.

Paired t-test

Use: paired continuous normal data.
Example: hemoglobin before and after treatment in the same patients.
PatientBeforeAfterDifference
19.010.1+1.1
29.410.2+0.8
38.910.3+1.4
49.510.1+0.6
59.110.1+1.0
Mean paired difference = (1.1 + 0.8 + 1.4 + 0.6 + 1.0) / 5 = 0.98
Instead of comparing two separate means, calculate a within-person difference for each subject and test whether the mean difference is 0.
Key point: same patients measured twice.

ANOVA

Use: 3 or more independent groups, continuous normal data.
Example: mean birthweight across BMI class I, II, and III.
GroupnMean birthweight
BMI I403000 g
BMI II403150 g
BMI III403350 g
Overall question: are these three means different beyond what would be expected from within-group variation alone?
ANOVA asks whether the between-group variability is larger than the within-group variability.

Conceptually: F = variance between groups / variance within groups
Key point: use instead of repeated t-tests across 3+ groups.

Mann-Whitney U

Use: 2 independent groups, skewed continuous data.
Example: bile acids reported as medians and IQRs.
Group AGroup B
43
65
87
309
Pool and rank all values: 3(1), 4(2), 5(3), 6(4), 7(5), 8(6), 9(7), 30(8)
Group A ranks = 2, 4, 6, 8 → rank sum = 20
Group B ranks = 1, 3, 5, 7 → rank sum = 16

The test asks whether one group tends to have systematically higher ranks than the other.
Key point: rank-based, robust to skew and outliers.

Wilcoxon signed-rank

Use: paired continuous non-normal data.
Example: paired sodium values before and after therapy, skewed distribution.
PatientDifferenceAbsolute differenceRank
1+223.5
2+111.5
3−111.5
4+335
5+223.5
Positive signed ranks = 3.5 + 1.5 + 5 + 3.5 = 13.5
Negative signed ranks = 1.5
The test asks whether the signed ranks are centered around 0, rather than whether the mean paired difference is 0.
Key point: paired analogue of rank-based testing.

Kruskal-Wallis

Use: 3 or more independent groups, skewed continuous data.
Example: skewed length of stay across three treatment groups.
Group AGroup BGroup C
124
235
101215
Rank all values together: 1(1), 2(2.5), 2(2.5), 3(4), 4(5), 5(6), 10(7), 12(8), 15(9)
Average ranks differ across groups if one group tends to have systematically higher values.

This is the nonparametric analogue of asking whether the group means differ in ANOVA.
Key point: nonparametric analogue of ANOVA.

Logistic regression

Use: binary outcome with adjustment.
Example: odds of preeclampsia adjusted for age, BMI, parity.
Predictorβ coefficientApproximate interpretation
BMI0.08Higher BMI raises log odds
Age0.03Older age slightly raises log odds
Prior PE1.10Much higher odds
Model form: log odds(outcome) = β₀ + β₁x₁ + β₂x₂ + …
Example: if β for BMI = 0.08, then OR = e^0.08 ≈ 1.08

Interpretation: for each 1-unit increase in BMI, the odds of preeclampsia increase by about 8%, holding the other covariates constant.
Key point: think adjusted odds ratio for yes/no outcomes.

Linear regression

Use: continuous outcome with adjustment or prediction.
Example: predict birthweight from gestational age and maternal covariates.
Predictorβ coefficientInterpretation
Gestational age (weeks)180+180 g per week
Smoking−120120 g lower if smoker
Male fetus+9090 g higher
Model form: y = β₀ + β₁x₁ + β₂x₂ + …
If β for gestational age is 180, then each additional week is associated with an average increase of 180 g in birthweight, holding the other predictors constant.
Key point: continuous modeled outcome.

Cox proportional hazards

Use: time-to-event outcome.
Example: time from FGR diagnosis to delivery.
GroupMedian time to deliveryHazard ratio
Abnormal UA Doppler10 days1.8
Normal UA Doppler18 daysReference
Focus is not simply whether delivery occurred, but when it occurred.
A hazard ratio of 1.8 means the instantaneous rate of delivery is 80% higher in the abnormal UA Doppler group at a given time point, assuming proportional hazards.
Key point: use when timing matters.