Statistics (inference & tests)

Confidence intervals, hypothesis tests, regression, and uncertainty.

Sample size Power analysis Effect size Confusion matrix Specificity & sensitivity Likelihood ratio Bayes theorem Odds converter Post-test odds Pre/post-test probability NPV & PPV Youden's J F-beta score Precision & recall MCC Balanced accuracy ROC AUC Diagnostic odds ratio ARR / NNT NNS NNH Risk difference RD with CI Attributable risk Attributable risk % Population attributable risk Population attributable fraction Population attributable risk % AF among exposed Risk ratio RR with CI Odds ratio OR with CI Fisher exact test McNemar test Binomial test Sign test Wilcoxon signed-rank Mann-Whitney U Kruskal-Wallis Friedman test RR vs OR Cohen's kappa Entropy & KL CI & hypothesis tests ANOVA Correlation Linear regression Normal distribution Permutation test Quick charts
Other languages 日本語 | English | 简体中文 | 繁體中文 | 繁體中文(香港) | Español | Español (México) | Português (Brasil) | Português (Portugal) | Bahasa Indonesia | Tiếng Việt | 한국어 | Français | Deutsch | Italiano | Русский | हिन्दी | العربية | বাংলা | اردو | Türkçe | ไทย | Polski | Filipino | Bahasa Melayu | فارسی | Nederlands | Українська | עברית | Čeština

Quick guide

  1. Start with sample size when the question is still about planning precision before data collection.
  2. Use power analysis when the planning question is required n, achieved power, or minimum detectable effect.
  3. Use CI/tests to quantify uncertainty and compare groups.
  4. Use effect size after the inferential step when the next question is practical magnitude rather than only statistical significance.
  5. Use confusion matrix metrics after a classifier run when accuracy alone is not enough and you need precision, recall, specificity, or F1.
  6. Use specificity and sensitivity when the stakeholder language is already about true positive rate versus true negative rate and you need FPR or FNR beside them.
  7. Use likelihood ratio when you want an odds-style explanation for how strongly a positive or negative result changes the evidence after the threshold is fixed.
  8. Use Bayes theorem when you need to explain the full prior → evidence → posterior update rather than only a threshold metric.
  9. Use the odds/probability converter when the missing step is moving between raw probability and the odds that likelihood ratios actually multiply.
  10. Use post-test odds when you want to show the explicit odds step between a prior and a posterior instead of jumping straight to the final probability.
  11. Use pre/post-test probability when LR+ or LR− is already known and the next question is the updated probability for one case.
  12. Use NPV and PPV when the question is how reliable positive or negative calls remain after prevalence changes between training and production.
  13. Use Youden's J when you want one threshold score that rewards both sensitivity and specificity while you compare nearby cutoffs.
  14. Use F-beta when precision and recall are both important but one side should count more than the other, such as recall-heavy screening or precision-heavy review queues.
  15. Use precision and recall when one operating threshold is already fixed and the real trade-off is false alarms versus missed positives on the positive class.
  16. Use MCC when you want one summary metric that reacts to all four confusion-matrix cells and stays useful under class imbalance.
  17. Use balanced accuracy when plain accuracy may flatter the dominant class and you need equal weight on recall and specificity.
  18. Use ROC AUC when the model outputs scores and the next question is how threshold choice changes sensitivity and specificity across the full sweep.
  19. Use diagnostic odds ratio when you want one ratio that summarizes how strongly the threshold separates positive and negative classes. Keep LR+ and LR- beside it for interpretation.
  20. Use ARR / NNT when you are past diagnostic metrics and need absolute effect plus people-needed interpretation for an intervention comparison.
  21. Use number needed to screen when the question is how many people must be screened to detect one target case under a prevalence and detection-yield assumption.
  22. Use NNH when the next question is harm-focused people-needed interpretation from an absolute risk increase rather than a ratio or a broader ARR / NNT summary.
  23. Use risk difference when you want the signed absolute gap in percentage points before converting that gap into NNH or comparing it with ratio metrics.
  24. Use risk difference with confidence interval when the same 2x2 table should report both the signed absolute gap and its 90 / 95 / 99% interval width before you move to ARR / NNT or RR comparison pages.
  25. Use attributable risk when the question is how much of exposed-group risk can be attributed to exposure, not just the raw signed gap.
  26. Use attributable risk percent when the exposed-group question is the share of exposed risk attributable to exposure, not the population burden.
  27. Use population attributable risk when you need the population-level burden after combining attributable risk with exposure prevalence.
  28. Use population attributable fraction when the same population burden should be explained as a fraction or percent of total population risk rather than only an absolute risk value.
  29. Use population attributable risk percent when the reporting language should stay in percent form and you want the same population burden stated directly as a percentage of total population risk.
  30. Use attributable fraction among exposed when the interpretation must stay inside the exposed group and you want the attributable share stated directly in fraction or percent form.
  31. Use risk ratio from table when a simple 2x2 count table is already available and the fastest question is the direct risk multiplier plus the absolute gap.
  32. Use risk ratio with confidence interval when the same 2x2 table should report both the RR point estimate and its 90 / 95 / 99% interval width before you move to broader comparison pages.
  33. Use odds ratio from table when the same 2x2 counts should be reported in odds terms, such as case-control summaries or rare-event interpretations.
  34. Use odds ratio with confidence interval when the same 2x2 table should report both the OR point estimate and its 90 / 95 / 99% interval width before you move to broader comparison pages.
  35. Use Fisher exact test when the 2x2 table is small or sparse and the first question is an exact p-value before you decide whether RR, OR, or chi-square-style summaries are worth reading next.
  36. Use McNemar test when the same items are measured twice and the paired 2x2 question is whether discordant pairs tilt in one direction. It is the small paired-table counterpart to Fisher-style exact reads.
  37. Use binomial test when one success count must be compared against one null proportion and the question is an exact one-proportion p-value before you move to sample-size or CI pages.
  38. Use the sign test when paired before/after data should be reduced to direction only and the next question is an exact p-value for more positives than negatives before you move to Wilcoxon-style rank weighting.
  39. Use the Wilcoxon signed-rank test when paired before/after data are ordinal or skewed continuous values and you want a paired nonparametric alternative to the paired t-test.
  40. Use Mann-Whitney U when 2 groups are independent but ordinal or skewed enough that you want a rank-based alternative to the two-sample t-test.
  41. Use Kruskal-Wallis when 3 or more independent groups should be compared with ranks instead of the mean-based assumptions behind one-way ANOVA. It is the natural 3-or-more-group extension after Mann-Whitney U.
  42. Use the Friedman test when the same items are measured across 3 or more conditions and you want a repeated-measures nonparametric alternative to repeated-measures ANOVA. It is the natural 3-or-more-condition extension after the Wilcoxon signed-rank test.
  43. Use relative risk versus odds ratio when two event-rate groups must be compared directly and the team needs to avoid mixing risk ratios with odds-based reporting.
  44. Use Cohen's kappa when 2 raters classify the same items and percent agreement alone would hide how much of the match is expected from chance.
  45. Use entropy and divergence when the question is uncertainty inside one distribution or mismatch between P and Q rather than a mean difference or classifier score.
  46. Use ANOVA when you compare the mean outcome across 3 or more groups.
  47. Use correlation when the first question is strength and direction of association.
  48. Use regression to fit lines and explain relationships.
  49. Need a non-parametric check? Try a permutation test.

Before you run a test

Write your question in one line. Define the metric and the groups.

Check units, sample size, and missing values before testing.

Use confidence intervals with p-values for clearer reporting.

Share the calculator URL so others can reproduce the same setup.

Keep the analysis plan simple and fixed before you look at results.

Write the null and alternative hypotheses before opening any calculator.

Report effect sizes with uncertainty, not only a single p-value.

Tools

Calculators