Statistics (inference & tests)

Confidence intervals, hypothesis tests, regression, and uncertainty.

Sample size Power analysis Effect size Confusion matrix Specificity & sensitivity Likelihood ratio Bayes theorem Odds converter Post-test odds Pre/post-test probability NPV & PPV Youden's J F-beta score Precision & recall MCC Balanced accuracy ROC AUC Diagnostic odds ratio ARR / NNT NNS NNH Risk difference RD with CI Attributable risk Attributable risk % Population attributable risk Population attributable fraction Population attributable risk % AF among exposed Risk ratio RR with CI Odds ratio OR with CI Fisher exact test McNemar test Binomial test Sign test Wilcoxon signed-rank Mann-Whitney U Kruskal-Wallis Friedman test RR vs OR Cohen's kappa Entropy & KL CI & hypothesis tests ANOVA Correlation Linear regression Normal distribution Permutation test Quick charts
Other languages 日本語 | English | 简体中文 | 繁體中文 | 繁體中文(香港) | Español | Español (México) | Português (Brasil) | Português (Portugal) | Bahasa Indonesia | Tiếng Việt | 한국어 | Français | Deutsch | Italiano | Русский | हिन्दी | العربية | বাংলা | اردو | Türkçe | ไทย | Polski | Filipino | Bahasa Melayu | فارسی | Nederlands | Українська | עברית | Čeština

Quick guide

  1. Start with sample size when the question is still about planning precision before data collection.
  2. Use power analysis when the planning question is required n, achieved power, or minimum detectable effect.
  3. Use CI/tests to quantify uncertainty and compare groups.
  4. Use effect size after the inferential step when the next question is practical magnitude rather than only statistical significance.
  5. Use confusion matrix metrics after a classifier run when accuracy alone is not enough and you need precision, recall, specificity, or F1.
  6. Use specificity and sensitivity when the stakeholder language is already about true positive rate versus true negative rate and you need FPR or FNR beside them.
  7. Use likelihood ratio when you want an odds-style explanation for how strongly a positive or negative result changes the evidence after the threshold is fixed.
  8. Use Bayes theorem when you need to explain the full prior → evidence → posterior update rather than only a threshold metric.
  9. Use the odds/probability converter when the missing step is moving between raw probability and the odds that likelihood ratios actually multiply.
  10. Use post-test odds when you want to show the explicit odds step between a prior and a posterior instead of jumping straight to the final probability.
  11. Use pre/post-test probability when LR+ or LR− is already known and the next question is the updated probability for one case.
  12. Use NPV and PPV when the question is how reliable positive or negative calls remain after prevalence changes between training and production.
  13. Use Youden's J when you want one threshold score that rewards both sensitivity and specificity while you compare nearby cutoffs.
  14. Use F-beta when precision and recall are both important but one side should count more than the other, such as recall-heavy screening or precision-heavy review queues.
  15. Use precision and recall when one operating threshold is already fixed and the real trade-off is false alarms versus missed positives on the positive class.
  16. Use MCC when you want one summary metric that reacts to all four confusion-matrix cells and stays useful under class imbalance.
  17. Use balanced accuracy when plain accuracy may flatter the dominant class and you need equal weight on recall and specificity.
  18. Use ROC AUC when the model outputs scores and the next question is how threshold choice changes sensitivity and specificity across the full sweep.
  19. Use diagnostic odds ratio when you want one ratio that summarizes how strongly the threshold separates positive and negative classes. Keep LR+ and LR- beside it for interpretation.
  20. Use ARR / NNT when you are past diagnostic metrics and need absolute effect plus people-needed interpretation for an intervention comparison.
  21. Use number needed to screen when the question is how many people must be screened to detect one target case under a prevalence and detection-yield assumption.
  22. Use NNH when the next question is harm-focused people-needed interpretation from an absolute risk increase rather than a ratio or a broader ARR / NNT summary.
  23. Use risk difference when you want the signed absolute gap in percentage points before converting that gap into NNH or comparing it with ratio metrics.
  24. Use risk difference with confidence interval when the same 2x2 table should report both the signed absolute gap and its 90 / 95 / 99% interval width before you move to ARR / NNT or RR comparison pages.
  25. Use attributable risk when the question is how much of exposed-group risk can be attributed to exposure, not just the raw signed gap.
  26. Use attributable risk percent when the exposed-group question is the share of exposed risk attributable to exposure, not the population burden.
  27. Use population attributable risk when you need the population-level burden after combining attributable risk with exposure prevalence.
  28. Use population attributable fraction when the same population burden should be explained as a fraction or percent of total population risk rather than only an absolute risk value.
  29. Use population attributable risk percent when the reporting language should stay in percent form and you want the same population burden stated directly as a percentage of total population risk.
  30. Use attributable fraction among exposed when the interpretation must stay inside the exposed group and you want the attributable share stated directly in fraction or percent form.
  31. Use risk ratio from table when a simple 2x2 count table is already available and the fastest question is the direct risk multiplier plus the absolute gap.
  32. Use risk ratio with confidence interval when the same 2x2 table should report both the RR point estimate and its 90 / 95 / 99% interval width before you move to broader comparison pages.
  33. Use odds ratio from table when the same 2x2 counts should be reported in odds terms, such as case-control summaries or rare-event interpretations.
  34. Use odds ratio with confidence interval when the same 2x2 table should report both the OR point estimate and its 90 / 95 / 99% interval width before you move to broader comparison pages.
  35. Use Fisher exact test when the 2x2 table is small or sparse and the first question is an exact p-value before you decide whether RR, OR, or chi-square-style summaries are worth reading next.
  36. Use McNemar test when the same items are measured twice and the paired 2x2 question is whether discordant pairs tilt in one direction. It is the small paired-table counterpart to Fisher-style exact reads.
  37. Use binomial test when one success count must be compared against one null proportion and the question is an exact one-proportion p-value before you move to sample-size or CI pages.
  38. Use the sign test when paired before/after data should be reduced to direction only and the next question is an exact p-value for more positives than negatives before you move to Wilcoxon-style rank weighting.
  39. Use the Wilcoxon signed-rank test when paired before/after data are ordinal or skewed continuous values and you want a paired nonparametric alternative to the paired t-test.
  40. Use Mann-Whitney U when 2 groups are independent but ordinal or skewed enough that you want a rank-based alternative to the two-sample t-test.
  41. Use Kruskal-Wallis when 3 or more independent groups should be compared with ranks instead of the mean-based assumptions behind one-way ANOVA. It is the natural 3-or-more-group extension after Mann-Whitney U.
  42. Use the Friedman test when the same items are measured across 3 or more conditions and you want a repeated-measures nonparametric alternative to repeated-measures ANOVA. It is the natural 3-or-more-condition extension after the Wilcoxon signed-rank test.
  43. Use relative risk versus odds ratio when two event-rate groups must be compared directly and the team needs to avoid mixing risk ratios with odds-based reporting.
  44. Use Cohen's kappa when 2 raters classify the same items and percent agreement alone would hide how much of the match is expected from chance.
  45. Use entropy and divergence when the question is uncertainty inside one distribution or mismatch between P and Q rather than a mean difference or classifier score.
  46. Use ANOVA when you compare the mean outcome across 3 or more groups.
  47. Use correlation when the first question is strength and direction of association.
  48. Use regression to fit lines and explain relationships.
  49. Need a non-parametric check? Try a permutation test.

Statistics (inference & tests): how to choose the right calculator path

This topic page works best when you treat it as a decision map rather than a flat list of tools. Start by writing the exact decision you need to make, then pick calculators in sequence so each output becomes an input to the next step. In practice, teams get faster and make fewer errors when they run a baseline model first, pressure-test assumptions second, and only then export a final number. For many workflows in this topic, a reliable sequence is to begin with Confidence Interval & Hypothesis Test Wizard, cross-check with Linear Regression & Correlation — Scatter, OLS/WLS, R², and finalize with Error Propagation Calculator with Steps when you need a publishable result.

How to choose calculators in this topic

Common mistakes

Practical workflow example

Suppose your team must deliver a recommendation by end of day. Use the first 10 minutes to define scope, constraints, and acceptance criteria in plain language. Run a baseline calculation, then a conservative and an optimistic case using the same structure. If outputs diverge materially, capture the sensitivity driver and decide which assumption needs escalation. Only after this pass should you export or share numbers. This process keeps the topic useful for real decisions, not just one-off calculations.

When results will influence spending, policy, or operations, keep a short note beside each output that records source data date, assumptions, and rounding policy. That one step dramatically reduces rework when someone asks for a rerun next week.

See also

Before you run a test

Write your question in one line. Define the metric and the groups.

Check units, sample size, and missing values before testing.

Use confidence intervals with p-values for clearer reporting.

Share the calculator URL so others can reproduce the same setup.

Keep the analysis plan simple and fixed before you look at results.

Write the null and alternative hypotheses before opening any calculator.

Report effect sizes with uncertainty, not only a single p-value.

Tools

Calculators

How to use this calculator effectively

This guide helps you use Statistics (inference & tests) in a repeatable way: define a baseline, change one variable at a time, and interpret outputs with explicit assumptions before you share or act on results.

How it works

The page applies deterministic logic to your inputs and shows rounded output for readability. Treat it as a comparison workflow: run one baseline case, adjust a single parameter, and measure both absolute and percentage deltas. If a result seems off, verify units, time basis, and sign conventions before drawing conclusions. This approach keeps your analysis reproducible across teammates and sessions.

When to use

Use this page when you need a fast estimate, a classroom check, or a practical what-if comparison. It works best for planning and prioritization steps where you need direction and magnitude quickly before investing in deeper modeling, manual spreadsheets, or formal external review.

Common mistakes to avoid

Interpretation and worked example

Run a baseline scenario and keep that result visible. Next, modify one assumption to reflect your realistic alternative and compare direction plus size of change. If the direction matches your domain expectation and the size is plausible, your setup is usually coherent. If not, check hidden defaults, boundary conditions, and interpretation notes before deciding which scenario to adopt.

See also

FAQ

What should I do first on this page?

Start with the minimum required inputs or the first action shown near the primary button. Keep optional settings at defaults for a baseline run, then change one setting at a time so you can explain what caused each output change.

Why does this page differ from another tool?

Different pages often use different defaults, units, rounding rules, or assumptions. Align those settings before comparing outputs. If differences remain, compare each intermediate step rather than only the final number.

How reliable are the displayed values?

Values are computed in the browser and rounded for display. They are good for planning and educational checks, but for regulated or high-stakes decisions you should validate assumptions with official guidance or professional review.

Can I share and reproduce this result?

Yes. Use the share or URL controls when available. Keep a baseline case and one changed case so others can reproduce your reasoning and verify that the direction and scale of change are consistent.

Is my input uploaded somewhere?

Core calculations run locally in your browser. Some pages encode parameters in a shareable URL, but no automatic upload is performed unless you explicitly share that link.

How to use Statistics (inference & tests) effectively

Topic overview

This topic page connects related methods, calculators, and practical contexts. Use it as an entry point: identify your objective first, then jump to the tool that isolates one decision variable and compare outputs with your constraints.

Recommended reading order

Start with the conceptual entry, then open one calculator page, then return here for alternatives. Reusing this loop avoids jumping directly into advanced inputs before you have enough context to interpret outputs correctly.

Cross-page consistency

Keep terminology aligned across calculators in the same topic family. If names or units drift between pages, the same user intent can produce conflicting interpretations even when numerical outcomes appear similar.

Quality checks

After each calculation, confirm input units, baseline assumptions, and edge-case handling. Topic-level consistency checks help your team retain interpretability over time and reduce accidental misinterpretation in future edits.