The Forecaster Test

Measure the cognitive skills that actually predict forecasting ability

Intelligence matters for many life outcomes, but surprisingly, IQ only weakly predicts forecasting ability (r ≈ .2). Even more counterintuitively, Tetlock's 20-year study of 82,361 predictions found that specialists performed no better in their own field than outside it—and "hedgehog" experts who relied heavily on domain expertise actually did worse when predicting within their specialty. Smart experts get caught off guard because expertise breeds overconfidence, not accuracy.

What matters more is what Mellers et al. called "good judgment"—a combination of probabilistic thinking, calibration, and willingness to update beliefs. This isn't an IQ test. It measures the cognitive skills that actually predict whether you'll be good at anticipating the future.

How it works

1. Take the Assessment — Measure your baseline across Bayesian reasoning, diagnostic thinking, cognitive reflection, and open-minded thinking.

2. Make Predictions — Forecast real-world events drawn from prediction markets. Your predictions are stored and scored when events resolve.

3. Track Your Accuracy — See how your judgment score correlates with actual forecasting performance over time.

Participants
Predictions Made
Questions Resolved

Assessment

Based on research from Tetlock, Mellers, and Baron

This measures five dimensions of judgment quality that predict forecasting accuracy: Bayesian updating, regression awareness, calibration, cognitive reflection, and open-minded thinking.

Time: ~8-10 minutes

Section 1 of 7 0 / 32
Section 1 Bayesian Update Tests
These problems test how you update probability estimates when new evidence arrives. First commit to an initial estimate, then update based on new information.
Problem 1 of 3
Stage 1: Initial Estimate

A company implements mandatory drug testing. The test has:

  • 95% sensitivity (correctly detects 95% of drug users)
  • 92% specificity (correctly clears 92% of non-users)

In this industry, approximately 4% of employees use drugs.

An employee tests positive. What is the probability they actually use drugs?
%
✓ Committed: %
Stage 2: Update with Evidence

New evidence: You learn this employee works in a warehouse division. Internal audits have found that 11% of warehouse employees use drugs (vs 4% company-wide).

Recalculate: What is the probability this employee uses drugs?
%
Most people spend about 30 seconds on each stage
Problem 2 of 3
Stage 1: Initial Estimate

A retailer sources products from two factories:

  • Factory A supplies 70% of inventory, with a 3% defect rate
  • Factory B supplies 30% of inventory, with a 9% defect rate
A customer returns a defective product. What is the probability it came from Factory A?
%
✓ Committed: %
Stage 2: Update with Disconfirming Evidence

New evidence: The defect is identified as a coating flaw. Historical data shows:

  • 15% of Factory A's defects are coating flaws
  • 60% of Factory B's defects are coating flaws
Given the defect is a coating flaw, update your estimate for Factory A.
%
Most people spend about 30 seconds on each stage
Problem 3 of 3
Stage 1: Initial Estimate

A venture fund screens startup founders. Historically:

  • 6% of applicants are "high-potential" (will return 10x+)
  • The screening committee correctly advances 85% of high-potential founders
  • The committee also advances 20% of ordinary founders
A founder passes the screening committee. What is the probability they are high-potential?
%
✓ Committed: %
Stage 2: Update with Mixed Evidence

New evidence: You learn two additional facts about this founder:

Signal A (positive): The committee ranked them in their top 3 picks of the quarter. Among advanced founders:

  • 40% of high-potential founders receive top-3 ranking
  • 8% of ordinary founders receive top-3 ranking

Signal B (negative): The founder has no prior startup experience. Among advanced founders:

  • 30% of high-potential founders lack prior experience
  • 65% of ordinary founders lack prior experience
Integrating both signals, update your estimate for this founder being high-potential.
%
Most people spend about 45 seconds on each stage
Section 2 Intuition Check
This tests whether you recognize common statistical patterns that often mislead people.
Performance Prediction

A regional sales team had an exceptional Q3, beating their quarterly target by 40%. This was their best quarter in 3 years.

The company's leadership is now projecting Q4 performance for this team. The team composition and market conditions are expected to remain similar.

What's most likely for Q4?
Most people answer in about 20 seconds
Section 3 Calibration Check
For each statement, indicate whether you think it's true or false, then rate your confidence in that answer. These test how well your confidence matches your accuracy.
Statement 1 of 4
More than half of major corporate mergers fail to create shareholder value (as measured by stock performance vs. industry benchmarks 3 years post-merger).
75%
Most people answer in about 15 seconds
Statement 2 of 4
Most startups (more than 50%) that raise Series A funding eventually go on to raise a Series B round.
75%
Most people answer in about 15 seconds
Statement 3 of 4
Clinical trials that show positive results are published at higher rates than those showing null or negative results.
75%
Most people answer in about 15 seconds
Statement 4 of 4
Professional economic forecasters accurately predict the direction (up or down) of annual GDP growth more than 75% of the time when forecasting 2 years ahead.
75%
Most people answer in about 15 seconds
Section 4 Cognitive Reflection
Take a moment to verify your answer.
Problem 1 of 2
A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?
$
Problem 2 of 2
A lily pad patch doubles daily. It covers the lake on day 48. When did it cover half the lake?
days
Section 5 Thinking Style
Rate your agreement with each statement.
Statement 1 of 11
I tend to make decisions quickly rather than deliberating for a long time.
Strongly DisagreeStrongly Agree
Statement 2 of 11
People should take into consideration evidence that goes against their beliefs.
Strongly DisagreeStrongly Agree
Statement 3 of 11
I prefer explanations that tie everything together with one big idea.
Strongly DisagreeStrongly Agree
Statement 4 of 11
Changing your mind is a sign of weakness.
Strongly DisagreeStrongly Agree
Statement 5 of 11
I often seek advice from others before making important decisions.
Strongly DisagreeStrongly Agree
Statement 6 of 11
People should search actively for reasons why they might be wrong.
Strongly DisagreeStrongly Agree
Statement 7 of 11
Most important problems have multiple partial causes rather than one root cause.
Strongly DisagreeStrongly Agree
Statement 8 of 11
I find it energizing to discuss controversial topics.
Strongly DisagreeStrongly Agree
Statement 9 of 11
It is important to be loyal to your beliefs even when evidence is brought against them.
Strongly DisagreeStrongly Agree
Statement 10 of 11
Specialists generally make better predictions in their field than generalists.
Strongly DisagreeStrongly Agree
Statement 11 of 11
I enjoy debates and arguments.
Strongly DisagreeStrongly Agree
Section 6 About You
Optional but helps us analyze what predicts good judgment.
Highest Education Completed
Primary Field
Political Orientation
Very LiberalVery Conservative
Prediction Market Experience
Familiar with Tetlock's Superforecasting Research?
Section 7 Scientific Calibration
Each study below was later tested in a large, pre-registered replication attempt. Estimate the probability that the original finding successfully replicated.
Study 1 of 6

Ego Depletion (1998)

Participants who first resisted eating cookies (exerting self-control) gave up faster on a subsequent puzzle task than those who hadn't resisted temptation. The researchers concluded that willpower is a limited resource that gets depleted with use.

Probability this replicated?
50%
Study 2 of 6

Facial Feedback (1988)

Participants who held a pen in their teeth (forcing a smile-like expression) rated cartoons as funnier than those who held the pen with their lips (preventing smiling). The researchers concluded that facial expressions can directly influence emotional experience.

Probability this replicated?
50%
Study 3 of 6

Anchoring Effect (1974)

Participants who first saw a random number (e.g., spinning a wheel showing "65") gave higher estimates to unrelated questions (e.g., "What percentage of African nations are in the UN?") than those who saw lower random numbers. The researchers concluded that arbitrary initial values bias subsequent numerical judgments.

Probability this replicated?
50%
Study 4 of 6

Power Posing (2010)

Participants who held "expansive" poses (arms spread, taking up space) for two minutes showed increased testosterone and decreased cortisol compared to those in "contractive" poses. The researchers concluded that body posture directly affects hormone levels and feelings of power.

Probability this replicated?
50%
Study 5 of 6

Loss Aversion (1979)

When choosing between gambles, people required potential gains to be roughly twice as large as potential losses before they'd accept a 50/50 bet. The researchers concluded that losses loom larger than equivalent gains in decision-making.

Probability this replicated?
50%
Study 6 of 6

Elderly Priming (1996)

Participants who unscrambled sentences containing words related to old age (e.g., "Florida," "wrinkle," "gray") walked more slowly down the hallway afterward than those exposed to neutral words. The researchers concluded that subtle exposure to concepts can unconsciously influence behavior.

Probability this replicated?
50%

Your Results

Composite Judgment Score
0
Bayesian Update
0/3
two-stage tests
Regression
0/1
mean awareness
Reflection
0/2
correct
Open-Minded
0
of 24

Bayesian Reasoning

Diagnostic Reasoning

Cognitive Reflection

Open-Minded Thinking

Your Forecasting Profile

Leaderboard Name

Choose a display name for the leaderboard. This is optional—you can stay anonymous if you prefer.

Retake the assessment with a fresh start

Predictions

Forecast real events. Your accuracy will be tracked and compared to your judgment score.

Instructions: For each question, drag the slider to your probability estimate. The market price is shown for reference—you're welcome to agree or disagree with it.

Predictions are scored using the Brier score when events resolve. Lower is better.

Your Predictions

No predictions yet.

Leaderboard

Tracking the correlation between judgment scores and forecasting accuracy

The Research Question

Mellers et al. (2017) found that superforecasters' judgment scores (a composite of Bayesian reasoning, diagnostic thinking, and other measures) correlated r ≈ .46-.60 with their forecasting accuracy.

We're testing whether this holds in the wild. As predictions resolve, we'll report the correlation between assessment scores and Brier scores.

Score ↔ Accuracy Correlation
Forecasters with Resolved Predictions
Questions Resolved

Top Forecasters (by Brier Score)

Lower Brier scores indicate better prediction accuracy. Scores range from 0 (perfect) to 1 (worst).

Rank User Brier Score Judgment Score Predictions
Waiting for predictions to resolve...

About

The science behind the assessment

This project tests whether laboratory measures of judgment quality predict real-world forecasting accuracy. The assessment is based on two major research programs.

The Good Judgment Project

From 2011-2015, Philip Tetlock and Barbara Mellers ran an IARPA-sponsored forecasting tournament with 5,000+ participants. "Superforecasters"—the top 2%—outperformed professional intelligence analysts by roughly 30%, even though the analysts had access to classified information.

Mellers, B. et al. (2015). Identifying and cultivating superforecasters as a method of improving probabilistic predictions. Perspectives on Psychological Science, 10(3), 267-281.

Generalizable Judgment

A follow-up study asked whether superforecasters' skills generalized to other judgment tasks. They outperformed on Bayesian reasoning (40-78% vs 5-28% for undergraduates), diagnostic test selection (77% vs 54% on congruence bias), and showed better calibration.

Mellers, B. et al. (2017). How generalizable is good judgment? A multi-task, multi-benchmark study. Judgment and Decision Making, 12(4), 369-381.
r = .46–.60
Correlation between composite judgment score and superforecaster status

This Project

We're testing whether that correlation replicates outside the lab. Participants take the assessment, make predictions on real events, and we track how judgment scores relate to forecasting accuracy as events resolve.

What We Measure

  • Bayesian Reasoning: Updating probabilities given evidence (Eddy 1982, Gigerenzer & Hoffrage 1995)
  • Diagnostic Reasoning: Pseudodiagnosticity, congruence bias, information bias (Doherty et al. 1979, Baron et al. 1988)
  • Cognitive Reflection: Frederick's CRT (2005)
  • Actively Open-Minded Thinking: Baron's AOT scale (1993, 2019)

Can Judgment Improve?

Yes. Unlike IQ, these skills appear trainable. GJP found that brief training improved accuracy by 10-15%, and superforecasters themselves improved over time. The key skills: calibration, base rate thinking, scope sensitivity, and systematic updating.

Privacy

Your data is stored with an anonymous ID. We don't collect names or email addresses. You can bookmark your ID to return and track your predictions.