Benchmarks

We Published Our Numbers. Now Ask Your Vendor to Publish Theirs.

Every metric on this page was validated against human annotation—not self-reported, not cherry-picked, not measured on a dataset we designed to make ourselves look good. Two independent annotators reviewed the same material and agreed with each other 96.3% of the time. Then we measured Fabula against their consensus. That’s how you build a benchmark you can defend in a room full of people who read scripts for a living and engineers who read architecture docs the same way.

All benchmarks measured on The West Wing seasons 1–4, validated by independent human annotators with 96.3% inter-annotator agreement.

Accuracy

Does It Actually Work? (We Brought Receipts.)

The first question everyone asks, and the one most vendors answer with a demo on their best example. Here are validated numbers across the full dataset—the ambiguous scenes, the walk-on characters, the cold opens that don’t name anyone for three pages. Framed for both the people who make television and the people who evaluate infrastructure.

Metric Result For Production For Investors
Entity Extraction Accuracy 90.2% Nine out of ten characters, locations, and objects extracted automatically. You review the genuinely ambiguous tenth—not the nine we missed. Hallucination rate below noise floor. Human-in-loop adjudication handles genuine ambiguity, not systemic extraction failures.
Relationship Precision 94.7% Who talked to whom, who works for whom, who betrayed whom—correctly mapped across four seasons without manual curation. Typed edges maintain semantic accuracy across temporal boundaries. Graph connectivity is high-signal, low-noise at production scale.
Temporal Ordering Accuracy 98.1% Flashbacks, cold opens, non-linear timelines—correctly sequenced. Near-perfect timeline construction from scripts alone. Dual temporal index architecture validates at research-grade accuracy. Fabula/syuzhet separation handles narrative complexity the way narratologists intended.
Performance

87 Milliseconds. You Won’t Finish the Thought.

Speed matters in a writers’ room where momentum is currency and every pause to look something up costs the pitch that was forming in someone’s head. Speed also matters when you’re evaluating whether infrastructure can scale past a single show.

Metric Result For Production For Investors
Query Latency (Avg) 87ms Ask any question about your show—who was in that scene, what happened after, where it took place—and the answer arrives before the room moves on. Sub-100ms P50 on PostgreSQL + Neo4j hybrid. Production-ready, horizontally scalable. No caching tricks.
Processing Throughput 83 min/episode Upload scripts Friday evening. Complete knowledge graph by Monday morning. Overnight batch processing, not a weekend project. 3.6x faster than baseline after async pipeline optimisation. Linear cost scaling per episode—no superlinear blowup at series length.
Entity Resolution Speed 120ms P95 “Is this the same character?” resolved instantly—even across seasons, even after name changes, even when the script just says ‘the bartender.’ ChromaDB + Neo4j hybrid scales in constant time per entity pair. Contrastive entity sharpening eliminates the N² trap.
Scale

We Didn’t Benchmark on Toy Examples

We don’t extract everything.

Ninety percent. Validated against human annotation across the full dataset.

The other ten percent is genuinely ambiguous—characters the script doesn’t name, relationships implied but never stated, moments where two professional annotators looked at the same scene and disagreed about what was happening. We flag those. You decide. That is not a limitation. That is the design.

What that ninety percent covers: 726 canonical entities across four seasons of The West Wing—from heads of state to walk-on characters who appear once and might return three seasons later. 80+ episodes across five series with different narrative structures, different temporal patterns, different ratios of dialogue to action. 15,000+ graph relationships with dense connectivity across causal, thematic, temporal, and emotional edges.

And zero entity drift. Process the same episode twice, get the same graph. Process it a third time six months later, get the same graph again. The same character mentioned in Episode 1 and Episode 40 resolves to the same canonical entity with the same UUID. Deterministic, reproducible, and verifiable—which is another way of saying: these aren’t benchmarks we ran once and then stopped looking. They hold.
Methodology

Two Annotators. 96.3% Agreement. And We Still Made Them Argue.

Human Annotation Baseline. Two independent annotators reviewed the same material without seeing each other’s work or Fabula’s output. They agreed 96.3% of the time. Where they disagreed, they adjudicated—discussed the edge case, examined the source text, reached consensus. Fabula’s accuracy is measured against that consensus, not against either individual annotator and certainly not against itself. This is how you build a benchmark that survives peer review.

Cross-Episode Consistency. The hardest problem in narrative extraction isn’t identifying a character in one scene. It’s knowing that the “Andy” in Episode 1 and the “Andrea Wyatt” in Episode 40 are the same person. We measure entity UUID stability across the full series run: 98.7%. Manual adjudication is only triggered for genuine edge cases—name changes after marriage, promotions that change how characters are addressed, aliases that the writers invented to mislead the audience.

Reproducibility Is Not a Feature. It’s an Obligation. Process the same input twice, get the same output. No stochastic drift, no accumulated error, no results that depend on which server handled the request. Deterministic extraction with schema-enforced output validation. If a benchmark can’t be reproduced, it isn’t a benchmark. It’s a press release.

Open Dataset (Planned). We intend to open-source our West Wing annotations for academic validation—a reproducible benchmark for narrative extraction research. The field needs shared evaluation standards. Publishing ours is the smallest useful thing we can do about that.

You’ve Read the Benchmarks. Now Walk the Graph.

Numbers describe. The graph demonstrates. Browse the live catalog to see what these metrics feel like when you’re tracing a character through four seasons of television. Or read the architecture behind them.