We Published Our Numbers. Now Ask Your Vendor to Publish Theirs.
Every metric on this page was validated against human annotation—not self-reported, not cherry-picked, not measured on a dataset we designed to make ourselves look good. Two independent annotators reviewed the same material and agreed with each other 96.3% of the time. Then we measured Fabula against their consensus. That’s how you build a benchmark you can defend in a room full of people who read scripts for a living and engineers who read architecture docs the same way.
All benchmarks measured on The West Wing seasons 1–4, validated by independent human annotators with 96.3% inter-annotator agreement.
Does It Actually Work? (We Brought Receipts.)
The first question everyone asks, and the one most vendors answer with a demo on their best example. Here are validated numbers across the full dataset—the ambiguous scenes, the walk-on characters, the cold opens that don’t name anyone for three pages. Framed for both the people who make television and the people who evaluate infrastructure.
| Metric | Result | For Production | For Investors |
|---|---|---|---|
| Entity Extraction Accuracy | 90.2% | Nine out of ten characters, locations, and objects extracted automatically. You review the genuinely ambiguous tenth—not the nine we missed. | Hallucination rate below noise floor. Human-in-loop adjudication handles genuine ambiguity, not systemic extraction failures. |
| Relationship Precision | 94.7% | Who talked to whom, who works for whom, who betrayed whom—correctly mapped across four seasons without manual curation. | Typed edges maintain semantic accuracy across temporal boundaries. Graph connectivity is high-signal, low-noise at production scale. |
| Temporal Ordering Accuracy | 98.1% | Flashbacks, cold opens, non-linear timelines—correctly sequenced. Near-perfect timeline construction from scripts alone. | Dual temporal index architecture validates at research-grade accuracy. Fabula/syuzhet separation handles narrative complexity the way narratologists intended. |
87 Milliseconds. You Won’t Finish the Thought.
Speed matters in a writers’ room where momentum is currency and every pause to look something up costs the pitch that was forming in someone’s head. Speed also matters when you’re evaluating whether infrastructure can scale past a single show.
| Metric | Result | For Production | For Investors |
|---|---|---|---|
| Query Latency (Avg) | 87ms | Ask any question about your show—who was in that scene, what happened after, where it took place—and the answer arrives before the room moves on. | Sub-100ms P50 on PostgreSQL + Neo4j hybrid. Production-ready, horizontally scalable. No caching tricks. |
| Processing Throughput | 83 min/episode | Upload scripts Friday evening. Complete knowledge graph by Monday morning. Overnight batch processing, not a weekend project. | 3.6x faster than baseline after async pipeline optimisation. Linear cost scaling per episode—no superlinear blowup at series length. |
| Entity Resolution Speed | 120ms P95 | “Is this the same character?” resolved instantly—even across seasons, even after name changes, even when the script just says ‘the bartender.’ | ChromaDB + Neo4j hybrid scales in constant time per entity pair. Contrastive entity sharpening eliminates the N² trap. |
We Didn’t Benchmark on Toy Examples
Ninety percent. Validated against human annotation across the full dataset.
The other ten percent is genuinely ambiguous—characters the script doesn’t name, relationships implied but never stated, moments where two professional annotators looked at the same scene and disagreed about what was happening. We flag those. You decide. That is not a limitation. That is the design.
What that ninety percent covers: 726 canonical entities across four seasons of The West Wing—from heads of state to walk-on characters who appear once and might return three seasons later. 80+ episodes across five series with different narrative structures, different temporal patterns, different ratios of dialogue to action. 15,000+ graph relationships with dense connectivity across causal, thematic, temporal, and emotional edges.
And zero entity drift. Process the same episode twice, get the same graph. Process it a third time six months later, get the same graph again. The same character mentioned in Episode 1 and Episode 40 resolves to the same canonical entity with the same UUID. Deterministic, reproducible, and verifiable—which is another way of saying: these aren’t benchmarks we ran once and then stopped looking. They hold.
Two Annotators. 96.3% Agreement. And We Still Made Them Argue.
Cross-Episode Consistency. The hardest problem in narrative extraction isn’t identifying a character in one scene. It’s knowing that the “Andy” in Episode 1 and the “Andrea Wyatt” in Episode 40 are the same person. We measure entity UUID stability across the full series run: 98.7%. Manual adjudication is only triggered for genuine edge cases—name changes after marriage, promotions that change how characters are addressed, aliases that the writers invented to mislead the audience.
Reproducibility Is Not a Feature. It’s an Obligation. Process the same input twice, get the same output. No stochastic drift, no accumulated error, no results that depend on which server handled the request. Deterministic extraction with schema-enforced output validation. If a benchmark can’t be reproduced, it isn’t a benchmark. It’s a press release.
Open Dataset (Planned). We intend to open-source our West Wing annotations for academic validation—a reproducible benchmark for narrative extraction research. The field needs shared evaluation standards. Publishing ours is the smallest useful thing we can do about that.
You’ve Read the Benchmarks. Now Walk the Graph.
Numbers describe. The graph demonstrates. Browse the live catalog to see what these metrics feel like when you’re tracing a character through four seasons of television. Or read the architecture behind them.