Research · 2026-06-01
Cross-document reconciliation on the DEXPI reference P&ID.
Engineering rework lives in the gap between documents. A datasheet says PT-101 reads 0–150 psi, the P&ID says 0–100 psi, the cable schedule terminates at a tag that isn’t in the index. The Tagsight cross-document reconciliation benchmark plants seven contradictions across a six-document bundle and asks fourteen engineer questions against it. On the DEXPI Public Reference P&ID (C01V04-VER.EX01, Methanol synthesis, CC BY 4.0), Tagsight detects 7 of 7 contradictions with zero false positives and scores 14 of 14 questions at 1.00. Methodology, scorers, question bank, contradiction list, and truth file are all public.
Why publish a cross-document benchmark
Per-page extraction accuracy is the easy benchmark to publish. A vendor takes a single P&ID, counts how many instrument bubbles the system reads, and reports a recall percentage. That number answers a narrow question: can the system read one drawing.
The harder question, and the one that actually moves rework hours, is: can the system find the contradictions that get caught at SAT and ripped out in a punch list. Range disagreements between the P&ID and the datasheet binder. Loop numbers that don’t match across the index, the loop sheet, and the JB schedule. A cable that terminates at a tag nobody else has heard of. These are the failure modes a working engineer recognises immediately.
Per-page extraction does not measure them. A system can read every bubble on a single sheet correctly and still ship a bundle that contradicts itself. The reverse is also true: a system with mediocre per-page recall can still catch every contradiction if it joins documents well. The two skills are independent.
What the benchmark measures
Two metrics, both scored against a frozen truth file:
- Contradiction detection. Seven planted contradictions across the bundle, distributed across the six document classes. The system must surface each one and must not surface any false positives. Reported as N caught / 7 planted + F false positives.
- Engineer question accuracy. Fourteen questions covering eleven primitives (lookup, list, count, find_missing, find_inconsistencies, trace, impact, standards_check). Each question is scored by the scorer that fits the answer shape: exact match for lookups, set equality for unordered lists, subset for partial-credit lists, entity-attributed F1 for question-answering tasks per MEBench (arxiv 2502.18993). Overall score is the mean across the fourteen questions.
Which source data the benchmark uses
The benchmark runs against the DEXPI Public Reference P&ID C01V04-VER.EX01, the Methanol synthesis case from the DEXPI working group’s training test cases. The drawing is real engineering, not synthetic. It carries 5,216 elements: 21 equipment items, 6 process instrumentation functions, 23 piping network segments, 7 actuating system components, 6 information flows, and 29 connections.
The source is gitlab.com/dexpi/TrainingTestCases under CC BY 4.0. The bundle build script reads the ProteusXML, populates five Excel sibling documents with engineering data (instrument index, datasheet binder, cable schedule, loop sheets, JB schedule), and plants the seven contradictions into the populated docs. Attribution is preserved on every artifact.
Which contradictions the benchmark plants
The seven planted contradictions:
- Range mismatch. The P&ID and the datasheet binder disagree on a transmitter range.
- Service mismatch. The instrument index and the P&ID name a different service for the same tag.
- Signal class mismatch. The loop sheet and the I/O list disagree on AI / AO / DI / DO for one channel.
- Wrong loop number. The JB schedule lists a loop number that contradicts the loop sheet.
- Orphan in index. A tag appears in the instrument index but has no datasheet in the binder.
- Missing datasheet. A tag on the P&ID is missing from the datasheet binder entirely.
- Cable to phantom tag. The cable schedule terminates at a tag that does not exist on the P&ID or in the index.
Each contradiction is listed by ID in the truth file shipped with the benchmark. A third party rerunning the benchmark can verify which contradictions were caught and confirm the absence of false positives.
How the questions are scored
Four scorers from the engineering question literature:
- Exact match. Lookup answers (“What is the design range of PT-101?”). The candidate string must equal the truth string after normalisation.
- Set equality. List answers with no meaningful order (“List every instrument on the vapour overhead line.”). The candidate set must equal the truth set.
- Subset. List answers with partial credit (“Which instruments are SIS-classified?”). Score is the size of the intersection divided by the size of the truth set.
- Entity-attributed F1. The precision / recall measure from MEBench (arxiv 2502.18993) for question-answering tasks where the answer is a set of entities each carrying attributes.
How the join is done
The reconciliation follows the cross-modality KG-linkage pattern documented in UniDoc-Bench (arxiv 2510.03663). Each tag becomes a node in a knowledge graph; each document becomes a set of edges carrying field values; field level agreement is computed at query time across the document set. Orphan detection gates by entity kind so that equipment and lines (which legitimately don’t have a datasheet) are exempt from the orphan rule.
Results on the DEXPI bundle
- Contradiction detection: 7 of 7 planted contradictions caught. 0 false positives.
- Engineer question accuracy: 14 of 14 questions at the maximum scorer output for their shape. Overall mean score: 1.00.
- Bundle size: six documents, fifteen reconciled entities, eleven question primitives covered.
- Reproducibility: methodology, question bank, contradiction list, scorer code, and truth file are all public; only the candidate-generation pipeline (the Tagsight extractor itself) is closed.
What the benchmark does not measure
Per-page extraction recall and bbox precision on scanned P&IDs are evaluated against a separate ground-truth corpus and are not in scope here. Vendor-house dialects outside the DEXPI bundle (KKS, IEC 81346, NORSOK, JIS, operator-house) are exercised in the wider standards regression suite, not in this benchmark. Real customer bundles are confidential and never become public benchmarks.
Citations
- DEXPI Working Group. DEXPI Public Training and Test Cases. gitlab.com/dexpi/TrainingTestCases. CC BY 4.0. Accessed 2026-06-01.
- Wang et al. MEBench: A Comprehensive Benchmark for Multi-Entity Question Answering. arxiv:2502.18993, February 2025. (Source of the entity-attributed F1 scorer.)
- Wang et al. UniDoc-Bench: A Unified Benchmark for Document Understanding. arxiv:2510.03663, October 2025. (Source of the cross-modality KG-linkage pattern.)
- DEXPI 1.3 specification. ISO 15926-based exchange format for P&ID information. (Source format for the bundle.)
What we’re adding next
Three extensions on the roadmap. First: two to three more DEXPI bundles (one DEXPI XML produces one full eval bundle) so plant-shape variety is exercised, not just one Methanol synthesis case. Second: a safety-primitive question bank covering PSV verification, SIS coverage, alarm response, and HAZOP node assembly. Third: piping the same question bank through the live Intelligence runner with mocked workspace data so the language-model side of the system gets a deterministic score against the same truth file.