What contradictions does the benchmark plant?

Seven contradictions across the bundle: range mismatch (P&ID says 0-100 psi, index says 0-150 psi), service mismatch, signal class mismatch, wrong loop number, orphan in index (tag with no datasheet), missing datasheet, and cable schedule reference to a phantom tag. Each contradiction is verifiable in the truth file shipped with the benchmark.

How are questions scored?

Four scorers from the engineering question literature: exact match (lookup answers), set equality (list answers with no order), subset (list with partial credit), and entity-attributed F1 (the precision/recall measure from MEBench, arxiv 2502.18993). Scoring rules per question are committed to the truth file so a third party can rerun the benchmark and compute the same numbers.

Why DEXPI and not a real customer drawing?

Customer drawings are confidential. A public benchmark needs a permissively-licensed source so anyone can reproduce the numbers. DEXPI was the only public reference P&ID with enough element diversity (5,216 elements, 21 equipment classes, real instrument loops and piping networks) to make cross-document reconciliation meaningful. Synthetic drawings would not exercise the same failure modes that customer bundles do.

Is the benchmark code open?

The methodology, scorers, question bank, contradiction list, and truth file are all open and reproducible. The bundle build script reads the DEXPI ProteusXML and populates the six Excel sibling documents with real engineering data; the scorer runs against the truth file and returns numbers a third party can reproduce. The pipeline that produces the candidate answers (the Tagsight extraction itself) is the proprietary part.

Why does this matter for engineering teams?

Per-page extraction accuracy answers "can the system read a single P&ID." Cross-document reconciliation answers "can the system find the contradictions that get caught at SAT and ripped out in a punch list." The two are different problems and the second is where most engineering rework lives. Publishing the benchmark openly is the only way to make accuracy claims that a procurement team can verify.

Research · 2026-06-01

Cross-document reconciliation on the DEXPI reference P&ID.

Name: DEXPI Public Reference P&ID, Methanol synthesis (C01V04-VER.EX01)
Creator: Tagsight
License: https://tagsight.io/terms
Keywords: DEXPI, ProteusXML, P&ID reference, Methanol synthesis, cross-document reconciliation, engineering benchmark, CC BY 4.0

Engineering rework lives in the gap between documents. A datasheet says PT-101 reads 0–150 psi, the P&ID says 0–100 psi, the cable schedule terminates at a tag that isn’t in the index. The Tagsight cross-document reconciliation benchmark plants seven contradictions across a six-document bundle and asks fourteen engineer questions against it. On the DEXPI Public Reference P&ID (C01V04-VER.EX01, Methanol synthesis, CC BY 4.0), Tagsight detects 7 of 7 contradictions with zero false positives and scores 14 of 14 questions at 1.00. Methodology, scorers, question bank, contradiction list, and truth file are all public.

By Tagsight Team·Published Jun 1, 2026·Reviewed Jun 5, 2026

Why publish a cross-document benchmark

Per-page extraction accuracy is the easy benchmark to publish. A vendor takes a single P&ID, counts how many instrument bubbles the system reads, and reports a recall percentage. That number answers a narrow question: can the system read one drawing.

The harder question, and the one that actually moves rework hours, is: can the system find the contradictions that get caught at SAT and ripped out in a punch list. Range disagreements between the P&ID and the datasheet binder. Loop numbers that don’t match across the index, the loop sheet, and the JB schedule. A cable that terminates at a tag nobody else has heard of. These are the failure modes a working engineer recognises immediately.

Per-page extraction does not measure them. A system can read every bubble on a single sheet correctly and still ship a bundle that contradicts itself. The reverse is also true: a system with mediocre per-page recall can still catch every contradiction if it joins documents well. The two skills are independent.

What the benchmark measures

Two metrics, both scored against a frozen truth file:

Contradiction detection. Seven planted contradictions across the bundle, distributed across the six document classes. The system must surface each one and must not surface any false positives. Reported as N caught / 7 planted + F false positives.
Engineer question accuracy. Fourteen questions covering eleven primitives (lookup, list, count, find_missing, find_inconsistencies, trace, impact, standards_check). Each question is scored by the scorer that fits the answer shape: exact match for lookups, set equality for unordered lists, subset for partial-credit lists, entity-attributed F1 for question-answering tasks per MEBench (arxiv 2502.18993). Overall score is the mean across the fourteen questions.

Which source data the benchmark uses

The benchmark runs against the DEXPI Public Reference P&ID C01V04-VER.EX01, the Methanol synthesis case from the DEXPI working group’s training test cases. The drawing is real engineering, not synthetic. It carries 5,216 elements: 21 equipment items, 6 process instrumentation functions, 23 piping network segments, 7 actuating system components, 6 information flows, and 29 connections.

The source is gitlab.com/dexpi/TrainingTestCases under CC BY 4.0. The bundle build script reads the ProteusXML, populates five Excel sibling documents with engineering data (instrument index, datasheet binder, cable schedule, loop sheets, JB schedule), and plants the seven contradictions into the populated docs. Attribution is preserved on every artifact.

Which contradictions the benchmark plants

The seven planted contradictions:

Range mismatch. The P&ID and the datasheet binder disagree on a transmitter range.
Service mismatch. The instrument index and the P&ID name a different service for the same tag.
Signal class mismatch. The loop sheet and the I/O list disagree on AI / AO / DI / DO for one channel.
Wrong loop number. The JB schedule lists a loop number that contradicts the loop sheet.
Orphan in index. A tag appears in the instrument index but has no datasheet in the binder.
Missing datasheet. A tag on the P&ID is missing from the datasheet binder entirely.
Cable to phantom tag. The cable schedule terminates at a tag that does not exist on the P&ID or in the index.

Each contradiction is listed by ID in the truth file shipped with the benchmark. A third party rerunning the benchmark can verify which contradictions were caught and confirm the absence of false positives.

How the questions are scored

Four scorers from the engineering question literature:

Exact match. Lookup answers (“What is the design range of PT-101?”). The candidate string must equal the truth string after normalisation.
Set equality. List answers with no meaningful order (“List every instrument on the vapour overhead line.”). The candidate set must equal the truth set.
Subset. List answers with partial credit (“Which instruments are SIS-classified?”). Score is the size of the intersection divided by the size of the truth set.
Entity-attributed F1. The precision / recall measure from MEBench (arxiv 2502.18993) for question-answering tasks where the answer is a set of entities each carrying attributes.

How the join is done

The reconciliation follows the cross-modality KG-linkage pattern documented in UniDoc-Bench (arxiv 2510.03663). Each tag becomes a node in a knowledge graph; each document becomes a set of edges carrying field values; field level agreement is computed at query time across the document set. Orphan detection gates by entity kind so that equipment and lines (which legitimately don’t have a datasheet) are exempt from the orphan rule.

Results on the DEXPI bundle

Contradiction detection: 7 of 7 planted contradictions caught. 0 false positives.
Engineer question accuracy: 14 of 14 questions at the maximum scorer output for their shape. Overall mean score: 1.00.
Bundle size: six documents, fifteen reconciled entities, eleven question primitives covered.
Reproducibility: methodology, question bank, contradiction list, scorer code, and truth file are all public; only the candidate-generation pipeline (the Tagsight extractor itself) is closed.

What the benchmark does not measure

Per-page extraction recall and bbox precision on scanned P&IDs are evaluated against a separate ground-truth corpus and are not in scope here. Vendor-house dialects outside the DEXPI bundle (KKS, IEC 81346, NORSOK, JIS, operator-house) are exercised in the wider standards regression suite, not in this benchmark. Real customer bundles are confidential and never become public benchmarks.

Citations

DEXPI Working Group. DEXPI Public Training and Test Cases. gitlab.com/dexpi/TrainingTestCases. CC BY 4.0. Accessed 2026-06-01.
Wang et al. MEBench: A Comprehensive Benchmark for Multi-Entity Question Answering. arxiv:2502.18993, February 2025. (Source of the entity-attributed F1 scorer.)
Wang et al. UniDoc-Bench: A Unified Benchmark for Document Understanding. arxiv:2510.03663, October 2025. (Source of the cross-modality KG-linkage pattern.)
DEXPI 1.3 specification. ISO 15926-based exchange format for P&ID information. (Source format for the bundle.)

What we’re adding next

Three extensions on the roadmap. First: two to three more DEXPI bundles (one DEXPI XML produces one full eval bundle) so plant-shape variety is exercised, not just one Methanol synthesis case. Second: a safety-primitive question bank covering PSV verification, SIS coverage, alarm response, and HAZOP node assembly. Third: piping the same question bank through the live Intelligence runner with mocked workspace data so the language-model side of the system gets a deterministic score against the same truth file.