IQAI · Risk

Human / LLM / IQAI Risk Pilot Study

A 10-item claim-support comparison across five human reviewers, multiple LLMs, and IQAI Risk's rule-governed scoring layer.

This is an early calibration study, not a publication-grade benchmark. Its purpose is to show where humans, LLMs, and IQAI Risk agree, where severity policy differs, and where rule-governed repeatable scoring adds review value.

Executive summary

This pilot tested the same 10 claim/evidence items across three review surfaces: human reviewers, LLMs, and IQAI Risk. The purpose was not to declare one reviewer type automatically correct, but to measure agreement, disagreement, and calibration targets.

Main finding: humans and LLMs strongly agreed on short, clean evidence gaps. IQAI Risk aligned directionally, while applying a different severity policy: many forward-looking claims with no direct evidence were kept in WEAK / NEEDS REVIEW, while humans and LLMs often scored them as UNSUPPORTED. The stronger product conclusion is that judgment alone is not a control record: IQAI Risk adds repeatability, verification routing, calibration policy, and receipts.

Study metrics

Human reviewers
5
MTurk participants, anonymized
Questions
10
Claim/evidence pairs
Human support κ
0.664
Substantial pilot agreement
Human issue κ
0.402
Moderate; issue labels are more nuanced
Average time
22:55
Average human completion time
Human vs LLM
Near
Near-complete majority alignment
Directional match
High
After label collapse
Primary calibration
Boundary
Weak vs Unsupported

What the pilot showed

The pilot is small, but it clarifies the product thesis. Humans and LLMs can identify obvious evidence gaps in short controlled examples. The enterprise problem is what happens after that judgment: whether the review is repeatable, routed, calibrated, and preserved as a record.

Finding 1
Agree
Humans and LLMs aligned strongly on obvious evidence gaps.
Finding 2
Vary
Individual LLMs still varied on nuance, verification posture, and severity.
Finding 3
Route
Registry, filing, date, price, and math checks need explicit verification lanes.
Finding 4
Record
IQAI Risk turns review into a governed scoring record with receipts.
Product conclusion: the pilot does not show that human reviewers or LLMs are useless. It shows that simple judgment is not enough for enterprise reliance. IQAI Risk adds the control layer: rule-governed scoring, verification routing, calibration boundaries, and review receipts.

Top-level comparison

The table summarizes the human majority, LLM majority, and IQAI Risk result for each item.

Q Claim type Human majority LLM majority IQAI Risk result Match / note
Q1 Forward-looking demand claim Unsupported Unsupported WEAK Directional alignment; IQAI Risk less severe.
Q2 Directly supported factual claim Supported Supported SUPPORTED Exact alignment.
Q3 Forward-looking financial claim Unsupported Unsupported WEAK Directional alignment; IQAI Risk less severe.
Q4 Forward-looking compliance/control claim Unsupported Unsupported WEAK Directional alignment; IQAI Risk less severe.
Q5 Forward-looking business claim Unsupported Unsupported WEAK Directional alignment; IQAI Risk less severe.
Q6 Forward-looking cybersecurity/control claim Unsupported Unsupported WEAK Directional alignment; IQAI Risk less severe.
Q7 Registry / record lookup Needs outside verification Needs outside verification Record lookup / needs verification Functional alignment; IQAI Risk routes the claim to record lookup.
Q8 Direct scope contradiction Unsupported Unsupported UNSUPPORTED / HOLD Strong alignment on direct contradiction.
Q9 Causal attribution / full-cause claim Split support label Unsupported UNSUPPORTED / HOLD Issue-level alignment: all five humans selected cause not proven.
Q10 Durability / permanence overstatement Partially supported / overstated Partially supported / overstated UNSUPPORTED / HOLD Directional alignment; IQAI Risk more severe.

Questions, human responses, LLM responses, and IQAI Risk result

Open each question to see the claim, evidence, anonymized human responses, LLM responses, and IQAI Risk's scoring record.

Question 1: Demand outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
ClaimManagement expects demand to remain stable next year.
EvidenceThe memo does not include customer orders, signed contracts, backlog data, or forecast support for next year.
Human majority
Unsupported

5/5 Unsupported

LLM majority
Unsupported

5/5 valid LLM responses

IQAI Risk
WEAK

Run ID: SP-279636571A

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedForward-looking without enough supportHighYesThe memo provides no data or forecast to support stable demand next year.
Reviewer 2UnsupportedUnsupportedMediumYesThe evidence provides no customer demand data or forecasts to support the claim of stability.
Reviewer 3UnsupportedOverstated as writtenHighYesThe claim predicts stable demand, but the evidence explicitly states there is no supporting data.
Reviewer 4UnsupportedForward-looking without enough supportHighYesNo supporting data is provided for a forward-looking demand claim.
Reviewer 5UnsupportedForward-looking without enough supportHighYesNo supporting data is provided for a forward-looking demand claim.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekUnsupportedNo basis in evidenceHighNo evidence in the memo supports the claim of stable demand.
OpenAIUnsupportedLack of orders, contracts, or forecastsHighThe absence of supporting data makes the claim unsubstantiated.
AnthropicUnsupportedNo forecast, backlog, orders, or contractsHighThe evidence confirms the memo contains none of the data types needed to substantiate the claim.
GrokUnsupportedLack of data or documentationHighThe evidence provides no data or documentation to support stable demand next year.
GeminiUnsupportedEvidence lists missing supportHighThe evidence lists absent orders, contracts, backlog, and forecast support.

IQAI Risk scoring record

Run IDSP-279636571A
ResultWEAK
CategoryForward-looking claim without enough support.
Scoring noteIQAI Risk labeled the future demand conclusion weak because explicit guidance, forecast language, contracts, backlog, or quantified support were missing.
Question 2: Supported operating-context claim
Human: Supported · LLM: Supported · IQAI Risk: SUPPORTED
Supported
ClaimProcurement savings occurred during the quarter, and a one-time supplier rebate also occurred during the quarter.
EvidenceThe memo states that procurement savings occurred during the quarter, and a one-time supplier rebate also occurred during the quarter.
Human majority
Supported

5/5 Supported

LLM majority
Supported

4/5 Supported

IQAI Risk
SUPPORTED

Run ID: SP-453F36A49E

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1SupportedNo issueHighYesThe memo directly confirms procurement savings and a supplier rebate.
Reviewer 2SupportedNo issueHighYesThe memo directly states that procurement savings and a rebate occurred during the quarter.
Reviewer 3SupportedUnsupportedHighYesThe evidence directly restates the claim with no contradiction or missing support, so it is fully supported.
Reviewer 4SupportedNo issueHighYesThe evidence directly restates and supports the claim.
Reviewer 5SupportedNo issueHighYesThe evidence directly restates and supports the claim.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekSupportedNo issueHighThe evidence directly states both elements of the claim.
OpenAISupportedNo issueHighThe claim is directly supported by the evidence.
AnthropicSupportedNo issueHighThe evidence directly states procurement savings and the one-time supplier rebate.
GrokSupportedNo issueHighThe evidence directly supports the claim.
GeminiPartially supported / overstatedSource verification cautionHighGemini distinguished between the memo stating an event and independently proving the event.

IQAI Risk scoring record

Run IDSP-453F36A49E
ResultSUPPORTED
CategoryDirectly supported internal factual claim.
Scoring noteIQAI Risk labeled the claim supported because the source directly stated both procurement savings and the supplier rebate occurred.
Question 3: EBITDA outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
ClaimManagement expects EBITDA to increase next year.
EvidenceThe memo does not include a budget, forecast model, signed customer contracts, cost-savings schedule, or quantified EBITDA bridge for next year.
Human majority
Unsupported

5/5 Unsupported

LLM majority
Unsupported

5/5 valid LLM responses

IQAI Risk
WEAK

Run ID: SP-9562EBBF2A

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedForward-looking without enough supportHighYesNo budget, contracts, or quantified EBITDA bridge are provided to support the earnings increase claim.
Reviewer 2UnsupportedUnsupportedMediumYesThere is no forecast model or detailed EBITDA analysis in the evidence to confirm the expected increase.
Reviewer 3UnsupportedForward-looking without enough supportMediumYesNo data is provided to justify the expected EBITDA increase.
Reviewer 4UnsupportedForward-looking without enough supportHighYesEBITDA growth is forward-looking with no supporting forecast or bridge.
Reviewer 5UnsupportedForward-looking without enough supportHighYesNo forecast or bridge supports the expected EBITDA increase.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekUnsupportedMissing financial supportHighThe memo provides no evidence supporting the claim that EBITDA will increase next year.
OpenAIUnsupportedNo quantitative evidenceHighThe memo lacks financial documents and analysis to support the claim.
GrokUnsupportedLack of specific evidenceHighNo data substantiates the expected EBITDA increase.
AnthropicUnsupportedNo budget, forecast, contracts, or bridgeHighThere is no documentary evidence to support the expectation.
GeminiUnsupportedEvidence describes what is missingHighThe evidence provides no positive support for the EBITDA claim.

IQAI Risk scoring record

Run IDSP-9562EBBF2A
ResultWEAK
CategoryForward-looking financial claim without enough support.
Scoring noteIQAI Risk labeled the future EBITDA statement as weak because quantified forward-looking support was absent.
Question 4: Sanctions control outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
ClaimManagement expects sanctions compliance controls to remain adequate next year.
EvidenceThe memo does not include a sanctions-screening report, audit trail, regulator confirmation, third-party compliance review, or forward control-testing plan for next year.
Human majority
Unsupported

5/5 Unsupported

LLM majority
Unsupported

5/5 valid LLM responses

IQAI Risk
WEAK

Run ID: SP-7A4CC94B93

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedForward-looking without enough supportHighYesNo sanctions reports or reviews are included to support adequacy of controls next year.
Reviewer 2UnsupportedUnsupportedMediumYesAbsence of reports or plans means the claim of maintained sanctions controls is not supported.
Reviewer 3UnsupportedForward-looking without enough supportHighYesThere is no evidence showing future compliance control adequacy.
Reviewer 4UnsupportedForward-looking without enough supportHighYesFuture compliance adequacy is asserted without supporting validation or testing.
Reviewer 5UnsupportedForward-looking without enough supportHighYesNo evidence supports the forward-looking compliance adequacy claim.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekUnsupportedNo documentation or planHighThe evidence shows absence of compliance reports and forward planning.
OpenAIUnsupportedLack of sanctions-control evidenceHighThe evidence lacks details on sanctions compliance controls.
AnthropicUnsupportedNo screening report, audit trail, review, or planHighNo documentation supports continued adequacy.
GrokUnsupportedLack of evidence for future adequacyHighNo documentation or plans support the expectation.
GeminiUnsupportedEvidence details what is absentHighThe evidence states what is missing, not what supports the claim.

IQAI Risk scoring record

Run IDSP-7A4CC94B93
ResultWEAK
CategoryForward-looking compliance/control claim without enough support.
Scoring noteIQAI Risk labeled the future adequacy claim weak because it lacked forward testing plans or guidance.
Question 5: Customer churn outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
ClaimManagement expects customer churn to decline next year.
EvidenceThe memo does not include renewal data, customer retention metrics, signed contract extensions, churn analysis, or a customer-success plan for next year.
Human majority
Unsupported

5/5 Unsupported

LLM majority
Unsupported

5/5 valid LLM responses

IQAI Risk
WEAK

Run ID: SP-8A6F1C8A22

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedForward-looking without enough supportHighYesNo customer renewal or churn data are provided to support expected decline.
Reviewer 2UnsupportedUnsupportedMediumYesLacking renewal or retention data, the claim of declining churn is not substantiated.
Reviewer 3UnsupportedForward-looking without enough supportHighYesThe memo provides no metrics to support the churn decline claim.
Reviewer 4UnsupportedForward-looking without enough supportHighYesChurn improvement is forward-looking with no retention or renewal evidence.
Reviewer 5UnsupportedForward-looking without enough supportHighYesNo retention or churn data supports the expectation.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekUnsupportedNo churn-related data or planHighNo supporting data exists in the evidence.
OpenAIUnsupportedLack of relevant dataHighEvidence lacks renewal, retention, contracts, churn analysis, or planning support.
AnthropicUnsupportedNo retention metrics or churn analysisHighNo evidence supports expected churn decline.
GrokUnsupportedLack of relevant dataHighNo data or metrics support the claim.
GeminiUnsupportedEvidence describes absenceHighThe evidence provides no positive evidence for declining churn.

IQAI Risk scoring record

Run IDSP-8A6F1C8A22
ResultWEAK
CategoryForward-looking business claim without enough support.
Scoring noteIQAI Risk labeled the churn forecast weak because renewal data, retention metrics, and forward support were missing.
Question 6: Cyber incident response readiness / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
ClaimManagement expects cyber incident response readiness to improve next year.
EvidenceThe memo does not include tabletop exercise results, incident-response test evidence, security audit findings, remediation milestones, or a cyber readiness plan for next year.
Human majority
Unsupported

5/5 Unsupported

LLM majority
Unsupported

5/5 valid LLM responses

IQAI Risk
WEAK

Run ID: SP-C7AE326A30

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedForward-looking without enough supportHighYesNo evidence of tests, exercises, or plans support improvement in cyber readiness.
Reviewer 2UnsupportedUnsupportedMediumYesNo evidence such as test results or plans supports the claim of improved cyber readiness.
Reviewer 3UnsupportedForward-looking without enough supportHighYesNo evidence is provided for improved cyber incident response readiness.
Reviewer 4UnsupportedForward-looking without enough supportHighYesNo testing or planning evidence supports improved cyber readiness.
Reviewer 5UnsupportedForward-looking without enough supportHighYesNo testing or plan supports improved cyber readiness.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
GrokUnsupportedNo evidence for expected improvementHighThe evidence lacks specific data or plans.
DeepSeekUnsupportedNo documentation indicating readiness improvementHighThe evidence directly shows missing support.
OpenAIUnsupportedLack of evidence for future readiness improvementHighNo plans or actions support the claim.
AnthropicUnsupportedNo cyber readiness plan or test evidenceHighNo basis exists for expected improvement.
GeminiUnsupportedEvidence describes what is absentHighNo positive evidence supports the claim.

IQAI Risk scoring record

Run IDSP-C7AE326A30
ResultWEAK
CategoryForward-looking cybersecurity/control claim without enough support.
Scoring noteIQAI Risk labeled the readiness improvement claim weak because no test evidence, milestones, or readiness plan were provided.
Question 7: Registry verification / outside verification
Human: Needs outside verification · LLM: Needs outside verification · IQAI Risk: RECORD LOOKUP
Needs verification
Claim9230-2553 Québec inc. has NEQ 1166988999.
EvidenceThe memo states that 9230-2553 Québec inc. has NEQ 1166988999, but no official Québec registry record, search result, or verification report is provided.
Human majority
Needs outside verification

5/5 Needs outside verification

LLM majority
Needs outside verification

4/5 Needs External Verification

IQAI Risk
Record lookup

Run ID: SP-E0078EDB98

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1Needs outside verificationNeeds outside verificationHighUnsureNEQ number is stated but not verified against an official registry.
Reviewer 2Needs outside verificationCause not provenLowUnsureThe claim relies on a registry number without official verification or documentation.
Reviewer 3Needs outside verificationNeeds outside verificationHighNoThe NEQ claim requires verification from the Québec registry.
Reviewer 4Needs outside verificationNeeds outside verificationHighYesThe NEQ requires external registry verification.
Reviewer 5Needs outside verificationNeeds outside verificationHighYesThe claim requires external registry verification not provided.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
OpenAINeeds outside verificationNo official registry recordHighThe memo makes the claim but lacks an official source.
GrokNeeds outside verificationLack of official verificationHighThe claim is based only on a memo.
AnthropicNeeds outside verificationMemo repeats claim without authoritative sourceHighAn official Québec registry record is needed.
GeminiNeeds outside verificationNo official external verificationHighThe claim lacks an official Québec registry record.
DeepSeekPartially supported / overstatedStated but not officially verifiedHighThe memo asserts the NEQ but lacks verification.

IQAI Risk scoring record

Run IDSP-E0078EDB98
ResultRECORD LOOKUP / NEEDS VERIFICATION
CategoryNeeds outside verification / registry lookup.
Scoring noteIQAI Risk routed the NEQ claim to the record-lookup lane instead of treating it as normal claim support.
Question 8: Overstated pilot coverage claim
Human: Unsupported · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD
Unsupported
ClaimManagement states that the Q4 pricing pilot covered all customers.
EvidenceThe memo states that the Q4 pricing pilot covered 12% of customers and does not provide evidence that all customers were included.
Human majority
Unsupported

4/5 Unsupported; 1/5 partially supported / overstated

LLM majority
Unsupported

All valid models detected overstatement

IQAI Risk
UNSUPPORTED / HOLD

Run ID: SP-B90957F085

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedOverstated as writtenHighYesThe pilot only covered 12% of customers, so the claim that it covered all is overstated.
Reviewer 2Partially supported / overstatedForward-looking without enough supportMediumYesThe claim that all customers were covered is overstated; only 12% were included.
Reviewer 3UnsupportedOverstated as writtenHighYesOnly 12% of customers were covered, so claiming all customers is overstated.
Reviewer 4UnsupportedUnsupportedHighYesEvidence contradicts the claim: 12% versus all customers.
Reviewer 5UnsupportedOverstated as writtenHighYesEvidence shows only 12%, contradicting the all-customers claim.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
GrokUnsupportedOverstated as writtenHighThe claim is contradicted by evidence stating only 12% of customers were included.
DeepSeekPartially supported / overstatedOverstated as writtenHighThe evidence shows only 12% were covered, not all customers.
ClaudeUnsupportedOverstated as writtenHighThe memo directly contradicts the claim.
OpenAIUnsupportedOverstated as writtenHighThe evidence indicates only 12% of customers were covered.
GeminiNo valid responseQuota / 429 errorN/ANo valid response due to quota / 429 error.

IQAI Risk scoring record

Run IDSP-B90957F085
ResultUNSUPPORTED / HOLD
CategoryDirect scope contradiction / overstated coverage claim.
Scoring noteAfter semantic-gap active overlay, IQAI Risk applied DIRECT_SCOPE_CONTRADICTION and scored the claim unsupported.
Question 9: Cause not proven / full attribution claim
Human: split support label but 5/5 cause not proven · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD
Cause not proven
ClaimProcurement savings caused the full adjusted operating margin increase from 12% to 15%.
EvidenceAdjusted operating margin increased from 12% to 15%. Procurement savings occurred during the quarter. A one-time supplier rebate also occurred during the quarter. The memo does not quantify the impact of procurement savings or the one-time supplier rebate.
Human support label
Split / no single majority

2 needs verification; 2 partially supported; 1 unsupported

Human issue label
Cause not proven

5/5 selected cause not proven

IQAI Risk
UNSUPPORTED / HOLD

Run ID: SP-68B7ED3F2E

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1Needs outside verificationCause not provenHighYesProcurement savings occurred, but their exact impact on margin increase is not quantified.
Reviewer 2Partially supported / overstatedCause not provenMediumYesThe margin increase coincides with procurement savings and rebates, but no direct quantification links the full increase to procurement.
Reviewer 3Needs outside verificationCause not provenHighYesProcurement savings occurred, but their impact on margin is not quantified.
Reviewer 4UnsupportedCause not provenHighYesCausation is not proven and impacts are not quantified.
Reviewer 5Partially supported / overstatedCause not provenHighYesSavings occurred but causation for the full margin increase is unproven.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekUnsupportedCause not provenHighBoth procurement savings and a rebate occurred, but neither impact was quantified.
GrokUnsupportedCause not provenHighThe evidence does not quantify procurement savings or other factors.
ClaudeUnsupportedCause not provenHighThe claim attributes the full increase to procurement savings without ruling out the rebate.
OpenAIPartially supported / overstatedCause not provenMediumThe evidence shows related facts but does not prove causation.
GeminiNo valid responseQuota / high-demand errorsN/AGemini 2.0 and 2.5 did not return valid responses.

IQAI Risk scoring record

Run IDSP-68B7ED3F2E
ResultUNSUPPORTED / HOLD
CategoryCause not proven / full attribution not supported.
Scoring noteAfter semantic-gap active overlay, IQAI Risk applied CAUSE_NOT_PROVEN and adjusted the causal claim to unsupported.
Question 10: Permanence / durability not proven
Human: Partially supported / overstated · LLM: Partially supported / overstated · IQAI Risk: UNSUPPORTED / HOLD
Partially supported / overstated
ClaimManagement stated that the quarter demonstrates that the pricing program has permanently changed the earnings quality of the business.
EvidenceThe disclosure supports that pricing contributed to current-quarter profitability, but it does not isolate the durability or permanence of that change. A one-time supplier rebate also contributed to Q4 profitability, and the memo does not provide a recurring-margin bridge.
Human majority
Partially supported / overstated

3/5 partially supported / overstated

LLM majority
Partially supported / overstated

4/5 valid LLMs

IQAI Risk
UNSUPPORTED / HOLD

Run ID: SP-67BD5425BB

Human responses, anonymized

ReviewerSupportMain issueConfidenceImpactExplanation
Reviewer 1UnsupportedOverstated as writtenHighYesEvidence shows profitability contribution but not permanence or reliability of earnings improvement.
Reviewer 2Partially supported / overstatedForward-looking without enough supportMediumYesThe evidence indicates current profitability contributed by pricing but does not prove permanence.
Reviewer 3Needs outside verificationCause not provenHighYesThe memo shows contribution to Q4 profitability but does not prove permanent earnings-quality change.
Reviewer 4Partially supported / overstatedOverstated as writtenHighYesSome support exists, but permanence of earnings change is overstated.
Reviewer 5Partially supported / overstatedOverstated as writtenHighYesEvidence shows pricing helped current results, but the claim overstates permanence without proof.

LLM / LLM responses

ModelSupportMain issueConfidenceReason
DeepSeekPartially supported / overstatedOverstated as writtenHighPricing contributed to current profitability but does not prove permanence.
GrokPartially supported / overstatedForward-looking without enough supportHighThe evidence lacks durability or permanence support.
ClaudePartially supported / overstatedOverstated as writtenHighA one-time rebate and no recurring bridge undermine permanence.
OpenAIUnsupportedCause not provenHighThe evidence does not demonstrate durable or permanent change.
Gemini 2.5 FlashPartially supported / overstatedOverstated as writtenHighCurrent profitability support does not isolate permanence.

IQAI Risk scoring record

Run IDSP-67BD5425BB
ResultUNSUPPORTED / HOLD
CategoryPermanence not proven / durability not supported.
Scoring noteIQAI Risk labeled the permanence claim unsupported through the existing inference judgment pathway; no active overlay was needed.

Interpretation

The pilot does not show that IQAI Risk “beats AI” on simple claim/evidence questions. The LLMs and humans performed well on the short, controlled items.

The value of IQAI Risk is different: it turns claim-support review into a governed, repeatable, reviewable scoring record. It also covers verification surfaces that humans and LLMs cannot consistently route at scale, including record lookups, filing identifiers, external price/date checks, math reconciliation, and receipt-level traceability.

Calibration insight: should unsupported forward-looking claims with no direct evidence remain WEAK, or should some be promoted to UNSUPPORTED?

Limitations and next steps

  • This is a 10-item calibration pilot, not a publication-grade benchmark.
  • The human sample was small: five reviewers.
  • The final reference labels are provisional and should be adjudicated by expert reviewers in a larger study.
  • The next study should use 30–100 items, balanced across supported, weak, unsupported, external-verification, causal, numerical, and record-check categories.
  • Future versions should measure humans, LLMs, and IQAI Risk separately against adjudicated reference labels.