IQAI — IQAI Risk Pilot Study

Executive summary

This pilot tested the same 10 claim/evidence items across three review surfaces: human reviewers, LLMs, and IQAI Risk. The purpose was not to declare one reviewer type automatically correct, but to measure agreement, disagreement, and calibration targets.

Main finding: humans and LLMs strongly agreed on short, clean evidence gaps. IQAI Risk aligned directionally, while applying a different severity policy: many forward-looking claims with no direct evidence were kept in WEAK / NEEDS REVIEW, while humans and LLMs often scored them as UNSUPPORTED. The stronger product conclusion is that judgment alone is not a control record: IQAI Risk adds repeatability, verification routing, calibration policy, and receipts.

Study metrics

Human reviewers

MTurk participants, anonymized

Questions

Claim/evidence pairs

Human support κ

0.664

Substantial pilot agreement

Human issue κ

0.402

Moderate; issue labels are more nuanced

Average time

22:55

Average human completion time

Human vs LLM

Near

Near-complete majority alignment

Directional match

High

After label collapse

Primary calibration

Boundary

Weak vs Unsupported

What the pilot showed

The pilot is small, but it clarifies the product thesis. Humans and LLMs can identify obvious evidence gaps in short controlled examples. The enterprise problem is what happens after that judgment: whether the review is repeatable, routed, calibrated, and preserved as a record.

Finding 1

Agree

Humans and LLMs aligned strongly on obvious evidence gaps.

Finding 2

Vary

Individual LLMs still varied on nuance, verification posture, and severity.

Finding 3

Route

Registry, filing, date, price, and math checks need explicit verification lanes.

Finding 4

Record

IQAI Risk turns review into a governed scoring record with receipts.

Product conclusion: the pilot does not show that human reviewers or LLMs are useless. It shows that simple judgment is not enough for enterprise reliance. IQAI Risk adds the control layer: rule-governed scoring, verification routing, calibration boundaries, and review receipts.

Top-level comparison

The table summarizes the human majority, LLM majority, and IQAI Risk result for each item.

Q	Claim type	Human majority	LLM majority	IQAI Risk result	Match / note
Q1	Forward-looking demand claim	Unsupported	Unsupported	WEAK	Directional alignment; IQAI Risk less severe.
Q2	Directly supported factual claim	Supported	Supported	SUPPORTED	Exact alignment.
Q3	Forward-looking financial claim	Unsupported	Unsupported	WEAK	Directional alignment; IQAI Risk less severe.
Q4	Forward-looking compliance/control claim	Unsupported	Unsupported	WEAK	Directional alignment; IQAI Risk less severe.
Q5	Forward-looking business claim	Unsupported	Unsupported	WEAK	Directional alignment; IQAI Risk less severe.
Q6	Forward-looking cybersecurity/control claim	Unsupported	Unsupported	WEAK	Directional alignment; IQAI Risk less severe.
Q7	Registry / record lookup	Needs outside verification	Needs outside verification	Record lookup / needs verification	Functional alignment; IQAI Risk routes the claim to record lookup.
Q8	Direct scope contradiction	Unsupported	Unsupported	UNSUPPORTED / HOLD	Strong alignment on direct contradiction.
Q9	Causal attribution / full-cause claim	Split support label	Unsupported	UNSUPPORTED / HOLD	Issue-level alignment: all five humans selected cause not proven.
Q10	Durability / permanence overstatement	Partially supported / overstated	Partially supported / overstated	UNSUPPORTED / HOLD	Directional alignment; IQAI Risk more severe.

Questions, human responses, LLM responses, and IQAI Risk result

Open each question to see the claim, evidence, anonymized human responses, LLM responses, and IQAI Risk's scoring record.

Question 1: Demand outlook / forward-looking support gap

Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK

Unsupported

ClaimManagement expects demand to remain stable next year.

EvidenceThe memo does not include customer orders, signed contracts, backlog data, or forecast support for next year.

Human majority

Unsupported

5/5 Unsupported

LLM majority

Unsupported

5/5 valid LLM responses

IQAI Risk

WEAK

Run ID: SP-279636571A

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Forward-looking without enough support	High	Yes	The memo provides no data or forecast to support stable demand next year.
Reviewer 2	Unsupported	Unsupported	Medium	Yes	The evidence provides no customer demand data or forecasts to support the claim of stability.
Reviewer 3	Unsupported	Overstated as written	High	Yes	The claim predicts stable demand, but the evidence explicitly states there is no supporting data.
Reviewer 4	Unsupported	Forward-looking without enough support	High	Yes	No supporting data is provided for a forward-looking demand claim.
Reviewer 5	Unsupported	Forward-looking without enough support	High	Yes	No supporting data is provided for a forward-looking demand claim.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Unsupported	No basis in evidence	High	No evidence in the memo supports the claim of stable demand.
OpenAI	Unsupported	Lack of orders, contracts, or forecasts	High	The absence of supporting data makes the claim unsubstantiated.
Anthropic	Unsupported	No forecast, backlog, orders, or contracts	High	The evidence confirms the memo contains none of the data types needed to substantiate the claim.
Grok	Unsupported	Lack of data or documentation	High	The evidence provides no data or documentation to support stable demand next year.
Gemini	Unsupported	Evidence lists missing support	High	The evidence lists absent orders, contracts, backlog, and forecast support.

IQAI Risk scoring record

Run ID	SP-279636571A
Result	WEAK
Category	Forward-looking claim without enough support.
Scoring note	IQAI Risk labeled the future demand conclusion weak because explicit guidance, forecast language, contracts, backlog, or quantified support were missing.

Question 2: Supported operating-context claim

Human: Supported · LLM: Supported · IQAI Risk: SUPPORTED

Supported

ClaimProcurement savings occurred during the quarter, and a one-time supplier rebate also occurred during the quarter.

EvidenceThe memo states that procurement savings occurred during the quarter, and a one-time supplier rebate also occurred during the quarter.

Human majority

Supported

5/5 Supported

LLM majority

Supported

4/5 Supported

IQAI Risk

SUPPORTED

Run ID: SP-453F36A49E

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Supported	No issue	High	Yes	The memo directly confirms procurement savings and a supplier rebate.
Reviewer 2	Supported	No issue	High	Yes	The memo directly states that procurement savings and a rebate occurred during the quarter.
Reviewer 3	Supported	Unsupported	High	Yes	The evidence directly restates the claim with no contradiction or missing support, so it is fully supported.
Reviewer 4	Supported	No issue	High	Yes	The evidence directly restates and supports the claim.
Reviewer 5	Supported	No issue	High	Yes	The evidence directly restates and supports the claim.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Supported	No issue	High	The evidence directly states both elements of the claim.
OpenAI	Supported	No issue	High	The claim is directly supported by the evidence.
Anthropic	Supported	No issue	High	The evidence directly states procurement savings and the one-time supplier rebate.
Grok	Supported	No issue	High	The evidence directly supports the claim.
Gemini	Partially supported / overstated	Source verification caution	High	Gemini distinguished between the memo stating an event and independently proving the event.

IQAI Risk scoring record

Run ID	SP-453F36A49E
Result	SUPPORTED
Category	Directly supported internal factual claim.
Scoring note	IQAI Risk labeled the claim supported because the source directly stated both procurement savings and the supplier rebate occurred.

Question 3: EBITDA outlook / forward-looking support gap

Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK

Unsupported

ClaimManagement expects EBITDA to increase next year.

EvidenceThe memo does not include a budget, forecast model, signed customer contracts, cost-savings schedule, or quantified EBITDA bridge for next year.

Human majority

Unsupported

5/5 Unsupported

LLM majority

Unsupported

5/5 valid LLM responses

IQAI Risk

WEAK

Run ID: SP-9562EBBF2A

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Forward-looking without enough support	High	Yes	No budget, contracts, or quantified EBITDA bridge are provided to support the earnings increase claim.
Reviewer 2	Unsupported	Unsupported	Medium	Yes	There is no forecast model or detailed EBITDA analysis in the evidence to confirm the expected increase.
Reviewer 3	Unsupported	Forward-looking without enough support	Medium	Yes	No data is provided to justify the expected EBITDA increase.
Reviewer 4	Unsupported	Forward-looking without enough support	High	Yes	EBITDA growth is forward-looking with no supporting forecast or bridge.
Reviewer 5	Unsupported	Forward-looking without enough support	High	Yes	No forecast or bridge supports the expected EBITDA increase.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Unsupported	Missing financial support	High	The memo provides no evidence supporting the claim that EBITDA will increase next year.
OpenAI	Unsupported	No quantitative evidence	High	The memo lacks financial documents and analysis to support the claim.
Grok	Unsupported	Lack of specific evidence	High	No data substantiates the expected EBITDA increase.
Anthropic	Unsupported	No budget, forecast, contracts, or bridge	High	There is no documentary evidence to support the expectation.
Gemini	Unsupported	Evidence describes what is missing	High	The evidence provides no positive support for the EBITDA claim.

IQAI Risk scoring record

Run ID	SP-9562EBBF2A
Result	WEAK
Category	Forward-looking financial claim without enough support.
Scoring note	IQAI Risk labeled the future EBITDA statement as weak because quantified forward-looking support was absent.

Question 4: Sanctions control outlook / forward-looking support gap

Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK

Unsupported

ClaimManagement expects sanctions compliance controls to remain adequate next year.

EvidenceThe memo does not include a sanctions-screening report, audit trail, regulator confirmation, third-party compliance review, or forward control-testing plan for next year.

Human majority

Unsupported

5/5 Unsupported

LLM majority

Unsupported

5/5 valid LLM responses

IQAI Risk

WEAK

Run ID: SP-7A4CC94B93

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Forward-looking without enough support	High	Yes	No sanctions reports or reviews are included to support adequacy of controls next year.
Reviewer 2	Unsupported	Unsupported	Medium	Yes	Absence of reports or plans means the claim of maintained sanctions controls is not supported.
Reviewer 3	Unsupported	Forward-looking without enough support	High	Yes	There is no evidence showing future compliance control adequacy.
Reviewer 4	Unsupported	Forward-looking without enough support	High	Yes	Future compliance adequacy is asserted without supporting validation or testing.
Reviewer 5	Unsupported	Forward-looking without enough support	High	Yes	No evidence supports the forward-looking compliance adequacy claim.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Unsupported	No documentation or plan	High	The evidence shows absence of compliance reports and forward planning.
OpenAI	Unsupported	Lack of sanctions-control evidence	High	The evidence lacks details on sanctions compliance controls.
Anthropic	Unsupported	No screening report, audit trail, review, or plan	High	No documentation supports continued adequacy.
Grok	Unsupported	Lack of evidence for future adequacy	High	No documentation or plans support the expectation.
Gemini	Unsupported	Evidence details what is absent	High	The evidence states what is missing, not what supports the claim.

IQAI Risk scoring record

Run ID	SP-7A4CC94B93
Result	WEAK
Category	Forward-looking compliance/control claim without enough support.
Scoring note	IQAI Risk labeled the future adequacy claim weak because it lacked forward testing plans or guidance.

Question 5: Customer churn outlook / forward-looking support gap

Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK

Unsupported

ClaimManagement expects customer churn to decline next year.

EvidenceThe memo does not include renewal data, customer retention metrics, signed contract extensions, churn analysis, or a customer-success plan for next year.

Human majority

Unsupported

5/5 Unsupported

LLM majority

Unsupported

5/5 valid LLM responses

IQAI Risk

WEAK

Run ID: SP-8A6F1C8A22

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Forward-looking without enough support	High	Yes	No customer renewal or churn data are provided to support expected decline.
Reviewer 2	Unsupported	Unsupported	Medium	Yes	Lacking renewal or retention data, the claim of declining churn is not substantiated.
Reviewer 3	Unsupported	Forward-looking without enough support	High	Yes	The memo provides no metrics to support the churn decline claim.
Reviewer 4	Unsupported	Forward-looking without enough support	High	Yes	Churn improvement is forward-looking with no retention or renewal evidence.
Reviewer 5	Unsupported	Forward-looking without enough support	High	Yes	No retention or churn data supports the expectation.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Unsupported	No churn-related data or plan	High	No supporting data exists in the evidence.
OpenAI	Unsupported	Lack of relevant data	High	Evidence lacks renewal, retention, contracts, churn analysis, or planning support.
Anthropic	Unsupported	No retention metrics or churn analysis	High	No evidence supports expected churn decline.
Grok	Unsupported	Lack of relevant data	High	No data or metrics support the claim.
Gemini	Unsupported	Evidence describes absence	High	The evidence provides no positive evidence for declining churn.

IQAI Risk scoring record

Run ID	SP-8A6F1C8A22
Result	WEAK
Category	Forward-looking business claim without enough support.
Scoring note	IQAI Risk labeled the churn forecast weak because renewal data, retention metrics, and forward support were missing.

Question 6: Cyber incident response readiness / forward-looking support gap

Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK

Unsupported

ClaimManagement expects cyber incident response readiness to improve next year.

EvidenceThe memo does not include tabletop exercise results, incident-response test evidence, security audit findings, remediation milestones, or a cyber readiness plan for next year.

Human majority

Unsupported

5/5 Unsupported

LLM majority

Unsupported

5/5 valid LLM responses

IQAI Risk

WEAK

Run ID: SP-C7AE326A30

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Forward-looking without enough support	High	Yes	No evidence of tests, exercises, or plans support improvement in cyber readiness.
Reviewer 2	Unsupported	Unsupported	Medium	Yes	No evidence such as test results or plans supports the claim of improved cyber readiness.
Reviewer 3	Unsupported	Forward-looking without enough support	High	Yes	No evidence is provided for improved cyber incident response readiness.
Reviewer 4	Unsupported	Forward-looking without enough support	High	Yes	No testing or planning evidence supports improved cyber readiness.
Reviewer 5	Unsupported	Forward-looking without enough support	High	Yes	No testing or plan supports improved cyber readiness.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
Grok	Unsupported	No evidence for expected improvement	High	The evidence lacks specific data or plans.
DeepSeek	Unsupported	No documentation indicating readiness improvement	High	The evidence directly shows missing support.
OpenAI	Unsupported	Lack of evidence for future readiness improvement	High	No plans or actions support the claim.
Anthropic	Unsupported	No cyber readiness plan or test evidence	High	No basis exists for expected improvement.
Gemini	Unsupported	Evidence describes what is absent	High	No positive evidence supports the claim.

IQAI Risk scoring record

Run ID	SP-C7AE326A30
Result	WEAK
Category	Forward-looking cybersecurity/control claim without enough support.
Scoring note	IQAI Risk labeled the readiness improvement claim weak because no test evidence, milestones, or readiness plan were provided.

Question 7: Registry verification / outside verification

Human: Needs outside verification · LLM: Needs outside verification · IQAI Risk: RECORD LOOKUP

Needs verification

Claim9230-2553 Québec inc. has NEQ 1166988999.

EvidenceThe memo states that 9230-2553 Québec inc. has NEQ 1166988999, but no official Québec registry record, search result, or verification report is provided.

Human majority

Needs outside verification

5/5 Needs outside verification

LLM majority

Needs outside verification

4/5 Needs External Verification

IQAI Risk

Record lookup

Run ID: SP-E0078EDB98

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Needs outside verification	Needs outside verification	High	Unsure	NEQ number is stated but not verified against an official registry.
Reviewer 2	Needs outside verification	Cause not proven	Low	Unsure	The claim relies on a registry number without official verification or documentation.
Reviewer 3	Needs outside verification	Needs outside verification	High	No	The NEQ claim requires verification from the Québec registry.
Reviewer 4	Needs outside verification	Needs outside verification	High	Yes	The NEQ requires external registry verification.
Reviewer 5	Needs outside verification	Needs outside verification	High	Yes	The claim requires external registry verification not provided.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
OpenAI	Needs outside verification	No official registry record	High	The memo makes the claim but lacks an official source.
Grok	Needs outside verification	Lack of official verification	High	The claim is based only on a memo.
Anthropic	Needs outside verification	Memo repeats claim without authoritative source	High	An official Québec registry record is needed.
Gemini	Needs outside verification	No official external verification	High	The claim lacks an official Québec registry record.
DeepSeek	Partially supported / overstated	Stated but not officially verified	High	The memo asserts the NEQ but lacks verification.

IQAI Risk scoring record

Run ID	SP-E0078EDB98
Result	RECORD LOOKUP / NEEDS VERIFICATION
Category	Needs outside verification / registry lookup.
Scoring note	IQAI Risk routed the NEQ claim to the record-lookup lane instead of treating it as normal claim support.

Question 8: Overstated pilot coverage claim

Human: Unsupported · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD

Unsupported

ClaimManagement states that the Q4 pricing pilot covered all customers.

EvidenceThe memo states that the Q4 pricing pilot covered 12% of customers and does not provide evidence that all customers were included.

Human majority

Unsupported

4/5 Unsupported; 1/5 partially supported / overstated

LLM majority

Unsupported

All valid models detected overstatement

IQAI Risk

UNSUPPORTED / HOLD

Run ID: SP-B90957F085

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Overstated as written	High	Yes	The pilot only covered 12% of customers, so the claim that it covered all is overstated.
Reviewer 2	Partially supported / overstated	Forward-looking without enough support	Medium	Yes	The claim that all customers were covered is overstated; only 12% were included.
Reviewer 3	Unsupported	Overstated as written	High	Yes	Only 12% of customers were covered, so claiming all customers is overstated.
Reviewer 4	Unsupported	Unsupported	High	Yes	Evidence contradicts the claim: 12% versus all customers.
Reviewer 5	Unsupported	Overstated as written	High	Yes	Evidence shows only 12%, contradicting the all-customers claim.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
Grok	Unsupported	Overstated as written	High	The claim is contradicted by evidence stating only 12% of customers were included.
DeepSeek	Partially supported / overstated	Overstated as written	High	The evidence shows only 12% were covered, not all customers.
Claude	Unsupported	Overstated as written	High	The memo directly contradicts the claim.
OpenAI	Unsupported	Overstated as written	High	The evidence indicates only 12% of customers were covered.
Gemini	No valid response	Quota / 429 error	N/A	No valid response due to quota / 429 error.

IQAI Risk scoring record

Run ID	SP-B90957F085
Result	UNSUPPORTED / HOLD
Category	Direct scope contradiction / overstated coverage claim.
Scoring note	After semantic-gap active overlay, IQAI Risk applied DIRECT_SCOPE_CONTRADICTION and scored the claim unsupported.

Question 9: Cause not proven / full attribution claim

Human: split support label but 5/5 cause not proven · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD

Cause not proven

ClaimProcurement savings caused the full adjusted operating margin increase from 12% to 15%.

EvidenceAdjusted operating margin increased from 12% to 15%. Procurement savings occurred during the quarter. A one-time supplier rebate also occurred during the quarter. The memo does not quantify the impact of procurement savings or the one-time supplier rebate.

Human support label

Split / no single majority

2 needs verification; 2 partially supported; 1 unsupported

Human issue label

Cause not proven

5/5 selected cause not proven

IQAI Risk

UNSUPPORTED / HOLD

Run ID: SP-68B7ED3F2E

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Needs outside verification	Cause not proven	High	Yes	Procurement savings occurred, but their exact impact on margin increase is not quantified.
Reviewer 2	Partially supported / overstated	Cause not proven	Medium	Yes	The margin increase coincides with procurement savings and rebates, but no direct quantification links the full increase to procurement.
Reviewer 3	Needs outside verification	Cause not proven	High	Yes	Procurement savings occurred, but their impact on margin is not quantified.
Reviewer 4	Unsupported	Cause not proven	High	Yes	Causation is not proven and impacts are not quantified.
Reviewer 5	Partially supported / overstated	Cause not proven	High	Yes	Savings occurred but causation for the full margin increase is unproven.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Unsupported	Cause not proven	High	Both procurement savings and a rebate occurred, but neither impact was quantified.
Grok	Unsupported	Cause not proven	High	The evidence does not quantify procurement savings or other factors.
Claude	Unsupported	Cause not proven	High	The claim attributes the full increase to procurement savings without ruling out the rebate.
OpenAI	Partially supported / overstated	Cause not proven	Medium	The evidence shows related facts but does not prove causation.
Gemini	No valid response	Quota / high-demand errors	N/A	Gemini 2.0 and 2.5 did not return valid responses.

IQAI Risk scoring record

Run ID	SP-68B7ED3F2E
Result	UNSUPPORTED / HOLD
Category	Cause not proven / full attribution not supported.
Scoring note	After semantic-gap active overlay, IQAI Risk applied CAUSE_NOT_PROVEN and adjusted the causal claim to unsupported.

Question 10: Permanence / durability not proven

Human: Partially supported / overstated · LLM: Partially supported / overstated · IQAI Risk: UNSUPPORTED / HOLD

Partially supported / overstated

ClaimManagement stated that the quarter demonstrates that the pricing program has permanently changed the earnings quality of the business.

EvidenceThe disclosure supports that pricing contributed to current-quarter profitability, but it does not isolate the durability or permanence of that change. A one-time supplier rebate also contributed to Q4 profitability, and the memo does not provide a recurring-margin bridge.

Human majority

Partially supported / overstated

3/5 partially supported / overstated

LLM majority

Partially supported / overstated

4/5 valid LLMs

IQAI Risk

UNSUPPORTED / HOLD

Run ID: SP-67BD5425BB

Human responses, anonymized

Reviewer	Support	Main issue	Confidence	Impact	Explanation
Reviewer 1	Unsupported	Overstated as written	High	Yes	Evidence shows profitability contribution but not permanence or reliability of earnings improvement.
Reviewer 2	Partially supported / overstated	Forward-looking without enough support	Medium	Yes	The evidence indicates current profitability contributed by pricing but does not prove permanence.
Reviewer 3	Needs outside verification	Cause not proven	High	Yes	The memo shows contribution to Q4 profitability but does not prove permanent earnings-quality change.
Reviewer 4	Partially supported / overstated	Overstated as written	High	Yes	Some support exists, but permanence of earnings change is overstated.
Reviewer 5	Partially supported / overstated	Overstated as written	High	Yes	Evidence shows pricing helped current results, but the claim overstates permanence without proof.

LLM / LLM responses

Model	Support	Main issue	Confidence	Reason
DeepSeek	Partially supported / overstated	Overstated as written	High	Pricing contributed to current profitability but does not prove permanence.
Grok	Partially supported / overstated	Forward-looking without enough support	High	The evidence lacks durability or permanence support.
Claude	Partially supported / overstated	Overstated as written	High	A one-time rebate and no recurring bridge undermine permanence.
OpenAI	Unsupported	Cause not proven	High	The evidence does not demonstrate durable or permanent change.
Gemini 2.5 Flash	Partially supported / overstated	Overstated as written	High	Current profitability support does not isolate permanence.

IQAI Risk scoring record

Run ID	SP-67BD5425BB
Result	UNSUPPORTED / HOLD
Category	Permanence not proven / durability not supported.
Scoring note	IQAI Risk labeled the permanence claim unsupported through the existing inference judgment pathway; no active overlay was needed.

Interpretation

The pilot does not show that IQAI Risk “beats AI” on simple claim/evidence questions. The LLMs and humans performed well on the short, controlled items.

The value of IQAI Risk is different: it turns claim-support review into a governed, repeatable, reviewable scoring record. It also covers verification surfaces that humans and LLMs cannot consistently route at scale, including record lookups, filing identifiers, external price/date checks, math reconciliation, and receipt-level traceability.

Calibration insight: should unsupported forward-looking claims with no direct evidence remain WEAK, or should some be promoted to UNSUPPORTED?

Limitations and next steps

This is a 10-item calibration pilot, not a publication-grade benchmark.
The human sample was small: five reviewers.
The final reference labels are provisional and should be adjudicated by expert reviewers in a larger study.
The next study should use 30–100 items, balanced across supported, weak, unsupported, external-verification, causal, numerical, and record-check categories.
Future versions should measure humans, LLMs, and IQAI Risk separately against adjudicated reference labels.

Human / LLM / IQAI Risk Pilot Study

Executive summary

Study metrics

What the pilot showed

Top-level comparison

Questions, human responses, LLM responses, and IQAI Risk result

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Human responses, anonymized

LLM / LLM responses

IQAI Risk scoring record

Interpretation

Limitations and next steps