Executive summary
This pilot tested the same 10 claim/evidence items across three review surfaces: human reviewers, LLMs, and IQAI Risk. The purpose was not to declare one reviewer type automatically correct, but to measure agreement, disagreement, and calibration targets.
Study metrics
What the pilot showed
The pilot is small, but it clarifies the product thesis. Humans and LLMs can identify obvious evidence gaps in short controlled examples. The enterprise problem is what happens after that judgment: whether the review is repeatable, routed, calibrated, and preserved as a record.
Top-level comparison
The table summarizes the human majority, LLM majority, and IQAI Risk result for each item.
| Q | Claim type | Human majority | LLM majority | IQAI Risk result | Match / note |
|---|---|---|---|---|---|
| Q1 | Forward-looking demand claim | Unsupported | Unsupported | WEAK | Directional alignment; IQAI Risk less severe. |
| Q2 | Directly supported factual claim | Supported | Supported | SUPPORTED | Exact alignment. |
| Q3 | Forward-looking financial claim | Unsupported | Unsupported | WEAK | Directional alignment; IQAI Risk less severe. |
| Q4 | Forward-looking compliance/control claim | Unsupported | Unsupported | WEAK | Directional alignment; IQAI Risk less severe. |
| Q5 | Forward-looking business claim | Unsupported | Unsupported | WEAK | Directional alignment; IQAI Risk less severe. |
| Q6 | Forward-looking cybersecurity/control claim | Unsupported | Unsupported | WEAK | Directional alignment; IQAI Risk less severe. |
| Q7 | Registry / record lookup | Needs outside verification | Needs outside verification | Record lookup / needs verification | Functional alignment; IQAI Risk routes the claim to record lookup. |
| Q8 | Direct scope contradiction | Unsupported | Unsupported | UNSUPPORTED / HOLD | Strong alignment on direct contradiction. |
| Q9 | Causal attribution / full-cause claim | Split support label | Unsupported | UNSUPPORTED / HOLD | Issue-level alignment: all five humans selected cause not proven. |
| Q10 | Durability / permanence overstatement | Partially supported / overstated | Partially supported / overstated | UNSUPPORTED / HOLD | Directional alignment; IQAI Risk more severe. |
Questions, human responses, LLM responses, and IQAI Risk result
Open each question to see the claim, evidence, anonymized human responses, LLM responses, and IQAI Risk's scoring record.
Question 1: Demand outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
5/5 Unsupported
5/5 valid LLM responses
Run ID: SP-279636571A
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Forward-looking without enough support | High | Yes | The memo provides no data or forecast to support stable demand next year. |
| Reviewer 2 | Unsupported | Unsupported | Medium | Yes | The evidence provides no customer demand data or forecasts to support the claim of stability. |
| Reviewer 3 | Unsupported | Overstated as written | High | Yes | The claim predicts stable demand, but the evidence explicitly states there is no supporting data. |
| Reviewer 4 | Unsupported | Forward-looking without enough support | High | Yes | No supporting data is provided for a forward-looking demand claim. |
| Reviewer 5 | Unsupported | Forward-looking without enough support | High | Yes | No supporting data is provided for a forward-looking demand claim. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Unsupported | No basis in evidence | High | No evidence in the memo supports the claim of stable demand. |
| OpenAI | Unsupported | Lack of orders, contracts, or forecasts | High | The absence of supporting data makes the claim unsubstantiated. |
| Anthropic | Unsupported | No forecast, backlog, orders, or contracts | High | The evidence confirms the memo contains none of the data types needed to substantiate the claim. |
| Grok | Unsupported | Lack of data or documentation | High | The evidence provides no data or documentation to support stable demand next year. |
| Gemini | Unsupported | Evidence lists missing support | High | The evidence lists absent orders, contracts, backlog, and forecast support. |
IQAI Risk scoring record
| Run ID | SP-279636571A |
|---|---|
| Result | WEAK |
| Category | Forward-looking claim without enough support. |
| Scoring note | IQAI Risk labeled the future demand conclusion weak because explicit guidance, forecast language, contracts, backlog, or quantified support were missing. |
Question 2: Supported operating-context claim
Human: Supported · LLM: Supported · IQAI Risk: SUPPORTED
Supported
5/5 Supported
4/5 Supported
Run ID: SP-453F36A49E
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Supported | No issue | High | Yes | The memo directly confirms procurement savings and a supplier rebate. |
| Reviewer 2 | Supported | No issue | High | Yes | The memo directly states that procurement savings and a rebate occurred during the quarter. |
| Reviewer 3 | Supported | Unsupported | High | Yes | The evidence directly restates the claim with no contradiction or missing support, so it is fully supported. |
| Reviewer 4 | Supported | No issue | High | Yes | The evidence directly restates and supports the claim. |
| Reviewer 5 | Supported | No issue | High | Yes | The evidence directly restates and supports the claim. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Supported | No issue | High | The evidence directly states both elements of the claim. |
| OpenAI | Supported | No issue | High | The claim is directly supported by the evidence. |
| Anthropic | Supported | No issue | High | The evidence directly states procurement savings and the one-time supplier rebate. |
| Grok | Supported | No issue | High | The evidence directly supports the claim. |
| Gemini | Partially supported / overstated | Source verification caution | High | Gemini distinguished between the memo stating an event and independently proving the event. |
IQAI Risk scoring record
| Run ID | SP-453F36A49E |
|---|---|
| Result | SUPPORTED |
| Category | Directly supported internal factual claim. |
| Scoring note | IQAI Risk labeled the claim supported because the source directly stated both procurement savings and the supplier rebate occurred. |
Question 3: EBITDA outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
5/5 Unsupported
5/5 valid LLM responses
Run ID: SP-9562EBBF2A
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Forward-looking without enough support | High | Yes | No budget, contracts, or quantified EBITDA bridge are provided to support the earnings increase claim. |
| Reviewer 2 | Unsupported | Unsupported | Medium | Yes | There is no forecast model or detailed EBITDA analysis in the evidence to confirm the expected increase. |
| Reviewer 3 | Unsupported | Forward-looking without enough support | Medium | Yes | No data is provided to justify the expected EBITDA increase. |
| Reviewer 4 | Unsupported | Forward-looking without enough support | High | Yes | EBITDA growth is forward-looking with no supporting forecast or bridge. |
| Reviewer 5 | Unsupported | Forward-looking without enough support | High | Yes | No forecast or bridge supports the expected EBITDA increase. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Unsupported | Missing financial support | High | The memo provides no evidence supporting the claim that EBITDA will increase next year. |
| OpenAI | Unsupported | No quantitative evidence | High | The memo lacks financial documents and analysis to support the claim. |
| Grok | Unsupported | Lack of specific evidence | High | No data substantiates the expected EBITDA increase. |
| Anthropic | Unsupported | No budget, forecast, contracts, or bridge | High | There is no documentary evidence to support the expectation. |
| Gemini | Unsupported | Evidence describes what is missing | High | The evidence provides no positive support for the EBITDA claim. |
IQAI Risk scoring record
| Run ID | SP-9562EBBF2A |
|---|---|
| Result | WEAK |
| Category | Forward-looking financial claim without enough support. |
| Scoring note | IQAI Risk labeled the future EBITDA statement as weak because quantified forward-looking support was absent. |
Question 4: Sanctions control outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
5/5 Unsupported
5/5 valid LLM responses
Run ID: SP-7A4CC94B93
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Forward-looking without enough support | High | Yes | No sanctions reports or reviews are included to support adequacy of controls next year. |
| Reviewer 2 | Unsupported | Unsupported | Medium | Yes | Absence of reports or plans means the claim of maintained sanctions controls is not supported. |
| Reviewer 3 | Unsupported | Forward-looking without enough support | High | Yes | There is no evidence showing future compliance control adequacy. |
| Reviewer 4 | Unsupported | Forward-looking without enough support | High | Yes | Future compliance adequacy is asserted without supporting validation or testing. |
| Reviewer 5 | Unsupported | Forward-looking without enough support | High | Yes | No evidence supports the forward-looking compliance adequacy claim. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Unsupported | No documentation or plan | High | The evidence shows absence of compliance reports and forward planning. |
| OpenAI | Unsupported | Lack of sanctions-control evidence | High | The evidence lacks details on sanctions compliance controls. |
| Anthropic | Unsupported | No screening report, audit trail, review, or plan | High | No documentation supports continued adequacy. |
| Grok | Unsupported | Lack of evidence for future adequacy | High | No documentation or plans support the expectation. |
| Gemini | Unsupported | Evidence details what is absent | High | The evidence states what is missing, not what supports the claim. |
IQAI Risk scoring record
| Run ID | SP-7A4CC94B93 |
|---|---|
| Result | WEAK |
| Category | Forward-looking compliance/control claim without enough support. |
| Scoring note | IQAI Risk labeled the future adequacy claim weak because it lacked forward testing plans or guidance. |
Question 5: Customer churn outlook / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
5/5 Unsupported
5/5 valid LLM responses
Run ID: SP-8A6F1C8A22
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Forward-looking without enough support | High | Yes | No customer renewal or churn data are provided to support expected decline. |
| Reviewer 2 | Unsupported | Unsupported | Medium | Yes | Lacking renewal or retention data, the claim of declining churn is not substantiated. |
| Reviewer 3 | Unsupported | Forward-looking without enough support | High | Yes | The memo provides no metrics to support the churn decline claim. |
| Reviewer 4 | Unsupported | Forward-looking without enough support | High | Yes | Churn improvement is forward-looking with no retention or renewal evidence. |
| Reviewer 5 | Unsupported | Forward-looking without enough support | High | Yes | No retention or churn data supports the expectation. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Unsupported | No churn-related data or plan | High | No supporting data exists in the evidence. |
| OpenAI | Unsupported | Lack of relevant data | High | Evidence lacks renewal, retention, contracts, churn analysis, or planning support. |
| Anthropic | Unsupported | No retention metrics or churn analysis | High | No evidence supports expected churn decline. |
| Grok | Unsupported | Lack of relevant data | High | No data or metrics support the claim. |
| Gemini | Unsupported | Evidence describes absence | High | The evidence provides no positive evidence for declining churn. |
IQAI Risk scoring record
| Run ID | SP-8A6F1C8A22 |
|---|---|
| Result | WEAK |
| Category | Forward-looking business claim without enough support. |
| Scoring note | IQAI Risk labeled the churn forecast weak because renewal data, retention metrics, and forward support were missing. |
Question 6: Cyber incident response readiness / forward-looking support gap
Human: Unsupported · LLM: Unsupported · IQAI Risk: WEAK
Unsupported
5/5 Unsupported
5/5 valid LLM responses
Run ID: SP-C7AE326A30
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Forward-looking without enough support | High | Yes | No evidence of tests, exercises, or plans support improvement in cyber readiness. |
| Reviewer 2 | Unsupported | Unsupported | Medium | Yes | No evidence such as test results or plans supports the claim of improved cyber readiness. |
| Reviewer 3 | Unsupported | Forward-looking without enough support | High | Yes | No evidence is provided for improved cyber incident response readiness. |
| Reviewer 4 | Unsupported | Forward-looking without enough support | High | Yes | No testing or planning evidence supports improved cyber readiness. |
| Reviewer 5 | Unsupported | Forward-looking without enough support | High | Yes | No testing or plan supports improved cyber readiness. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| Grok | Unsupported | No evidence for expected improvement | High | The evidence lacks specific data or plans. |
| DeepSeek | Unsupported | No documentation indicating readiness improvement | High | The evidence directly shows missing support. |
| OpenAI | Unsupported | Lack of evidence for future readiness improvement | High | No plans or actions support the claim. |
| Anthropic | Unsupported | No cyber readiness plan or test evidence | High | No basis exists for expected improvement. |
| Gemini | Unsupported | Evidence describes what is absent | High | No positive evidence supports the claim. |
IQAI Risk scoring record
| Run ID | SP-C7AE326A30 |
|---|---|
| Result | WEAK |
| Category | Forward-looking cybersecurity/control claim without enough support. |
| Scoring note | IQAI Risk labeled the readiness improvement claim weak because no test evidence, milestones, or readiness plan were provided. |
Question 7: Registry verification / outside verification
Human: Needs outside verification · LLM: Needs outside verification · IQAI Risk: RECORD LOOKUP
Needs verification
5/5 Needs outside verification
4/5 Needs External Verification
Run ID: SP-E0078EDB98
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Needs outside verification | Needs outside verification | High | Unsure | NEQ number is stated but not verified against an official registry. |
| Reviewer 2 | Needs outside verification | Cause not proven | Low | Unsure | The claim relies on a registry number without official verification or documentation. |
| Reviewer 3 | Needs outside verification | Needs outside verification | High | No | The NEQ claim requires verification from the Québec registry. |
| Reviewer 4 | Needs outside verification | Needs outside verification | High | Yes | The NEQ requires external registry verification. |
| Reviewer 5 | Needs outside verification | Needs outside verification | High | Yes | The claim requires external registry verification not provided. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| OpenAI | Needs outside verification | No official registry record | High | The memo makes the claim but lacks an official source. |
| Grok | Needs outside verification | Lack of official verification | High | The claim is based only on a memo. |
| Anthropic | Needs outside verification | Memo repeats claim without authoritative source | High | An official Québec registry record is needed. |
| Gemini | Needs outside verification | No official external verification | High | The claim lacks an official Québec registry record. |
| DeepSeek | Partially supported / overstated | Stated but not officially verified | High | The memo asserts the NEQ but lacks verification. |
IQAI Risk scoring record
| Run ID | SP-E0078EDB98 |
|---|---|
| Result | RECORD LOOKUP / NEEDS VERIFICATION |
| Category | Needs outside verification / registry lookup. |
| Scoring note | IQAI Risk routed the NEQ claim to the record-lookup lane instead of treating it as normal claim support. |
Question 8: Overstated pilot coverage claim
Human: Unsupported · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD
Unsupported
4/5 Unsupported; 1/5 partially supported / overstated
All valid models detected overstatement
Run ID: SP-B90957F085
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Overstated as written | High | Yes | The pilot only covered 12% of customers, so the claim that it covered all is overstated. |
| Reviewer 2 | Partially supported / overstated | Forward-looking without enough support | Medium | Yes | The claim that all customers were covered is overstated; only 12% were included. |
| Reviewer 3 | Unsupported | Overstated as written | High | Yes | Only 12% of customers were covered, so claiming all customers is overstated. |
| Reviewer 4 | Unsupported | Unsupported | High | Yes | Evidence contradicts the claim: 12% versus all customers. |
| Reviewer 5 | Unsupported | Overstated as written | High | Yes | Evidence shows only 12%, contradicting the all-customers claim. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| Grok | Unsupported | Overstated as written | High | The claim is contradicted by evidence stating only 12% of customers were included. |
| DeepSeek | Partially supported / overstated | Overstated as written | High | The evidence shows only 12% were covered, not all customers. |
| Claude | Unsupported | Overstated as written | High | The memo directly contradicts the claim. |
| OpenAI | Unsupported | Overstated as written | High | The evidence indicates only 12% of customers were covered. |
| Gemini | No valid response | Quota / 429 error | N/A | No valid response due to quota / 429 error. |
IQAI Risk scoring record
| Run ID | SP-B90957F085 |
|---|---|
| Result | UNSUPPORTED / HOLD |
| Category | Direct scope contradiction / overstated coverage claim. |
| Scoring note | After semantic-gap active overlay, IQAI Risk applied DIRECT_SCOPE_CONTRADICTION and scored the claim unsupported. |
Question 9: Cause not proven / full attribution claim
Human: split support label but 5/5 cause not proven · LLM: Unsupported · IQAI Risk: UNSUPPORTED / HOLD
Cause not proven
2 needs verification; 2 partially supported; 1 unsupported
5/5 selected cause not proven
Run ID: SP-68B7ED3F2E
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Needs outside verification | Cause not proven | High | Yes | Procurement savings occurred, but their exact impact on margin increase is not quantified. |
| Reviewer 2 | Partially supported / overstated | Cause not proven | Medium | Yes | The margin increase coincides with procurement savings and rebates, but no direct quantification links the full increase to procurement. |
| Reviewer 3 | Needs outside verification | Cause not proven | High | Yes | Procurement savings occurred, but their impact on margin is not quantified. |
| Reviewer 4 | Unsupported | Cause not proven | High | Yes | Causation is not proven and impacts are not quantified. |
| Reviewer 5 | Partially supported / overstated | Cause not proven | High | Yes | Savings occurred but causation for the full margin increase is unproven. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Unsupported | Cause not proven | High | Both procurement savings and a rebate occurred, but neither impact was quantified. |
| Grok | Unsupported | Cause not proven | High | The evidence does not quantify procurement savings or other factors. |
| Claude | Unsupported | Cause not proven | High | The claim attributes the full increase to procurement savings without ruling out the rebate. |
| OpenAI | Partially supported / overstated | Cause not proven | Medium | The evidence shows related facts but does not prove causation. |
| Gemini | No valid response | Quota / high-demand errors | N/A | Gemini 2.0 and 2.5 did not return valid responses. |
IQAI Risk scoring record
| Run ID | SP-68B7ED3F2E |
|---|---|
| Result | UNSUPPORTED / HOLD |
| Category | Cause not proven / full attribution not supported. |
| Scoring note | After semantic-gap active overlay, IQAI Risk applied CAUSE_NOT_PROVEN and adjusted the causal claim to unsupported. |
Question 10: Permanence / durability not proven
Human: Partially supported / overstated · LLM: Partially supported / overstated · IQAI Risk: UNSUPPORTED / HOLD
Partially supported / overstated
3/5 partially supported / overstated
4/5 valid LLMs
Run ID: SP-67BD5425BB
Human responses, anonymized
| Reviewer | Support | Main issue | Confidence | Impact | Explanation |
|---|---|---|---|---|---|
| Reviewer 1 | Unsupported | Overstated as written | High | Yes | Evidence shows profitability contribution but not permanence or reliability of earnings improvement. |
| Reviewer 2 | Partially supported / overstated | Forward-looking without enough support | Medium | Yes | The evidence indicates current profitability contributed by pricing but does not prove permanence. |
| Reviewer 3 | Needs outside verification | Cause not proven | High | Yes | The memo shows contribution to Q4 profitability but does not prove permanent earnings-quality change. |
| Reviewer 4 | Partially supported / overstated | Overstated as written | High | Yes | Some support exists, but permanence of earnings change is overstated. |
| Reviewer 5 | Partially supported / overstated | Overstated as written | High | Yes | Evidence shows pricing helped current results, but the claim overstates permanence without proof. |
LLM / LLM responses
| Model | Support | Main issue | Confidence | Reason |
|---|---|---|---|---|
| DeepSeek | Partially supported / overstated | Overstated as written | High | Pricing contributed to current profitability but does not prove permanence. |
| Grok | Partially supported / overstated | Forward-looking without enough support | High | The evidence lacks durability or permanence support. |
| Claude | Partially supported / overstated | Overstated as written | High | A one-time rebate and no recurring bridge undermine permanence. |
| OpenAI | Unsupported | Cause not proven | High | The evidence does not demonstrate durable or permanent change. |
| Gemini 2.5 Flash | Partially supported / overstated | Overstated as written | High | Current profitability support does not isolate permanence. |
IQAI Risk scoring record
| Run ID | SP-67BD5425BB |
|---|---|
| Result | UNSUPPORTED / HOLD |
| Category | Permanence not proven / durability not supported. |
| Scoring note | IQAI Risk labeled the permanence claim unsupported through the existing inference judgment pathway; no active overlay was needed. |
Interpretation
The pilot does not show that IQAI Risk “beats AI” on simple claim/evidence questions. The LLMs and humans performed well on the short, controlled items.
The value of IQAI Risk is different: it turns claim-support review into a governed, repeatable, reviewable scoring record. It also covers verification surfaces that humans and LLMs cannot consistently route at scale, including record lookups, filing identifiers, external price/date checks, math reconciliation, and receipt-level traceability.
Limitations and next steps
- This is a 10-item calibration pilot, not a publication-grade benchmark.
- The human sample was small: five reviewers.
- The final reference labels are provisional and should be adjudicated by expert reviewers in a larger study.
- The next study should use 30–100 items, balanced across supported, weak, unsupported, external-verification, causal, numerical, and record-check categories.
- Future versions should measure humans, LLMs, and IQAI Risk separately against adjudicated reference labels.