Methods
Study question
In a reproducible sample of AI-generated outputs, how often are cited references verifiable using public bibliographic sources?
Sampling
- Target design: N = 100 prompts from a fixed prompt bank.
- Current published sample: N = 100 (source: ChatGPT).
- One response per prompt from the chosen model (v1 uses a single model as a baseline).
- Each response is instructed to include exactly 5 references in a strict one-line schema.
- Collected outputs are stored as JSONL rows (prompt + full answer text).
What is “verifiable”?
We run each full AI output through Verifing’s Citation Verification tool, which attempts to resolve citations via public bibliographic sources (e.g., Crossref/DataCite/PubMed/OpenAlex/Open Library) using conservative matching.
- VERIFIED: citation metadata matches a known record with sufficient confidence.
- RETRACTED: the resolved record is known to be retracted (when detectable).
- HALLUCINATED: the identifier/citation could not be found in queried sources.
- AMBIGUOUS: plausible candidates exist but there isn’t enough information to confirm safely.
- ERROR: transient/system failure (timeouts, upstream issues).
Important limitations
- “HALLUCINATED” in this study means “not found in the queried sources.” It is not a claim about intent.
- Public sources can be incomplete, rate-limited, or delayed; some real citations may be marked AMBIGUOUS or HALLUCINATED.
- This v1 study uses a single model and a single run per prompt; results may differ across models and runs.
Reproduction steps
- Use the prompt bank at
apps/web/src/data/studies/citation-verifiability-jan-2026/prompt-bank.md. - Save outputs to
apps/web/src/data/studies/citation-verifiability-jan-2026/ai-outputs.jsonlfollowing the template file. - Run:
node scripts/study-citation-verifiability/run-study.mjs --api https://api.verifing.com \ --input apps/web/src/data/studies/citation-verifiability-jan-2026/ai-outputs.jsonl
Dataset download (current published sample): /study/citation-verifiability-jan-2026/dataset