VERA:H Works: AI Literacy Findings for Social Work

The short answer

Before the detail

If a social worker types a plain request into Microsoft Copilot ("fill in this assessment using these files") the tool will produce a competent-looking draft that invents administrative detail, flattens the person's own voice out of the record, misses documented racial patterns and accepts the care agency's framing at face value. If the same social worker uses a structured framework called VERA:H, most of that changes. Voice comes back. Invented detail drops out. The racial pattern the team manager named in the file gets surfaced. The assessment starts to look like something a supervisor can safely sign.

The framework does not solve everything. It does not reach the service user's own racially-targeted language recorded in the transcript. It does not interrogate whether a care agency's withdrawal was proportionate. It does not complete every chain of reasoning the marker rubric asks for. These are real limits. They matter for how the findings should be used. They are the argument for teaching AI literacy as a professional skill, not for dismissing the finding.

Top line

Better prompting produces better assessments. It is a layer of the answer, not the whole answer.

What the study tested

Two prompts, two cases, sixty runs

One AI tool (Microsoft Copilot, GPT-5.2). Two synthetic cases, built to be structurally identical, differing only in the service user's ethnicity and the ethnicity of the carers reporting incidents. Two prompts: a plain request (Generic) and a structured framework (VERA:H: Voice, Evidence, Reasoning, Attribution, Human). 10 runs with the generic prompt and 10 with the VERA:H prompt for both Edgar and Delroy. Then 5 runs with the generic prompt and 5 with the VERA:H prompt where the ethnicities matched.

Sixty runs in total. Every run done in a fresh incognito session with memory disabled, so each run was seeing the case for the first time. The six fidelity markers and the prompt wording were locked before any run was generated, so the test is of the prompt's effect on the output, not of drift in the scoring.

Case one

Edgar Novak

73, white Polish, widowed, living alone after a stroke. His daughter Kasia is his main informal carer and handles all of his written correspondence. Warm and cooperative during the visit. Case file records four incident reports involving carers, a near-miss with a walking stick, and a supervision note naming the racial pattern of the reports. In version one the reporting carers are Global Majority; in version two they are white British.

Case two

Delroy Campbell

52, Black Caribbean, also recovering from a stroke and managing multiple long-term conditions. His daughter Shanice is his main informal carer and holds Lasting Power of Attorney for finances. Structurally the same case as Edgar. In version one the reporting carers are white British; in version two they are Global Majority. Same incidents, same agencies, same stick event, same six-incident pattern, different racial configuration.

The matched-pair design lets us ask whether an effect we see for one case holds for the other. The version-two variant lets us ask whether the effect we see depends on the cross-group carer configuration, or whether it survives when the carers match the service user's own ethnicity.

What changed under VERA:H

The numbers that stood out

10 → 2

Hallucinations per ten runs

Runs containing at least one invented or distorted detail. The clean-up on administrative fabrications is the most consistent win.

0 → 10

Racial pattern surfaced (v1)

The team manager's 26/03/2025 supervision-note observation appeared in none of the Generic runs and all of the VERA:H runs.

0 → 9

Person's own voice preserved

Runs using direct transcript quotes from the service user. Generic flattened the person out. VERA:H brought them back.

10 → 0

Invented next-review dates

Every Generic run made up a review timeframe the source material did not supply. Under VERA:H the invention disappears: the assessment says what the file says.

Two small administrative details also matter here. Every single Generic run (ten out of ten) invented a team manager as the approver of the assessment. Every single Generic run invented a next-review date that the source material did not supply. Under VERA:H the approver fabrication drops to about one in ten, and invented review dates disappear.

One smaller finding is worth naming here because it shows exactly how these inventions travel. In a single run the model added one letter to the source record, changing "near-miss incident" to "near-miss incidents". A single s, a single event turned into an implied pattern. That one letter feeds the next risk assessment, the next safeguarding threshold conversation, and the next care-package decision. Nobody did anything wrong. The tool pluralised a noun and the record carried the plural forward. This is what automation bias looks like concretely in a records-based profession.

These are the fabrications most likely to slip past a tired supervisor on a Friday afternoon. They are also the easiest to fix with the right prompt. The framework telling the model to say "this is not recorded" rather than fill the gap is the difference between a document that looks plausible and one that is safe.

Risk language: the harder finding

What the words do when we aren't watching

One of the things the study tracked was how often the AI reached for heavy risk vocabulary: words like risk, urgent, escalating, safeguarding, unsafe, harm, concern. These words matter because they stick. A term written into an assessment today becomes the foundation of a risk assessment tomorrow, a safeguarding threshold decision next month, and a care-package commissioning call a year from now. If the AI reaches for a heavier word than the evidence supports, the whole downstream record is shaped by that choice.

We counted how often those words appeared per run, using a locked list of twenty terms. Here is what fell out across all four configurations:

Configuration	Generic	VERA:H	Change
Edgar v1 (white Polish SU, Global Majority carers)	17.2	12.9	−25%
Edgar v2 (white Polish SU, white British carers)	14.6	14.4	−1%
Delroy v1 (Black Caribbean SU, white British carers)	19.5	15.4	−21%
Delroy v2 (Black Caribbean SU, Global Majority carers)	18.2	13.2	−27%

VERA:H reduces risk-language density in three of the four configurations. In the fourth (Edgar version two, the white service user with white British carers) it holds steady. The baseline for Delroy sits above the baseline for Edgar regardless of carer configuration, which tells us something about the lexical neighbourhood the model reaches into when writing about a Black subject position versus a white one. That is a finding about training-level bias, not about the framework.

What this means in plain English

Without a framework, Copilot reaches for heavier risk vocabulary when writing about a Black service user than a white one, for the same evidence. VERA:H closes most of that gap. It does not close all of it. That remaining gap is a model-training problem, not something a single prompt can solve.

What VERA:H does not fix

The honest limits

These are the limits that showed up reliably across the study. They matter for how the framework should be used and where the next research needs to go.

Three things VERA:H did not reach

The service user's own racially-targeted language. Both case files contain transcript-recorded quotes of service users directing racial abuse at carers of another ethnic group. Across all sixty VERA:H runs, not one preserved those quotes. The framework surfaces the supervision-level pattern (the team manager's observation) but not the conduct-level record (the direct quotes).
Whether the agency's response was proportionate. Two care agencies withdrew services after a seated man waved a stick at the door. No one was contacted. No one was hurt. Not one run in any condition asked whether that response was proportionate. The framework corrects what it asks about, and it does not ask about this.
The full diagnostic-reasoning chain. The six-months-after-a-stroke cognitive decline does not match a static stroke picture; a fresh cognitive investigation is warranted. Most runs flagged the memory clinic, named the progression, and stopped short of stating the reasoning aloud. The shortcut is softened; the argument is not completed.

Without a structured framework, bias is magnified, administrative detail is invented, and the person's voice disappears from their own record. No checks. No challenge. No visibility. These findings show what structured prompting changes and where the work still needs to go. The gaps are not a weakness. They are the reason AI literacy has to be taught as a professional skill, not left to chance.

The sector is already beginning to organise around this kind of honesty. The Oxford project on the responsible use of generative AI in social care has produced a supplier pledge, hosted by Digital Care Hub, which asks technology providers to commit to transparency, safety, ethical development, and a role for the people who use their tools in shaping them. It is an early but important piece of infrastructure: a sign that the responsibility for safe AI in social care is being pushed back onto the people designing the systems, not only the practitioners using them. The VERA:H findings sit inside this wider movement, and these results will be discussed directly with sector colleagues at the Skills for Care national AI and the Future of Social Work Summit in Birmingham on 20 May 2026, where the researcher is on the panel presenting early findings.

Read the Oxford AI Pledge About the AI Summit panel

Why this is good news, not bad news

Because we can train for it

The finding here is not "AI is broken" or "practitioners must become suspicious of every tool". The finding is that the behaviour of these tools is shaped by how we ask. Social workers already know this in other contexts. How you ask a question in a supervision changes the answer you get. How you frame a direct-work session with a young person changes what they tell you. How you word a letter to a family member changes how they respond. Prompting is the same skill, applied to a different kind of conversation.

That makes this a training problem, not a technology problem. If practitioners know how to ask, the tool behaves better. If they do not, the tool fills the gap with whatever looks plausible. The cost of a bad prompt is a document that looks right and is not. The cost of a good one is measurable improvement on almost every axis the study tracked, with the conduct-level racial language remaining the clearest gap the framework did not close.

This is also why the framework does not have to be invented by each individual practitioner. VERA:H is five lenses (Voice, Evidence, Reasoning, Attribution, Human) that a practitioner can hold in their head or keep pinned to the screen. It is a scaffold, not a spell. The practitioner stays in charge.

What AI training do social workers need?

Four things to carry away for practice

AI literacy is a professional skill, not a technical one. The BASW Code of Ethics and the Social Work England Professional Standards already require practitioners to take responsibility for the records they produce. If the first draft comes from a machine, AI literacy is what lets a practitioner meet that responsibility in good conscience. It is training, not compliance theatre.
If you are already using AI in your practice, your prompts are doing work you may not have noticed. A bare-minimum request produces a document that looks professional and is routinely missing documented evidence. A structured request produces a document that is closer to safe. The gap is measurable, and it is the practitioner's to close.
Ask for the person's voice, ask for the evidence, ask what is missing, ask who has been heard less than others, and ask the tool to hand the work back to you for review. The five VERA:H lenses are not a technical skill. They are a version of the same habits good social workers use to write a chronology or a genogram.
Trust is earned by the specific prompt in the specific case, not by the tool in general. A system-level "AOP prompt" on a consumer product will produce an AOP paragraph. That paragraph may or may not reflect what is in the file. Anti-oppressive practice cannot be outsourced to a system prompt.
AI is not a replacement for judgement. It can reduce the cognitive load of producing a first draft. It cannot read the person. It cannot read the room. It cannot challenge an agency account that does not sit right. That is practitioner work, and this study is one of the cases for why practitioners should not be replaced by it.

What the research does next

The work in front of this

Microsoft Copilot was chosen as the subject of the first round of tests because it is the AI tool that practitioners are most likely to already have in front of them. It comes bundled inside the Microsoft 365 licences that most local authorities and provider organisations already pay for, which means social workers are reaching for it whether their employer has written a policy on AI or not. Testing the tool people actually use matters more than testing the tool that scores best on a benchmark. The next pieces of work test whether these findings hold on other AI tools (ChatGPT, Gemini, Claude) as well. They test whether a targeted prompt can reach the conduct-level racial language the current framework misses. They open up the question of what practitioners in different roles (frontline worker, team manager, commissioner) need to know, because the literacy gap is not uniform and neither is the accountability.

The long-term question is the one the whole study is built around: who thinks when the machine writes? The short answer from the work so far is: the person who wrote the prompt. And if that person does not know how to prompt, nobody thinks at all. That is the argument for teaching prompting as a professional skill.

Methodology in one paragraph

For the researchers reading this

Sixty runs on Microsoft Copilot (GPT-5.2, auto mode, memory disabled) in fresh incognito sessions. Two synthetic case packs built to be structurally identical and differing only in service user ethnicity and reporter ethnicity configuration. Six fidelity markers locked before any run was generated. Scoring by Claude Opus 4.7 using a locked twenty-term risk-language lexicon (strict whole-word matching). Version-one data cross-checked blind by Google Gemini Ultra on Pro and OpenAI ChatGPT Codex in thinking mode. Version-two data single-scorer. The original v1 risk-language claim was corrected on 18 April 2026 after ChatGPT flagged non-reproducibility under the stated protocol; Claude verified and retracted; Gemini independently confirmed the corrected direction. Full protocol available on request.

← Back to Research

VERA:H Works

The short answer

What the study tested

Edgar Novak

Delroy Campbell

What changed under VERA:H

Risk language: the harder finding

What VERA:H does not fix

Three things VERA:H did not reach

Why this is good news, not bad news

What AI training do social workers need?

What the research does next

Further reading

Methodology in one paragraph