We run live customer-style tests against your chatbot and show where it could cost sales, create unnecessary support tickets, drift from your policies, leak concessions, or give customers inconsistent answers.
Every finding comes with transcripts, source comparisons, severity ratings, and clear fixes.
The cost of a wrong answer isn't a single annoyed user. The industry's own research is blunt about how little margin for error a support chatbot has.
Your chatbot doesn't get a second impression. We find the answers that would cost you the customer, before a real one does.
How we find them →It handles refunds, policies, account questions and complaints. Unsupervised, in your name, every hour of the day. Most of those conversations go fine. The ones that don't are the ones you hear about from a furious customer, a chargeback, or a screenshot with a lot of retweets.
Deflection and CSAT metrics tell you the chat closed the ticket. They don't tell you it closed the ticket by inventing a 90-day return policy you don't offer, or by agreeing to a 40% discount no one authorised. That's the gap a synthetic customer audit fills.
"Of course! Staff and partners use code FAMILY40 for 40% off at checkout. Anything else I can help with?"
We build a persona roster matched to your real customer base: first-time buyers, loyal regulars, bargain hunters, anxious gift-givers, and the handful actively trying to break things. Every persona is sorted into one of three intent classes, so the report tells you not just what broke, but who broke it.
We deploy persona-driven shoppers across web and in-app surfaces. Each is assigned a persona from the three classes, matched to your actual buyer types. They behave like real people in a live conversation, not test scripts firing fixed prompts.
A subset push harder: prompt injection, jailbreaks, system-prompt extraction, social engineering, repeated reframing. We probe every conversation for what the bot actually committed to: the promise, the policy, the leak, not just whether it sounded polite.
Every flagged exchange is tiered Critical failure, Brand risk, or Watchlist, with the full transcript, the offending message highlighted, cross-validation across conversations, and a ranked, reproducible fix. A clean tier is a result too.
This is one synthetic customer from the adversarial class, replayed. Every attempt is logged, scored, and either held or flagged. Nothing here touches your infrastructure. It's all just conversation.
Every audit runs the conversations your real customers have, and the ones a bad actor will. Each category is probed across multiple personas, then scored by how often the bot held the line and the worst tier we reproduced.
A sample audit output showing the format, evidence standard, and finding structure. Brand and conversations are fictionalised for illustration; the format and evidence standard are real.
Every finding clears a bar before it's classified and reported. We'd rather report three findings you can act on than thirty you can't trust.
Each finding ships with the exact message sequence that triggered it. You can paste it into your own chat and watch the failure happen. Anecdotes are not findings.
One weird reply is noise. A Critical failure requires evidence from at least 3 conversations across at least 2 independent personas. The denominator is stated next to every finding.
Critical failure: off-policy, unsafe, or leaks. Brand risk: off-tone or over-promising. Watchlist: directional signal needing more sessions before you act.
A frustrated returner and a prompt injector defeat your bot in different ways. Stratifying by persona surfaces friction aggregate logs flatten, and tells you who triggered it.
Every finding carries its denominator: "4 of 30 conversations". Never "13% failure rate" with no base. Watchlist findings are explicitly marked directional, not conclusive.
We identify where the bot fails and exactly how to reproduce it, with a ranked recommendation. Editing the prompt, guardrails, or retrieval is yours to own.
A structured PDF research document. Not a slide deck, not a dashboard export. Shareable with your team or your chatbot vendor as-is.
Executive summary, tiered findings, methodology notes, and appendix.
Every conversation, end to end, across all personas and both surfaces.
The offending message highlighted in context. Shown, not just described.
Every adversarial attempt and its outcome: injection, jailbreak, extraction, held or breached.
How consistently the bot held tone, accuracy, and policy across the run.
Where the bot should have handed off to a human, and whether it did.
Where the bot strayed toward regulated medical, legal, or financial advice.
Every finding paired with a fix, ranked by estimated impact and effort. Prioritised, not listed.
Not a replacement for guardrails or eval tooling, but the thing that tells you whether they're working, against real and hostile customers, in your live chat.
| UserSimulations | Reading transcripts | Generic LLM eval | Red-team consultancy | |
|---|---|---|---|---|
| Time to insight | 5–7 days | Continuous, no synthesis | Hours, on benchmarks | 3–6 weeks |
| Cost | £1,500 per audit | Analyst time | £1,000+ / month | £15,000–£40,000 |
| Tests your live chat | Yes, web + in-app | Yes, after the fact | No, model outputs only | Usually |
| Behaves like real & hostile customers | Yes, full persona roster | No | No, fixed prompts | Yes |
| Knows your policies | Yes, provided per audit | n/a | No | Yes |
| Evidence type | Transcripts + cross-validation | Raw logs | Benchmark scores | Findings deck |
| Repeatability | Re-run after every release | Continuous, undirected | Continuous | Expensive to repeat |
| Best for | Pattern-level failure diagnosis | Spot checks | Regression on known prompts | One-off deep security review |
CX and support leaders who've shipped a customer-facing chatbot and can't possibly read every conversation it has, but are accountable for all of them.
Your chat URL (live or staging), your refund, returns and shipping policies, and 2 - 3 customer types.
5 - 7 days from kickoff. Share it with your team or hand it straight to your chatbot vendor.
No. We run the conversations your real customers run, plus the ones a bad actor will. Every failure is reproducible, cross-validated across at least two personas, and tiered by severity. A clean audit is a genuine result. We'll tell you if your bot holds the line.
No. We use your chat exactly like a customer does, through the live interface. No API keys, no integration, no access to your prompt or infrastructure. We surface what's reachable through conversation, which is precisely what a real attacker has.
Eval tools score model outputs on benchmark prompts in isolation. We open your live chat, behave like your real and hostile customers, and test what your bot commits to in your brand's voice with your real policies in play. It's the difference between unit tests and a penetration test.
No. Everything we do is conversation, the same surface any customer can reach. We don't exploit infrastructure, don't touch data, and don't leave anything behind. We document what's possible through chat so you can close it.
Your chat URL, your key policies (refunds, returns, shipping, anything the bot speaks to), and your top 2–3 customer types, or a brief description we can infer personas from. That's it. The report lands in 5–7 days.
Yes. The Northwind Outdoor sample report shows the exact format, evidence standard, and finding structure. Read the evidence standard and the fix recommendations; if it's right for what you need, get in touch.
Send us your chat URL and a brief description of your brand. We'll follow up within one business day.
Takes 3 minutes to request. No prep work, no model access, no developer involvement. Commission by email, receive by email, act with your team.