insights

Why most AI pilots fail (and how to be the exception)

04 March 2026 By LiverpoolAI Editorial 4 min read

Most pilots stall not because the model is weak, but because nobody decided what success looked like first.

Why most AI pilots fail (and how to be the exception)

Most AI pilots fail to make it to production. We have walked into enough of those projects in the post-mortem stage to see the patterns clearly — and they are usually the same patterns, not new ones.

This piece is the version of the conversation we have with Liverpool clients who have already had one or two failed AI pilots and want to know what to do differently. Almost everything below has been learned the hard way by someone — not always us.

Pattern 1: The unscoped chat window

The single most common failure mode in the city right now. Someone wires an LLM to "all our internal documents" via a vector store and puts a chat window in front of it. There is no eval set, no source citation, no refusal behaviour. The system works for the first thirty queries and then hallucinates badly in front of a chief executive or a regulator. Trust is permanently damaged. The project gets shelved for eighteen months and the technology gets blamed for a project-management failure.

What works instead: scope the assistant to a defined corpus, build an eval set against real queries, require source citations on every answer, design explicit refusal behaviour for out-of-scope questions. We covered the technical side of this in RAG vs fine-tuning.

Pattern 2: The vendor demo trap

A model vendor or platform reseller runs an impressive demo on curated data. The buyer signs an annual contract on the strength of the demo. Six months in, the system performs at half the demo level on real data. There is no internal team capable of debugging it. The vendor is contractually unreachable for anything outside their roadmap.

What works instead: run the proof of concept on your data, with your eval set, before any commitment. If the vendor will not run the demo on your data, the demo does not count. We covered this in our buyer's checklist.

Pattern 3: No metric, no measurement

The project starts with a vague goal — "explore AI use cases", "transform our customer experience", "be more efficient". There is no single named metric the project has to move. Six months in, no one can say whether the project worked because no one defined what working meant.

What works instead: pick a metric on day one. Time saved per case. Tickets deflected per week. Accuracy at threshold. Hours reclaimed. Measure the baseline before the project starts; measure again after launch. We covered the scoping discipline in how to scope an AI project in a week.

Pattern 4: The prototype that never crosses the production line

A working proof-of-concept gets demoed. Everyone agrees it is impressive. The project then dies in the gap between "working demo" and "live system the business depends on". The reason is almost always the same — the prototype was built without thinking about evaluation, observability, cost or integration, and the work to retrofit all four exceeds the will to do it.

What works instead: build the prototype as a thin, throwaway version of the production system. Plan the production work alongside the prototype. Assume from day one that you will have to ship.

Pattern 5: The agent that should have been a rule

A surprising number of projects we are called in to rescue have been built as "agents" when they should have been deterministic workflows with three LLM calls inside them. Agency is expensive — both in tokens and in the failure surface. Use it when you actually need it (multi-step, branching, context-dependent decisions); otherwise stick to a rule engine.

The diagnostic question: if you can describe the workflow as a flowchart with no more than two AI calls, you do not need an agent. Build the flowchart.

Pattern 6: Tooling chosen before the problem is understood

The technology decision happens before the problem decision. A team buys a vector database, signs up for a model provider, picks an orchestration framework — then goes looking for a problem the stack can solve. The work that follows is bent to fit the tooling.

What works instead: pick the smallest, most-measurable problem first. Choose the tooling that fits the problem, not the other way around. For most problems we ship, the tooling is boring and the technique is what matters.

Pattern 7: No human in the loop where one was needed

The system makes decisions about individual customers or cases without a human review step. The reviewer was an afterthought, or the project owner thought the model was confident enough. The first time the model is wrong about a real customer in a way that has real consequences, the project is over.

What works instead: design the human-in-the-loop boundary on day one. Identify what the model decides, what the human reviews, where escalations go, and how often the eval set is re-run in production. This is non-negotiable for regulated work and almost always the right choice for non-regulated work too.

How to be the exception

If we had to compress this into a single rule, it would be: pick one named metric, scope tightly to one workflow, build the smallest version that ships, and design the human in the loop from day one. Every pattern above is some flavour of failing to do one of those things.

The good news is that AI pilots done in this style ship reliably. We have shipped systems in six weeks for clients in Liverpool, the North West and across the UK that move the metric they were built to move. The discipline is doable; it just is not glamorous.

For the broader picture, read our field guide to AI in Liverpool, 2026. If you would like an honest view on whether the pilot you are planning is at risk of one of the patterns above, book a 30-minute discovery call.

The state of AI in Liverpool, 2026 — broader field guide to where AI is shipping (and failing) in the city.
Hiring an AI consultancy in Liverpool: a buyer's checklist — twelve questions to ask before any first call.
How to scope an AI project in a week — the scoping discipline that prevents the failure modes in this post.

Why most AI pilots fail (and how to be the exception)

Pattern 1: The unscoped chat window

Pattern 2: The vendor demo trap

Pattern 3: No metric, no measurement

Pattern 4: The prototype that never crosses the production line

Pattern 5: The agent that should have been a rule

Pattern 6: Tooling chosen before the problem is understood

Pattern 7: No human in the loop where one was needed

How to be the exception

More from insights

The Liverpool AI ecosystem in 2026: a who's who

AI for retail in Liverpool: what is working in 2026

AI for financial services in Liverpool: practical use cases for 2026

Pattern 1: The unscoped chat window

Pattern 2: The vendor demo trap

Pattern 3: No metric, no measurement

Pattern 4: The prototype that never crosses the production line

Pattern 5: The agent that should have been a rule

Pattern 6: Tooling chosen before the problem is understood

Pattern 7: No human in the loop where one was needed

How to be the exception

Related reading

More from insights

The Liverpool AI ecosystem in 2026: a who's who

AI for retail in Liverpool: what is working in 2026

AI for financial services in Liverpool: practical use cases for 2026