Blog

Generality Is The Enemy Of Precision: Why Enterprise AI Is Stuck In Pilot Purgatory

James Duez
James Duez
6 min read

Walk into any large financial institution today and you’ll find the same scene: dozens, sometimes hundreds, of AI pilots and almost nothing in production. The business case is obvious, the ROI is overwhelming and the technology works in the demo. And yet the projects stall at the same gate, every time, when someone in risk or compliance asks a deceptively simple question, “Show me how it made that decision”. If the answer is “we can’t”, the project doesn’t graduate from proof of concept and quietly dies.

In my experience there are really only two paths out of that meeting. The first is the quiet death I’ve just described, and it accounts for the overwhelming majority of stalled initiatives. The second is that the project limps forward by bolting a human onto the end of the process, on the basis that if a person reviews every output then the decision is, technically, a human one. In Europe this approach has the comfort of regulation behind it, because Article 14 of the EU AI Act explicitly requires effective human oversight of high-risk systems, and similar expectations are emerging from supervisors in most major markets. It sounds responsible. The problem is that it rests on an assumption about human beings that the evidence simply doesn’t support, and I’ll come back to why.

From use cases to architectures

It’s worth understanding how we got here. Two years ago most enterprises were busy switching AI experiments off, reining in the hundreds of ungoverned use cases that bloomed when generative AI first arrived. What has emerged since is more interesting. Rather than approving individual use cases one committee meeting at a time, the leading institutions have started pre-approving architectures. If you can get the architecture right, meaning you know where the probabilistic components sit, where the deterministic controls sit and where the audit trail comes from, then you can repeat that pattern across hundreds of use cases. If you get it wrong, every project becomes a fresh fight with the governance committee.

This is a profound shift, and it cuts against the narrative coming out of the frontier labs, which amounts to a promise that you shouldn’t worry about today’s shortcomings because a better model is coming next month. Enterprises have stopped waiting for the risks to evaporate. They have been through the trough of disillusionment and come out the other side with a pragmatic conclusion: for the meaningful proportion of use cases where precision, determinism and explainability are non-negotiable, the answer isn’t a bigger model, it’s a different architecture.

Humans are terrible guardrails

Which brings me back to the second path, the human in the loop. Automation bias is one of the deepest cognitive biases we have, and it doesn’t take long to assert itself. Put a person in front of a stream of AI-generated outputs and ask them to challenge each one and within weeks they stop reading properly. They get tired, they get comfortable, and they approve. Worse, the very skills they would need in order to challenge the machine begin to decay through disuse, so automation bias slides quietly into de-skilling. A human checkbox at the end of a pipeline doesn’t transform an AI output into a human decision; it launders accountability while judgement atrophies.

This matters enormously for the agentic wave, because agentic AI properly understood is not a product category called “AI agents” but AI with genuine agency, the ability to take action autonomously. Autonomy at scale and human review of every output are mathematically incompatible. You cannot have straight-through processing and a person reading everything, so something else has to provide the guarantee, and that something has to be engineered into the stack in the form of deterministic logic, explicit policy and causal audit trails, rather than bolted on as a tired human at the end of the process.

I believe regulators broadly underestimate this. Article 14 was written with the right intent, but the implicit assumption running through much supervisory thinking, in Europe and elsewhere, is that human review is a sufficient control. The institutions deploying at any real volume already know that it isn’t.

The systemic risk nobody is pricing

There is also a second-order problem brewing. When everyone in a market uses the same handful of foundation models, trained on substantially the same data, the only thing differentiating one institution from another is the context and institutional knowledge they bring to those models. Strip that away and you get convergence: similar signals, similar decisions and increasingly synchronised behaviour. Humans have historically been the market’s shock absorbers, slow and inconsistent but gloriously diverse in their judgement, and replacing them with a monoculture of models builds a system that is brilliant right up until it encounters something its training data never contained. Machine learning is predicated on the assumption that the future will resemble the past, and the most expensive moments in financial history are precisely the ones where it didn’t.

Layer on concentration risk, with a handful of compute-constrained model providers experiencing demand growth that outstrips the supply of compute, and you have operational dependencies that would never pass muster if we called them what they are: single points of failure in the supply chain of critical financial infrastructure. One pragmatic principle deserves much wider adoption, which is to cut the tether at runtime. Use large models where they genuinely excel, in the build process, in drafting and in synthesis, but don’t allow the uptime of a production decision system to depend on someone else’s GPU availability.

Generality is the enemy of precision

The deeper issue is a mindset we imported from the consumer internet. The original machine learning successes paired extremely rich data with extremely simple decisions, such as which advert to show you next. We then spent a decade porting that “data is the answer” mindset into domains with far worse data and vastly more complex decisions, and we are now compounding the error with general-purpose models trained, to all intents and purposes, on everything.

A system designed to be good at everything cannot be precise at your thing. Regulated decisions don’t live in the statistical haze of internet text; they live in regulation, policy, procedure and the hard-won institutional knowledge sitting in the heads of experienced people. The organisations that win the next phase won’t be the ones with the biggest model bill, but the ones that treat their own knowledge as a first-class citizen in the AI stack, explicitly represented, reasoned over and auditable end to end, with probabilistic components deployed where flexibility helps and deterministic components deployed where guarantees are required.

That hybrid approach, whether you call it neurosymbolic, governed AI or simply good engineering, is what gets agentic AI out of pilot purgatory. The future of enterprise AI is not a larger language model. It is an architecture worthy of the decisions we are asking it to make.

Transform Complex Reasoning into Deterministic AI at Speed and Scale

In a world demanding AI outcomes that can be justified, Rainbird stands as the most advanced trust layer for the AI era. When high-stakes applications need AI guardrails, come to us.