Business

How To Build Agentic AI That Ships

By Brian Friedman

Copyright thenewstack

How To Build Agentic AI That Ships

This is the first of two parts.

Recent reports of AI doom highlight a 95% failure rate in enterprise rollouts. Given the nascent state of AI, high failure rates are expected, but some common-sense approaches can help align your efforts with the 5% of winners in this space.

Defining Failure

Before we dig in, let’s address what we mean by “failure” in the context of these initiatives. The technology itself isn’t failing at a rate of 95%; rather, it is the integration of new technology with existing people and processes that is proving most challenging. As a result, only 5% are achieving their target of new revenue by leveraging AI. A one in 20 chance seems rough, but there’s likely a lot of learning falling out of the 19 shots that missed the target; these are early experiments that may produce new product categories. Technical complexities aside, the C-suite tends to raise the fail-flag on a project if:

It’s not generating the savings projected.
It’s not generating the new revenues projected.

The two scenarios above are not new to AI. Remember Kubernetes? How about refactoring to microservices? What about tossing everything into “the cloud?” Failure in our space is common and is rarely associated with strictly technology-related outcomes. Typically, failure manifests from underestimating collaborative risk, which has a direct implication on any estimated costs. Failure in our space is often the first step; We learn by failing. In the right environment, failing first is the first sign of being in the lead.

Why Things Fail

AI’s failure reasons are not so different from other new technology initiatives. First, we assume the technology is “done.” That any problems we have, the technology has an answer to solve.

Second, just as a fish has no idea it’s wet, we’re bathed in complexity specific to our organization’s way of doing things. And, much like a fish, we don’t understand the criticality of our environment until we’re pulled out of it.

Third, like the fish, we are drawn to shiny objects. For the fish, chasing the shiny object will burn calories and the meal will have no nutritional value. For us, picking a shiny object for a use-case can carry similar results; we might reach the finish line with nothing of real value to show for our efforts.

And finally, any time we introduce new technology, we introduce new problems to be solved. Model deployments, access tokens, policies, permissions — all these things are needed to operationalize a solution. Whether we choose to build or to buy these management capabilities, they’re not optional. They’ll have to exist.

Ninety-five percent of AI initiatives should be expected to fail as long as we ignore these pitfalls:

Models are generic.
Enterprises are unique.
Ideal use cases aren’t flashy.
Coordination across teams is hard.

Models Are Built for the Masses

Asking a large-language model (LLM) how to dress for the weather is much more effective than asking that same LLM how to dress for work. If it’s cold outside, generally accepted advice is to dress warmly. And, while an LLM could venture a solid guess at what you might wear to work, the “how to dress for work” question is answered very differently for a lab technician in comparison to, let’s say, a firefighter. If an LLM responds to either of these personas with a recommendation of “business casual,” it has officially put their safety at risk.

Models are not built to snap into the workplace in the same way they do daily life. The workplace is all about context, and many (if not most) workplaces make a concerted effort to prevent the outside world from knowing anything about their inner workings. The models are being trained on the outside world. Their perspective on corporate culture is formed by pop culture, academia and history.

Enterprises Are Snowflakes

When you get a new job at a new company, everything feels very different. We take for granted that this context switch is as all-encompassing as it is, but in some cases, that ambient feeling of “different” follows you even after you’ve gone home for the day. You’re part of a new culture, there’s new branding everywhere and maybe some decisions you’re making at home now can be attributed to things you’ve picked up at the office.

We know intuitively that each workplace is unique, and we know that models are built for the masses, yet we continue to apply the tools built for the masses in support of our unique needs. High failure rates should not be shocking in this framing if we haven’t budgeted the time and expertise needed for customization.

Model Tuning Is a Myth

Many have been told that if they find a model that works well for their needs, they can fine-tune the model on their data. This may involve additional training to augment what the model already knows with proprietary data about your company. The problem is that training a model is not an additive process; it is a destructive process. Neurons are created, updated and in some cases destroyed during training operations. This being the case, we’re not adding to the model; we’re fundamentally rewiring it. Things it knew for certain yesterday might now be questionable today, and vice versa.

This is why retrieval-augmented generation (RAG) underpins all consumer-accessible customization solutions for AI; it’s the only solution we have that is not destructive to the underlying model(s) and doesn’t require PhD-level knowledge to implement (distillation, for example, is powerful, but impractical for most). Let’s also remember the leading models have over 7 billion tunable parameters. Given the current state of the art, model-tuning is a dead end for businesses whose mission is anything other than data science.

The High-Value Use-Cases Aren’t Always Exciting

AI’s biggest impact can be made most readily on back-office concerns, helping us slowly chip away at inefficiencies that have haunted us for decades. With each inefficiency we can mitigate, we add more wind to our sails and the effect, over time, is cumulative. The biggest problem with the focus on back-office inefficiencies is they don’t make for good storytelling. The shinier, more public-facing uses of AI get all the attention. If you choose the public-facing path, be sure to solve an actual problem.

AI in itself is an entirely new collection of problems to be solved. Make sure your initial adoption is targeted at eliminating something taking up your time, as more is about to be taken by your AI rollout. Much will depend on how far ahead (or behind) you are on your digital transformation efforts, if any are in play. Companies with closer-to-modern systems are having a much easier time integrating AI with their internal data and tooling.

All the AI Cats Must Be Herded

The moving parts associated with AI go far beyond models and GPUs. Just to scratch the surface, we can start asking questions like: Which models can be used within your organization? Or within a given business unit? What kinds of internal data can be sent to these models? Is there any masking or obfuscation needed before the data heads downstream? What about throttling? Each request to a model has a cost. Should we trust our developers to configure throttling correctly? What if a team needs to change the model backing their solution? Do we have all the things we need in place to answer these basic questions? If not, we’re going to have to acquire those capabilities either through building or through buying.

According to MIT, using a vendor-provided solution for managing the cross-cutting concerns of an AI ecosystem more than doubles the odds of success for getting an AI-integrated solution deployed and operationalized. Given the complexity of the ecosystem, this should not be surprising. An independent study commissioned by Broadcom’s Tanzu division had similar findings with 82% of the 255 CIOs surveyed responding that they believe AI app platforms are necessary for scaling AI, and even more suggesting they will invest in such.

Platform engineering is the organizational tool you use to solve these problems and has gained attention as a critical discipline for enterprise IT organizations. Your platform team can be a critical enabler for new product categories. Your AI policy leaders, likely data scientists by training, will need expertise in developer enablement like that found in a platform engineering team to create reusable patterns, robust documentation and clear policies – ideally policy as code.

Back-Office vs. Front-of-House Example

Let’s say, for example, you own a national auto-shop chain. Customers come for all kinds of reasons, but every customer schedules an appointment to drop off their car, and every customer responds to an oral survey from a service rep when they pick up their car. The survey response is recorded in the service ticket’s notes field upon closure.

Scheduling requires two full-time employees per store who never interact with customers. Service reps handle drop-off and pick-up with as many as 10 reps per location. Every week, the store manager at each site spends about four hours turning customer pick-up survey responses into customer satisfaction metrics and trends for corporate management.

If we boil the above down to two ideas, we land in a place familiar to many: one idea so ambitious that it could be its own business, and another idea so small-scope and practical the technology will be a hidden implementation detail.

Reduce headcount at every store by two by using AI to handle appointment scheduling.
Save a specific user (store managers) a measurable number of hours per week by using AI to do a specific task (convert survey responses to reportable data).

Both options save money, and the “back of the napkin” math shows option one offers 20 times the savings of option two, and 20x is a hard number to ignore when expressed in dollars.

Unfortunately, the back of the napkin isn’t big enough to capture the collaborative risk associated with tackling a domain as core to daily operations as scheduling. What I mean by collaborative risk is, the number of teams/personas affected by and/or supporting the effort. Customers, store managers, service techs, mechanics, service reps, IT — all of these folks, some of them on a daily basis, will be affected by a change in the scheduling process. Employees will need training and some of the systems that interact with scheduling may need modification; these systems also count as collaborators.

Another red flag on option one, scheduling, is that scheduling in itself is a workflow. Workflows, as of the writing of this article, are not a skill in the agentic AI toolkit. In other words, we’d still have to orchestrate or choreograph scheduling, even if we had agents handling the individual steps. This would certainly not be the desired outcome if we were investing millions in AI tech to handle a workflow-heavy domain.

The final and most compelling warning sign associated with option one is that failure is not an option. If we can’t schedule, we can’t make money. This sounds like an “all or nothing” proposition where the only thing more monumental than success would be the crushing weight of failure should things go sideways. It’s enough to end our fictional national auto-shop chain, or at the very least, to waste millions in valuable company time and treasure.

All this complexity is very expensive, and when combined with mission criticality, very, very risky. With an overly optimistic estimate of 12 to 18 months for the project, we won’t see a dime back for at least a year. The longer the project, the more opportunities for delay, so we can be assured at least 24 months, given we’re adopting an entirely new technology.

We Need a New Napkin

Despite the mouthwatering savings associated with replacing scheduling, and the career boost sure to accompany such a feat, a new napkin that includes the cost of delivery and quantifies the risk of failure reveals option two, while only offering 5% of the savings of option one, begins saving money in just 12 to 16 weeks. We calculate a rudimentary risk of failure score by multiplying the number of collaborators by the length of the project in months:

Risk = collaborators * months

Option two has only one collaborator; option one has at least six. Option two runs for 12 to 16 weeks; option one is projected to last at least 18 months. If you’re following along on a napkin of your own, you should see that option one has a risk of 103, while option two has a risk of four. It’s a primitive tool, but this formula helps us see we’re accepting nearly 98% more risk with option one. We can validate this by seeing if we fail on option two, we lose very little on the bet versus the millions risked on option two. If we succeed on option two, we save over 200 hours per year per store.

Instead of a moonshot ripe for failure, it seems we should focus on freeing up 10% of every store manager’s week by doing something AI is already good at and can be implemented in weeks, not years. We could use a language model to convert customer survey responses into reportable data. Now, we can see a project that might take 12 weeks of high-caliber workforce effort that will free up four hours at every store every week. The small efficiency gains stack up, and someday, headcount might adjust to match. But the key is shipping something measurable, not aiming for wild transformations on our first deployment of a new tech.

The quiet process fixes found in the back office may not be exciting, but it’s where the immediate and impactful wins lie. And, again, these wins are cumulative. We’re solving real problems, creating more budget through savings as we go and building organizational expertise in a new tech stack. The actual moonshot landed on the 11th Apollo mission.

So there we have it: Failure in the AI space is explainable and looks quite a bit like the failure in every other rapidly adopted tech wave we’ve been crushed by in the past few decades. The sky, it seems, hasn’t dropped an inch. We frequently pick the wrong use case and dive headlong into unfamiliar waters, chasing things that look shiny but yield little value when caught. Now that we know why things fail and where to look for successful project candidates, we’ll take some time to explore the “how.” Stay tuned for Part 2, where I’ll break down five factors for predictable autonomy with agentic AI.