Why Agentic AI Often Fails and the Enduring Value of Human Judgment

We’ve been doing a lot of work recently on agentic AI workflows, so it’s time to share some of our thinking on how to assess their risks and benefits. The terms “AI Agent” and “Agentic AI” are used to describe several different concepts, many of which are actually routine uses of GenAI models with basic software tools, for which we have found many high-value/low-risk use cases. But here, we are using the term Agentic AI Workflow (AAW) to mean a series of tasks currently being done partially or fully by humans, that will be accomplished in a workflow where GenAI is used to achieve all of those tasks (either in series or in parallel) without any human review, involvement, or intervention between those tasks.

Our view is that the current frontier GenAI models are generally not able to reliably and consistently handle a series of complex tasks entirely on their own, as AAWs require. Indeed, we were not surprised to read the latest Replicate Labor Index by Scale AI, which found that the highest performing agent (Manus) achieved only a 2.5% success rate for full automation. The vast majority of the fully automated workflows failed to deliver human-quality outputs consistently, although many of them were able to reliably automate certain aspects of a complex project.

Many of these AAW projects fail because, as various tasks are stitched together without any human review in between them, the risk of errors compound, and any time saved in the process is lost by needing human intervention to fix the end product. For these workflows, at least for the foreseeable future, it will be more effective to use AI to automate only parts of the AAW, and include humans at various points in the workflow to fix errors and course-correct.

One interesting—and paradoxical— observation is that many AAW failures arise because the AI is actually following the rules it has been provided, but the rules themselves are not sufficiently nuanced to allow the AI to succeed over time. Indeed, the reality of most job-related tasks is that unforeseen problems arise frequently. And, as every good employee knows, the solutions to unanticipated problems sometimes require pragmatic, situational rule‑bending.

One of the enduring challenges for self-driving cars is that humans don’t always follow the rules of the road. We speed, roll through stop signs, double‑park to drop someone off, or edge briefly into the bus lane to get around a stalled car—all without consequence.

Southwest Airlines has earned a reputation for exceptional customer service in large part because of its policy of allowing employees to break rules when appropriate. Hospital nurses are routinely permitted to bypass electronic‑health‑record or barcode‑scanning steps when a patient’s condition demands immediate action. For time-sensitive legal problems, some lawyers get started on the project before all the conflicts are cleared. And here in Manhattan, getting from point A to point B in a reasonable period of time almost always requires violating several traffic rules.

The rules that we set up for agentic AI systems usually cannot capture the nuanced social and cultural contexts that experienced employees rely on when deciding that not following a policy is actually the right course of action because the policy was drafted without this particular situation in mind. And employees learn from their experiences, as well as from watching other employees, as to which minor violations are forgiven, and in many cases, rewarded.

Professionals working in AI believe that AAWs can replace large categories of human work. But many of those professionals are coders, software engineers, or mathematicians, where formal structures and hard rules predominate. For them, going outside the rules often results in errors or failure.

But in many other professional settings, the rules come from policies that are based on less formal structures, including non-binding regulatory guidance or industry best practices. These kinds of rules are rough approximations of optimal codes of conduct that are constantly being adjusted to accommodate new and unanticipated situations. If people decide to violate the policy, they may be punished. But the conduct may also be tolerated as a mere technical violation that should not be discouraged. And in some cases, employees may have identified a non-compliant way of achieving the desired outcome that is consistent with the intent of the policy, revealing that the policy itself is not properly calibrated and needs to be revised.

Efficient, pragmatic, and creative rule-bending is one of the ways that we innovate and make progress without the help of AI. So, at least for now, while AI can be very helpful in automating certain discrete aspects of professional services, it cannot replace entire jobs because, for the successful completion of most complex workflows, the AI agents still need us more than we need them.

To subscribe to the Data Blog, please click here.

The Debevoise STAAR (Suite of Tools for Assessing AI Risk) is a monthly subscription service that provides Debevoise clients with an online suite of tools to help them fast-track their AI adoption. Please contact us at STAARinfo@debevoise.com for more information.

To learn why we added a CAPTCHA to the blog, click here.

The cover art for this blog was generated by ChatGPT-5.

Debevoise Partner Robert Maddox Contributes to Financial Markets Law Committee’s Report on Private Law Issues in AI

Matt Kelly to Present at 2026 SKILLS Showcase

Related Posts

When Does AI-Generated Content Become a Deepfake, and Why Does It Matter?

Mythos: Governance, Technical, Business and Regulatory Considerations

Debevoise ARB-COMPASS Tool Qualifies as Finalists for GAR-LCIA Hackathon