Devin AI and Goldman Sachs: Independent Analysis of AI Agents in Banking

Important note: this case study is an independent analysis conducted by NERVICO based on public information. NERVICO has not worked directly with Goldman Sachs or with Cognition (the company behind Devin). The goal is to provide a technical and practical perspective on what Goldman’s evaluation means for the industry.

In mid-2025, Goldman Sachs published one of the most rigorous market evaluations of Devin, the AI software development agent created by Cognition. The report, directed at institutional investors, analyzed Devin’s actual ability to execute software engineering tasks autonomously.

When a top-tier investment bank dedicates resources to evaluating an AI development tool, the entire industry should pay attention. Not because Goldman holds the absolute truth, but because their analysis applies a level of financial and methodological rigor rarely seen in typical technology evaluations.

At NERVICO, we analyzed the report in depth to extract lessons applicable to any company considering incorporating AI agents into their development processes.

The Challenge

The market for AI agents in software development is experiencing a moment of outsized expectations. Promises range from “replacing 50% of developers” to “multiplying productivity by 10x.” Amid all the noise, companies need real data to make informed decisions.

Promises Versus Reality

Cognition presented Devin as “the world’s first AI software engineer,” capable of planning, executing, and debugging complex tasks autonomously. The initial demonstration generated a $2 billion valuation. But controlled demos are one thing. Execution in real-world environments, with legacy code, ambiguous requirements, and complex dependencies, is something else entirely.

Lack of Independent Evaluations

Until Goldman’s report, most evaluations of Devin came from Cognition itself or anecdotal assessments on social media. There was no systematic analysis measuring performance under realistic conditions, with clear metrics and a reproducible methodology.

Confusion in Enterprise Decision-Making

CTOs and engineering directors were fielding constant questions from their boards: “If AI can write code, why are we still hiring expensive developers?” The absence of reliable data made it impossible to answer with substance.

The Solution

NERVICO conducted an independent analysis of the Goldman Sachs report, supplementing it with our direct experience implementing AI agents in real development teams.

What Goldman Found

Goldman Sachs’s team evaluated Devin across a diverse set of software engineering tasks, from simple bug fixes to complete feature implementations. The key findings were revealing.

The successful completion rate hovered around 26%. This means that out of every four assigned tasks, Devin correctly completed one. For simple, well-defined tasks (bug fixes with clear tests, boilerplate code generation), the rate was significantly higher. For tasks requiring business context understanding or architectural decisions, the rate dropped below 15%.

The cost per task ranged from $28 to $150, depending on complexity and the number of retries needed. Compared to the hourly cost of a junior developer in the United States ($40 to $80), the economic calculation only favors automation for repetitive, low-complexity tasks.

NERVICO Analysis: Missing Context from the Report

Goldman analyzed Devin as a standalone product. In our experience, AI agents work best as part of an integrated workflow, not as a developer replacement. These are the nuances we consider essential.

AI agents do not replace, they amplify. A team of five developers with a well-designed workflow incorporating AI agents can generate the output of a ten to twelve person team. But it needs five developers. Not zero.

Task type changes everything. The aggregate metrics (26% completion rate) mask a very uneven distribution. On mechanical tasks (syntax migration, unit test generation, dependency updates), the rate exceeds 70%. On tasks requiring technical judgment, the agent needs constant human oversight.

Prompt quality determines results. We observed that teams with prompting experience consistently achieve better results with the same agents. Investing in training the team on how to interact with AI agents has an immediate return.

Evaluation Framework for Enterprises

We developed a five-question framework that any company should answer before investing in AI agents for development:

Task inventory: What percentage of your team’s current tasks are repetitive and well-defined?
Actual base cost: What is the real cost per completed task with your current team (including overhead, meetings, and context switching)?
Error tolerance: What are the consequences of a bug in production in your context? An internal app is very different from a payment system.
Supervision capacity: Do you have senior developers available to review agent output, or are they already overloaded?
Time horizon: Are you looking for results in weeks, or can you invest months in optimizing the workflow with agents?

Results

Key Findings from the Combined Analysis

After cross-referencing Goldman’s data with our direct experience, the conclusions are as follows:

AI agents are cost-effective today for specific tasks. Test generation, code migration, documentation, and mechanical refactoring. In these cases, savings can reach 20-30% of engineering costs.
They are not cost-effective as “developer replacements.” The 26% completion rate on general tasks means you need a developer reviewing and fixing the remaining 74%. The net cost can be higher than doing the task directly.
Return on investment depends on implementation, not the tool. We have seen teams achieve 3x ROI with Claude Code and Cursor, and teams achieve negative ROI with the same tools. The difference is how the workflow is designed.
Goldman underestimates the continuous improvement factor. AI agents improve with every model iteration. What has a 26% rate today will likely reach 50% within 12-18 months. Companies that invest now in integrating agents will have an advantage when that happens.

Implications for CTOs

Goldman’s report reinforces a position we have held for some time: adopting AI agents in development must be pragmatic, incremental, and measured. It is not a binary decision of “adopt or not adopt.” It is a decision of “where, how, and with what expectations.”

Lessons Learned

Financial Data Provides Clarity That Demos Cannot

Goldman’s main contribution is not technical, it is economic. By translating Devin’s performance into financial metrics (cost per task, ROI by activity type), the report enables executives to make decisions based on numbers, not marketing promises.

Raw Completion Rate Is a Misleading Metric

A global 26% says little if it is not broken down by task type. Companies should measure the completion rate for their specific tasks, not assume the aggregate number applies to their context.

Investing in Workflow Beats Investing in Tools

We have seen six-figure budgets on AI tool licenses with teams that do not change how they work. The result is predictable: tools are underutilized and ROI never materializes. The correct investment is first in process, then in tools.

The Time to Start Is Now, but With Realistic Expectations

Companies that wait for AI agents to be “perfect” will arrive late. Those that adopt them with outsized expectations will be frustrated. The right path is to adopt with bounded pilot projects, measure real results, and scale progressively.

If you are evaluating how to incorporate AI agents into your development team, we can help you design a realistic plan based on data, not promises. Request a free audit and we will analyze your specific situation.