Get ahead
VMware offers training and certification to turbo-charge your progress.
Learn moreI'd like to introduce two new projects that are part of the Spring AI Community GitHub organization: Spring AI Agents, and Spring AI Bench. These two projects focus on using agentic coding tools—tools you likely already have in your enterprise.
In 2025 AI coding agents have matured to the point that they need to be seriously considered for enterprise Java development and general SDLC tasks. CLI Tools like Claude Code, Google’s Gemini CLI, Amazon Q Developer, and OpenAI’s assistants are examples from leading large AI labs, but there are also smaller startups and open-source options. These agentic coding tools can reason about architecture, grok large code bases, and hold great promise to help developers ship software faster. They are often used in a “human in the loop” style, but they can also be instructed to execute autonomously until they determine the goal has been completed.
Spring AI Agents defines a lightweight but powerful portable abstraction: the AgentClient. It acts as a consistent interface for invoking autonomous CLI-based agents. This allows developers to use the agentic tools they already have while providing flexibility to avoid locking into one single provider.
However, AgentClient is only one piece of the developer toolbox you need to be effective using agentic tools. Spring AI Agents provides the following abstractions, which, when combined, can produce the most effective results:
The companion project, Spring AI Bench, is a benchmark suite for evaluating agents on goal-directed enterprise workflows. It evaluates how effectively different agents have completed their goals and can be considered the test harness that runs any agent via Spring AI Agents.
The need for this project came from my investigation of existing Agentic benchmarks. I discovered they were focused primarily on Python and addressed only the use case of providing a code patch for GitHub issues. You see the following pattern in the literature: SWE-bench posts strong numbers on its static, curated Python issue set, yet when a new set of curated issues is introduced, the numbers drop dramatically. On SWE-bench Verified, agents score 60-75% on static Python sets; on SWE-bench-Live, the same runs fall to 19% - a 3× drop. On SWE-bench-Java, Java tasks land around ~7-10% compared to Python's ~75% on the same benchmark family, showing an order of magnitude gap. For engineering leaders, volatile scores translate into volatile decisions.
None of this implies agents are weak; it implies the yardsticks are dated. The SWE-agent is thousands of lines of Python, but the ~100-line mini-SWE-agent (a simple agentic loop with chat memory and a single tool—bash) achieves competitive SWE-Bench results. Turns out, there are no benchmarks to judge the capabilities of today's and tomorrow’s modern set of agentic CLI tools.
Early runs are promising. Multi-label issue classification across more than a hundred domain-specific labels matched or exceeded published F1 scores. PR-merge agents have processed hundreds of pull requests on the Spring AI code base, generating structured reports - risk assessments, architecture notes, and backport analysis. This significantly reduced review time while improving consistency. Simple code coverage benchmarks revealed that while leading models can both achieve the same coverage number, there are differences in the code quality and level of instruction following between leading models.
What's next: Both projects are incubating in the Spring AI Community organization. Snapshot builds are available in Maven Central. We're also working with leaders of the Developer Productivity AI Arena (DPAIA) initiative, which was created to address the issues I raised here.
The Spring AI Community looks forward to your feedback as we move from the year of agents to a new era of using agents effectively.
Projects:
Research References:
Conference Talks: