Get ahead
VMware offers training and certification to turbo-charge your progress.
Learn moreLarge language model API costs can accumulate quickly when applications repeatedly send the same prompt content. A typical scenario: you're building a document analyzer that includes a 3,000-token document in every request. Five questions about that document means processing 15,000 tokens of identical content at full price.
Anthropic's prompt caching addresses this by allowing you to reuse previously processed prompt segments. Spring AI provides comprehensive support through strategic caching patterns that handle cache breakpoint placement and management automatically.
In this blog post, we explain how prompt caching works, when to use it, and how Spring AI simplifies its implementation for you.
Prompt caching allows you to mark portions of your prompt for reuse across multiple API requests. When you enable it, Anthropic caches the specified content and charges reduced rates for cached segments in subsequent requests.
The cache operates on exact prefix matching. Consider this sequence:
Request 1: [System Prompt] + [User: "Question 1"]
└─ Cached ──┘
Request 2: [System Prompt] + [User: "Question 2"]
└─ Cache Hit ─┘ (Only this part incurs full cost)
The system generates cache keys using cryptographic hashes of prompt content up to designated cache control points. Only requests with identical content achieve cache hits—even a single character change creates a new cache entry.
Request 1: "You are a helpful assistant." → Cache created
Request 2: "You are a helpful assistant." → Cache hit
Request 3: "You are a helpful assistant " → Cache miss (extra space)
Request 4: "You are a Helpful assistant." → Cache miss (capitalization)
On a cache miss, the system processes content at standard rates and creates a new cache entry. This is why maintaining consistent prompt templates becomes important for effective caching.
Performance Impact: Beyond cost savings, Anthropic reports latency reductions of up to 85% for long prompts. In their announcement, a 100K-token book example showed response time dropping from 11.5s to 2.4s with caching enabled. Note that actual latency improvements depend on how much content you're caching and your cache hit rate.
Cache Lifecycle: Entries refresh on each use within the TTL window (5 minutes default, 1 hour optional). After TTL expiration, the next request creates a new cache entry.
Pricing varies significantly by model tier:
| Model | Base Input | Cache Write | Cache Read | Savings |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3/MTok | $3.75/MTok (+25%) | $0.30/MTok | 90% |
| Claude Sonnet 4 | $3/MTok | $3.75/MTok (+25%) | $0.30/MTok | 90% |
| Claude Opus 4.1 | $15/MTok | $18.75/MTok (+25%) | $1.50/MTok | 90% |
| Claude Opus 4 | $15/MTok | $18.75/MTok (+25%) | $1.50/MTok | 90% |
| Claude Haiku 4.5 | $1/MTok | $1.25/MTok (+25%) | $0.10/MTok | 90% |
| Claude Haiku 3.5 | $0.80/MTok | $1/MTok (+25%) | $0.08/MTok | 90% |
| Claude Haiku 3 | $0.25/MTok | $0.30/MTok (+25%) | $0.03/MTok | 90% |
All code examples in this blog use Claude Sonnet 4.5 pricing unless otherwise specified.
For more on pricing, see - Anthropic pricing on prompt caching
Example calculation (Claude 3.5 Sonnet, 5,000 tokens):
Minimum token thresholds vary by model:
| Model | Minimum Cacheable Tokens |
|---|---|
| Claude Sonnet 4.5, , Claude Sonnet 4, Claude Opus 4.1, Claude Opus 4 | 1,024 |
| Claude Haiku 3.5, Claude Haiku 3 | 2,048 |
| Claude Haiku 4.5 | 4,096 |
Prompts below these thresholds cannot be cached, even when you mark them with cache control directives. Note that Claude Haiku 4.5 requires significantly more tokens (4,096) compared to other models.
For more on this, see Anthropic's cacheable prompt length limitations.
Additional constraints:
Anthropic processes request components in a specific order, and this order determines how cache invalidation works.
┌─────────────────────────────────────────┐
│ Request Processing Order: │
│ │
│ 1. Tools │
│ ↓ │
│ 2. System Message │
│ ↓ │
│ 3. Message History │
└─────────────────────────────────────────┘
Changes cascade downward through this hierarchy:
┌──────────────────────────────────────────────────────┐
│ Cascade Invalidation: │
│ │
│ Change Tools → Invalidates: Tools, System, Msgs │
│ Change System → Invalidates: System, Msgs │
│ Change Messages → Invalidates: Msgs only │
└──────────────────────────────────────────────────────┘
Understanding this behavior is essential when choosing your caching strategy. Changes to components higher in the hierarchy invalidate all downstream caches, which directly impacts your cache hit rates.
Rather than requiring you to manually place cache breakpoints (which can be error-prone and tedious), Spring AI provides five strategic patterns through the AnthropicCacheStrategy enum. Each strategy handles cache control directive placement automatically while respecting Anthropic's 4-breakpoint limit.
| Strategy | Breakpoints | Cached Content | Typical Use Case |
|---|---|---|---|
NONE |
0 | Nothing | One-off requests, testing |
SYSTEM_ONLY |
1 | System message | Stable system prompts, <20 tools |
TOOLS_ONLY |
1 | Tool definitions | Large tools, dynamic system prompts |
SYSTEM_AND_TOOLS |
2 | Tools + System | 20+ tools, both stable |
CONVERSATION_HISTORY |
1-4 | Full conversation | Multi-turn conversations |
Let's take a look at each strategy with practical examples.
This strategy caches the system message content. Since tools appear before the system message in Anthropic's request hierarchy (Tools → System → Messages), they automatically become part of the cache prefix when you place a cache breakpoint on the system message.
Important: Changing any tool definition will invalidate the system cache due to the cache hierarchy.
String systemPrompt = """
You are an expert software architect specializing in distributed systems.
You have deep knowledge of microservices, event-driven architecture, and cloud-native patterns.
When analyzing systems, consider scalability, resilience, and maintainability.
[... additional context ...]
""";
ChatResponse response = chatModel.call(
new Prompt(
List.of(
new SystemMessage(systemPrompt),
new UserMessage("What is microservices architecture?")
),
AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.SYSTEM_ONLY)
.build())
.maxTokens(500)
.build()
)
);
// Access cache metrics
AnthropicApi.Usage usage = (AnthropicApi.Usage) response.getMetadata()
.getUsage().getNativeUsage();
if (usage != null) {
System.out.println("Cache creation: " + usage.cacheCreationInputTokens());
System.out.println("Cache read: " + usage.cacheReadInputTokens());
}
On the first request, cacheCreationInputTokens will be greater than zero and cacheReadInputTokens will be zero. Subsequent requests with the same system prompt will show zero for cacheCreationInputTokens and a positive value for cacheReadInputTokens. This is how you can verify that caching is working as expected in your application.
Use this strategy when your system prompt is large (meeting the minimum token threshold) and stable, but user questions vary.
This strategy caches tool definitions while processing the system message fresh on each request. The use case becomes clear in multi-tenant scenarios where tools are shared but system prompts need customization.
Consider a SaaS application serving multiple organizations:
@Service
public class MultiTenantAIService {
private final List<FunctionCallback> sharedTools = List.of(
weatherTool, // 500 tokens
calendarTool, // 800 tokens
emailTool, // 700 tokens
analyticsTool, // 600 tokens
reportingTool, // 900 tokens
// ... 15 more tools totaling 5,000+ tokens
);
public String handleRequest(String tenantId, String userQuery) {
TenantConfig config = tenantRepository.findById(tenantId);
// Each tenant requires a unique system prompt
String systemPrompt = """
You are %s's AI assistant.
Company values: [custom data]
Brand voice: [custom data]
Compliance requirements: [custom data]
""";
ChatResponse response = chatModel.call(
new Prompt(
List.of(
new SystemMessage(systemPrompt),
new UserMessage(userQuery)
),
AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.TOOLS_ONLY)
.build())
.toolCallbacks(sharedTools)
.maxTokens(800)
.build()
)
);
return response.getResult().getOutput().getText();
}
}
Here's what happens:
All tenants share the same cached tool definitions, while each receives their customized system prompt. For the 5,000-token tool set, this means paying $0.01875 once to create the cache, then $0.0015 per request for cache reads—regardless of which tenant is making the request.
This strategy creates two independent cache breakpoints: one for tools (breakpoint 1) and one for the system message (breakpoint 2). This separation matters when you have more than 20 tools or when you need deterministic caching of both components.
The key advantage: changing the system message does not invalidate the tool cache.
ChatResponse response = chatModel.call(
new Prompt(
List.of(
new SystemMessage(systemPrompt),
new UserMessage(userQuery)
),
AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.SYSTEM_AND_TOOLS)
.build())
.toolCallbacks(toolCallbacks)
.maxTokens(500)
.build()
)
);
The cache keys work as follows:
hash(tools)hash(tools + system)Cascade Behavior:
This makes SYSTEM_AND_TOOLS ideal when your system prompt changes more frequently than your tools, allowing efficient reuse of the tool cache.
For multi-turn conversations, this strategy caches the entire conversation history incrementally. Spring AI places a cache breakpoint on the last user message in the conversation history. This is particularly useful when building conversational AI applications (such as chatbots, virtual assistants, and customer support systems).
ChatClient chatClient = ChatClient.builder(chatModel)
.defaultSystem("You are a personalized career counselor with 20 years of experience...")
.defaultAdvisors(MessageChatMemoryAdvisor.builder(chatMemory)
.conversationId(conversationId)
.build())
.build();
String response = chatClient.prompt()
.user("What career advice would you give me based on our conversation?")
.options(AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.CONVERSATION_HISTORY)
.build())
.maxTokens(500)
.build())
.call()
.content();
The cache grows incrementally as the conversation progresses:
Turn 1: Cache [U1]
Turn 2: Reuse [U1], cache [U1 + A1 + U2]
Turn 3: Reuse [U1 + A1 + U2], cache [U1 + A1 + U2 + A2 + U3]
Turn 4: Reuse [U1 + A1 + U2 + A2 + U3], cache [full history + U4]
By turn 10 in a conversation, you might have 20,000 tokens of history cached, paying just 10% of the normal cost for that context on each subsequent turn.
Critical Requirement: Tool and System Stability
When using CONVERSATION_HISTORY, both tools and system prompts must remain stable throughout the conversation. Changes to either invalidate the entire conversation cache.
Here's a complete example showing cache efficiency for multi-question document analysis:
@Service
public class ContractAnalyzer {
private final AnthropicChatModel chatModel;
private final PdfExtractor pdfExtractor;
public AnalysisReport analyzeAgreement(String agreementPdf) {
// Extract text from PDF (typically 2,000-5,000 tokens)
String agreementText = pdfExtractor.extract(agreementPdf);
String systemPrompt = """
You are an expert business analyst specializing in partnership agreements.
Analyze the following partnership agreement and provide insights about
collaboration terms, value propositions, and strategic opportunities.
AGREEMENT:
%s
""".formatted(agreementText);
String[] questions = {
"What are the key collaboration opportunities outlined in this agreement?",
"Summarize the revenue sharing and financial arrangements.",
"What intellectual property rights and licensing terms are defined?",
"Identify the strategic value propositions for both parties.",
"What are the performance milestones and success metrics?"
};
AnalysisReport report = new AnalysisReport();
AnthropicChatOptions options = AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.SYSTEM_ONLY)
.messageTypeTtl(MessageType.SYSTEM, AnthropicCacheTtl.ONE_HOUR)
.build())
.maxTokens(1000)
.build();
for (int i = 0; i < questions.length; i++) {
ChatResponse response = chatModel.call(
new Prompt(
List.of(
new SystemMessage(systemPrompt),
new UserMessage(questions[i])
),
options
)
);
String answer = response.getResult().getOutput().getText();
report.addSection(questions[i], answer);
// Log cache performance
AnthropicApi.Usage usage = (AnthropicApi.Usage)
response.getMetadata().getUsage().getNativeUsage();
if (usage != null) {
if (i == 0) {
logger.info("First question - Cache created: {} tokens",
usage.cacheCreationInputTokens());
} else {
logger.info("Question {} - Cache read: {} tokens",
i + 1, usage.cacheReadInputTokens());
}
}
}
return report;
}
}
With a 3,500-token system prompt (agreement + instructions) using Claude 3.5 Sonnet:
This represents a 68% cost reduction for the cached system prompt portion. The actual total savings will be lower when you factor in the user question tokens and output tokens (which are not cached), but the reduction becomes more significant with more questions or larger documents.
Note: Prompt caching support is available in Spring AI 1.1.0 and later. Try it with the latest 1.1.0-SNAPSHOT version.
Add the Spring AI Anthropic starter to your project:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
</dependency>
Configure your API key in your application properties:
spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
Inject and use in your application:
@Autowired
private AnthropicChatModel chatModel;
// Start using caching strategies immediately
The default cache TTL is 5 minutes. For scenarios where requests arrive less frequently, you can configure a 1-hour cache:
AnthropicChatOptions options = AnthropicChatOptions.builder()
.model(AnthropicApi.ChatModel.CLAUDE_3_5_SONNET)
.cacheOptions(AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.SYSTEM_ONLY)
.messageTypeTtl(MessageType.SYSTEM, AnthropicCacheTtl.ONE_HOUR)
.build())
.maxTokens(500)
.build();
Spring AI automatically adds the required beta header (anthropic-beta: extended-cache-ttl-2025-04-11) when you configure 1-hour TTL.
When to use each TTL:
Trade-off: 1-hour cache writes cost 2x vs 5-minute writes ($6/M vs $3.75/M for Claude 3.5 Sonnet). Evaluate your application's request patterns and stability requirements to determine which TTL makes sense.
You can set minimum content lengths per message type to optimize breakpoint usage:
AnthropicCacheOptions cache = AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.CONVERSATION_HISTORY)
.messageTypeMinContentLength(MessageType.SYSTEM, 1024)
.messageTypeMinContentLength(MessageType.USER, 512)
.messageTypeMinContentLength(MessageType.ASSISTANT, 512)
.build();
For precise token counting, you can provide a custom length function:
AnthropicCacheOptions cache = AnthropicCacheOptions.builder()
.strategy(AnthropicCacheStrategy.CONVERSATION_HISTORY)
.contentLengthFunction(text -> customTokenCounter.count(text))
.build();
By default, Spring AI uses string length as a proxy for token count, but for production scenarios, you might want to consider injecting a proper token counter.
For those interested in internals, here's how Spring AI handles cache management:
┌─────────────────────────────────────────────────────────┐
│ Request Flow: │
│ │
│ Application │
│ ↓ │
│ AnthropicChatModel │
│ ↓ │
│ CacheEligibilityResolver │
│ (created from strategy) │
│ ↓ │
│ For each component: │
│ - Check strategy eligibility │
│ - Verify minimum content length │
│ - Confirm breakpoint availability (<4) │
│ - Add cache_control if eligible │
│ ↓ │
│ Build request with cache markers │
│ ↓ │
│ Add beta headers if needed (1h TTL) │
│ ↓ │
│ Anthropic API │
│ ↓ │
│ Response with cache metrics │
└─────────────────────────────────────────────────────────┘
The CacheEligibilityResolver determines whether each message or tool qualifies for caching based on the chosen strategy, message type eligibility, content length requirements, and available breakpoints. The CacheBreakpointTracker enforces Anthropic's 4-breakpoint limit with thread-safe tracking per request.
For CONVERSATION_HISTORY, Spring AI uses aggregate eligibility checking—it considers the combined content of all message types (user, assistant, tool) within the last ~20 content blocks when determining cache eligibility. This prevents short user questions (such as "Tell me more") from blocking cache creation when there are substantial assistant responses in the conversation history.
Avoid caching when:
Avoid these common mistakes:
Caching works seamlessly with both streaming and non-streaming responses. Cache metrics appear in the final response chunk when using streaming. There is no difference in cache behavior between the two modes.
Prompt caching in Anthropic Claude provides significant cost and latency benefits for applications with reusable prompt content. Spring AI's strategic approach simplifies implementation by automatically handling cache breakpoint placement, breakpoint limits, and TTL configuration.
The five caching strategies cover common usage patterns, from simple system prompt caching to complex multi-turn conversations. For most applications, selecting the appropriate strategy based on content stability patterns is sufficient—Spring AI handles the implementation details.
For additional information, see the Spring AI Anthropic documentation and Anthropic's prompt caching guide.