When developing and testing LLM applications, you often make the same requests repeatedly during debugging and iteration. Helicone caching stores complete responses on Cloudflare’s edge network, eliminating redundant API calls and reducing both latency and costs.
Looking for provider-level caching? Learn about Prompt Caching to cache prompts directly on provider servers (OpenAI, Anthropic, etc.) for reduced token costs.
Number of different responses to store for the same request. Useful for non-deterministic prompts.Default: "1" (single response cached)Example: "3" to cache up to 3 different responses
const response = await client.chat.completions.create( { model: "gpt-4o-mini", messages: [{ role: "user", content: "Analyze this large document with cached context..." }], prompt_cache_key: `doc-analysis-${documentId}` // Different per document }, { headers: { "Helicone-Cache-Enabled": "true", "Helicone-Cache-Ignore-Keys": "prompt_cache_key", // Ignore this for Helicone cache "Cache-Control": "max-age=3600" // Cache for 1 hour } });// Requests with the same message but different prompt_cache_key values // will hit Helicone's cache, while still leveraging OpenAI's prompt caching// for improved performance and cost savings on both sides
This approach:
Uses OpenAI’s prompt caching for faster processing of repeated context
Uses Helicone’s caching for instant responses to identical requests
Ignores prompt_cache_key so Helicone cache works across different OpenAI cache entries
Maximizes cost savings by combining both caching strategies
Avoid repeated charges while debugging and iterating on prompts:
Copy
Ask AI
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://ai-gateway.helicone.ai", apiKey: process.env.HELICONE_API_KEY, defaultHeaders: { "Helicone-Cache-Enabled": "true", "Cache-Control": "max-age=86400" // Cache for 1 day during development },});// This request will be cached - works with any modelconst response = await client.chat.completions.create({ model: "gpt-4o-mini", // or "claude-3.5-sonnet", "gemini-2.5-flash", etc. messages: [{ role: "user", content: "Explain quantum computing" }]});// Subsequent identical requests return cached response instantly
Cache responses separately for different users or contexts:
Copy
Ask AI
const userId = "user-123";const response = await client.chat.completions.create( { model: "claude-3.5-sonnet", messages: [{ role: "user", content: "What are my account settings?" }] }, { headers: { "Helicone-Cache-Enabled": "true", "Helicone-Cache-Seed": userId, // User-specific cache "Cache-Control": "max-age=3600" // Cache for 1 hour } });// Each user gets their own cached responses
Control how many different responses are stored for the same request:
Copy
Ask AI
{ "Helicone-Cache-Bucket-Max-Size": "3"}
With bucket size 3, the same request can return one of 3 different cached responses randomly:
Copy
Ask AI
openai.completion("give me a random number") -> "42" # Cache Missopenai.completion("give me a random number") -> "47" # Cache Miss openai.completion("give me a random number") -> "17" # Cache Missopenai.completion("give me a random number") -> "42" | "47" | "17" # Cache Hit
Behavior by bucket size:
Size 1 (default): Same request always returns same cached response (deterministic)
Size > 1: Same request can return different cached responses (useful for creative prompts)
Response chosen randomly from bucket
Maximum bucket size is 20. Enterprise plans support larger buckets.