Efficient token usage is one of the most important factors when building applications with large language models. Whether you’re running a chatbot, an AI assistant, or a retrieval-augmented system, poor token management can quickly lead to unnecessary costs and latency.
This guide walks through the most effective techniques—ranked from highest to lowest impact—and shows how to apply each one with clear examples.
1. Prompt Caching (Highest Impact)
What it is
Prompt caching allows the API to reuse previously processed tokens when the beginning of your prompt remains identical across requests.
Why it matters
- Cached tokens are billed at a significantly reduced rate
- Eliminates repeated processing of large static prompts
- Ideal for systems with consistent instructions or context
When to use it
- Large system prompts (policies, instructions, formatting rules)
- Reusable RAG prefixes (documents, knowledge base excerpts)
- Template-based applications
How to apply it
Keep the first part of your request identical across calls.
{
"model": "gpt-5.3",
"input": [
{
"role": "system",
"content": "You are a financial assistant. Follow these rules strictly: (long static instructions...)"
},
{
"role": "user",
"content": "Explain inflation in simple terms"
}
]
}
If the system message remains unchanged, it will be cached.
Best practices
- Place static content at the very beginning
- Avoid: Changing wording – Adding timestamps – Injecting dynamic variables
- Move dynamic content (user input, session data) to the end
Common mistake
- "Follow these rules:" + "Follow these rules: "
Even a small whitespace difference can invalidate caching.
2. Conversation State Management (High Impact)
What it is
Instead of resending the entire conversation with every request, you maintain conversation state using:
previous_response_id- Or your own external memory system
Why it matters
- Prevents exponential growth in token usage
- Essential for chat-based applications
- Reduces both cost and latency
Approach A: Using previous_response_id
Example:
{
"model": "gpt-5.3",
"previous_response_id": "resp_abc123",
"input": "Can you expand on that?"
}
How it works:
- The server remembers prior context
- You only send the new input
Approach B: Using the conversations api and pass the conversation id:
conversation = openai.conversations.create()
response = openai.responses.create(
model="gpt-4.1",
input=[{"role": "user", "content": "What are the 5 Ds of dodgeball?"}],
conversation=conversation.id
)
Best practices:
- Keep only: User preferences – Key facts – Relevant prior steps
- Drop irrelevant conversation turns
- Avoid sending full chat logs
3. Context Compaction (Medium Impact)
What it is
Context compaction reduces token usage by summarizing or compressing earlier conversation history.
Why it matters
- Prevents context windows from growing indefinitely
- Maintains essential information in fewer tokens
- Works well alongside state management
How to apply it
Step 1: Detect large context
When conversation grows beyond a threshold (e.g., 2–4k tokens)
Step 2: Summarize
Before:
User: طويل... Assistant: طويل... User: ... Assistant: ...
After:
System: Summary: User is building a fintech app, discussed APIs, prefers Python. User: New question
Example:
{
"model": "gpt-5.3",
"input": [
{
"role": "system",
"content": "Summary: User is building a fintech API, cares about latency, uses Python."
},
{
"role": "user",
"content": "How do I optimize response time?"
}
]
}
Best practices
- Keep summaries: Short – Fact-based – Structured if possible
- Periodically refresh summaries
- Avoid over-compressing critical details
Advanced Tip
Use a separate model call to generate summaries automatically.
Putting It All Together:
A Production Strategy The most efficient systems combine all techniques.
Recommended architecture
- Prompt caching: Static system instructions at the top
- State management: Maintain only relevant context
- Context compaction: Summarize older interactions
Example:
{
"model": "gpt-5.3",
"previous_response_id": "resp_456def",
"input": [
{
"role": "system",
"content": "You are a backend engineering expert. Follow best practices... (static, cached)"
},
{
"role": "system",
"content": "Summary: User is building a high-performance Node.js API, focusing on latency."
},
{
"role": "user",
"content": "How do I reduce API response time under heavy load?"
}
]
}
Final Comparison
| Technique | Token Savings | Complexity | Best Use Case |
|---|---|---|---|
| Prompt Caching | Very High | Medium | Repeated static prompts |
| State Management | High | Medium | Chat applications |
| Context Compaction | Medium | Medium | Long sessions |
| previous_response_id | Low | Low | Simple continuation |
Key Takeaways
- The biggest savings come from not reprocessing the same tokens
- Structure your prompts to maximize reuse
- Never send unnecessary context
- Combine techniques for maximum efficiency


