Articles

Optimizing AI Token Usage in the OpenAI Responses API

Optimizing Token Usage in the OpenAI Responses API

Efficient token usage is one of the most important factors when building applications with large language models. Whether you’re running a chatbot, an AI assistant, or a retrieval-augmented system, poor token management can quickly lead to unnecessary costs and latency.

 

 

 

This guide walks through the most effective techniques—ranked from highest to lowest impact—and shows how to apply each one with clear examples.

 

1. Prompt Caching (Highest Impact)

What it is

Prompt caching allows the API to reuse previously processed tokens when the beginning of your prompt remains identical across requests.

Why it matters

  • Cached tokens are billed at a significantly reduced rate
  • Eliminates repeated processing of large static prompts
  • Ideal for systems with consistent instructions or context

When to use it

  • Large system prompts (policies, instructions, formatting rules)
  • Reusable RAG prefixes (documents, knowledge base excerpts)
  • Template-based applications

 

How to apply it

Keep the first part of your request identical across calls.

{
  "model": "gpt-5.3",
  "input": [
    {
      "role": "system",
      "content": "You are a financial assistant. Follow these rules strictly: (long static instructions...)"
    },
    {
      "role": "user",
      "content": "Explain inflation in simple terms"
    }
  ]
}

If the system message remains unchanged, it will be cached.


Best practices

  • Place static content at the very beginning
  • Avoid: Changing wording – Adding timestamps – Injecting dynamic variables
  • Move dynamic content (user input, session data) to the end

Common mistake

- "Follow these rules:"
+ "Follow these rules: "

Even a small whitespace difference can invalidate caching.

 

2. Conversation State Management (High Impact)

What it is

Instead of resending the entire conversation with every request, you maintain conversation state using:

  • previous_response_id
  • Or your own external memory system

Why it matters

  • Prevents exponential growth in token usage
  • Essential for chat-based applications
  • Reduces both cost and latency

 

Approach A: Using previous_response_id

Example:

{
  "model": "gpt-5.3",
  "previous_response_id": "resp_abc123",
  "input": "Can you expand on that?"
}

How it works:

  • The server remembers prior context
  • You only send the new input

Approach B: Using the conversations api and pass the conversation id:

conversation = openai.conversations.create()

response = openai.responses.create(
  model="gpt-4.1",
  input=[{"role": "user", "content": "What are the 5 Ds of dodgeball?"}],
  conversation=conversation.id 
)

Best practices:

  • Keep only: User preferences – Key facts – Relevant prior steps
  • Drop irrelevant conversation turns
  • Avoid sending full chat logs

 

3. Context Compaction (Medium Impact)

What it is

Context compaction reduces token usage by summarizing or compressing earlier conversation history. 

Why it matters

  • Prevents context windows from growing indefinitely
  • Maintains essential information in fewer tokens
  • Works well alongside state management

How to apply it

Step 1: Detect large context

When conversation grows beyond a threshold (e.g., 2–4k tokens) 

Step 2: Summarize

Before:

User: طويل...
Assistant: طويل...
User: ...
Assistant: ...

After:

System: Summary: User is building a fintech app, discussed APIs, prefers Python.
User: New question

Example:

{
  "model": "gpt-5.3",
  "input": [
    {
      "role": "system",
      "content": "Summary: User is building a fintech API, cares about latency, uses Python."
    },
    {
      "role": "user",
      "content": "How do I optimize response time?"
    }
  ]
}

Best practices

  • Keep summaries: Short – Fact-based – Structured if possible
  • Periodically refresh summaries
  • Avoid over-compressing critical details

Advanced Tip

Use a separate model call to generate summaries automatically.


 

Putting It All Together:

A Production Strategy The most efficient systems combine all techniques.

 

 

Recommended architecture

  1. Prompt caching: Static system instructions at the top
  2. State management: Maintain only relevant context
  3. Context compaction: Summarize older interactions

Example:

{
  "model": "gpt-5.3",
  "previous_response_id": "resp_456def",
  "input": [
    {
      "role": "system",
      "content": "You are a backend engineering expert. Follow best practices... (static, cached)"
    },
    {
      "role": "system",
      "content": "Summary: User is building a high-performance Node.js API, focusing on latency."
    },
    {
      "role": "user",
      "content": "How do I reduce API response time under heavy load?"
    }
  ]
}

Final Comparison

Technique Token Savings Complexity Best Use Case
Prompt Caching Very High Medium Repeated static prompts
State Management High Medium Chat applications
Context Compaction Medium Medium Long sessions
previous_response_id Low Low Simple continuation

Key Takeaways

  • The biggest savings come from not reprocessing the same tokens
  • Structure your prompts to maximize reuse
  • Never send unnecessary context
  • Combine techniques for maximum efficiency

 

0 0 votes
Article Rating

What's your reaction?

Excited
0
Happy
0
Not Sure
0
Confused
0

You may also like

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted