Optimizing AI Token Usage in the OpenAI Responses API

Optimizing Token Usage in the OpenAI Responses API

Efficient token usage is one of the most important factors when building applications with large language models. Whether you’re running a chatbot, an AI assistant, or a retrieval-augmented system, poor token management can quickly lead to unnecessary costs and latency.

This guide walks through the most effective techniques—ranked from highest to lowest impact—and shows how to apply each one with clear examples.

1. Prompt Caching (Highest Impact)

What it is

Prompt caching allows the API to reuse previously processed tokens when the beginning of your prompt remains identical across requests.

Why it matters

Cached tokens are billed at a significantly reduced rate
Eliminates repeated processing of large static prompts
Ideal for systems with consistent instructions or context

When to use it

Large system prompts (policies, instructions, formatting rules)
Reusable RAG prefixes (documents, knowledge base excerpts)
Template-based applications

How to apply it

Keep the first part of your request identical across calls.

{
  "model": "gpt-5.3",
  "input": [
    {
      "role": "system",
      "content": "You are a financial assistant. Follow these rules strictly: (long static instructions...)"
    },
    {
      "role": "user",
      "content": "Explain inflation in simple terms"
    }
  ]
}

If the system message remains unchanged, it will be cached.

Best practices

Place static content at the very beginning
Avoid: Changing wording – Adding timestamps – Injecting dynamic variables
Move dynamic content (user input, session data) to the end

Common mistake

- "Follow these rules:"
+ "Follow these rules: "

Even a small whitespace difference can invalidate caching.

2. Conversation State Management (High Impact)

What it is

Instead of resending the entire conversation with every request, you maintain conversation state using:

previous_response_id
Or your own external memory system

Why it matters

Prevents exponential growth in token usage
Essential for chat-based applications
Reduces both cost and latency

Approach A: Using previous_response_id

Example:

{
  "model": "gpt-5.3",
  "previous_response_id": "resp_abc123",
  "input": "Can you expand on that?"
}

How it works:

The server remembers prior context
You only send the new input

Approach B: Using the conversations api and pass the conversation id:

conversation = openai.conversations.create()

response = openai.responses.create(
  model="gpt-4.1",
  input=[{"role": "user", "content": "What are the 5 Ds of dodgeball?"}],
  conversation=conversation.id 
)

Best practices:

Keep only: User preferences – Key facts – Relevant prior steps
Drop irrelevant conversation turns
Avoid sending full chat logs

3. Context Compaction (Medium Impact)

What it is

Context compaction reduces token usage by summarizing or compressing earlier conversation history.

Why it matters

Prevents context windows from growing indefinitely
Maintains essential information in fewer tokens
Works well alongside state management

How to apply it

Step 1: Detect large context

When conversation grows beyond a threshold (e.g., 2–4k tokens)

Step 2: Summarize

Before:

User: طويل...
Assistant: طويل...
User: ...
Assistant: ...

After:

System: Summary: User is building a fintech app, discussed APIs, prefers Python.
User: New question

Example:

{
  "model": "gpt-5.3",
  "input": [
    {
      "role": "system",
      "content": "Summary: User is building a fintech API, cares about latency, uses Python."
    },
    {
      "role": "user",
      "content": "How do I optimize response time?"
    }
  ]
}

Best practices

Keep summaries: Short – Fact-based – Structured if possible
Periodically refresh summaries
Avoid over-compressing critical details

Advanced Tip

Use a separate model call to generate summaries automatically.

Putting It All Together:

A Production Strategy The most efficient systems combine all techniques.

Recommended architecture

Prompt caching: Static system instructions at the top
State management: Maintain only relevant context
Context compaction: Summarize older interactions

Example:

{
  "model": "gpt-5.3",
  "previous_response_id": "resp_456def",
  "input": [
    {
      "role": "system",
      "content": "You are a backend engineering expert. Follow best practices... (static, cached)"
    },
    {
      "role": "system",
      "content": "Summary: User is building a high-performance Node.js API, focusing on latency."
    },
    {
      "role": "user",
      "content": "How do I reduce API response time under heavy load?"
    }
  ]
}

Final Comparison

Technique	Token Savings	Complexity	Best Use Case
Prompt Caching	Very High	Medium	Repeated static prompts
State Management	High	Medium	Chat applications
Context Compaction	Medium	Medium	Long sessions
previous_response_id	Low	Low	Simple continuation

Key Takeaways

The biggest savings come from not reprocessing the same tokens
Structure your prompts to maximize reuse
Never send unnecessary context
Combine techniques for maximum efficiency

0 0 votes

Article Rating

What's your reaction?

Excited

Happy

Not Sure

Confused

How to Use Spatie Permissions in Laravel Vue Inertia App

PHP 8.4 Property Hooks: A Modern Alternative to Getters and Setters

A Comprehensive Guide to PHP 8 Attributes

Building an AI MCP Server with Laravel: A Step-by-Step Tutorial

How to Use Spatie Permissions in Laravel Vue Inertia App

Next.js Data Fetching (App Router) — Complete Guide

Building a Custom AI Chatbox Using Laravel AI SDK and Vuejs

Handling SEO and Meta in Nuxt 3 Vue Applications

Implement Interactive Word Animation In Javascript

Automatically Resizing an Embedded Iframe to Fit Its Content

Shortening Javascript Expressions With The Optional Chaining and Null Coalescing Operators

Learn About Javascript Classes with Examples For Beginners

Optimizing AI Token Usage in the OpenAI Responses API

1. Prompt Caching (Highest Impact)

2. Conversation State Management (High Impact)

3. Context Compaction (Medium Impact)

Putting It All Together:

Recommended architecture

Final Comparison

Key Takeaways

What's your reaction?

Next.js Data Fetching (App Router) — Complete Guide

A Comprehensive Guide to PHP 8 Attributes

Claude Code Best Practices Guide

MySQL Locking: Concepts, and Usage in MySQL & Laravel

Resolving the React + Next.js 16 SVG Import Problem with Turbopack

Recent Articles

How to Use Spatie Permissions in Laravel Vue Inertia App

PHP 8.4 Property Hooks: A Modern Alternative to Getters and Setters

A Comprehensive Guide to PHP 8 Attributes

Building an AI MCP Server with Laravel: A Step-by-Step Tutorial

How to Use Spatie Permissions in Laravel Vue Inertia App

Next.js Data Fetching (App Router) — Complete Guide

Building a Custom AI Chatbox Using Laravel AI SDK and Vuejs

Handling SEO and Meta in Nuxt 3 Vue Applications

Implement Interactive Word Animation In Javascript

Automatically Resizing an Embedded Iframe to Fit Its Content

Shortening Javascript Expressions With The Optional Chaining and Null Coalescing Operators

Learn About Javascript Classes with Examples For Beginners

Popular Tags

1. Prompt Caching (Highest Impact)

2. Conversation State Management (High Impact)

3. Context Compaction (Medium Impact)

Putting It All Together:

Recommended architecture

Final Comparison

Key Takeaways

Share

What's your reaction?

You may also like

Recent Articles

Latest Posts

Popular Tags