Understanding Token Usage in Conversation History with LLMs
Question: Where is this held? As tokens in the LLM? (Maintains a conversation history list with message roles and content)
Answer: The conversation history is held in memory as a Python list variable, not as tokens in the LLM.
In the code, you can see this implementation:
# Store conversation history
conversation_history = [
{"role": "system", "content": "You are a helpful assistant."}
]
This is a regular Python list that contains dictionaries with “role” and “content” keys. Throughout the code, this list is updated by appending new messages:
# Update conversation history
conversation_history.append({"role": "user", "content": user_content})
When sending requests to the OpenAI API, this entire conversation history is passed as the “messages” parameter:
response = await openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=conversation_history
)
The LLM (OpenAI’s GPT-3.5-turbo in this case) only sees this conversation history when it’s sent in the API request. The tokens are processed on OpenAI’s servers during the API call, but the history itself is maintained in your Python application’s memory.
This approach allows the client application to:
- Maintain context between multiple interactions
- Reset or modify the conversation history when needed (like when switching between different assistant types)
- Control exactly what context is sent to the LLM
Question: So with each question, the message history gets longer? Does that mean more tokens are used as the conversation progresses?
Answer: Yes, exactly! With each question and response, the message history gets longer, and more tokens are used as the conversation progresses.
Each time the code makes a request to the OpenAI API, it sends the entire conversation history as the “messages” parameter. This includes:
- The initial system message
- All previous user messages
- All previous assistant responses
- The new user message
As this history grows, the token count increases for each API call. This has two important implications:
- Cost implications: Since OpenAI charges based on token usage, longer conversations become progressively more expensive.
- Token limit constraints: There’s a maximum token limit for each model (e.g., GPT-3.5-turbo has a context window limit). If the conversation history exceeds this limit, you’ll either get an error or some of the earlier messages will be truncated.
This is why many production applications implement conversation management strategies like:
- Summarizing conversation history
- Truncating older messages
- Keeping only the most relevant context
- Implementing windowing approaches (keeping only the last N messages)
The code shown doesn’t have these management strategies built in, so it will indeed use more tokens (and potentially cost more) as the conversation continues.
