In this article, we will unravel the significance of the 128,000 token context window, dissecting the implications of this substantial improvement in GPT-4 Turbo. We will demonstrate how this expanded capacity unlocks new possibilities for generating extensive text and explore the associated trade-off.
Introduction
Recently, OpenAI quite significant updates on their front of artificial intelligence products 1. The most important significant change was the introduction of a new GPT-4 model, "GPT4-turbo".
Besides the flashy name the model can be accessed as "gpt-4-1106-preview" using OpenAI's and offers up to 128k token as context. Also, it's quite a bit cheaper than the actual GPT-4.
However, looking at OpenAI's docs 2, we find the following:
The latest GPT-4 model with improved instruction following, JSON mode, reproducible outputs, parallel function calling, and more. Returns a maximum of 4,096 output tokens. This preview model is not yet suited for production traffic
Ok, what does it mean exactly with the 128k "contex token"? This is what we will find out in a second.
Meaning of the 128,000 token context window
Ok, let be very hands-on.
Let's simply try the following code first with GPT-3.5 Turbo and then with GPT-4 Turbo setting max_tokens to 4096.
GPT-3.5 Turbo test
Here's a very simple code with a very basic prompt. Please replace your OpenAI key if needed.
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
model = "gpt-3.5-turbo"
prompt = "Just tell me that this prompt is working. I am desparate to find it out!"
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
n=1,
temperature=0.5
)
message = response['choices'][0]['message']['content'].strip()
print(message)
We will get the following known error message:
This model's maximum context length is 4097 tokens. However, you requested 4121 tokens (25 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.
That's expected, since the overall token count including prompt and answer may not exceed 4096, which we have allowed. So the error is inherent to the limitations of the model. We would need to reduce the length of the allowed answer by reducing max_tokens.
Improvement by GPT-4 Turbo
However, the situation changes if we simply replace the model to the currently available preview of GPT-4 Turbo:
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
model = "gpt-4-1106-preview"
prompt = "Just tell me that this prompt is working. I am desparate to find it out!"
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
n=1,
temperature=0.5
)
message = response['choices'][0]['message']['content'].strip()
print(message)
And this is what we get instead:
Yes, the prompt is working, and I'm here to help you. If you have any questions or need assistance, feel free to ask!
That's a valid (and kind) answer. Thank you, GPT-4 Turbo!
Bottom Line - what does the 128k token context length mean?
It's very simple, you may input extremely long inputs to the model as prompt but still get an answer that is 4096 token long.
So there is no trade-off between input and output.
That's actually big news, since the model can be used to provide answers / analyses / whatever to texts as long as about 50k words.
Keep in mind the rate limits 3 you are subject to, for example 10k token per minute in the first tier. You may want to increase your OpenAI balance if you want to use very long texts.