Understanding Language Model Continuations: A Comprehensive Guide
In the realm of natural language processing (NLP), language models have become indispensable tools for generating text continuations based on given prompts. When we denote a prompt as (x{0:m}) and a possible continuation as (x{m:n_i}), it becomes crucial to understand how to rank these continuations effectively. Since language models primarily provide log probabilities for the next token based on the preceding context, scoring various continuations introduces certain complexities. Let’s explore the different methodologies for evaluating these continuations.
Unnormalized Score Method
The unnormalized score for a continuation (i) is derived from the sum of log probabilities of each token from the continuation. Formally, it can be represented as:
[
sum_{j=m}^{n_i – 1} log mathbb P(xj|x{0:j})
]
This approach essentially captures the likelihood of a generated text containing the specified continuation. While it’s the most straightforward method, it has its pitfalls. Notably, longer continuations tend to yield lower log probabilities, which can skew the language model’s preference towards shorter outputs. This method is utilized in various multiple-choice tasks and is referred to as acc in the evaluation harness.
Token-Length Normalized Score
To counter the biases introduced by the length of continuations, the token-length normalized score emerges as a more refined method. Here, the score for continuation (i) is calculated as:
[
sum_{j=m}^{n_i – 1} log mathbb P(xj|x{0:j}) / (n_i – m)
]
This formula calculates the average log probability per token, attempting to equalize the impact of length. However, a significant limitation arises: this method is not tokenization agnostic. If two models use different tokenization strategies but yield identical log likelihoods for given strings, their normalized scores could differ. While GPT-3 employs this approach in various tasks, the evaluation harness refrains from reporting it due to its tokenization dependence.
Byte-Length Normalized Score
To address the limitations posed by token-length normalization, the byte-length normalized score offers a more robust solution. The scoring for continuation (i) is given by:
[
sum_{j=m}^{n_i – 1} log mathbb P(xj|x{0:j}) / sum_{j=m}^{ni – 1} L{x_j}
]
In this context, (L_{x_j}) represents the byte count of each token (x_j). By focusing on byte length, this method achieves a tokenization-agnostic evaluation, ensuring that scores remain consistent across different tokenization implementations. Like the unnormalized score, this method is also employed in multiple-choice tasks and is labeled as acc_norm.
Unconditional Likelihood Normalized Score
An alternative approach to scoring continuations is the unconditional likelihood normalized method. The score for continuation (i) is computed as follows:
[
sum_{j=m}^{n_i – 1} log mathbb P(xj|x{0:j}) – log mathbb P(x_j)
]
This method uniquely assesses how much the prompt enhances the model’s likelihood of producing a given continuation compared to its unconditional probability. Although this approach has demonstrated improved performance in certain tasks like ARC, OpenBookQA, and RACE when used in GPT-3, its selective application raises questions. The reason for its limited use across all tasks remains unclear, yet it offers a nuanced perspective on continuation evaluation.
Efficiency of Scoring Methods
When considering the practicality of these scoring methods, it’s essential to note that the unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional calls to the language model. This efficiency is particularly beneficial in real-world applications where computational resources are often constrained. Conversely, the unconditional likelihood normalized metric necessitates an additional call to the language model to obtain the unconditional likelihood, which could impact performance in resource-sensitive environments.
Conclusion
In the dynamic landscape of NLP, understanding how to rank continuations from language models is pivotal. Each scoring method—whether unnormalized, token-length normalized, byte-length normalized, or unconditional likelihood normalized—offers unique advantages and drawbacks. By exploring these methodologies, researchers and practitioners can make informed decisions that enhance the performance and reliability of language model applications.
Inspired by: Source

