Exploring ORFuzz: A Breakthrough in Detecting Over-Refusals in Large Language Models
As large language models (LLMs) continue to evolve and integrate into various applications, the challenge of ensuring their reliability becomes increasingly apparent. One of the critical issues faced by these models is over-refusal, where they erroneously reject harmless queries due to overly stringent safety measures. This flaw not only undermines user trust but also hampers the models’ usability in real-world scenarios.
- Understanding Over-Refusals in LLMs
- Limitations of Current Testing Methods
- Introducing ORFuzz: A Revolutionary Testing Framework
- 1. Safety Category-Aware Seed Selection
- 2. Adaptive Mutator Optimization
- 3. OR-Judge: Evaluating User Perception
- Empirical Results and Benchmarking with ORFuzz
- Implications for the Future of LLM Testing
Understanding Over-Refusals in LLMs
Over-refusals occur when an LLM denies a benign request, making it seem overly cautious or even hostile. This behavior can be rooted in the safety protocols designed to filter out potentially harmful content. However, these safety measures can sometimes overreach, causing the model to misinterpret inputs that are perfectly acceptable. This leads to a frustrating user experience and raises questions about the model’s reliability in various applications, ranging from customer support to content generation.
Limitations of Current Testing Methods
To address this pressing concern, researchers have often relied on testing methodologies aimed at assessing LLM performance. Unfortunately, many of these current methods have significant drawbacks. From limited test generation capabilities to poorly designed benchmarks, the testing landscape for LLM over-refusals is lacking. Recent user studies highlight these inadequacies, suggesting that existing evaluation techniques aren’t sufficient to identify the full scope of over-refusal issues.
Introducing ORFuzz: A Revolutionary Testing Framework
In light of the existing gaps in testing methodologies, the introduction of the ORFuzz framework marks a significant breakthrough. ORFuzz stands out as the first evolutionary testing framework specifically designed to detect and analyze over-refusals in LLMs. This innovative approach integrates three essential components, each playing a critical role in forming a comprehensive testing strategy.
1. Safety Category-Aware Seed Selection
The first component of ORFuzz is its safety category-aware seed selection. This mechanism ensures that the test cases cover a wide range of safety categories, providing comprehensive test coverage. By selecting seed inputs that are representative of various types of queries, ORFuzz effectively addresses the pitfalls of incomplete testing. This robust selection process is crucial in generating a diverse array of test scenarios that genuinely represent user interactions.
2. Adaptive Mutator Optimization
The second key element is the adaptive mutator optimization feature. Utilizing reasoning LLMs, ORFuzz can generate effective test cases through a dynamic adjustment process. This optimization not only enhances the quality of the test inputs but also increases the likelihood of uncovering instances of over-refusal. By intelligently mutating and adapting input queries, ORFuzz fosters a more nuanced understanding of how LLMs respond to various prompts.
3. OR-Judge: Evaluating User Perception
The final critical aspect of the ORFuzz framework is the OR-Judge. This is a human-aligned judge model that has been specifically validated to reflect user perception regarding toxicity and refusal rates. By incorporating an evaluation system that prioritizes user experience, ORFuzz can effectively discern which instances of over-refusal are most relevant and impactful. This alignment with user sentiment is paramount in creating meaningful metrics for evaluating model performance.
Empirical Results and Benchmarking with ORFuzz
The effectiveness of the ORFuzz framework is supported by extensive evaluations, revealing some impressive findings. ORFuzz successfully generates over-refusal instances at an average rate of 6.98%, which is more than double that of existing benchmarks. This capability enables researchers and developers to uncover vulnerabilities in LLMs more efficiently than ever before.
In addition to generating these instances, ORFuzz also lays the groundwork for the ORFuzzSet. This new benchmark consists of 1,855 highly transferable test cases that achieve an average over-refusal rate of 63.56% across ten diverse LLMs. Such a performance significantly surpasses existing datasets, providing a robust resource for the community aiming to enhance LLM reliability.
Implications for the Future of LLM Testing
The introduction of ORFuzz and ORFuzzSet not only represents a significant advancement in the ability to detect over-refusals but also paves the way for developing more trustworthy LLM-based software systems. As the reliance on LLMs grows across various sectors, it is vital to ensure that these models are equipped to handle user queries effectively while maintaining safety standards.
By fostering a deeper understanding of model weaknesses and providing essential resources for testing, ORFuzz offers a valuable contribution to the continued evolution of language technologies, fostering an environment where LLMs can be both safe and effective.
Inspired by: Source

