SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
In a rapidly evolving software development landscape, maintaining code quality is more important than ever. One of the prevailing challenges facing developers is the presence of architectural code smells—issues in the code design that can erode maintainability and are often costly to repair manually. While localized bugs are typically easier to fix due to their more straightforward nature, architectural code smells necessitate a higher level of cross-module reasoning about design intent. This complexity complicates the repair process, making automated tools less effective.
In this article, we explore SmellBench, an innovative framework designed by Ion George Dinu and his collaborators. This framework serves to evaluate the ability of large language model (LLM) agents to repair architectural code smells effectively.
Understanding Architectural Code Smells
Architectural code smells suggest deficiencies in code structure and design that can hinder long-term maintainability. Unlike simple bugs, these issues require an understanding of inter-module relationships and overall design principles. This need for broader architectural insights presents a significant challenge for both developers and automated tools. Some common types of architectural code smells include:
- God Objects: Classes that control too much behavior, leading to high coupling and low cohesion.
- Spaghetti Code: Code that is tangled and difficult to follow, making it hard to manage and maintain.
- Feature Envy: Situations where one class is overly interested in another’s data or functionality, indicating a potential design flaw.
Addressing these smells is critical for creating maintainable and scalable software systems, but the challenge lies in their complexity.
The Role of LLM Agents
Large language model agents have demonstrated remarkable capabilities in code-level tasks, particularly in bug fixing and localized refactoring. However, their potential for repairing architectural code smells remains an underexplored area. SmellBench sets out to fill this gap by providing a structured evaluation of various agent configurations from four prominent model families: GPT, Claude, Gemini, and Mistral.
Task Orchestration Framework
At the heart of SmellBench is its task orchestration framework. This framework incorporates smell-type-specific optimized prompts, which help guide the LLM agents in their attempts to repair detected smells. Additionally, the framework supports iterative multi-step execution, allowing agents to refine their approaches based on outcomes.
Evaluation Methodology
The evaluation methodologies employed by SmellBench are comprehensive. They include a scoring system that measures:
- Repair Effectiveness: How well the agents manage to fix the identified architectural smells.
- False Positive Identification: The ability of agents to discern between actual smells and those erroneously flagged.
- Net Codebase Impact: The broader effects of the repairs on the overall codebase quality.
By using these criteria, SmellBench can paint a more nuanced picture of LLM agent performance in relation to architectural code smell repair.
Empirical Findings
The empirical evaluation conducted on 11 agent configurations revealed some enlightening insights into the current capabilities of LLM agents. The study focused on 65 hard-severity architectural smells detected by PyExamine in the widely used Python project, scikit-learn, and compared the results with expert judgments for validation.
Notably, the expert validation process indicated that a staggering 63.1% of detected smells were false positives. Despite this high false-positive rate, the best-performing LLM agent achieved a commendable 47.7% resolution rate for actual architectural code smells. This shows that while LLMs are making strides, there remains a critical need for development in their architectural understanding.
Moreover, an intriguing relationship was uncovered between repair aggressiveness and net codebase quality. While some agents exhibited high repair rates, they inadvertently introduced up to 140 new smells—a clear indicator that aggressive repairs do not always lead to improved quality.
Implications for Automated Software Engineering
The findings from SmellBench underscore a significant gap between the current capabilities of LLMs in performing localized code transformations and the architectural awareness essential for effective cross-module refactoring. As developers increasingly rely on automated tools to maintain code, understanding these limitations becomes crucial for informed decision-making.
Beyond individual agent performance, SmellBench is positioned to serve as a reusable infrastructure that tracks progress in this critical yet underexplored domain of automated software engineering. By focusing on architectural code smells, it opens avenues for further research and development aimed at enhancing LLM capabilities.
This framework not only aims to improve LLM behavior but also enriches the discussions around automated software engineering practices, helping to shape the future of code maintenance and quality assurance in tech development.
To explore the comprehensive findings and methodologies, researchers and developers can access the paper, “SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair,” available in PDF format from the authors.
Inspired by: Source

