cs.SE

315 posts

Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement

arXiv:2503.22512v1 Announce Type: new Abstract: Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM-based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus creates a significant gap in repair capabilities across the programming language spectrum, where the full potential of LLMs for comprehensive multilingual program repair remains largely unexplored. To address this limitation, we introduce a novel cross-language program repair approach LANTERN that leverages LLMs' differential proficiency across languages through a multi-agent iterative repair paradigm. Our technique strategically translates defective code from languages where LLMs exhibit weaker repair capabilities to languages where they demonstrate stronger performance, without requiring additional training. A key innovation of our approach is an LLM-based decision-making system that dynamically selects optimal target languages based on bug characteristics and continuously incorporates feedback from previous repair attempts. We evaluate our method on xCodeEval, a comprehensive multilingual benchmark comprising 5,068 bugs across 11 programming languages. Results demonstrate significant enhancement in repair effectiveness, particularly for underrepresented languages, with Rust showing a 22.09% improvement in Pass@10 metrics. Our research provides the first empirical evidence that cross-language translation significantly expands the repair capabilities of LLMs and effectively bridges the performance gap between programming languages with different levels of popularity, opening new avenues for truly language-agnostic automated program repair.

Wenqiang Luo, Jacky Wai Keung, Boyang Yang, Tegawende F. Bissyande, Haoye Tian, Bach Le3/31/2025

An Industry Interview Study of Software Signing for Supply Chain Security

arXiv:2406.08198v3 Announce Type: replace Abstract: Many software products are composed of components integrated from other teams or external parties. Each additional link in a software product's supply chain increases the risk of the injection of malicious behavior. To improve supply chain provenance, many cybersecurity frameworks, standards, and regulations recommend the use of software signing. However, recent surveys and measurement studies have found that the adoption rate and quality of software signatures are low. We lack in-depth industry perspectives on the challenges and practices of software signing. To understand software signing in practice, we interviewed 18 experienced security practitioners across 13 organizations. We study the challenges that affect the effective implementation of software signing in practice. We also provide possible impacts of experienced software supply chain failures, security standards, and regulations on software signing adoption. To summarize our findings: (1) We present a refined model of the software supply chain factory model highlighting practitioner's signing practices; (2) We highlight the different challenges-technical, organizational, and human-that hamper software signing implementation; (3) We report that experts disagree on the importance of signing; and (4) We describe how internal and external events affect the adoption of software signing. Our work describes the considerations for adopting software signing as one aspect of the broader goal of improved software supply chain security.

Kelechi G. Kalu, Tanya Singla, Chinenye Okafor, Santiago Torres-Arias, James C. Davis3/31/2025

Integrating Artificial Intelligence with Human Expertise: An In-depth Analysis of ChatGPT's Capabilities in Generating Metamorphic Relations

arXiv:2503.22141v1 Announce Type: new Abstract: Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain.

Yifan Zhang (University of Nottingham Ningbo China), Dave Towey (University of Nottingham Ningbo China), Matthew Pike (University of Nottingham Ningbo China), Quang-Hung Luu (Swinburne University of Technology), Huai Liu (Swinburne University of Technology), Tsong Yueh Chen (Swinburne University of Technology)3/31/2025

cs.SE cs.AI cs.CL

CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching

arXiv:2503.22424v1 Announce Type: new Abstract: Large language models (LLMs) have significantly advanced autonomous software engineering, leading to a growing number of software engineering agents that assist developers in automatic program repair. Issue localization forms the basis for accurate patch generation. However, because of limitations caused by the context window length of LLMs, existing issue localization methods face challenges in balancing concise yet effective contexts and adequately comprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven, simple yet powerful function level issue localization method without training or indexing. CoSIL reduces the search space through module call graphs, iteratively searches the function call graph to obtain relevant contexts, and uses context pruning to control the search direction and manage contexts effectively. Importantly, the call graph is dynamically constructed by the LLM during search, eliminating the need for pre-parsing. Experiment results demonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent and 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using Qwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When CoSIL is applied to guide the patch generation stage, the resolved rate further improves by 9.3 to 31.5 percent.

Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, Zhongxin Liu3/31/2025

LLM-enabled Instance Model Generation

arXiv:2503.22587v1 Announce Type: new Abstract: In the domain of model-based engineering, models are essential components that enable system design and analysis. Traditionally, the creation of these models has been a manual process requiring not only deep modeling expertise but also substantial domain knowledge of target systems. With the rapid advancement of generative artificial intelligence, large language models (LLMs) show potential for automating model generation. This work explores the generation of instance models using LLMs, focusing specifically on producing XMI-based instance models from Ecore metamodels and natural language specifications. We observe that current LLMs struggle to directly generate valid XMI models. To address this, we propose a two-step approach: first, using LLMs to produce a simplified structured output containing all necessary instance model information, namely a conceptual instance model, and then compiling this intermediate representation into a valid XMI file. The conceptual instance model is format-independent, allowing it to be transformed into various modeling formats via different compilers. The feasibility of the proposed method has been demonstrated using several LLMs, including GPT-4o, o1-preview, Llama 3.1 (8B and 70B). Results show that the proposed method significantly improves the usability of LLMs for instance model generation tasks. Notably, the smaller open-source model, Llama 3.1 70B, demonstrated performance comparable to proprietary GPT models within the proposed framework.

Fengjunjie Pan, Nenad Petrovic, Vahid Zolfaghari, Long Wen, Alois Knoll3/31/2025

QuCheck: A Property-based Testing Framework for Quantum Programs in Qiskit

arXiv:2503.22641v1 Announce Type: cross Abstract: Property-based testing has been previously proposed for quantum programs in Q# with QSharpCheck; however, this implementation was limited in functionality, lacked extensibility, and was evaluated on a narrow range of programs using a single property. To address these limitations, we propose QuCheck, an enhanced property-based testing framework in Qiskit. By leveraging Qiskit and the broader Python ecosystem, QuCheck facilitates property construction, introduces flexible input generators and assertions, and supports expressive preconditions. We assessed its effectiveness through mutation analysis on five quantum programs (2-10 qubits), varying the number of properties, inputs, and measurement shots to assess their impact on fault detection and demonstrate the effectiveness of property-based testing across a range of conditions. Results show a strong positive correlation between the mutation score (a measure of fault detection) and number of properties evaluated, with a moderate negative correlation between the false positive rate and number of measurement shots. Among the most thorough test configurations, those evaluating three properties achieved a mean mutation score ranging from 0.90 to 0.92 across all five algorithms, with the false positive rate between 0 and 0.04. QuCheck identified 36.0% more faults than QSharpCheck, with execution time reduced by 81.1%, despite one false positive. These findings underscore the viability of property-based testing for verifying quantum systems.

Gabriel Pontolillo, Mohammad Reza Mousavi, Marek Grzesiuk3/31/2025

Safety Verification and Optimization in Industrial Drive Systems

arXiv:2503.21965v1 Announce Type: new Abstract: Safety and reliability are crucial in industrial drive systems, where hazardous failures can have severe consequences. Detecting and mitigating dangerous faults on time is challenging due to the stochastic and unpredictable nature of fault occurrences, which can lead to limited diagnostic efficiency and compromise safety. This paper optimizes the safety and diagnostic performance of a real-world industrial Basic Drive Module(BDM) using Uppaal Stratego. We model the functional safety architecture of the BDM with timed automata and formally verify its key functional and safety requirements through model checking to eliminate unwanted behaviors. Considering the formally verified correct model as a baseline, we leverage the reinforcement learning facility in Uppaal Stratego to optimize the safe failure fraction to the 90 % threshold, improving fault detection ability. The promising results highlight strong potential for broader safety applications in industrial automation.

Imran Riaz Hasrat, Eun-Young Kang, Christian Uldal Graulund3/31/2025

Decoding Dependency Risks: A Quantitative Study of Vulnerabilities in the Maven Ecosystem

arXiv:2503.22134v1 Announce Type: new Abstract: This study investigates vulnerabilities within the Maven ecosystem by analyzing a comprehensive dataset of 14,459,139 releases. Our analysis reveals the most critical weaknesses that pose significant threats to developers and their projects as they look to streamline their development tasks through code reuse. We show risky weaknesses, those unique to Maven, and emphasize those becoming increasingly dangerous over time. Furthermore, we reveal how vulnerabilities subtly propagate, impacting 31.39% of the 635,003 latest releases through direct dependencies and 62.89% through transitive dependencies. Our findings suggest that improper handling of input and mismanagement of resources pose the most risk. Additionally, Insufficient session-ID length in J2EE configuration and no throttling while allocating resources uniquely threaten the Maven ecosystem. We also find that weaknesses related to improper authentication and managing sensitive data without encryption have quickly gained prominence in recent years. These findings emphasize the need for proactive strategies to mitigate security risks in the Maven ecosystem.

Costain Nachuma, Md Mosharaf Hossan, Asif Kamal Turzo, Minhaz F. Zibran3/31/2025

Integrating LLMs in Software Engineering Education: Motivators, Demotivators, and a Roadmap Towards a Framework for Finnish Higher Education Institutes

arXiv:2503.22238v1 Announce Type: new Abstract: The increasing adoption of Large Language Models (LLMs) in software engineering education presents both opportunities and challenges. While LLMs offer benefits such as enhanced learning experiences, automated assessments, and personalized tutoring, their integration also raises concerns about academic integrity, student over-reliance, and ethical considerations. In this study, we conducted a preliminary literature review to identify motivators and demotivators for using LLMs in software engineering education. We applied a thematic mapping process to categorize and structure these factors (motivators and demotivators), offering a comprehensive view of their impact. In total, we identified 25 motivators and 30 demotivators, which are further organized into four high-level themes. This mapping provides a structured framework for understanding the factors that influence the integration of LLMs in software engineering education, both positively and negatively. As part of a larger research project, this study serves as a feasibility assessment, laying the groundwork for future systematic literature review and empirical studies. Ultimately, this project aims to develop a framework to assist Finnish higher education institutions in effectively integrating LLMs into software engineering education while addressing potential risks and challenges.

Maryam Khan, Muhammad Azeem Akbar, Jussi Kasurinen3/31/2025

Understanding Software Vulnerabilities in the Maven Ecosystem: Patterns, Timelines, and Risks

arXiv:2503.22391v1 Announce Type: new Abstract: Vulnerabilities in software libraries and reusable components cause major security challenges, particularly in dependency-heavy ecosystems such as Maven. This paper presents a large-scale analysis of vulnerabilities in the Maven ecosystem using the Goblin framework. Our analysis focuses on the aspects and implications of vulnerability types, documentation delays, and resolution timelines. We identify 77,393 vulnerable releases with 226 unique CWEs. On average, vulnerabilities take nearly half a decade to be documented and 4.4 years to be resolved, with some remaining unresolved for even over a decade. The delays in documenting and fixing vulnerabilities incur security risks for the library users emphasizing the need for more careful and efficient vulnerability management in the Maven ecosystem.

Md Fazle Rabbi, Rajshakhar Paul, Arifa Islam Champa, Minhaz F. Zibran3/31/2025

On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations

arXiv:2503.22575v1 Announce Type: new Abstract: Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.

Rajdeep Singh Hundal, Yan Xiao, Xiaochun Cao, Jin Song Dong, Manuel Rigger3/31/2025

Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis

arXiv:2503.22576v1 Announce Type: new Abstract: The prevalent use of third-party libraries (TPLs) in modern software development introduces significant security and compliance risks, necessitating the implementation of Software Composition Analysis (SCA) to manage these threats. However, the accuracy of SCA tools heavily relies on the quality of the integrated feature database to cross-reference with user projects. While under the circumstance of the exponentially growing of open-source ecosystems and the integration of large models into software development, it becomes even more challenging to maintain a comprehensive feature database for potential TPLs. To this end, after referring to the evolution of LLM applications in terms of external data interactions, we propose the first framework of DB-Less SCA, to get rid of the traditional heavy database and embrace the flexibility of LLMs to mimic the manual analysis of security analysts to retrieve identical evidence and confirm the identity of TPLs by supportive information from the open Internet. Our experiments on two typical scenarios, native library identification for Android and copy-based TPL reuse for C/C++, especially on artifacts that are not that underappreciated, have demonstrated the favorable future for implementing database-less strategies in SCA.

Lyuye Zhang, Chengwei Liu, Jiahui Wu, Shiyang Zhang, Chengyue Liu, Zhengzi Xu, Sen Chen, Yang Liu3/31/2025

cs.CR cs.CY cs.SE

Advancing DevSecOps in SMEs: Challenges and Best Practices for Secure CI/CD Pipelines

arXiv:2503.22612v1 Announce Type: new Abstract: This study evaluates the adoption of DevSecOps among small and medium-sized enterprises (SMEs), identifying key challenges, best practices, and future trends. Through a mixed methods approach backed by the Technology Acceptance Model (TAM) and Diffusion of Innovations (DOI) theory, we analyzed survey data from 405 SME professionals, revealing that while 68% have implemented DevSecOps, adoption is hindered by technical complexity (41%), resource constraints (35%), and cultural resistance (38%). Despite strong leadership prioritization of security (73%), automation gaps persist, with only 12% of organizations conducting security scans per commit. Our findings highlight a growing integration of security tools, particularly API security (63%) and software composition analysis (62%), although container security adoption remains low (34%). Looking ahead, SMEs anticipate artificial intelligence and machine learning to significantly influence DevSecOps, underscoring the need for proactive adoption of AI-driven security enhancements. Based on our findings, this research proposes strategic best practices to enhance CI/CD pipeline security including automation, leadership-driven security culture, and cross-team collaboration.

Jayaprakashreddy Cheenepalli, John D. Hastings, Khandaker Mamun Ahmed, Chad Fenner3/31/2025

cs.SE cs.AI cs.LG

Challenges and Paths Towards AI for Software Engineering

arXiv:2503.22625v1 Announce Type: new Abstract: AI for software engineering has made remarkable progress recently, becoming a notable success within generative AI. Despite this, there are still many challenges that need to be addressed before automated software engineering reaches its full potential. It should be possible to reach high levels of automation where humans can focus on the critical decisions of what to build and how to balance difficult tradeoffs while most routine development effort is automated away. Reaching this level of automation will require substantial research and engineering efforts across academia and industry. In this paper, we aim to discuss progress towards this in a threefold manner. First, we provide a structured taxonomy of concrete tasks in AI for software engineering, emphasizing the many other tasks in software engineering beyond code generation and completion. Second, we outline several key bottlenecks that limit current approaches. Finally, we provide an opinionated list of promising research directions toward making progress on these bottlenecks, hoping to inspire future research in this rapidly maturing field.

Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Yijia Shao, Ziyang Li, Diyi Yang, Kevin Ellis, Koushik Sen, Armando Solar-Lezama3/31/2025

Integrating DAST in Kanban and CI/CD: A Real World Security Case Study

arXiv:2503.21947v1 Announce Type: new Abstract: Modern development methodologies, such as Kanban and continuous integration and continuous deployment (CI/CD), are critical for web application development -- as software products must adapt to changing requirements and deploy products to users quickly. As web application attacks and exploited vulnerabilities are rising, it is increasingly crucial to integrate security into modern development practices. Yet, the iterative and incremental nature of these processes can clash with the sequential nature of security engineering. Thus, it is challenging to adopt security practices and activities in modern development practices. Dynamic Application Security Testing (DAST) is a security practice within software development frameworks that bolsters system security. This study delves into the intersection of Agile development and DAST, exploring how a software organization attempted to integrate DAST into their Kanban workflows and CI/CD pipelines to identify and mitigate security vulnerabilities within the development process. Through an action research case study incorporating interviews among team members, this research elucidates the challenges, mitigation techniques, and best practices associated with incorporating DAST into Agile methodologies from developers' perspectives. We provide insights into integrating security practices with modern development, ensuring both speed and security in software delivery.

Arpit Thool, Chris Brown3/31/2025

A Delphi Study on the Adaptation of SCRUM Practices to Remote Work

arXiv:2503.21960v1 Announce Type: new Abstract: This study explores how Scrum practices were adjusted for remote and hybrid work during and after the COVID-19 pandemic, using a Delphi study with Scrum Masters to gather expert insights. Preliminary key findings highlight communication as the primary challenge, leading to adjustments in meeting structures, information-sharing practices, and collaboration tools. Teams restructured ceremonies, introduced new meetings, and implemented persistent information-sharing mechanisms to improve their work.

Cleyton Magalhaes, Fernando Padoan, Robson Santos, Ronnie de Souza Santos3/31/2025

RocketPPA: Ultra-Fast LLM-Based PPA Estimator at Code-Level Abstraction

arXiv:2503.21971v1 Announce Type: new Abstract: Large language models have recently transformed hardware design, yet bridging the gap between code synthesis and PPA (power, performance, and area) estimation remains a challenge. In this work, we introduce a novel framework that leverages a 21k dataset of thoroughly cleaned and synthesizable Verilog modules, each annotated with detailed power, delay, and area metrics. By employing chain-of-thought techniques, we automatically debug and curate this dataset to ensure high fidelity in downstream applications. We then fine-tune CodeLlama using LoRA-based parameter-efficient methods, framing the task as a regression problem to accurately predict PPA metrics from Verilog code. Furthermore, we augment our approach with a mixture-of-experts architecture-integrating both LoRA and an additional MLP expert layer-to further refine predictions. Experimental results demonstrate significant improvements: power estimation accuracy is enhanced by 5.9% at a 20% error threshold and by 7.2% at a 10% threshold, delay estimation improves by 5.1% and 3.9%, and area estimation sees gains of 4% and 7.9% for the 20% and 10% thresholds, respectively. Notably, the incorporation of the mixture-of-experts module contributes an additional 3--4% improvement across these tasks. Our results establish a new benchmark for PPA-aware Verilog generation, highlighting the effectiveness of our integrated dataset and modeling strategies for next-generation EDA workflows.

Armin Abdollahi, Mehdi Kamal, Massoud Pedram3/31/2025

Reflection on Code Contributor Demographics and Collaboration Patterns in the Rust Communit

arXiv:2503.22066v1 Announce Type: new Abstract: Open-source software communities thrive on global collaboration and contributions from diverse participants. This study explores the Rust programming language ecosystem to understand its contributors' demographic composition and interaction patterns. Our objective is to investigate the phenomenon of participation inequality in key Rust projects and the presence of diversity among them. We studied GitHub pull request data from the year leading up to the release of the latest completed Rust community annual survey in 2023. Specifically, we extracted information from three leading repositories: Rust, Rust Analyzer, and Cargo, and used social network graphs to visualize the interactions and identify central contributors and sub-communities. Social network analysis has shown concerning disparities in gender and geographic representation among contributors who play pivotal roles in collaboration networks and the presence of varying diversity levels in the sub-communities formed. These results suggest that while the Rust community is globally active, the contributor base does not fully reflect the diversity of the wider user community. We conclude that there is a need for more inclusive practices to encourage broader participation and ensure that the contributor base aligns more closely with the diverse global community that utilizes Rust.

Rohit Dandamudi, Ifeoma Adaji, Gema Rodr\'iguez-P\'erez3/31/2025

cs.SE cs.AI cs.LG

MFH: A Multi-faceted Heuristic Algorithm Selection Approach for Software Verification

arXiv:2503.22228v1 Announce Type: new Abstract: Currently, many verification algorithms are available to improve the reliability of software systems. Selecting the appropriate verification algorithm typically demands domain expertise and non-trivial manpower. An automated algorithm selector is thus desired. However, existing selectors, either depend on machine-learned strategies or manually designed heuristics, encounter issues such as reliance on high-quality samples with algorithm labels and limited scalability. In this paper, an automated algorithm selection approach, namely MFH, is proposed for software verification. Our approach leverages the heuristics that verifiers producing correct results typically implement certain appropriate algorithms, and the supported algorithms by these verifiers indirectly reflect which ones are potentially applicable. Specifically, MFH embeds the code property graph (CPG) of a semantic-preserving transformed program to enhance the robustness of the prediction model. Furthermore, our approach decomposes the selection task into the sub-tasks of predicting potentially applicable algorithms and matching the most appropriate verifiers. Additionally, MFH also introduces a feedback loop on incorrect predictions to improve model prediction accuracy. We evaluate MFH on 20 verifiers and over 15,000 verification tasks. Experimental results demonstrate the effectiveness of MFH, achieving a prediction accuracy of 91.47% even without ground truth algorithm labels provided during the training phase. Moreover, the prediction accuracy decreases only by 0.84% when introducing 10 new verifiers, indicating the strong scalability of the proposed approach.

Jie Su, Liansai Deng, Cheng Wen, Rong Wang, Zhi Ma, Nan Zhang, Cong Tian, Zhenhua Duan, Shengchao Qin3/31/2025

Measuring the Influence of Incorrect Code on Test Generation

arXiv:2409.09464v3 Announce Type: replace Abstract: It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to the real-world code, where tests generated for incorrect code experience a 47\% worse bug detection rate. Finally, we report that improvements of +18\% in accuracy, +4\% coverage, and +34\% in bug detection can be achieved by providing natural language code descriptions. These findings have actionable conclusions. For example, the 47\% reduction in real-world bug detection is a clear concern. Fortunately, it is a concern for which our findings about the added value of descriptions offer an immediately actionable remedy.

Dong Huang, Jie M. Zhang, Mark Harman, Mingzhe Du, Heming Cui3/31/2025