SOFTWARE AND SOCIETAL SYSTEMS DEPARTMENT TECHNICAL REPORT ABSTRACTS

CMU-S3D-25-100
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University

CMU-S3D-25-100

Exploiting Test Structure to Enhance
Language Models for Software Testing

Kush Jain

April 2025

Ph.D. Thesis
Software Engineering

CMU-S3D-25-100.pdf

Keywords: Software testing; machine learning; large language models

Software testing is an integral part of software development. However, testing faces challenges due to the time-consuming and challenging nature of writing high quality tests, leading to poorly maintained test suites and lower overall software quality. Prior work for automatically generating tests, like EvoSuite and Randoop can generate high-coverage tests, however, often these tests are hard to read, unrealistic, or incorrect, necessitating additional effort from developers for verification. In contrast, language models have shown promise in generating human-like, high-quality code functions, benefiting tools like Copilot in code generation.

However, language models are not as successful at generating tests, struggling with both hallucination and with correctly invoking internal methods present in the code under test. This is because code generation language models are typically trained primarily for code generation and code completion. Benchmarks also do not resemble real-world development; existing benchmarks consist of simple programming or LeetCode problems. To help overcome these limitations, I focus on how we can incorporate domain-specific properties of testing such as the strong coupling between source and test files along with important test execution data to improve the evaluation and application of language models to software testing. I also examine how we can better evaluate test generation approaches with metrics that are more meaningful to developers and evaluation on larger codebases that more closely resemble real-world development. My thesis statement is: We exploit the structure of test code and close relationship between code and test files to improve the evaluation and application of language models to software testing in both pretraining and fine-tuning. This insight can (a) generate useful unit test cases, (b) identify weaknesses in existing test suites, (c) build more realistic test generation benchmarks, and (d) generate test suites for large scale projects.

My thesis will make the following contributions:

It presents a new method for pretraining models for test generation, that considers the relationship between source code and test code.
It provides an approach to automatically classify mutants as detected or undetected without executing the test suite by leveraging additional test context.
It evaluates all provided techniques with metrics and experiments that are practically meaningful to developers, not considered in prior work.
It introduces a benchmark for evaluating test generation approaches that is sourced from large scale open source repositories and thus more closely resembles real-world test generation.
It demonstrates the effectiveness of adding execution context to test generation models, which enables us to generate high quality test suites for large scale projects.

My work (ASE 2023) demonstrated that pretraining language models on dual objectives of code and test generation significantly improves unit test generation. I also leveraged the joint relationship between code and tests (FSE 2023) to improve predictive mutation testing techniques, modeling mutants at the token level, and incorporating both source and test methods during fine-tuning. I improved test generation evaluation (ICLR 2025) by introducing a large test generation benchmark, TestGenEval, that is sourced from large scale open source repositories. Finally, I built a test generation agent (submitted to ICSE 2026) that incorporates execution feedback, while also scaling to the large open source repositories in TestGenEval.

119 pages

Thesis Committee:
Claire Le Goues (Chair)
Christian Kaestner
Daniel Fried
Alex Groce (Northern Arizona University)

Nicolas Christin, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu