CMU-S3D-25-101
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-25-101

Navigating Challenges with LLM-based Code
Generation using Software-specific Insights

Nikitha Rao

April 2025

Ph.D. Thesis
Software Engineering

CMU-S3D-25-104.pdf


Keywords: Large Language Models for Code, Generative AI, Verification, Reliability

The software development process is rapidly evolving with the advancement of Large Language Models (LLMs). LLMs are not only transforming the way code is written but are also increasingly integrated into AI programming tools, such as ChatGPT and GitHub Copilot, to enhance developer productivity by generating programs from natural language instructions, identifying and fixing bugs, generating documentation and so on.

These LLMs are pretrained on large volumes of natural language and code data. They are trained using cross-entropy and preference losses that have no coefficient for correctness and only optimize for matching the ground truth. Therefore, despite their proficiency in learning code syntax, they fall short in capturing semantic signals. To date, the main focus of efforts to improve these models has been training larger models and collecting more human preference data. However, user studies have found notable issues with the usability of these larger models, including difficulty in understanding the generated code, the presence of subtle bugs that are hard to find, and a lack of verification of the generated code.

This dissertation demonstrates that integrating domain insights from software engineering into AI-based code generation can enhance reliability and utility for developers. This is done by empowering the model to take on a more active role in building valid and usable code, instilling greater trust among users in the capabilities of the model. I focus on three main challenges identified by prior work and propose solutions using software-specific insights.

     (1) The generated code can be difficult to understand and manipulate, especially for non-expert programmers. To address this, I contribute LOWCODER, a tool that abstracts away the syntactic complexity associated with traditional code and provides a more user-friendly interface using drag-and-drop functionality. As a result, LOWCODER provides a trusted environment where users can leverage the capabilities of AI without the need for extensive coding knowledge.
     (2) Verifying the correctness of the generated code is hard. While LLMs excel at generating code, they are lacking when it comes to generating tests. This is largely because current models are trained on individual files and therefore can not consider the code under test context. To overcome this, I contribute CAT-LM, a LLM trained to explicitly consider the mapping between code and test files. CAT-LM can therefore help users with verifying code that they or other models generate, by generating tests that align more coherently with the underlying code.
     (3) The generated code often has subtle bugs that are hard to find. To address this, I contribute DIFFSPEC, a framework for generating differential tests with LLMs using prompt chaining to verify code correctness. DIFFSPEC makes use of various software artifacts like natural language specification documents, source code, existing tests, and previous bug reports to generate tests to not only verify code correctness, but also checks for conformance against the specification. By highlighting meaningful behavioral differences between implementations, DIFFSPEC can enhance the overall reliability of even extensively tested software systems.

The goal of my dissertation is to demonstrate the significance of integrating software-specific insights when training models to make code generation more reliable and useful for developers. My dissertation work contributes several artifacts including datasets, evaluation frameworks and models that are trained by integrating software-specific insights to improve the quality of generated code. Importantly, these models are all quite small relative to cutting-edge general purpose models like GPT-4. While large, general models can also be very useful for these tasks, they have their own limitations: few companies can afford the immense resources required to train such large models, and most of these models are closed-source and provide limited (free) access to the community which can be unreliable. In contrast, my work produces smaller open-source models that are specialized to perform various programming related tasks, resulting in tools that make code generation more reliable and useful for developers.

134 pages

Thesis Committee:
Vincent J. Hellendoorn (Co-Chair)
Claire Le Goues (Co-Chair)
Daniel Fried
Andrew Begel
Thomas Zimmermann (University of California, Irvine)

Nicolas Christin, Head, Software and Societal Systems Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu