arXiv

Measuring and Mitigating Bias in Code Generated by Large Language Models

June 2, 2026 · Yuxi Chen, Yutian Tang, Timothy Storer · Original Source

Title: Assessing and Reducing Bias in Code Produced by Large Language Models

Abstract:

While Large Language Models (LLMs) are widely acknowledged for their proficiency in natural language generation, their adoption in code generation is growing rapidly. Despite this progress, significant concerns persist regarding the potential for bias in the code they produce. This study centers on two prominent code generation tools, GPT-4o and Gemini, and introduces a novel framework designed to evaluate bias within LLM-generated code. Our analysis specifically explores how protected attributes, prompt variations, and web-search functionalities impact output.

To measure these effects, we employ two distinct metrics: the Code Bias Score (CBS), which quantifies the overall prevalence of bias, and the Attribute Change Ratio (ACR), which assesses the extent to which specific attributes influence the results. Furthermore, we test four lightweight mitigation techniques—Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, and Multi-agent—aimed at reducing bias in the generated code. Our results indicate that bias continues to be widespread across various protected attributes and datasets, even after the implementation of these mitigation strategies. These findings underscore the urgent requirement for more robust methods to minimize bias in AI-driven code generation systems.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC