Descriptions:
Prasenjit Sarkar from Sonar presents an enterprise-focused LLM code quality evaluation that goes substantially beyond standard SWE-bench or HumanEval pass rates. Sonar ran 53+ models — including multiple versions of Gemini, Claude, and GPT — against 4,444 Java programming assignments and analyzed the outputs with SonarQube Enterprise, measuring security vulnerabilities, cyclomatic complexity, cognitive complexity, bug density, and verbosity alongside functional correctness.
Key findings are striking: Gemini 3.1 Pro High leads on functional correctness at 84.17% pass rate and produces a relatively concise 307,000 lines of code for the full assignment set, with 614 bugs per million lines. Claude Sonnet 4.6 registers the highest security risk in the evaluation at 300 security issues per million lines of code and generates 627,000 lines. GPT-5.4 is by far the most verbose, generating 1.2 million lines for the same 4,444 assignments — more than four times Gemini’s output. The presenter attributes growing verbosity in newer models to mixed-quality training data and inherited security flaws from open-source code used during pretraining.
Sonar has published the full dataset and a continuously updated leaderboard at sonar.com/leaderboard, with per-model drill-downs that include complexity breakdowns and issue type distributions. For engineering teams evaluating which LLM to integrate into their development pipeline, this is one of the most granular publicly available comparisons focused specifically on production code quality.
📺 Source: AI Engineer · Published May 31, 2026
🏷️ Format: Benchmark Test







