Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

Benchmarks2 weeks ago

Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

Descriptions:

Prasenjit Sarkar from Sonar presents an enterprise-focused LLM code quality evaluation that goes substantially beyond standard SWE-bench or HumanEval pass rates. Sonar ran 53+ models — including multiple versions of Gemini, Claude, and GPT — against 4,444 Java programming assignments and analyzed the outputs with SonarQube Enterprise, measuring security vulnerabilities, cyclomatic complexity, cognitive complexity, bug density, and verbosity alongside functional correctness.

Key findings are striking: Gemini 3.1 Pro High leads on functional correctness at 84.17% pass rate and produces a relatively concise 307,000 lines of code for the full assignment set, with 614 bugs per million lines. Claude Sonnet 4.6 registers the highest security risk in the evaluation at 300 security issues per million lines of code and generates 627,000 lines. GPT-5.4 is by far the most verbose, generating 1.2 million lines for the same 4,444 assignments — more than four times Gemini’s output. The presenter attributes growing verbosity in newer models to mixed-quality training data and inherited security flaws from open-source code used during pretraining.

Sonar has published the full dataset and a continuously updated leaderboard at sonar.com/leaderboard, with per-model drill-downs that include complexity breakdowns and issue type distributions. For engineering teams evaluating which LLM to integrate into their development pipeline, this is one of the most granular publicly available comparisons focused specifically on production code quality.

📺 Source: AI Engineer · Published May 31, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

AI Engineer

Tags

Claude Sonnet 4.6 Codex GPT-5.4 MCP Sonar SWE-bench

Prev

Weekly AI Recap — Opus 4.8, Step Audio 3, Bonsai Image and More | May 2026

Next

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

18 Related Posts

Related Posts

31:25

Benchmarks

Claude Fable 5 BANNED: The First Model Agentic Engineers DON’T NEED

22 minutes ago

09:03

Benchmarks

I tried to prove AI trading is BS and it backfired

2 days ago

09:13

Benchmarks

I Tested 100,000 Trading Strategies on 1,000 Stocks

6 days ago

14:35

Benchmarks

Google QAT vs Unsloth Q4_0 – Which Gemma 4 12B Quantization Is Better?

1 week ago

12:30

Benchmarks

Ideogram 4: World’s Best Text-to-Image Model? Let’s Test Locally

2 weeks ago

14:55

Benchmarks

Gemma 4 12B on a 16GB Mac Mini Is Surprisingly Capable

2 weeks ago