Hermes Agent is INSANE…

Benchmarks3 weeks ago

Hermes Agent is INSANE…

Descriptions:

Wes Roth builds and runs a custom model benchmark using a physics-based gravity well ship simulation — a game where AI models must iteratively write code to pilot ships around suns while managing fuel, momentum, and collision avoidance. The entire simulation, including the website, game logic, and scoring system, was built using large language models. Each model receives 20 iterations to improve its piloting scripts, with scores logged over time to measure learning curves rather than single-shot performance.

Benchmark results across frontier models are the core of the video: Claude Opus 4.5 achieves a high score of 276 with a clear upward learning trajectory; Claude Sonnet 4.6 tops out around 78; Claude Sonnet 4.5 scores as low as 1 on its first attempt. The overnight automated test run — using the Anthropic API from 2:17 a.m. to 5:32 a.m. — also covers GPT-5.4, GPT-5.5, GPT-5.5 Pro, Grok 420, Deepseek V4 Pro, and Gemini 3.1 Pro Preview, with a PvP leaderboard bracket emerging from the results. Roth notes that some models train directly on published benchmarks, making independent custom evaluations like this more meaningful for assessing real reasoning and code-generation capability.

The second half of the video covers Hermes Agent installation on a fresh Ubuntu VPS through Hostinger, walking through SSH login, terminal setup, and running the Hermes installer — using an AI assistant throughout to explain each command in context. The tutorial positions AI-guided technical setup as a replacement for traditional step-by-step documentation.

📺 Source: Wes Roth · Published April 27, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

Wes Roth

Tags

Claude Code Claude Opus Claude Opus 4.5 Claude Sonnet 4.5 Claude Sonnet 4.6 Codex DeepSeek V4 Pro GPT-5 Hermes Agent Hostinger Kimi K2.6 Open Router Wes Roth

Prev

Where the Economy Thrives After AI

Next

Nemotron OCR v2: Fast Multilingual OCR Model: Run Locally on CPU

18 Related Posts

Related Posts

11:12

Benchmarks

Qwen3.6 27B Gets 20% Faster with MTP and llama.cpp Locally

5 days ago

09:15

Benchmarks

ZAYA1-VL-8B: Efficient Open Visual Intelligence – Run Locally

6 days ago

04:40

Benchmarks

One API Key for Every AI Model (Pay With Crypto)

1 week ago

08:57

Benchmarks

Google Releases Gemma 4 MTP Drafters – Run Locally and DFlash Comparison

1 week ago

08:44

Benchmarks

Are AI Coding Skills Just Hype? I Tested Them

2 weeks ago

11:03

Benchmarks

I Didn’t Expect This: Opus 4.7 vs GPT 5.5

2 weeks ago