Hermes Agent is INSANE…

Hermes Agent is INSANE…

More

Descriptions:

Wes Roth builds and runs a custom model benchmark using a physics-based gravity well ship simulation — a game where AI models must iteratively write code to pilot ships around suns while managing fuel, momentum, and collision avoidance. The entire simulation, including the website, game logic, and scoring system, was built using large language models. Each model receives 20 iterations to improve its piloting scripts, with scores logged over time to measure learning curves rather than single-shot performance.

Benchmark results across frontier models are the core of the video: Claude Opus 4.5 achieves a high score of 276 with a clear upward learning trajectory; Claude Sonnet 4.6 tops out around 78; Claude Sonnet 4.5 scores as low as 1 on its first attempt. The overnight automated test run — using the Anthropic API from 2:17 a.m. to 5:32 a.m. — also covers GPT-5.4, GPT-5.5, GPT-5.5 Pro, Grok 420, Deepseek V4 Pro, and Gemini 3.1 Pro Preview, with a PvP leaderboard bracket emerging from the results. Roth notes that some models train directly on published benchmarks, making independent custom evaluations like this more meaningful for assessing real reasoning and code-generation capability.

The second half of the video covers Hermes Agent installation on a fresh Ubuntu VPS through Hostinger, walking through SSH login, terminal setup, and running the Hermes installer — using an AI assistant throughout to explain each command in context. The tutorial positions AI-guided technical setup as a replacement for traditional step-by-step documentation.


📺 Source: Wes Roth · Published April 27, 2026
🏷️ Format: Benchmark Test

1 Item

Channels