Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Back to Blog
OpenAI GPT-5.5 Spud Tops Benchmarks in Agentic Coding
ai tools

OpenAI GPT-5.5 Spud Tops Benchmarks in Agentic Coding

OpenAI launched GPT-5.5 'Spud' on April 23, 2026, achieving 82.7% on Terminal-Bench 2.0 and leading in agentic tasks like coding and computer use, outpacing ...

6 min read
April 27, 2026
gpt55 spud benchmarks, openai gpt55 features, agentic coding terminalbench
W
Wayne Lowry

10+ years in Digital Marketing & SEO

Imagine handing your AI a vague, messy task—like "fix this repo's bugs and deploy it overnight"—and walking away, only to find it done right the next morning. No hand-holding, no endless back-and-forth. That's not sci-fi anymore. On April 23, 2026, OpenAI dropped GPT-5.5 "Spud", their first fully retrained base model since GPT-4.5, and it's rewriting the rules for GPT-5.5 Spud benchmarks in agentic coding and beyond.[1][2]

Developers are losing their minds on X today—threads exploding with Codex demos building full apps autonomously, terminal agents crushing multi-hour workflows, and viral clips of Spud outpacing Claude Opus 4.7 in real-time coding battles.[3] This isn't just hype. With an 82.7% score on Terminal-Bench 2.0—a 13-point leap over Opus 4.7's 69.4%—Spud signals rapid AI advancement, proving we're in an era where models don't just chat; they execute.[1][4]

In this deep dive, we'll unpack the benchmarks, what makes Spud tick, how it stacks up against rivals, and why it's a game-changer for devs building with tools like OpenAI Codex or ChatGPT Pro. Buckle up—this potato's no dud.

What is GPT-5.5 "Spud"? A Fully Retrained Powerhouse

Codename "Spud" (yes, like the potato—get it? Ground-up growth?), GPT-5.5 is OpenAI's boldest leap yet: the first fully retrained base model since GPT-4.5, not just an RLHF tweak or distillation of priors.[2] Trained on Stargate-scale compute (100K+ H100s), it matches GPT-5.4's per-token latency but crushes intelligence metrics, using fewer tokens overall for the same tasks—up to 72% less output than Opus 4.7 on identical coding jobs.[5]

Why does this matter? Previous ".5" releases were incremental. Spud's a ground-up rebuild, optimized for agentic workflows: planning, tool coordination, iteration, and persistence without constant babysitting. Rollout started April 23 in ChatGPT (Plus/Pro/Business/Enterprise) and Codex, with API access "very soon" after safeguards.[1]

Key specs at a glance:

Feature Details
Context Window 1M tokens
Pricing (API) $5/$30 per 1M input/output (medium); up to $30/$180 for Pro/xhigh
Variants GPT-5.5, Thinking, Pro; reasoning levels: xhigh/high/medium/low/non
Modalities Text primary; no native image/audio/video output (yet)
Strengths Agentic coding, computer use, knowledge work, math/science[3]

It's not bigger for bloat's sake—it's smarter per token, making it economical for high-volume agent runs. See our guide on AI model economics for how this shifts ROI.

GPT-5.5 Spud Benchmarks: Dominating Agentic Coding

Benchmarks aren't perfect, but Spud's numbers scream progress. OpenAI claims state-of-the-art (SOTA) on 14 evals, with independent verifs like Artificial Analysis crowning it #1 on their Intelligence Index (60 vs. Opus 4.7/Gemini 3.1 Pro's 57).[3]

Terminal-Bench 2.0: The Agentic Coding Crown Jewel

82.7%—that's Spud's score on Terminal-Bench 2.0, testing complex CLI workflows: planning, iteration, tool calls, error recovery in a sandboxed terminal.[1] Breakdown:

Model Score Delta from Spud
GPT-5.5 82.7% -
GPT-5.4 75.1% -7.6 pts
Claude Opus 4.7 69.4% -13.3 pts
Gemini 3.1 Pro 68.5% -14.2 pts
Claude Mythos Preview 82.0% -0.7 pts[6]

This isn't gaming a leaderboard—it's real DevOps: "fail test → read logs → fix → retest → deploy." Spud uses fewer tokens to get there, slashing costs for unattended agents.[1]

Other Agentic & Coding Wins

  • SWE-Bench Pro: 58.6% (single-pass GitHub issue resolution; Opus leads at 64.3%, but Spud solves more end-to-end)[1]
  • Expert-SWE (Internal): 73.1% (20-hour human-equivalent tasks; up from GPT-5.4's 68.5%)[1]
  • OSWorld-Verified: 78.7% (autonomous computer use; edges Opus 4.7's 78.0%)[2]

Caveats? Hallucination rate hit 86% on AA-Omniscience (vs. Opus's 36%), but accuracy soared to 57%—it knows more, just overconfident sometimes.[3]

Head-to-Head: GPT-5.5 Spud vs. Claude Opus 4.7

Opus 4.7 dropped a week earlier, owning SWE-Bench. Spud fired back, reclaiming agentic leads. Side-by-side:

Benchmark GPT-5.5 Spud Claude Opus 4.7 Winner
Terminal-Bench 2.0 82.7% 69.4% Spud +13.3 pts[1]
SWE-Bench Pro 58.6% 64.3% Opus +5.7 pts
OSWorld-Verified 78.7% 78.0% Spud (narrow)
FrontierMath T1-3 51.7% 43.8% Spud
CyberGym 81.8% 73.1% Spud +8.7 pts
BrowseComp 84.4% 79.3% Spud +5.1 pts[3]

Spud excels in execution: unattended terminals, long-horizon agents, token efficiency. Opus shines in precision coding (e.g., PR patches).[7] Real-world tests? Devs report Spud finishing Codex tasks 2x faster/cheaper (20min/$3 vs. Opus's 40min/$6).[8]

For agentic coding, pick Spud. For architected refactors, Opus. Check our Claude vs. GPT showdown.

Real-World Impact: How Spud Powers Agentic Workflows

Forget benchmarks—here's Spud in action:

  1. Codex Overnight Builds: Prompt: "Build a Mac app from spec." Spud spins agents for UI/code/tests/deploy. Users report full prototypes while they sleep.[9]

    Example code snippet from a viral demo:

    # Spud autonomously ran this in Terminal-Bench style
    git clone repo && cd repo
    npm install && npm test  # Fixed 3 failing suites
    docker build -t app . && docker run -p 3000:3000 app
    # Deployed to Vercel: success!
    
  2. DevOps Automation: CI/CD loops without humans—read logs, patch, retest. Terminal-Bench 82.7% translates to 2x fewer interventions.

  3. Computer Use: OSWorld 78.7% means navigating browsers, spreadsheets, emails autonomously. Think: "Research Q1 earnings, build slide deck."

X buzz? "Spud just built my landing page in 12min" (Nate Herkelman); "Codex on 5.5 > Opus for agents" (dev threads).[8]

Products to try: OpenAI Codex for agents, ChatGPT Pro ($100/mo) for xhigh reasoning. Pair with Cursor or Replit for turbocharged dev.

Why This Signals Rapid AI Advancement

Spud's launch—just weeks after GPT-5.4, a week post-Opus 4.7—proves the pace exploding. Sam Altman/Greg Brockman hint at "significantly more releases," fueled by Stargate compute.[10] Token efficiency + agentic gains = scalable autonomy.

Risks? Higher hallucinations demand monitors. Safety card notes stronger cyber/bio guardrails, but resampling evals show slight misalignment upticks—watch for long-horizon drifts.[11]

This is the shift: from "chatty assistants" to "workhorses." Dive into our agentic AI roadmap.

FAQ

### What makes GPT-5.5 Spud's Terminal-Bench 2.0 score of 82.7% a big deal?

It's SOTA for complex CLI agents—planning, tools, recovery. Beats Opus 4.7 by 13 pts, enabling unattended DevOps/CI. Real win: fewer tokens = lower costs.[1]

### How does GPT-5.5 compare to Claude Opus 4.7 for coding?

Spud owns agentic/terminal (82.7% vs 69.4%); Opus leads precision (SWE-Bench Pro 64.3% vs 58.6%). Use Spud for loops, Opus for patches.[7]

### Is GPT-5.5 available now, and what's the pricing?

Yes—in ChatGPT/Codex for paid tiers; API soon. $5/$30 per 1M (medium); Pro xhigh up to $30/$180. 1M context.[3]

### Any downsides to GPT-5.5 Spud?

86% hallucination rate (high confidence in unknowns); Opus better on some precision evals. Needs safeguards for production agents.

Spud's here, devs—are you deploying agents yet, or still prompting manually? Drop your wildest Codex story below! [1][2]

Affiliate Disclosure: As an Amazon Associate I earn from qualifying purchases. This site contains affiliate links.

Related Articles