Imagine handing your AI a vague, messy task—like "fix this repo's bugs and deploy it overnight"—and walking away, only to find it done right the next morning. No hand-holding, no endless back-and-forth. That's not sci-fi anymore. On April 23, 2026, OpenAI dropped GPT-5.5 "Spud", their first fully retrained base model since GPT-4.5, and it's rewriting the rules for GPT-5.5 Spud benchmarks in agentic coding and beyond.[1][2]
Developers are losing their minds on X today—threads exploding with Codex demos building full apps autonomously, terminal agents crushing multi-hour workflows, and viral clips of Spud outpacing Claude Opus 4.7 in real-time coding battles.[3] This isn't just hype. With an 82.7% score on Terminal-Bench 2.0—a 13-point leap over Opus 4.7's 69.4%—Spud signals rapid AI advancement, proving we're in an era where models don't just chat; they execute.[1][4]
In this deep dive, we'll unpack the benchmarks, what makes Spud tick, how it stacks up against rivals, and why it's a game-changer for devs building with tools like OpenAI Codex or ChatGPT Pro. Buckle up—this potato's no dud.
What is GPT-5.5 "Spud"? A Fully Retrained Powerhouse
Codename "Spud" (yes, like the potato—get it? Ground-up growth?), GPT-5.5 is OpenAI's boldest leap yet: the first fully retrained base model since GPT-4.5, not just an RLHF tweak or distillation of priors.[2] Trained on Stargate-scale compute (100K+ H100s), it matches GPT-5.4's per-token latency but crushes intelligence metrics, using fewer tokens overall for the same tasks—up to 72% less output than Opus 4.7 on identical coding jobs.[5]
Why does this matter? Previous ".5" releases were incremental. Spud's a ground-up rebuild, optimized for agentic workflows: planning, tool coordination, iteration, and persistence without constant babysitting. Rollout started April 23 in ChatGPT (Plus/Pro/Business/Enterprise) and Codex, with API access "very soon" after safeguards.[1]
Key specs at a glance:
| Feature | Details |
|---|---|
| Context Window | 1M tokens |
| Pricing (API) | $5/$30 per 1M input/output (medium); up to $30/$180 for Pro/xhigh |
| Variants | GPT-5.5, Thinking, Pro; reasoning levels: xhigh/high/medium/low/non |
| Modalities | Text primary; no native image/audio/video output (yet) |
| Strengths | Agentic coding, computer use, knowledge work, math/science[3] |
It's not bigger for bloat's sake—it's smarter per token, making it economical for high-volume agent runs. See our guide on AI model economics for how this shifts ROI.
GPT-5.5 Spud Benchmarks: Dominating Agentic Coding
Benchmarks aren't perfect, but Spud's numbers scream progress. OpenAI claims state-of-the-art (SOTA) on 14 evals, with independent verifs like Artificial Analysis crowning it #1 on their Intelligence Index (60 vs. Opus 4.7/Gemini 3.1 Pro's 57).[3]
Terminal-Bench 2.0: The Agentic Coding Crown Jewel
82.7%—that's Spud's score on Terminal-Bench 2.0, testing complex CLI workflows: planning, iteration, tool calls, error recovery in a sandboxed terminal.[1] Breakdown:
| Model | Score | Delta from Spud |
|---|---|---|
| GPT-5.5 | 82.7% | - |
| GPT-5.4 | 75.1% | -7.6 pts |
| Claude Opus 4.7 | 69.4% | -13.3 pts |
| Gemini 3.1 Pro | 68.5% | -14.2 pts |
| Claude Mythos Preview | 82.0% | -0.7 pts[6] |
This isn't gaming a leaderboard—it's real DevOps: "fail test → read logs → fix → retest → deploy." Spud uses fewer tokens to get there, slashing costs for unattended agents.[1]
Other Agentic & Coding Wins
- SWE-Bench Pro: 58.6% (single-pass GitHub issue resolution; Opus leads at 64.3%, but Spud solves more end-to-end)[1]
- Expert-SWE (Internal): 73.1% (20-hour human-equivalent tasks; up from GPT-5.4's 68.5%)[1]
- OSWorld-Verified: 78.7% (autonomous computer use; edges Opus 4.7's 78.0%)[2]
Caveats? Hallucination rate hit 86% on AA-Omniscience (vs. Opus's 36%), but accuracy soared to 57%—it knows more, just overconfident sometimes.[3]
Head-to-Head: GPT-5.5 Spud vs. Claude Opus 4.7
Opus 4.7 dropped a week earlier, owning SWE-Bench. Spud fired back, reclaiming agentic leads. Side-by-side:
| Benchmark | GPT-5.5 Spud | Claude Opus 4.7 | Winner |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | Spud +13.3 pts[1] |
| SWE-Bench Pro | 58.6% | 64.3% | Opus +5.7 pts |
| OSWorld-Verified | 78.7% | 78.0% | Spud (narrow) |
| FrontierMath T1-3 | 51.7% | 43.8% | Spud |
| CyberGym | 81.8% | 73.1% | Spud +8.7 pts |
| BrowseComp | 84.4% | 79.3% | Spud +5.1 pts[3] |
Spud excels in execution: unattended terminals, long-horizon agents, token efficiency. Opus shines in precision coding (e.g., PR patches).[7] Real-world tests? Devs report Spud finishing Codex tasks 2x faster/cheaper (20min/$3 vs. Opus's 40min/$6).[8]
For agentic coding, pick Spud. For architected refactors, Opus. Check our Claude vs. GPT showdown.
Real-World Impact: How Spud Powers Agentic Workflows
Forget benchmarks—here's Spud in action:
-
Codex Overnight Builds: Prompt: "Build a Mac app from spec." Spud spins agents for UI/code/tests/deploy. Users report full prototypes while they sleep.[9]
Example code snippet from a viral demo:
# Spud autonomously ran this in Terminal-Bench style git clone repo && cd repo npm install && npm test # Fixed 3 failing suites docker build -t app . && docker run -p 3000:3000 app # Deployed to Vercel: success! -
DevOps Automation: CI/CD loops without humans—read logs, patch, retest. Terminal-Bench 82.7% translates to 2x fewer interventions.
-
Computer Use: OSWorld 78.7% means navigating browsers, spreadsheets, emails autonomously. Think: "Research Q1 earnings, build slide deck."
X buzz? "Spud just built my landing page in 12min" (Nate Herkelman); "Codex on 5.5 > Opus for agents" (dev threads).[8]
Products to try: OpenAI Codex for agents, ChatGPT Pro ($100/mo) for xhigh reasoning. Pair with Cursor or Replit for turbocharged dev.
Why This Signals Rapid AI Advancement
Spud's launch—just weeks after GPT-5.4, a week post-Opus 4.7—proves the pace exploding. Sam Altman/Greg Brockman hint at "significantly more releases," fueled by Stargate compute.[10] Token efficiency + agentic gains = scalable autonomy.
Risks? Higher hallucinations demand monitors. Safety card notes stronger cyber/bio guardrails, but resampling evals show slight misalignment upticks—watch for long-horizon drifts.[11]
This is the shift: from "chatty assistants" to "workhorses." Dive into our agentic AI roadmap.
FAQ
### What makes GPT-5.5 Spud's Terminal-Bench 2.0 score of 82.7% a big deal?
It's SOTA for complex CLI agents—planning, tools, recovery. Beats Opus 4.7 by 13 pts, enabling unattended DevOps/CI. Real win: fewer tokens = lower costs.[1]
### How does GPT-5.5 compare to Claude Opus 4.7 for coding?
Spud owns agentic/terminal (82.7% vs 69.4%); Opus leads precision (SWE-Bench Pro 64.3% vs 58.6%). Use Spud for loops, Opus for patches.[7]
### Is GPT-5.5 available now, and what's the pricing?
Yes—in ChatGPT/Codex for paid tiers; API soon. $5/$30 per 1M (medium); Pro xhigh up to $30/$180. 1M context.[3]
### Any downsides to GPT-5.5 Spud?
86% hallucination rate (high confidence in unknowns); Opus better on some precision evals. Needs safeguards for production agents.
Spud's here, devs—are you deploying agents yet, or still prompting manually? Drop your wildest Codex story below! [1][2]
