AI News•Dec 28, 2025•6 min

Agents Are Growing Up - And So Are the Ways They Break

This week: MCP security pitfalls, Claude Skills, agent reasoning benchmarks, a leaner 162B SMoE, and a world model that predicts video futures.

The most important AI story this week isn't a new benchmark score. It's the slow realization that "agentic" AI is basically becoming software. And software has an entire genre of failure modes we already know too well: supply-chain attacks, dependency confusion, privilege escalation, and sneaky prompt-shaped injections that look like harmless config.

That's the vibe running through a bunch of updates right now. The tooling is getting more modular. People are packaging "skills" and "tools" like plugins. Models are being optimized specifically for long-context coding agents. And researchers are finally doing the unsexy work of measuring which reasoning patterns actually hold up under latency and tool-use constraints.

If you're building anything agent-shaped in 2026, this week's items are a pretty clean map of what will make you money-and what will wake you up at 3 a.m.

MCP security: welcome to the prompt supply chain

What caught my attention in the Model Context Protocol (MCP) security write-up is how familiar the risks feel. MCP is all about letting models talk to tools through a standardized interface. That's great. It also means you've created a "tool supply chain," and attackers love supply chains.

The big bucket of issues can be described like this: the model thinks it's calling a trusted tool, but the tool (or the tool's context) is lying. Tool poisoning is the obvious one. If an attacker can change a tool's behavior or its returned text, they can push hidden instructions back into the model. The model treats that output as authoritative because, hey, it came from the database tool or the "safe" retrieval layer.

Then you get the nastier variants that feel very 2025: "rug pulls" and "hijacks." Rug pull, in this framing, is when a tool that used to be trustworthy updates and becomes malicious-or gets acquired, compromised, or just quietly changes its behavior. Hijacking is when an attacker can intercept or redirect which tool is being invoked, or slip in hidden instructions that override the developer's intent.

Here's the part that matters for builders: we're past the era where prompt injection is just a clever demo. In an agent stack, hidden instructions aren't a parlor trick. They're a control plane attack. If your agent can deploy code, move money, email customers, or write to production systems, the "text" channel is effectively an admin interface unless you harden it.

My take: if you're adopting MCP (or anything MCP-like), treat tool outputs as untrusted input by default. Sanitize them. Gate them. Add allowlists for operations. Separate "data return" from "instruction return." And log everything, because you'll need forensics the first time a tool quietly convinces your agent that "rotating API keys" means "exfiltrate them."

The deeper trend is that agents are pushing security left. Not in the buzzword sense. In the brutal sense that your product's security now depends on prompt-layer decisions and tool protocol design, not just network policies.

Anthropic Claude Skills: packaging behavior, not just prompts

Anthropic's new push around Claude Skills is interesting because it's basically an admission that raw prompts don't scale. Skills are positioned as a more structured way to define reusable behaviors-something between "a prompt snippet" and "a full agent framework."

When I read their guidance, what stood out is the product philosophy: Skills are meant to be designed, documented, and reused. That sounds obvious, but it's a big deal culturally. Most teams still treat prompts like magic spells in a Notion doc. Skills are an attempt to make them more like software artifacts: scoped, versioned, tested, and shareable across projects.

Anthropic also spends time differentiating Skills from other building blocks like Projects, MCP, and subagents. That taxonomy matters. If you're a developer or PM, the fastest way to drown is to adopt every new abstraction layer without deciding what each one is for. Skills are "how Claude should do a class of tasks." MCP is "how Claude talks to external capabilities." Subagents are "how you decompose work." Those are different axes, and mixing them blindly is how you end up with an agent that is impossible to debug.

The catch: once Skills become a distribution mechanism, they inherit the same trust problems as MCP tools. A "Skill" can be a dependency. Dependencies can be compromised. And because Skills are about behavior, a malicious or sloppy Skill doesn't just leak data-it can normalize dangerous actions. If your org starts building an internal marketplace of Skills, you'll need governance and review like you would for shared libraries.

Still, I like this direction. It nudges teams toward repeatability. It also hints at the next platform battle: not "who has the best base model," but "who has the best ecosystem for composing model behavior safely."

Benchmarking agent reasoning: less vibes, more numbers

One of the most quietly important items this week is the empirical framework for benchmarking reasoning strategies in agentic systems. This is the kind of work that doesn't trend on social media and absolutely should.

We've spent years arguing about Direct prompting versus Chain-of-Thought (CoT), and then ReAct, Reflexion, self-consistency, tool-use loops, planner/executor splits, and so on. But in production, the question isn't "which one feels smarter." It's "which one hits the accuracy bar under latency constraints, tool costs, and failure recovery rules."

That's why I like that the framework looks at things like efficiency, latency, and tool use-not just end-task accuracy. Agents don't live in single-shot eval land. They live in retry loops, timeouts, rate limits, partial tool outages, and messy user input. A reasoning strategy that's 2% more accurate but 4× slower might be a net loss if it blows your SLA or makes your product feel sluggish.

Here's what I noticed: as soon as you measure tool calls and latency, you stop fetishizing "longer thinking" as a universal good. Sometimes a Direct approach plus a lightweight verifier is the right trade. Sometimes ReAct wins because it externalizes intermediate steps into tool interactions. Sometimes Reflexion helps because it catches systematic failure modes. The point is: it's contextual, and we need benchmarks that reflect that context.

If you're building agents, this should push you toward A/B testing reasoning patterns the same way you A/B test UI flows. Treat reasoning strategies as configurable policies. Measure cost per successful task, not just pass/fail. And keep a tight loop between evals and incident reports, because real-world failures are the best dataset you have.

Cerebras MiniMax-M2-REAP: the long-context agent tax is real

Cerebras releasing a memory-efficient version of a large SMoE model (MiniMax-M2-REAP) is a very practical signal: long-context coding agents are expensive, and everyone is trying to cut the bill without losing capability.

The idea here-pruning experts with something like REAP while preserving accuracy-fits the moment. Sparse Mixture-of-Experts gives you a huge parameter count with only a subset active per token. But deployment still gets gnarly when you crank context length and want stable throughput. Memory becomes the constraint, not just FLOPs. If you're building a coding agent that slurps repos, issues, logs, and docs into context, that memory pressure shows up fast.

So pruning ~30% of experts while keeping performance is basically saying: we can make these giant agent-friendly models more shippable. This threatens anyone betting that "only the biggest, densest models can code well." It also benefits teams trying to run serious agents without defaulting to the most expensive hosted APIs.

My opinion: the winning stack for many companies will be "pretty strong model + ruthless systems optimization + good tools." Not "the single smartest model." This kind of release is a brick in that wall.

PAN world model: video futures as a product primitive

MBZUAI's PAN, a general world model that predicts future world states as video conditioned on actions, is the most "peek at the future" item in the set. The pitch is long-horizon, interactive simulation with high fidelity. If it works as advertised, it's a step toward agents that can rehearse.

World models are having a moment because they promise something LLMs struggle with: consistent dynamics over time. If an agent can simulate "what happens if I do X," you get a new kind of planning loop. And video as the predicted medium is a clue: the interface for many agents won't be text. It'll be embodied, spatial, and time-based, even if the first use cases are still virtual (games, robotics sim, digital twins, UI automation).

The so-what for entrepreneurs is pretty spicy: once you can cheaply generate plausible futures, you can build planning products that feel like magic. The so-what for developers is more sobering: evaluating these systems is hard. "Looks right" is not the same as "is right." And when you connect a world model to real actions-robots, vehicles, or even just high-stakes automation-you'll need tight calibration and robust uncertainty handling.

Still, this is the direction. Agents that can't simulate will be at a disadvantage against agents that can.

Quick hits

OpenAI's GPT-5.1 prompting guide is a reminder that "prompting" is evolving into an engineering discipline. The best practices are less about clever phrasing and more about structuring tasks, controlling tool use, and reducing ambiguity-basically, writing specs the model can execute.

The broader industry roundup-Google/DeepMind upgrades, OpenAI updates, Anthropic's agent push, and even timeline noise like model delays-keeps pointing to the same reality: the big labs are optimizing for agent ecosystems now. Models are table stakes. Distribution happens through tools, workflows, and developer ergonomics.

Closing thought: I keep seeing the same shape across all these stories. We're turning language models into operators. Operators need protocols, packaging, benchmarking, and security. The teams that win won't just have a smart model. They'll have a disciplined way to compose behavior-and a paranoid way to defend it.