AI News•Dec 29, 2025•6 min

Skills are the new plugins: IBM's open agent, Hugging Face's "do stuff" tutorials, and Claude's enterprise push

Agents are getting less mystical and more modular-skills, hooks, and domain workflows are turning LLMs into real products.

The AI agent conversation is finally getting less cringe.

For most of 2024 and early 2025, "agents" meant demos that looked magical until you tried them on a real codebase or a real business process. This week's batch of updates shifts the vibe. IBM put out an open, configurable agent that's designed to be assembled, not worshipped. Hugging Face is leaning hard into "skills" as the unit of work. And Anthropic is basically saying the quiet part out loud: the next wave of LLM value isn't the chat window. It's the integration surface-hooks, skills, org controls, and repeatable workflows that survive contact with enterprise reality.

Here's what I noticed across all of it: everyone is converging on the same product idea. Don't just make models smarter. Make them operable.

The big theme: agents are becoming packaging problems

If you build software for a living, you already know the punchline. Intelligence isn't the bottleneck. Reliability is. Repeatability is. Permissioning is. "How do I make this thing do the same correct action tomorrow?" is the whole game.

That's why the most interesting news here isn't a benchmark jump. It's the boring-sounding stuff: configurable agents, skills registries, MCP servers, CLAUDE.md conventions, and workflow case studies. It's all scaffolding. And scaffolding is what turns a model into a product.

IBM's CUGA: open-source agents that look like software, not a science project

IBM Research released CUGA, a configurable open-source agent that plugs into Hugging Face Spaces and Langflow. On paper, that sounds like "yet another agent framework." In practice, I think it's IBM making a very specific bet: the agent layer will be standardized by composition, not by one vendor's monolithic runtime.

CUGA's core idea-configuration over invention-matters. Most teams don't want a bespoke agent architecture. They want something they can understand, tweak, and deploy. They want knobs. They want guardrails. They want a thing that behaves like a system, not a prompt that behaves like a mood.

The Hugging Face Spaces integration is also telling. HF has become the public square for model experimentation, and increasingly, the place where "agent UX" gets normalized. If CUGA becomes a default template people remix inside Spaces, IBM effectively ships distribution without owning the platform. That's a very modern enterprise move.

Who benefits? Developers who want a starting point that's inspectable and swappable. Product teams who want to demo fast but still keep an upgrade path to "real." Anyone who has been burned by an agent that only works in the notebook.

Who's threatened? Closed agent runtimes that rely on mystery. If open tooling gets good enough, the premium shifts from "we have agents" to "we have the best enterprise controls, integrations, and economics." Which is a tougher, more competitive business.

The "so what" for builders: CUGA is another sign that the agent layer is becoming a commodity interface. If you're building a product, differentiation probably won't come from "we have an agent." It'll come from what your agent can safely do, in your domain, with your data, under your constraints.

Hugging Face Skills tutorials: the job is the product, not the model

Hugging Face also published tutorials showing Claude and OpenAI Codex doing practical work via Hugging Face Skills-fine-tuning an open model, running experiments, and generally behaving like an operator instead of a chatbot.

This matters for one reason: it reframes what "model choice" even means.

The default debate is still "Which frontier model is best?" Meanwhile, these tutorials are quietly saying: pick the model you like, then wire it to a repeatable tool interface so it can actually accomplish tasks. Skills are basically an API contract for "do the thing." Once you have that contract, models become interchangeable labor.

And that's the catch that caught my attention. If you're OpenAI, Anthropic, Google, or anyone selling premium tokens, the long-term risk is that you get abstracted. If Claude can drive HF Skills and Codex can drive HF Skills, then the buyer's attachment shifts to the workflow layer, not the model brand. The model is still important-latency, cost, quality-but the lock-in moves up the stack.

For developers, this is pretty neat because it's a path out of prompt spaghetti. Skills-based design pushes you toward explicit inputs/outputs, structured tool calls, and reproducible actions. It's also a gentle push toward evals, because once you treat "fine-tune this model" as a skill, you can test it like software.

For entrepreneurs, it's a hint about where new startups can wedge in: build the best skill packs for specific industries, or build the orchestration that makes skill execution auditable and safe. The model vendor doesn't have to be the platform.

Anthropic's Skills + Claude Code: they're productizing the "agentic dev" workflow

Anthropic dropped a whole cluster of posts around Skills, MCP/skills guidance, org management, a partner directory, and a bunch of Claude Code customization docs (hooks, CLAUDE.md, and even a Slack delegation beta).

My take: Anthropic is trying to win by being the most "operational" frontier model provider. Not just "Claude is smart," but "Claude fits into how teams already ship software."

The customization pieces are the real story. Hooks and CLAUDE.md are basically an admission that generic coding agents hit a wall on real repositories. Every codebase has local conventions, brittle scripts, private tooling, weird CI, and tribal knowledge. You don't fix that with a better model alone. You fix it with a mechanism to inject procedural knowledge and enforce workflow constraints.

CLAUDE.md is especially interesting because it creates a lightweight, repo-native control plane. Put the rules next to the code. Version them. Review them. Treat "how the agent should behave here" as part of the repository's source of truth. That's not sexy, but it's how you actually scale agent usage beyond one power user.

The Slack delegation beta is another signal: the agent isn't just in the IDE anymore. It's becoming an organizational actor. That's exciting-and also where things can go sideways fast. Once you let an agent operate in comms channels, ticket systems, and deploy pipelines, the question becomes less "can it code?" and more "can it be governed?"

That's why the org management and partner directory stuff matters. Anthropic is building an ecosystem where Skills are distributable artifacts. That's the same move app stores made, and the same move cloud providers made with marketplaces. If they pull it off, "Skills" become the unit of value exchange, not prompts.

The "so what" for teams: if you're evaluating coding agents, stop only testing "does it solve this LeetCode-ish bug." Start testing "can we constrain it, customize it, and integrate it without creating a security and compliance nightmare."

Claude in electrical engineering and legal: domain wins come from collaboration plus structure

Anthropic also shared two very concrete workflow stories: improving Claude's electrical engineering performance through domain collaboration and a PCB DSL called Zener, and using Claude internally in legal review workflows to cut turnaround from days to hours.

These are worth paying attention to because they show the playbook for "AI that actually works in regulated, high-stakes contexts."

In electrical engineering, the interesting move isn't "Claude learned circuits." It's that they introduced structure. A DSL for PCB design is a way of forcing the model to operate in a constrained language where outputs are checkable. Domain collaboration is a way of turning tacit expertise into explicit guidance and test cases. This is how you avoid the trap where the model sounds right but is subtly wrong.

In legal, the story isn't "lawyers got replaced." It's that the workflow got re-chunked. Review is a pipeline: summarize, spot issues, compare clauses, propose revisions, track changes, escalate edge cases. Models are good at that when the task is framed as steps with clear artifacts. Cutting time from days to hours is plausible when you stop treating the model as a single magical reviewer and start treating it as a drafting and triage engine.

The pattern across both: domain performance comes from constraints, tooling, and human-in-the-loop process design. Not vibes.

If you're building in verticals, this is your clue. The moat isn't the base model. The moat is the structured interface (DSLs, schemas, templates), the evaluation suite, and the workflow that makes the model's behavior legible.

Quick hits

Anthropic's "enterprises building agents in 2026" framing is basically them planting a flag: agent adoption will be measured in governance maturity, not pilot counts. That aligns with all the org controls and ecosystem moves they're shipping.

Claude's "thinking partner" updates-memory and voice-are nice, but I see them as supporting features, not the main event. Memory makes skills and workflows stickier. Voice makes the interface broader. Neither matters if the underlying tool integration story is weak.

The "50+ customizable Claude skills" chatter and the Noumena write-up on building marketing agents both reinforce the same point: once skills are easy to author and share, you get an explosion of narrow, useful capabilities. The risk is fragmentation and low-quality skills. The opportunity is curation, testing, and lifecycle management-basically "package management," but for agent behaviors.

Closing thought

What's happening right now looks a lot like the early days of cloud-native. Everyone's arguing about raw compute, but the real winners are quietly standardizing the interfaces, the deployment patterns, and the governance primitives.

Models will keep getting better. That's table stakes. The real competition is shifting to who can make AI legible inside real systems: configurable agents, skill contracts, repo-level rules, constrained domain languages, and workflows that produce auditable artifacts.

If you're building in this space, I'd stop asking "Which model should we use?" and start asking "What's our skill surface, and how do we control it?" That question is where products-and moats-are going to come from in 2026.