AI News•Dec 28, 2025•6 min

Agents Are Growing Hands and Long-Term Memory - and the Data Behind Them Is Getting Serious

This week's AI news: better GUI grounding, smarter agent memory, a new agent framework, and foundation-model moves in tabular and recommender systems.

The most interesting AI progress right now isn't a new chatbot personality. It's the boring stuff. Clicks. Logs. Tool plumbing. Datasets that look like they came from real businesses instead of Kaggle nostalgia.

This week's batch made that trend hard to ignore. Agents are getting better at operating computers (literally clicking the right pixel). People are finally getting honest about memory systems (and how they fail). And we're seeing "foundation model" thinking spread into tabular ML and recommenders, where the money actually is.

If you're building product, this matters more than another benchmark win on a math test.

Agents that can actually use a computer: Gelato-30B-A3B

What caught my attention first was Gelato-30B-A3B, a model focused on GUI grounding for computer-use tasks. Translation: it helps an agent figure out where to click on a screen. Not "describe what you see," but "put the cursor here, on this exact UI element, and do it reliably."

This is the part of agent demos that usually falls apart. Planning is one thing. Acting is another. You can have a gorgeous GPT-5-style planner that writes an elegant multi-step plan, and then the whole run faceplants because the agent clicks one button to the left and ends up in settings hell.

Gelato's claim is simple and practical: train specifically for click localization using a dedicated dataset (Click 100k), and you beat prior grounding models. The more important claim is downstream: better grounding increases end-to-end agent success when paired with strong planners.

Here's my take. We're watching the agent stack split into specialists, and that's good. The "one giant model does everything" idea is convenient for papers, but product wants reliability. I'd rather have a planner that's great at reasoning, a grounding model that's great at pointing, and a set of tools that are deterministic where they can be.

If you're shipping agentic workflows, this pushes you toward an architecture where GUI interaction is treated like robotics. Perception and control loops. Narrow models trained on the right supervision. And evaluation that looks like "task completion rate," not "model seems confident."

The catch: GUI grounding is brittle across themes, screen resolutions, accessibility modes, and app updates. The long-term winners will be the ones who treat UI change as a first-class problem, maybe with continual data collection and automated regression tests. But I like the direction. It's less magical. More engineering. More shippable.

Agent memory is still messy, but we're getting patterns

The other item that felt unusually grounded was the comparison of memory systems for LLM agents: vector memory, graph memory, and event log memory.

I'm glad people are talking about this as "patterns" instead of "just add RAG." Because memory is not one thing. It's at least three things, and mixing them badly is how you get agents that hallucinate commitments, forget decisions, or loop forever.

Vector stores are great when you want fuzzy recall. "Have we seen something like this before?" They're also the easiest to bolt on, which is why everyone does it. But vector recall can be too eager. It surfaces semantically similar junk and the agent treats it as truth. Developers then respond by tightening thresholds, which turns "memory" into "occasionally it remembers a random fact."

Graph memory is the opposite vibe. It's structured and explicit. Relationships matter. You can represent "customer X owns project Y, which is blocked by issue Z." That's powerful for multi-step planning and multi-agent coordination. The failure mode is obvious too: if the extraction into a graph is wrong, you've just created a confident, queryable lie. And graph construction is work. Real work.

Event logs are my personal favorite for agent systems that need to behave like accountable software. Store what happened. Who said what. What tool was called. What output came back. Not an "interpretation," but a trace. This plays nicely with debugging, audits, and "why did you do that?" questions. The downside is it's not immediately useful for reasoning unless you build summaries and indexes on top.

The meta-point is the same as the Gelato story: agent stacks are becoming layered systems. Memory is not a single database. It's a set of representations with different tradeoffs, and you should pick based on the failure mode you can tolerate.

If you're a PM or founder, the "so what" is operational. If you can't answer "what did the agent know at the moment it made this decision?" you don't have a product. You have a demo. Event logs get you closer to that answer. Vector and graph layers can make it smarter, but logs make it governable.

Kosong (Moonshot AI): the agent framework wars continue, quietly

Moonshot AI releasing Kosong, the abstraction layer behind Kimi CLI, is another sign of something I've been expecting: the agent framework space is stabilizing into a few common primitives.

Kosong is positioned as a Python layer for unified messaging and tool orchestration. That sounds mundane until you've built one of these systems yourself. Then you realize messaging formats, tool schemas, retries, tool result normalization, and multi-model routing are where time goes to die.

Here's what I noticed. The market is shifting from "frameworks as ideology" to "frameworks as plumbing." Early frameworks tried to sell you a worldview. Now the winners will sell you fewer footguns.

Unified messaging matters because multi-agent setups are mostly a serialization problem. Tool orchestration matters because the real world is asynchronous and flaky. And "powers a CLI" matters because CLIs are unforgiving: you can't hide behind a pretty UI. If the tool call fails, users see it.

If you're deciding whether to adopt yet another abstraction layer, I'd ask one question: does it make failures legible? Not just "it supports tools," but "it produces traces, enforces schemas, and makes retries deterministic." If Kosong leans into that, it could be genuinely useful-especially for teams that don't want to marry a single model provider.

The competitive angle is also clear. Agent frameworks are becoming a wedge for ecosystems. If your framework becomes the default way developers wire tools, you influence which models and services get called. It's subtle distribution, not flashy model launches.

TabPFN-2.5: foundation model energy arrives in tabular ML (for real this time)

Tabular data is where businesses actually live. Payments, churn, risk, pricing, fraud, ops. And yet tabular ML has felt stuck in an eternal loop of gradient-boosted trees plus feature engineering.

TabPFN-2.5 is interesting because it keeps pushing "training-free" in-context learning for tabular tasks, now scaling to around 50k samples and 2k features while competing with tuned ensembles.

I'm not saying XGBoost is dead. It's not. But the promise here is huge for teams that don't have time or appetite for tuning pipelines. If you can drop in a foundation-style model and get near-ensemble performance with minimal fuss, you change the economics of shipping ML features.

The deeper significance is that "foundation model" no longer means "text model that we fine-tune for everything." It's becoming a broader design pattern: pretrain a model to be a strong general learner in a domain, then adapt with context instead of retraining.

For developers, the obvious win is iteration speed. Fewer training jobs. Faster baselines. Easier A/B tests because you can ship a decent model early and spend your time on data quality and product integration.

The catch is governance and predictability. Tabular use cases are often regulated or high-stakes. If your model behaves like an opaque in-context learner, you'll still need guardrails, monitoring, and maybe simpler fallback models for certain segments. But I love seeing this direction because it targets the ML problems that make or save companies-not just the ones that make cool demos.

Yandex's recommender push: ARGUS + Yambda is the "serious stack" signal

Recommenders are another area where AI isn't optional. They're the product. And Yandex showing up with both a large-scale transformer recommender framework (ARGUS, scaling to a billion parameters) and a massive event dataset (Yambda, 5B events from music) feels like a coordinated message: modern recommenders are foundation-model territory now.

ARGUS is about training huge sequence models over long user histories. That matters because most real preference is temporal and contextual. What I listened to last week is more predictive than what I listened to three years ago. Transformers are a natural fit-if you can afford them and if you can train them reliably.

Yambda is the other half: credible data and evaluation. A 5B-event dataset with temporal-aware evaluation is not just "big." It's closer to the messy, drifting reality production teams deal with. And that's what recommender research has been missing: realistic offline benchmarks that don't accidentally reward leakage or oversimplified splits.

If you're building in e-commerce, media, marketplaces, or ads, the "so what" is twofold. First, expect the bar to rise. Simple models will still work, but the upside of moving to long-sequence architectures is getting harder to ignore. Second, data advantage is still the advantage. Frameworks help, but the teams that win are the ones who can collect, clean, and evaluate at scale without fooling themselves.

This also connects back to agents. Personalization and agents are converging. The more agents act on behalf of users, the more they need recommender-like models to predict intent and choose actions. Long sequences aren't just "songs you played." They're "tools you used," "tasks you completed," and "things you refused." The future agent stack will look suspiciously like a recommender stack with tools attached.

Quick hits

Yandex also released Alchemist, a compact supervised fine-tuning dataset for text-to-image. Small, curated datasets with replicable curation methods are underrated; brute-force scraping isn't the only path to better alignment and aesthetics, and tight datasets are easier to audit and iterate on.

Closing thought

The theme I can't shake is this: AI is getting less like a single model and more like a system you can debug. Grounding models that "click correctly." Memory designs that admit failure modes. Frameworks that standardize tool orchestration. Domain foundation models for tabular and recommendation with real datasets behind them.

That's not as headline-friendly as "AGI is near." But it's the stuff that turns AI from spectacle into infrastructure. And if you're building a company, infrastructure is where you want the story to go.