Back to blog
AI NewsDec 28, 20256

ChatGPT Is Turning Into an App Store (and Safety Evals Are the Price of Admission)

OpenAI doubles down on distribution, safety testing, and education-while rivals and MIT push agentic and reasoning models forward.

ChatGPT Is Turning Into an App Store (and Safety Evals Are the Price of Admission)

If you've been waiting for the moment ChatGPT stops feeling like "a chatbot" and starts feeling like "a platform," this week pretty much confirmed it. What caught my attention wasn't a shiny new model launch. It was distribution. Specifically, OpenAI stitching real products-Target and Intuit-directly into the ChatGPT experience, like it's building the AI equivalent of an app store… without calling it that.

And here's the catch: once you become the surface area where people shop, manage money, and do their work, you don't get to hand-wave safety and reliability anymore. That's why the other big story this week-OpenAI's push on evals and third-party testing-matters more than it looks at first glance.


The platform play: Target and Intuit move into ChatGPT

OpenAI's partnerships with Target and Intuit are the loudest signal I've seen recently that the "model wars" are turning into "distribution wars."

In retail, the idea of a Target experience living inside ChatGPT sounds small until you think about user intent. Retail search is already intent-heavy. People show up wanting to buy. If ChatGPT becomes a front door for shopping discovery-gifts, household basics, "what should I get for…"-then the brand that gets embedded in that flow wins prime placement in the new funnel.

For Target, the upside is obvious: intercept customers earlier, personalize recommendations, and potentially reduce the friction from "I need ideas" to "it's in my cart." For OpenAI, it's even bigger. Every embedded partner teaches users that ChatGPT isn't a destination for answers. It's a destination for actions. That's the move.

Then there's Intuit, with a reported $100M+ multi-year deal. This one is sneakier and, in my view, more strategically important. Finance is where AI assistants either become indispensable or get banned from the building. If Intuit experiences show up inside ChatGPT-and frontier models get applied to personalized financial insights and actions-you're basically watching an AI assistant try to earn "trusted operator" status.

The opportunity: natural language becomes the UI for money workflows. "Explain my cash flow like I'm five." "Find the categories where my spending spiked." "What happens if I pay this down faster?" Developers and product teams should notice the pattern: the assistant isn't just summarizing dashboards. It's negotiating tradeoffs and proposing next steps.

The risk: finance is full of sharp edges. Small mistakes aren't cute. They're expensive. That's why I read this deal as a forcing function for better evals, better guardrails, and better provenance. If ChatGPT is going to tell users what to do with their money, hallucinations stop being an annoyance and start being a liability.

Also: these partnerships quietly reshape where startups can play. If distribution consolidates into a few AI "super-surfaces," niche apps might either (a) become plugins/embedded experiences inside them, or (b) go hard on owning a vertical end-to-end experience that a general assistant can't match.


Safety becomes product: OpenAI leans into evals and external testing

OpenAI published more detail on how it uses evals and third-party assessments to shape deployment. I'm glad they're talking about this stuff publicly, because evals are becoming the real product spec for enterprise AI.

Here's what I noticed: the conversation is shifting from "trust us, we're careful" to "here's how we measure what the model does in situations you actually care about." That's a big difference. Evals are how you translate abstract safety promises into something operational: regression tests for behavior, domain-specific benchmarks, red-team scenarios that map to real failure modes.

Third-party testing matters for a similar reason. If you're selling AI into regulated industries-or even into brand-sensitive consumer flows like retail-the buyer wants independent evidence. Not vibes. Evidence.

For developers, the implication is blunt: you can't ship "LLM features" anymore without owning an eval strategy. And I don't mean a couple of unit tests around prompts. I mean you need a living test suite that evolves as your product evolves, and as the model under you changes. If your app does customer support, you should be measuring refusal accuracy, policy compliance, escalation behavior, and hallucination rates on your own data. If your app does finance, you should be testing numerical stability, citation requirements, and action safety.

The meta-point: evals are becoming a competitive advantage. Teams that can measure behavior can iterate faster and take bigger bets. Teams that can't will move slower, ship less, and get surprised in production.


Grok 4.1 and the new battleground: "emotional intelligence" + fewer hallucinations

xAI's Grok 4.1 update is pitched around higher emotional intelligence, lower hallucinations, tighter safety, and improvements via large-scale reinforcement learning and agentic reasoning rewards.

My take: everyone is chasing the same north star now-models that can operate in messy, human contexts without going off the rails. "Emotional intelligence" sounds fluffy, but it's actually a proxy for something very product-relevant: can the model interpret tone, intent, and social cues well enough to be used in support, coaching, sales, HR, and any workflow where the words matter as much as the facts?

The hallucination angle is even more telling. For the last two years, we treated hallucinations like an academic quirk. Now they're a go-to-market blocker. If Grok is genuinely getting hallucinations down, that's not just a quality win. It's a trust win. And trust is the currency that decides which assistant gets put in front of customers.

The competitive dynamic is also changing. Model providers aren't only competing on raw benchmarks. They're competing on "operational behavior": how well the model acts like a reliable coworker, how predictable it is under pressure, and how controllable it is when plugged into tools.

Which leads straight into the next story.


MIT's CAD agent: the clearest demo of "agentic AI" becoming normal software

MIT built an AI agent that operates CAD software through the UI, turning 2D sketches into 3D models. It's trained on a dataset built for this kind of interaction (VideoCAD), and the key is the interface: the agent behaves like a human user, clicking around and using the same tools a designer would.

I'm pretty bullish on this approach. Not because "CAD automation" is the headline, but because UI-native agents are a general template. Most enterprise software doesn't have clean APIs for every action people take. And even when APIs exist, real workflows are a rat's nest of exceptions, permissions, and legacy tooling.

An agent that can "drive the app" is a bridge over that mess. It's not the most elegant bridge. But it's practical. And practicality wins.

For founders and product managers, the so-what is: we're going to see a wave of agents that learn the software you already use, instead of forcing your company to adopt a brand-new AI-native stack. That's a huge adoption lever. It also creates a new kind of defensibility around datasets of human-tool interaction. If you own the data of how experts use complex tools, you can train agents that don't just answer questions-they produce deliverables.

Also, this should make you rethink "AI copilots" as glorified chat sidebars. The real value is when the model can do the work inside the tools, not just talk about the work.


ChatGPT for Teachers: free until 2027 is a strategic wedge

OpenAI launching a free, secure ChatGPT workspace for U.S. K-12 educators (through June 2027) is one of those moves that looks altruistic and also happens to be extremely strategic.

Education is where habits form. If teachers get a safe-ish workspace with collaboration features and admin controls, you don't just win individual users. You win institutions. You get standard operating procedures built around your product. And you normalize "AI as a work surface" for lesson plans, rubrics, feedback, and classroom materials.

For developers building in edtech, there are two angles. First, distribution: if ChatGPT becomes the default workspace, your product either integrates cleanly or risks being sidelined. Second, product expectations: educators will start demanding admin controls, policy visibility, and auditability everywhere. That pressure will spill into other sectors too, because the same governance features make sense in any organization with compliance needs.

I also think the "free through 2027" detail is the point. That's long enough to outlast pilot fatigue and budget cycles. It's a land grab for mindshare and workflow lock-in.


Quick hits

OpenAI's Small Business AI Jam is a smart grassroots move. Helping 1,000+ small businesses build practical tools isn't just community goodwill-it's a pipeline for real-world use cases and feedback loops that polished enterprise customers often can't provide.

OpenAI getting named an Emerging Leader by Gartner is mostly a checkbox for procurement and enterprise comfort. It won't change the tech, but it does change how quickly large orgs can justify spending. If you sell to enterprises, these "analyst signals" still matter, whether we like it or not.

MIT's "cost of thinking" research-showing reasoning models take more steps on harder problems, mirroring humans-is a useful reminder that "more reasoning" has a price. That price shows up as latency and compute. As agentic systems become normal, product teams will have to decide where to pay for deep thinking and where to settle for fast, cheap guesses.


Closing thought

The theme I can't shake is this: the AI race is shifting from building smarter models to building believable operators. OpenAI is chasing distribution inside ChatGPT. xAI is chasing behavioral reliability and "EQ." MIT is showing how agents can actually use tools the way humans do. And the boring-sounding evals story sits underneath all of it, because nobody gets to be an operator in high-stakes workflows without measurable trust.

If you're building right now, I'd plan for a world where the winning products aren't the ones with the best demos. They're the ones that can prove-over and over-that the assistant behaves the way you promised it would.

Want to improve your prompts instantly?