The great AI debate: are skills the new agents?
Download MP3Alright, I think we're live. Cool. We are live and back with another episode of the Breakeven Brothers podcast. This is probably the last episode for the year 2025, I think, Brad. I think we're planning to take a little bit of a hiatus for the holidays after this one and then probably won't release another one until 2026. So first of all, what a year it's been. It's pretty exciting.
Yeah, maybe in 2026 we need a new theme song. I know we have one right now, and when Ben created it, I was like, "That is a jingle jangle." For 2026, we need a little bit of a change-up, a kind of refresh on the Breakeven Brothers assets, so to speak.
That would be fun. Yeah, I like it. And it's funny because I made that with AI—no shocker, if you've heard it. It's from Suno AI, but I have a subscription I pay for; I think it's like ten bucks a month. But in order to use it for commercial applications, you have to keep your license active. Otherwise, I would just download the song and then cancel my subscription. So they got a nice little fishhook in me there.
Wait, so you're paying for it right now?
Paying for it right now, yeah. My kids like to make stupid songs about the dogs. It's actually a cool site. I don't know if we've ever talked about it on this podcast, but you can make different kinds of songs and stuff like that.
So it's kind of an honor system?
I think so. I don't think they have any way to test it. I mean, we could try and find out and see if I could cancel my subscription and keep the song on the pod.
There's a JavaScript charting library called Highcharts that is also a paid JavaScript package. And I used it when I was first building out Split My Expenses, and I thought, "This is so nice," but it's also that honor system of being free to use non-commercially. But if you use it commercially, you have to pay. And I thought, "How would they ever find me?" I used it, and then I thought, "I want my site to actually grow," so I'm taking that off. So then I went to Chart.js.
It kind of sounds similar; if you're using it in a commercial context, you have to pay money, which I assume a lot of people do, honestly.
Yeah, they settled some huge lawsuit too with, I think, Warner Bros. Music or something like that. I can't remember what it was. It was because you could basically make a copy of "Billie Jean," and it was too close to the real thing. So anywho, go for it.
This has been a big year. And I think when 2025 started, it was the, quote-unquote, "year of the agents." And honestly, I feel like that has pretty much been true. I don't know what 2026 will hold, but one thing that I've seen recently from Anthropic and other leading AI labs is the introduction of skills. I think Claude came out with skills a few months ago, but to be completely honest, it went under my radar. It wasn't super exciting when it came out. I didn't see a ton of people pick it up, and I think there was a little bit of rough support in the beginning, kind of akin to MCP servers. When it came out, it definitely had a lot of hype, but once that initial few-month period gets polished, then it's really good.
So, I think skills is a topic I want to talk to you about because I'm excited about it, but I'm also confused about it. I've used it a little bit, but for people who aren't aware, we have the MCP layer, which is tools you can call from an MCP server. These tools that are invoked can be a remote server, local server, etc. But what skills are are just Markdown files that describe a capability. So when you're writing a prompt, you're writing a ton of context. You can think of skills as individual text files that describe to an agent how to perform a task. Oftentimes in tutorial videos, they talk about whether you'd rather have a really smart mathematician doing your taxes or a CPA doing your taxes. They bring that up because AI models are really smart now, but they don't know the context of that domain. So skills bring that domain knowledge to the agent. And the big lever that Anthropic has is creating their skills marketplace or plugin marketplace to allow these skills to be shared and reused within companies or publicly. So it creates this new agentic harness that's kind of similar to MCP, but kind of not. So we're in a new world now. I don't know if this will take over or fizzle out, but I am seeing a lot of attention in this area.
Yeah, it sounds a lot like the tool-calling agent paradigm, where you have a certain way that you want things done and it's not just free-form. There's a certain domain or context that the agent needs to accomplish that task. So you're basically preloading it with that context or giving it instructions on what you need. It seems to me, and again, I haven't used them extensively, but I have heard of people using them and they seem to have gotten popular within the Claude community. But to me, it seems like a tool-calling agent for the most part, which I've talked a lot about with LangGraph and LangChain. But it's cool that it's already built into the Claude ecosystem, and like you said, you have MCP capabilities there with that too. So, pretty cool. And then I think where you're going with this is that OpenAI also just recently released some skills.
Yeah, I think they put it inside their app almost secretly. I saw a tweet talking about if you ask ChatGPT what skills it knows, they had shipped a folder within the chat agent that said it can do these five or ten things. I think one of the Anthropic ones they talk about frequently is editing what they call "workspace files," so things like PowerPoint, Excel, and Word. One of the main differences between skills and MCP is that MCP can bloat the context window. For each tool an MCP server has, there can be a giant description of what that tool does. You want to add a bunch of MCPs, but then your context window of 200,000 tokens in the latest models fills up fast. I think the difference in skills is that you have this giant text file that describes how to do something, whether it's taxes, coding, or working with systems at your company. But the only exposure to Claude or any other AI agent that might use the skill is more or less a one- or two-line description. So just like MCP servers and tools, the agent decides when it needs a skill or a tool. If it needs a skill, it will look at the entire text file, but if it doesn't, it will ignore it. So I think the context window is a bit more preserved using a skill. But if it does invoke that skill, it will then read that entire text file. And oftentimes, what people are doing in setting up this skill is describing a process with raw text and including tools. For example, run this command to run this tool or go to this URL to get this data. It's not only a description of the process but also tools that the agent can use to pull it off. That's kind of why I see it as similar to MCP servers and tools. You could almost take an MCP server, break it down into a text file, and describe each tool as a few steps in the process. So it's very interesting to me, but Anthropic released an interesting video saying, "Don't build agents, build skills," and I'll talk more about it later. There is a paradigm shift. I'm unsure where it'll finally land. Their hypothesis is that coding agents are everything because they have good logic and good runtime environments to write a Python script, run it, and get output. With that comes how we get this domain expertise within this coding environment or general agent. So I think skills are exciting and popular. The fact that OpenAI and Anthropic have support for it means it's on a good trajectory. I'm curious where it'll land in 2026, though.
Yeah, and I've seen on the accounting and finance side, people on Twitter will say that Claude with skills can now create a full three-statement cash flow, P&L, and balance sheet model and full financial analysis of a company with just the push of a button. And that's because now Claude can build Excel files and PowerPoints from the Microsoft suite, which accountants and finance people are heavily saturated with. So it looks really cool. People can do things like ask for a discounted cash flow analysis of a loan, and it'll pull it all together and make a presentation. I think there's a ton of utility there. And right now people are using it as a lead magnet, replying to messages with "financial model," and they'll send you the skill file. And of course, there's a lot of snake oil out there when anything new is exciting and has a lot of hype around it. But I think there is a lot of use to those. And I do wonder if at some point in the future, they'll start selling skills, maybe in a skills marketplace. So this could be the AICPA audit skill or whatever, and someone can buy that if they want to from the person who made it.
Yeah, that's a good point. I think that brings up a point that skills are easier to share and keep up to date than MCP servers because MCP servers take so much effort to install the required software dependencies on your computer. And they've done a lot of work for one-click install to get an MCP server up and running in Cursor, but it's still a bit of a pain. So I think skills is a more user-friendly package to get almost the same result as an MCP server at the end of the day, but way friendlier. Anyone can write a text file. It's also quite similar to when ChatGPT had what they called GPTs, where it's kind of like a profile in which to do something. And those are easy to share as well. So it kind of meets in the middle of a GPT and an MCP server, but it's easier to share and use, and it can be public or private, totally up to you. It's a generic surface that's kept up to date, which I think is a big deal. Oftentimes these GPTs can't be used anymore or are hard to share within a company. It feels like Anthropic is doing a decent job packaging this up and saying to go to their plugin marketplace and add your own repository with your skills. Once you add those skills, I think when Claude boots up, it'll download the latest ones and have them ready for you. So pretty cool. I'm hoping to see some improvements there, but we're getting into a weird time where these agents are so smart, it's like we can talk less to them. They know our intent. I think there's less wiring up of things, which is what MCP servers are, and more just describing the process and letting it figure it out, which is pretty cool.
Yeah, for sure. Cool, so that's a nice little win from Anthropic and now OpenAI on skills. What else is going on with Anthropic these days?
Well, if your Twitter feed is the same as mine, which I think we do follow a similar crew of people, there has been controversy on Twitter about Opus 4.5. When Opus 4.5 came out, I was singing its praises, and I still am, to be fair. But there are a lot of people that are up in arms about how, quote-unquote, "stupid" Opus became. There's a long-running theory because it's not Anthropic's first rodeo when it comes to degrading Opus over time. Last time we talked about it, they had changed the model limits because people were abusing it because it was so good. And in the process of getting their model deployed to Google and other providers, they had two or three minor bugs that affected the performance, but it wasn't significant. I think it really only impacted maybe 15 or 20% of users. So previously, there was a lot of hype about them downgrading their models. Because the whole theory is, "Get them hooked on it, and then downgrade the model." In a business sense, it makes a ton of sense. "Let's get them hooked, and then let's cut our costs, and now they're still going to pay us." But they came out previously and said they don't do any of that. They came out today and said they still don't do any of that. One of the big changes they've made is a *slash feedback* or a *slash export* command within Claude Code that allows them to look at your session. There was a tweet that went viral saying, "Opus 4.5 got downgraded. It is so dumb today." The top response was someone who worked at Anthropic saying, "Hey, could you share that session? Like *slash feedback* if you want to share it with me." I believe the original poster responded to that Anthropic employee and said, "Oh, this was a previous session." But the employee came back and said, "Hey, you can resume that old session within Claude and then do a *slash feedback*." I thought it was kind of funny that the guy is saying, "Show me the receipts of where it went wrong." I don't think that person ever responded, but we'll have to dig up the thread. But it feels like they're on top of it. I haven't felt a degradation in performance, but it's a hot topic. People are very sensitive around Opus and Anthropic's usage of the model. I don't know. I for one think they're not doing anything, but I'm curious to hear your thoughts.
Yeah, I mean, I could see it happening. It seems like Anthropic is very hype-driven. When it comes out, people like yourself and others are all about it and sign up. I see people on Twitter saying they'd pay $2,000 for this not to be throttled. So I could see it. To me, it's interesting because with my tinfoil hat on, I don't think Anthropic can compete in the long run with Google and OpenAI. So, why are they cutting costs? If it is true, which I think is not confirmed, why are they doing that? And look, cutting costs is a good thing to do as a business, but to do it this way is not super transparent. It's interesting why they would do that because can't they just do a pay-per-token plan? You have your base threshold and then a usage plan, so you could just be billed on that. So why feel the need to dumb it down? Maybe you just need to price it higher. It sounds like people would be willing to pay that, so I don't know what their strategy is. Truly, it's confusing to me.
Yeah, I think one hypothesis is that when the model comes out, it's a big leap from the previous model, and then you get used to that change. Over time, you expect a certain amount from Opus or whatever leading AI model it is. If it doesn't deliver, whether that's their downgrading or just your expectations being higher, that's what this poster was talking about. To me, when I'm using Opus 4.5 in Claude Code, it feels really strong and has continued to feel strong. So I haven't felt that personally. But there are people who use it less often and maybe feel the difference based on the problem or the prompt. So I hate to call it a skill issue because it has popped up a lot. I will give it that. However, Anthropic is eager and willing to say, "Show me the receipts." So I think that's a good sign. I don't think any of this can fly under the radar given their history. I'm just hoping that Anthropic continues pushing forward on their incremental release scale. If you look at the graph of intelligence over time, they are the only frontier AI model that really has a steady incline. And if they can continue to do that without bugs or downgrading their model, intentionally or unintentionally, I think they're in a really good spot. But for now, the craze isn't what it was last time. A bit too much attention, maybe negatively, but my guess is they're so open to feedback that it'll probably disappear within the next few weeks.
Yeah, and one thing I'll say on Anthropic, because I know it feels like I'm maybe harsh on them, is I think they have the best taste, if that's a thing. If you think about it, they're the first ones that did skills, at least to my knowledge, mainstream-wise. They introduced MCP, which has led to agent-to-agent protocol, which we haven't seen kick off largely, but I'm sure it's coming. So they seem to be the ones that push that stuff forward and pioneer it a little bit, and then the others play catch-up and, in my opinion, do it better.
And the coding CLIs too, Claude Code. That was a leader and pretty much the only good one. When that product first came out, it had a great user experience right out of the box, which a lot of other competitors like Gemini and OpenAI have had to catch up to. But they've had to catch up with a really good product they could just look at and say, "Hey, let's literally replicate this," while having somewhat different features. They led with a really good product, and it's impressive that Anthropic is not focused on the image models per se and is just driving home Claude Code and a really darn good coding agent. So yeah, hats off to them. Hopefully, nothing comes out in the future that says they downgraded. I think it would lose a lot of user trust because people are paying good money. But there's always a "what if" in the back of your head.
Yep, I agree. Cool, okay, let's move off of Anthropic and see what else we got going on recently. Gemini 3 Flash. Your favorite company came out with their pretty fast and pretty smart model. I think the benchmarks were that Flash was almost equal in a lot of regards to Gemini 3 Pro, which I don't know how they do it exactly. But they've definitely struck a balance of intelligence and speed. And I think most importantly is cost, where Google's really driving home the TPUs, which Anthropic is actually buying in large scale. So yeah, a pretty awesome model release.
Nice. Yeah, it's cool. And I saw the benchmarks look pretty good for being a cheaper model.
Yeah, one note that I was very interested in was the benchmark commentary. What I saw online was that the hallucination rate of Gemini 3 Flash is actually pretty high. What that means is the model is definitely coming up with its own facts, but it turns out it's actually scoring high in these benchmarks. The analysis from that benchmark was saying the model almost hallucinates its way to an answer in its reasoning. And when GPT-3 was around, the OpenAI model, I think it also scored similarly with high intelligence but high hallucinations, given that it also had a deep chain of thought and reasoning. So maybe in the middle of its chain of thought, it would be off course and making stuff up, but when it makes stuff up, it has good reasoning and is able to get to the right answer. So it's almost like an indirect cheat to say it hallucinates a lot and makes up wrong statements, but it thinks so well that it's able to think itself out of these wrong statements and get to the right answer. So it's kind of interesting, but I think Google will probably make that better in the next model release. But it's a good callout because all these benchmarks measure different things like recall, hallucinations, raw intelligence, and coding ability. There are so many benchmarks, and this one was unique in saying high hallucinations but high intelligence that almost work together as a powerful combination. So, pretty interesting.
Yeah, no, that's cool. It's interesting to hallucinate your way into an answer. That's what gets people kind of iffy on AI in some ways. I don't know if I like the hallucinations.
Cool. Okay. What else we got shaking? We have a recent supply chain attack. We've talked about this for our bingo card: when a large-scale AI hack would happen. And so this one, I think, was code-named "SHA-1 Has Leaked: The Second Coming." A pretty verbose name, but there was a certain set of JavaScript packages on the NPM package registry that were infected. What that means is if you installed this package locally on your developer machine, it would then try to extract secrets from your environment. So oftentimes in a coding project, you'll have a `.env` file. I believe what these malicious, hacked libraries would do is look at your computer, find any important credentials like a Gemini API key or an Amazon API key, and upload that to a public repository on your own account with a specific identifier. The attacker would then just go to github.com, search for a specific identifier like "SHA-1 Has Leaked: The Second Coming." Anyone who was infected would pop up in the search. They would take your API keys or whatever popped up, save it, and use those as a malicious actor. So I think there were over 200 packages affected. The one that I was personally concerned about was PostHog. PostHog is an analytics company that a lot of indie or small to medium companies use for analytics. You can think of it like Google Analytics, but a little bit better, open-source, and more pricing-friendly. That one was affected because one of their dependencies was affected by this hack. They published a statement on Twitter saying they published their libraries today, realized there's malicious code, and have removed those versions. But if you installed them already, you're in trouble. As in, get those off your machine, stat. I was concerned because I upgrade my dependencies for both my React Native project and my web app decently often. And I thought, "Oh crap." So I took the affected versions from the disclosure article, popped them into Claude Code, and said, "Look at my current installed packages and the ones flagged as malware. Do I have any of those?" Thankfully, it said no. But I think when the article came out, there were at least 20,000 repositories that were affected. So it was a big deal. And honestly, with a lot of these hacks, the impact comes later, downstream. It's not day one or two; it's about how they can find these keys and extract the most value out of them. Some people probably didn't even know they got hacked. So it was a pretty significant event, and I expect this to happen more and more. We're getting into the territory of AI doing a lot more, and it's smarter, but it also does a lot of work for you, so you can miss these things.
Yeah. I guess for the layperson, what is the benefit? When you get hacked, people often think of stealing someone's crypto wallet or money from your account. It doesn't sound like that's what this does. They can take your API keys, but is the benefit that they can use your API key for free while you get billed? What's the long-term play?
Yeah, I think the hack specifically took all the contents of your secrets. So it could be API keys, passwords, just a dump of any secret credentials. I think what they do is look at where it came from, whether that's an Amazon API key or a secret password to an admin account, and then extract value out of it. It's hard to say what they collected.
Yeah, okay. No, I was curious. Yeah, it's a scary world out there. You got to be careful.
Yeah. Cool, okay. So let's wrap it up here. The other news we wanted to chat about is Cursor, my favorite IDE these days. They released a really cool new feature that I haven't been able to use, full disclosure, but I'm excited to look at it and be more involved in it. They have a visual editor for editing your front end. They do all the back-end code and agentic coding assistance, but now you can view it within their system. Which is pretty cool. They also released a debug mode. So Brad, what are the details on these? Let's start with the visual editor.
Yeah, so you can open your website in Cursor, and it's a browser where you can drag and drop certain UI elements. It'll then figure out in the code how to update that. I believe it's based on Tailwind, which is the most popular CSS framework out there. I'm not sure if it works on non-Tailwind, but since it's AI, I assume it does. But basically, if you want to center text or move things around, it's now much easier. I think it enables both developers and non-developers to look at a marketing site and change a few things. You're changing the visual side, but under the hood, it's changing the coding side. Since AI is so good now, this makes a lot of sense. So hats off to them on that. If you're into front-end design or just want to change things without coding—like the font, positioning, or style—all of that is made much easier by this visual browser. So it's really cool. I'm glad Cursor is stepping into a different role here. Anthropic, OpenAI, and Gemini haven't shipped anything like this. I think Cursor exists in that sweet spot of building unique products that bring the industry forward, while other companies are catching up in the CLI or general model war. Cursor doesn't have to worry as much about that, although they do create their own models. They still create some cool stuff, so hats off to them.
Yeah, pretty cool. Awesome. All right, well, should we wrap it up there, Brad, with our bookmarks?
Yeah, I guess the last note is the debug mode, which I haven't really used personally. But when I read their marketing article, the quick spiel is that as an engineer, when things go wrong, you need to figure out what's happening and where to fix it. Oftentimes, that means adding print statements or log statements to see how far the code got and what the variable values are. We can understand what happens at runtime. What the debug mode does is it does that for you. It looks at your code, adds a bunch of log statements, and then allows you to run the debug server. For example, in Python or Node, you'd be running your web or back-end server. You'd recreate the issue, maybe by navigating to a webpage and filling out a form. Then it would see all the logs being produced while running the server for you. There's a terminal within Cursor that can run your web server and look at the logs. Based on the logs, it will figure out what's happening at runtime and then propose the right fix. To me, again, they exist in this mode where they can build cool stuff. This sounds pretty useful. I almost do this indirectly, so it's like packaging up what I would do in my day-to-day into a mode. I'm not sure how much I would use it. Different languages have different ease of debugging. I think Python and JavaScript are pretty easy, but for more compiled things like Swift or macOS apps, this wouldn't work at all. So it's a mixed bag, but I think this is the right direction. They brought plan mode a few months ago, which was an excellent addition. They have the visual editor and debug mode. I'm not sure what's next. They brought full front-end reviewing, which is pretty cool. So they're clearly building a ton of stuff based on talking to people. This is an obvious one. I'm not sure how much I would use it personally, but if you're in the back-end infrastructure space, it could probably work out pretty well for you.
Yeah, it's interesting. Maybe the next foray is into deployments or something like that.
Honestly, it could be. I'm sure they have a lot of stuff cooking up right now.
Yeah. Well cool, awesome. All right, now let's jump into bookmarks. I jumped ahead of you, forgot about the debug mode. Do you want to go first on the bookmark, Brad?
Yeah, sure. So, mine is one I talked about earlier. It's Anthropic's YouTube video on the AI Engineer channel, which I think is an excellent resource for people who are not subscribed. This video is titled, "Don't Build Agents, Build Skills Instead" by two of the Anthropic engineers. Again, we've had 2025 being the year of agents. They are proposing the next form factor is skills because these agents are so well-equipped. Creating a better agent doesn't really make sense anymore. It's more about describing what you want and having that be shareable, like we talked about. It's a 16-minute video, very interesting. These are obviously smart people in the weeds of AI every day. So I'll link it in the show notes, excellent video. Please watch if you're interested.
Cool. My bookmark, I was trying to find this really funny one. It was a video on X that said, "Think about this before you use AI." It was someone screen-sharing, typing "center this div" into Claude or ChatGPT. But then it had an infographic showing the request going to the server, to the processor, doing all this calculation, and sending it back—this whole long cycle for just a little code tweak. Which is completely how I use AI as a layperson. I'm just lazy and I'm like, "How do I center this?" It's just funny. But I couldn't find the link. So what I did have bookmarked is another YouTube video from LangChain. They do really good deep-dive videos where their founders get on camera for an hour and talk through stuff. They did one on building and observing a deep agent for email triage, which I wanted to look at. In the accounting world, accountants often get a lot of flak for not responding to emails. I think some of that is behavior, but some of it is also volume and overwhelm. And I think that's one of those areas where AI could be super useful. I'd love to have a 24-hour response window. If a client sends me an email, I'll respond within 24 hours, even if it's just, "Hey, I received your email." But I don't want to respond to every email, like from vendors or non-clients that I don't need the same TLC for. So you probably need an agent to help figure that out. I thought it was an interesting video. I'm going to check it out and see what they say.
Awesome. All right, well, let's wrap it up there, Brad. Good stuff. Great year for 2025. I don't know how many episodes we did, but pretty awesome. I guess for our next one in 2026, we should revisit our 2025 bingo card for the last time and see how it shook out.
Yeah, and create a new one, too.
Yeah, and make a new one. I can't believe we're already saying that because I feel like not that long ago we made that bingo card. We need to create an infographic with Gemini or the Nano Banana Pro and just see how that ends up. It'd be fun to share.
It's a great name, by the way, Nano Banana. I think it's fun to say.
Cool. Alrighty, well, good stuff, Brad. And enjoy the holidays, everybody, and we'll see you next time.
Awesome. See ya.
See ya.
Creators and Guests
