GPT-5 is here: so why are we sticking with Claude Code?
Download MP3All righty, we are back with an exciting episode of the Breakeven Brothers podcast. Brad, how's it going?
Excellent. Excellent. We have all been waiting for this moment. The esteemed GPT-5 release is upon us, and we have lots and lots of thoughts. So I'm excited to talk about it. But yeah, just a little bit of news, I guess, since the last time we recorded our episode. Of course, GPT-5, the big announcement and live stream last Thursday, depending on when you're watching this. But I'd say mixed reactions, I guess. Where do you want to start? Do you want to just get into your own personal experience, or should we get into the live stream first?
Yeah, I unfortunately had a meeting at work, so I didn't catch the full live stream. But there were quite a few things that they released that day. GPT-5, the overarching model. But outside of that, they changed their pricing. They highlighted Cursor as the GPT-5 harness of choice during the live stream. They even kind of modeled their Codex, which is the Claude Code competitor, to be a little more up-to-date. So you can use your ChatGPT Pro-like plan for their CLI tool. So there's a bunch of stuff there. One of the biggest things that I was excited about was they gave one week free of GPT-5 plus Cursor. So I think they did this for GPT-4o, which is kind of their more agentic model to compete with Anthropic's model. But I was really excited because I was like, I want to use this at work and outside of work, and I want to see what it has to offer. Because, you know, I don't know about you, but in the days leading up to that release, I was thinking, "Oh, I could just wait a little bit and it'll get that much better." And I even had told people on Twitter and at work, "Oh, you need to wait a few days and it could be a big moment." And I'm here to report that it's good, with an asterisk on it, that I don't think I would use it for day-to-day, and we'll get into that later. But there is a big unlock for, I think, the large majority of people, which was OpenAI's positioning. And we can kind of dive into that, but I'll pause there. I want to hear your thoughts.
Yeah, well, I think I thought about this a lot when the announcement was going on and then during subsequent uses and seeing people use it and give thoughts. I think for starters, you know, I don't want to, I guess, poo-poo on a new model announcement because all those teams worked really hard, and the people at OpenAI should be really proud of getting that across the finish line and putting that out there. Because I'm sure it's super tough, and we've talked a lot on this podcast about how competitive it is and what a super competitive grind it is, especially for those companies working to get all this done. So this isn't to take away from any of the hard work that they've done. But for me, watching the live stream, you know, usually when you go from, like, GPT-2 to GPT-3 to GPT-4, they're big jumps. And I feel like people were expecting a big jump out of GPT-5. I certainly was. And watching the live stream, I just felt like I wasn't getting that big jump. I didn't know if it was an unfair expectation that I was having, like, what do we expect, you know, for them to start being able to predict the future or, you know, being able to just do everything with the push of a button? But it wasn't that much better for me to get super excited about. So that was kind of my takeaway from watching the live stream. The benchmarks and the LLM Arena numbers are cool, but for me, as a user who uses it for a specific task, I don't really care if it can do PhD-level English composition. I don't care if it can do PhD-level mathematics, you know, at least right now I don't. And so, those are cool. And again, that team should be proud of that, and they should be commended for that. But for me, it didn't really blow my socks off.
You know, I think the expectations—the marketing was absolutely incredible by OpenAI. Sam was posting a photo of the Death Star from Star Wars the night before. The expectations were just really, really high. And I think the good news about being kind of inside of "AI Twitter," as I like to put it, is there's the people who are kind of following along with the marketing, saying GPT-5 is going to blow your mind, and then the other half of the spectrum saying, "No, GPT-5, we're hitting scaling limits, etc., etc., blah, blah, blah," almost playing devil's advocate, saying it's going to be good, but not blow-your-mind-crazy-good. And I had interacted with it the entire day it came out. I was giving it pretty hard problems at work, asking it to solve real problems. And my takeaway from using it within Cursor was that it was really confusing to choose the model initially. I don't know if you saw it, but there was GPT-5, GPT-5 Fast, GPT-5 High Fast... Yeah, so they came out with different thinking budgets, and then there was a fast variant on top of that. So I guess OpenAI is providing some sort of speed-up for extra cost. But essentially, the TLDR takeaway that I had from using GPT-5 High Fast, which was free for one week when it came out, was that it's really, really good at doing something in one shot. So I would ask it a detailed prompt to go fix a problem. It would spend so much time, almost painfully so much time, just iterating and thinking. So when DeepSeek first came out, it had this "thinking" moment where you'd ask it to do something and see this large chain of thought. That exact experience was kind of replicated with this new model, at least on the high-thinking mode within Cursor. But it felt five times worse, not five times better, because I felt like it was just over-indexing on the thinking. Like it went really, really hard, had to come up with like a 10-minute plan to do something that I felt like I'd rather see it make progress and then stop it, versus it spending so much time on this initial plan. And so, yeah, it was really good. One shot is something I worked on. I was really impressed, and I thought, "Oh, I'm going to use this every day." And then I took a step back and thought, this speed is actually not what I want. It's not productive for what I'm looking at. The iteration cycle doesn't make sense. So that was my takeaway: it's a really smart model, hits high on benchmarks. It was great to have a free week, but it feels really, really slow. I feel like when I use these AI models, I want the highest intelligence. But as I played with it, I realized my priority has actually shifted a bit. I want pretty good intelligence, but a little bit faster, because that really hit a wall for me. I said, I don't think I can do that.
Yeah. Well, and I think let's unpack a little bit of one of the bigger—we've already touched on it a little bit, but we didn't really call it out—one of the bigger features or drawbacks, I guess, depending on who you are and who you ask. But if you're using it in the front-end interface, you know, chat.openai.com, you no longer choose between like a "thinking" model versus a regular model, right? You just query it, and you don't pick the model, but it'll decide when to be "thinking" versus not. And I don't use ChatGPT that much from that standpoint; I use it more with an API. But to me, I like being able to choose the model, and it makes me wonder who was asking for that feature. Because online—and again, online is not the real world all the time—but online, it seemed like that was a big drawback. People were complaining that for certain tasks, "I want thinking," and for other tasks, "I don't need thinking." I can pick and choose which ones fit my use case. I don't need it to be general and have the AI choose for me, and then sometimes it doesn't choose correctly or, like you said, it overthinks. One example that I saw just recently on LinkedIn was someone—I can't remember the exact quote, I'll try and find it for the bookmarks—but they had asked it a question, and GPT-5 thought about it for like 30 seconds. And then they went and asked the same exact question again, and GPT-5 thought about it for like three minutes or whatever. So there's just an inconsistency that I think maybe is still being worked out by the OpenAI team, or it's just not sticking like they thought. But for me, from what I've seen, that seems like a big negative so far from my circles: not being able to choose the model.
Yeah, they have this new router technology that had actually been teased for, I think, some months on Twitter, at least as a rumor, that GPT-5 would be this mega-model that would be routed internally to choose high thinking, low thinking, medium thinking, etc. And it kind of maps to the various models they had before. You can think of GPT-4o being their general chat model, GPT-4 being their kind of deep-thinking model. And so when people talk about GPT-5 and their thinking models, they basically map it to say like, "Oh, low thinking is 4o, high thinking is GPT-4," et cetera, and it kind of meets in the middle. So I think there's been a huge reaction of, "Why did they change it? Are they saving money? What is the end goal here?" I think outside of that, the tools that use the OpenAI APIs—those ones are using the high-thinking budget by default. And so the transparency is there and the power is there. But for the everyday ChatGPT users who could usually choose between the models, you can't do that anymore. They have just recently exposed, I think, a "GPT-5 Thinking" mode, which gets you one class higher. I think it's like medium thinking. And they exposed a keyword, which is what Anthropic does as well, that you can use to engage the higher thinking mode. I can't remember what the exact keyword is, but I think Sam tweeted a day or two ago saying, "Hey, we're fixing it." They even had day-one hiccups of the router choosing the dumber thinking level when it shouldn't. So people came out of the gate saying, "Oh my God, they changed it so I can't choose, and the routing was bad, so the logic felt really, really dumb." So I think overall it was a bit of a bumpy rollout, but the expectations were really high, so I don't really blame them for much, and they've fixed it now. But I think the TLDR is they removed it. It's not as clear anymore. I think they've actually since brought back GPT-4o and the model selector because people on Reddit were going crazy saying, "I talked to GPT-4o daily for therapy, for whatever, and now I lost a friend. My best friend is gone," which was a computer. I think that was mind-blowing to me because ChatGPT has like what, 700 million monthly active users, which is insane for a product. And for them to change this fundamental... I don't know, friend, or what some might think of as a friend on the other side, overnight. You know, for us, at least for the engineers, I'm like, "Oh, I want better intelligence." For people who want someone to talk to, that's a big deal. And I think they realized that a little bit too late, and so now they're kind of rolling back and trying to figure out how to appease everybody.
Yeah. I mean, that's not... that's not OpenAI's problem, to be honest with you. But yeah, you know, someone needs a friend. But yeah. One thing that's also—I was just actually experiencing this earlier today in my day job, and this is really nuanced, but I kind of found it annoying. So I'm just going to say it, not to again diminish the rollout, but just to add some things I've observed, is you used to be able to give it a temperature of how you want it to respond. And it was like a sliding scale where zero was very precise, pretty plain, stick-to-the-facts, whereas one is more creative, more of a creative writing style, basically. And I guess with GPT-5, they don't have that temperature scale anymore, at least in the API. So I had a process I was using—I was using GPT-4 Turbo to write this report and had the temperature at zero because it's a formal, professional report; just stick to the facts. But then when I switched to GPT-5, one, I got an error saying "temperature is no longer an accepted parameter" or whatever. And then when I looked into it, I thought it was something I was doing wrong, but I looked into it, and it was like, "Oh yeah, we don't do that anymore with GPT-5." And so then I just took it out, or I just set it to one. I think the default value is one, and it has to be the number one, or if you just don't declare it, it will be one. And the writing was way, way too creative. For a professional report, it was just giving me way too much. I'm sure I can streamline it, I'm sure there are other things I need to do, but I just found that odd. I was like, I still want... It's almost like they took some of the flexibility out of the models and just said, "Here you go, here's just one model." And as you were talking earlier, I looked at chat.openai.com, and it was GPT-5, the flagship model, and they have "GPT-5 Thinking," and then now there's "GPT-5 Pro," which is the paid, the premium one, I guess. I don't have the option to pick 4 or the other ones on my account. So, yeah, it's an interesting rollout.
Yeah, they said it's available. I don't see it either anymore. Maybe they tucked it away in some menu. But it's funny because I think Sam had tweeted that only like 7% of people were using a thinking mode back when GPT-4 was available, and these are paying users, I guess. So then with GPT-5, it jumped up to, I think, like 24% using a thinking mode because now it's thinking based on the user's query, not by the user's selection. So I think they're trying to take the onus off the user saying, "I want thinking." And now it's, "I ask anything to ChatGPT, and ChatGPT is smarter to figure out if I need thinking." But it blew my mind that as paying users, everyone was using 4o. So 4o got pretty good, you know, as a general chat assistant and for intelligence. It hits the marks across speed, intelligence, and price. But if I'm paying $20 a month for OpenAI's premium plan, I'm going to use their thinking model with web search. That was the bread and butter for me. Anytime I used ChatGPT, any query I asked, 99% of the time, I would go to GPT-4, click the little web icon, both in chat and online, and then ask something. It's thinking and it's searching the web. That's what I would do if I was trying to figure out anything in my life, and I think 4o could maybe do that, but it just didn't think as much, and the answers just aren't as good. And that's like a proven thing in LLMs. So I'm very, very surprised that the paying population was not aware of that. And I've seen so many people share screenshots with me from various places that, "Hey, I just got this answer from ChatGPT," and in the top left corner, I see "4o," and I'm like, "Oh, that 4o is bad. It's just... there's better." And I don't like being that guy that's like, "Oh, you gotta choose the better model," but at some point, I feel like we need a bit of intelligence on how to use LLMs. And I think GPT-5 is this moment where people are now using the thinking models without spending any effort. So the general level of intelligence for chats will rise. People will be happier overall. I think that was OpenAI's goal. It's not only to get number one across all benchmarks, which they did, but they also kept good API pricing. So hats off to that. But I think the major, major reason was taking all their models—you know, some were experimental, like GPT-4.5 was, I think we had so many queries per month, it was very good at writing—and they took various flavors of these models, baked them into one, put a router in front of it, and this is GPT-5. And on average, it thinks more than 4o, and that's a win for most people. I think how they pitched it, of "general intelligence will go up, thinking will happen, we have better advancements," I think that is where you give them hats off. Like, it'll definitely be a lot better for everybody. For the people who are looking for maximum intelligence, already knowing how to use ChatGPT, it doesn't feel like as much of a leap as I was hoping for. But I think I want to say that the marketing hype was there, my expectations were high. Given all that, I think it falls short. But again, it's a great release. I just pulled up LLM Arena: number one on text, number one on web dev, number one on vision, and number two on text-to-image. So across the board, it does a fantastic job. And it's an easy upgrade from GPT-4o if you're using an API. But yeah, it's interesting times. I think the expectations were high. We get a lot of wins. I wouldn't say it's what I exactly wanted, but it's good. It's good.
The Death Star is too much. Like, come on, what are we doing, you know what I mean? I think I saw—I didn't watch the podcast, but I think I just saw a clip of him—he was on Theo Von's podcast, which is kind of just weird to say out loud, too. But I think he was saying something about AGI, and I don't think he was saying GPT-5 was like AGI. I don't think he was going that far. But he was really alluding—again, this was probably two or three weeks before the release of GPT-5, so he's doing this press tour, getting people excited. And you know, for me, that was a disappointment in the sense of, this is just a slightly better version. The one thing that's interesting that you said, though, about the thinking models is, you know, from their side, they need to balance out the cost too, right? And so if people don't need thinking but they're getting thinking, there's a cost element to that that I'm sure they weighed into the decision-making. I'm sure they're obviously very smart people, but that's interesting to me. Because one use case—and this is going to sound ridiculous—but one use case I do use ChatGPT for, just 'cause I'm lazy, is I'll get a grocery list of things I need to buy at the store for meal prep. And then I'll say, "Organize it like where I'd find these things in the grocery store." 'Cause what I used to do—I used to hate grocery shopping—is I'd zigzag all throughout the store because my list wasn't organized in any way. It was just organized from whatever the meal prep list had it, wherever I got it online. So then I use ChatGPT to organize it so I just stop in the produce section, I get everything I need. Then I go to the next section, get everything I need. Super basic, obviously. I'm not blowing anyone's socks off, but I don't need thinking for that. And I can just use 4o to just organize this. And just little things like that. So I do wonder if sometimes, or if they'll run into this, where GPT-5 thinks more than it needs to and it ends up costing them more than they really need. It's almost like a waste of—what's the right word?—like a waste of energy, for lack of a better term.
Part of their rollout, they had also adjusted the limits for paying users. So they had different models with different limits. Now it's kind of centralized on GPT-5, and there was a different published model limit, essentially, for GPT-5. And people were really not happy about it. I think they really condensed, like, 2,000 requests per month to like 250. Maybe those aren't the exact numbers, but it was a large reduction in the number of messages you could have per month. And so Sam recently came out, actually tweeted today saying, you know, from the compute needed for GPT-5, we will have different priorities for what we're going to serve. Number one, making sure that paying ChatGPT users get more than they did before GPT-5 came out. Because, again, the issue was we had a lot of usage if you paid. Once they changed it, we had less. So, one, they're bringing that back to where it should be. And then two, we're going to prioritize API demand up to our capacity and the commitments we've already made. So that's like, I guess, using GPT-5 in replacement for the other models at the capacity that the other models were being used. That's kind of how I read it. And then three, increase the quality of the free tier of ChatGPT. So I guess giving more thinking budget maybe to free users. And then lastly, prioritize new API demand. And then his final note is, "We're doubling our compute over the next five months," which sounds like a big deal. But essentially, it sounds like, exactly like you mentioned, they're using more thinking. Maybe they shouldn't be. Maybe they should be. Overall, they're just going to be using more computing resources for this new model. And due to that, they need to kind of balance where their server allocation is, whereas I think before, there were these powerful models, but people didn't really use them as much as they probably should because they didn't know. Now it's being used by default, and now they've got to prioritize where to put requests, where to serve the API. It sounds a little complicated. Hopefully, they can figure it out and they don't have any scalability issues because they're pretty stable, I think, over the past two years as an API provider.
Yeah. The other part that was interesting, too, I guess I'm just comparing the rollout of GPT-5 to some of the other bigger announcements that have happened this year. I think—and I'm curious what your take is on this—for me, Claude Code was kind of a sleepy announcement, I would say. Not many people caught on to how good it was and got excited about it as they were using it. It wasn't like they heard the announcement and then everyone dove in headfirst. That's just my take on it. It seems like people were like, "Hey, this is actually pretty cool. Oh, hey, this is actually really cool. I'm a die-hard now. I'm going to break the model and ruin the max plans for everybody." That was like that Anthropic rollout. And then for me, the biggest one of the year is still, I think, the Google I/O conference with Veo, in terms of something that came out that blew people away. Because I think up until that point, Veo—which is the video model where you can give it plain text, say, "Hey, make this movie scene for me"—prior to the Google I/O conference, I think that was in April or May, whenever you were in Japan, that was when that conference was going on. And I remember the hype and the reception of that was crazy because up until that point, there was nothing that was even close to as good as Veo for making videos. And now you see all the funny little yeti videos that are floating around, of like, influencer yetis and stuff like that. But people are using them in marketing for commercials and stuff like that. So that to me is still the biggest rollout. But yeah, I don't know. What would you say? Like, would you agree with some of that? Or do you see those things differently as far as Anthropic's rollout and what's the biggest rollout in your opinion so far for 2024?
Yeah, I think Claude Code wins. And I think it's great that a thing wins based on user feedback and developers loving it. Again, it's a very developer-focused product, so it's hard to compare a Claude Code to a ChatGPT. But if there's anything in the AI space, there's always a ton of hype, and will it deliver, will it not? And part of that is stacking the expectations against reality. When Claude Code came out, I had like two people talk about it. And I had picked it up like, "Oh, a new AI tool." Like, I'm down to try new things, but it's a lot of friction. I don't really love it. Then I picked it up and I tried it and I was like, "Damn, this is really good." Like this, I knew instantly, was valuable and people are going to love this. And so I ran with that, really loved it. I think since then, GPT-5, expectations too high. You know, good, we already talked about it. Veo, I think it's really cool. I just don't have a use case for it myself. I'm not in the marketing space. Like, I've seen cool stuff with it, but it feels more like a toy for where I sit. For people who are in the marketing space, I'm sure they would be like, "Oh, I don't have to edit half the things," or "I can fill in the background." I haven't used it. I've just seen demo videos. So I think it's pretty damn cool. But we get to this point where as these LLMs get better and better and better, it sometimes feels hard to measure them through benchmarks. And like, GPT-5 hits number one, but does it actually feel that way? How much time did it take to become number one? Because again, when I used it in Cursor, it took 10 minutes to think about something, and I think Claude Code could do it in four. So there becomes a time in which benchmarks aren't that good, where we get to vibes, which is a critical one. You know, how does it respond? How does it react to you? We've seen Gemini just have self-deprecating things like, "Oh, I'm dumb. I'm so sorry." You know, I'm sure you've seen the screenshots of Gemini going off the rails and kind of shaming itself. We have OpenAI, who went through a phase of agreeing with you and being this person who is always there for you. I feel like Claude and Anthropic hit just the right kind of nuance of not being self-deprecating, not agreeing with you too much. They definitely are on the agreeing side, but not too much. And then their tooling has really made their model shine. So I think Claude Code wins by far. I think there's a lot more to come. I think Google Gemini 3, supposedly, is coming soon. DeepSeek's new model is coming soon. And there's been a ton of open-source ones that have dropped on top of that. Anthropic released Claude 3.5 Sonnet in the same week that GPT-5 was released. So last week, a lot of exciting stuff.
Yeah. One thing we didn't touch on is the open-source models that OpenAI released. I haven't touched them myself, so I don't have much to say. But have you seen anything online about those or used those yourself? Like, seen any fanfare?
They did it the week before GPT-5, I want to say. Yeah, I think people were saying they're coming out with this, and then GPT-5, and if this is that good, it's kind of what people said, GPT-5 is going to be incredible. And I think that added to the hype. Their models are good. I had found that initially when they were being served by open-source AI providers like Groq and Fireworks, all these companies that just host open-source models, I think GPT-OSS or whatever they called it, had a very specific configuration to run. So I think a lot of these providers who were running it ran it incorrectly, with the wrong configuration. So it actually felt really dumb. Over time, they fixed that, and I think it's gotten better. But the open-source community has really been on fire, like lots of releases from China AI labs that have really been extremely impressive. So I thought, you know, their open-source model came out, it's pretty good. It's okay, like it's decent. And then, a few drops from China, I think Qwen came out with a few models and other Chinese AI labs, and they were better than GPT's open-source model. But, you know, it's hard to say. I think my TLDR is I wouldn't touch them because I don't really use something that's not super high-intelligence because I want that best bang for my buck on coding tasks. But I think it does allow different use cases to be enabled. And I know people feel differently about using some of the Chinese models because they can't talk about certain topics. So I think it unlocks a certain use case if you're into that, but overall I felt online it was a very mid-tier reaction of, "Hats off to them for open-sourcing it. It's not leading in any benchmark, but it's just open-source from OpenAI, so for that we give you a few claps here and there, and we'll move on with our day."
Yeah, it's cool to put more stuff out there. If it doesn't always fit someone's specific use case, I'm sure it'll serve someone else's. If it's not for you, I'm sure there are people out there that are only using the open-source, open-weight models or whatever. So, yeah. So, yeah, big news. I think I still think, you know, as I'm saying this, I still think the jury's out as far as how GPT-5 will be received overall. Because one thing that's kind of cool about OpenAI, and I haven't seen this with the other providers, so I could be misspeaking here, but it does seem that OpenAI and Sam are very active and receptive to feedback. Or when things are happening that people don't like, it seems like they come in and they change it. Or if things are happening that people really like, they seem to kind of boost that and encourage that. So we've already touched on some of the things that people were complaining about. They might already be adding some other features or giving us some flexibility back. So I'm sure they'll continue to kind of dial it in and get it right, so to speak. But yeah, it's cool. It's always good to have a new model to check out and use in your different tasks.
I think one of my biggest takeaways is for folks using GPT-5 today, if you use it in Cursor, I think the biggest unlock that it has is it's really good at following your instructions, and it's really good at one-shotting things. I think that combination of it trying really hard to understand what you want, and two, it being able to do most things with one try, leads it to be a bit slow. Like I've talked about, it'll spend a lot of time thinking. So the onus is on you to spend a ton of effort on this initial prompt. As we've talked about, the planning phase is really critical for AI workflows. I'll probably talk a little bit more about that later, but I think for GPT-5, it's really agentic now. It thinks on its own. It can tool-call much better. And Cursor is a really good harness, an AI tool, for it to kind of have its maximum productivity or effectiveness. And if you're using Cursor and GPT-5, spend a ton of time writing this initial prompt. I'm talking 10 paragraphs. It'll one-shot it. You'll be really impressed. If you do anything less, I think it just doesn't really perform at the level that it should for coding tasks. So again, for the coding tasks, it's usually using the high thinking effort or reasoning budget. From there, spend a ton of time writing the prompt. And I promise you, it will one-shot things that you will be really surprised about. But you have to be very tolerant of a, you know, 10-15 minute wait. And there have been times where I was using it, and I would literally watch it just sit there thinking, thinking, thinking. I thought, "It's stuck. Like, it's definitely stuck. It doesn't take this long." Like, I've sat here with Claude Code plenty of times, being like, "Oh, Claude might be stuck," and it gets through it. With Cursor, when I played with GPT-5 High Fast—Fast, mind you—it was so slow, but it did well on the output. So give it a query, give it a nice, juicy initial prompt, wait 10 minutes, and don't watch it, because you will think it's stuck multiple times. Once it's done, come check the output. Really, really impressive. I think they've done a good job at making it smart... whoops... making it smart, making it so that...
Smacking stuff, Brad. You're smacking stuff on your desk.
It follows instructions. So give really clear instructions. I think that's one of my biggest takeaways. It's slow, it's very intelligent, it will follow your instructions more so than I think any other model will. But it's super fricking slow, and it's like, you gotta be okay with that. So I would say, take it to high-complexity tasks or be mindful of what you use. I think I would still hands-down prefer Claude Code, just for now. At least we'll see if things change for better or for worse.
Yeah. And I think for me, what I'd be curious about, maybe something that listeners and viewers can fill us in on in the comments or shoot Brad a tweet, is what... because I haven't seen this yet, where someone says, "I couldn't do X before, and now I can do it with GPT-5." It seems that... to me, it seems like I haven't seen that. I've seen cases on the contrary, where it's more of like... our friend, you know, Justin, he puts out great content, and he did one—he's in the accounting world, we've talked about some of the posts he does before—he did a great one where he was having GPT-5 categorize expenses and do an expense audit. And I think he was saying that—and he has all these great screenshots and a ton of detail in his posts—and I think the takeaway that he had from that post was that Grok, the xAI Grok, G-R-O-K, that one was actually better than GPT-5 at categorizing new expenses. Maybe that's like a vibe thing, like you were talking about. Some of them just have a certain flavor and they're just good at certain things. But I haven't seen someone say GPT-5 can do this and these other models can't. I haven't seen it yet. So I'd love to see if people can find that or point me in that direction. I'm sure there are use cases where that is the case. But yeah, I'd love to be enlightened on that because I haven't seen it yet.
I follow a few of the OpenAI tribe on Twitter. So Sam Altman, I think their head of Go-to-Market, and some of their DevRel folks. And speaking of things that GPT-5 can do that previous models couldn't, I think those folks just dive straight into those use cases. Like, one tweet out of the blue from someone with no followers will just be retweeted by some of these people because it'll sing the praises of some random use case that is now possible. But I think for all seriousness, an excellent release. Good for OpenAI for releasing that. I think it brings the competition up, and I'm expecting Anthropic to probably take the crown in the next few weeks. When they released their Claude 3.5 Sonnet series last week, the same week that GPT-5 was released, they had put a blurb saying, "And we have a lot more coming soon." Maybe it wasn't exact. It was like, "coming in the next few weeks." And when people see "weeks," that translates to like one week. And so we're expecting a lot in the next week. Hopefully, Anthropic can come through. I think they've done a fantastic job historically. So I can't wait till Claude Code gets better. I'm going to be stoked.
So I think that's a good, interesting topic that we can, I guess, we've touched on a little bit before. But when is intelligence enough? Because for me, and this is just my own use case, I've said this to you privately, I don't really need it to be much smarter. Like for me personally, the fact that it can already do PhD-level math, like it can already be a better doctor in certain situations. That's all great. Like, is now about speed or is it about intelligence, or is it a combination of both? But at some point, we will hit a ceiling—not a ceiling in intelligence, but a ceiling where it doesn't really make a difference for me. You know what I mean? And I'm already kind of there for my own use cases right now. That could change in the future. You obviously use Claude Code; you're building much more complex applications than I am, since you are a full software engineer through and through. But for me, I don't know what they could come out with that would make me go crazy in that sense. So like, what are you looking for? Like, when you are dreaming, when Brad has dreams about his model that comes out, what is he... is it about speed? Is it about intelligence? Like, we're already there, Brad.
If I'm dreaming of Opus 5—so not GPT-5, we're talking Opus 5, and not Opus 4.1, any of that, we're talking Opus 5—I think as I've used these tools, they're very, very good. Again, they get good with you writing good prompts. So this is taking the context that the AI tool needs to know, bringing all that data front and center, and essentially doing a planning step on top of that. So I dump a certain coding task into it. I say, "Create a plan to achieve these three goals," and then it responds saying, "I'll create a checklist of these three to-dos," and that'll be the plan. I spend a lot of time iterating on that plan, and then I tell it to go. And essentially, I think the top three things I'm looking for are, one, coding intelligence. So how to use the latest frameworks. For example, when I work on my React Native app, I am essentially porting over my website code to mobile. Doing that, it's different. The paradigms are different. The navigation is different. If you use an iPhone and use a desktop browser, you know that things are just different. I don't need to tell you as a user. For example, I can have multiple tabs in a browser. Can't do that with an app. Things are different. So when I bring over this web code and I say, "Hey, can you create an expense detail page? This is my web code. Please create this in React Native," there is a step in which the AI does a really good job to try to map these platform paradigms from web to React Native. It does a pretty good job, but it's not perfect. And there are various details that expose that it's not very mobile-feeling. It feels like a website that's shoved into a native app. It does not feel like Airbnb coded me a "split my expenses" app. I think that part is really, really critical, that I can ask it to do something without as many design expectations, but the model has really good taste in design. The second part is taste in coding and writing code. Because again, the era of "vibe coding" is upon us. People are writing these apps and pushing them out to production without thinking twice. But as you use these AI tools in large codebases, they can choose to write new functions instead of reusing old functions. They can have really convoluted logic when it should be simpler. There's a bit of taste in programming where it's not just the output. It's how the output looks, how it feels, how you'd approach it if I were to do it myself. And the AI is not as good as I'd want it to be in that category. Oftentimes when it creates code, I feel like it's a little bit more tech debt than I want, and tech debt being, it's not to my level of standard, but I don't want to go spend 30 minutes trying to make it better. And then I think the last part is cost. So it's one, does it have taste of design, does it understand platforms? Two, does it have good taste in writing raw code? Because code and functions and naming things is really, really important for understanding. Three, is it fast? Is it cheap? Can I run it, and can I be effective with it? Claude Code is kind of the pinnacle of all those things together, where it has a great model, it has a great harness, it does a really good job at understanding my intent, and it has a really good design aesthetic and writes clean code. But I think it could do a lot better. There are times where it writes code and I'm like, "That sucked." There are times I'm asking it to do something and it took it in the wrong direction. I thought if I asked a normal human, they would get that. The AI just did not get that. Not sure why, but I'll just do it again. So I think there's a lot more that I want. Which category I care about the most? I think a culmination of all, but I expect the next models to do well on all those fronts. Because I think GPT-5 was a good example of them actually highlighting that if you provide design guidance, it'll follow that. If you do not provide design guidance, it'll use its own built-in taste. And they trained the model with designers in mind, using color systems, Tailwind CSS for example. Like they literally said in their presentation, "We had outputs, it didn't look that good. We took it to designers and figured out what should be changed and then brought that back in training and then made the design better." So I think it's just leveling up design understanding. Like, there's a lot to be had there. Maybe we're at the 80% mark, where that last 20% is going to be, I ask you to code an app and it's Airbnb instead of some crappy-looking thing. But that's going to be really important, I think, for future generations, to just have something that looks good by default.
Yeah. The one thing you said about how it needs to try to understand React Native, I think I'll be truly impressed if it can really nail down a JavaScript framework because they're always changing and there are always issues. They are. It's always overcomplicated. Oh, man. Yeah. It's really impressive. But yeah, to me, it's interesting because I'm not a vibe coder. And we've talked about that before. I don't know enough to catch BS when I see it. And I tend to want to make things myself, if that makes sense. It's not that you can't, but for me, for it to make sense, I kind of need to be the one making the calls, at least on the first go. And then afterwards, I can say, "Hey, can this be done any better?" But I use Cursor pretty 100% now, pretty exclusively. And for the most part, I'm asking it kind of design questions and not so much "write this codebase for me," just because, again, I have to know how it works, especially being a non-technical person. I can't take some of those things for granted. So I've been building out a workflow, and it's been something I've kind of talked about here on the podcast before. It's a tool-calling agent that has access to all these different tools, and the tools are graphs with LangGraph, where they're just structured workflows where you go to step A, step B, step C, step D, et cetera, to get to an end result. So if you've seen the classic tool-calling example where they go, "What's the weather in Tokyo?" and it goes and does a weather query search, either on search or there's a weather API, it's like that, but obviously with workflows that are much more complex that are actually what people do for their job. And you know, to me, the questions I've had, Gemini has been the one that's actually been really good to answer as far as models go. It's questions around, "What should my classes be if I want to capture this data when I need to connect it to FastAPI to serve up a web interface?" That's the part where I'm not that familiar with FastAPI. I know what it is and I've dabbled in it, but not in connection with a tool-calling agent that's calling tools. So it's been really useful in that regard. But yeah, I don't have it write much code for me, except for to clean up things, you know what I mean? But that might change in the future, but I still need to control it.
Yeah, control freak. It's funny that you bring that up because oftentimes I bring up the fact that, "Hey, we have PhD-level reasoning and math," but I've never asked anything that is of that caliber because I wouldn't know if it did it right. I think coding is actually that piece where I have expertise, I have industry experience, and I can take a look at code and get a good feeling if I love it, hate it, or it's okay. And I think oftentimes, AI is on the side where it's okay or I love it, but it rarely falls in this bucket where that is completely off. I would never write it like that. I wouldn't expect others to write it like that, to a certain degree. Let's change this. And I think the problem with vibe coding is you build this layer of layer of layer of code, and if that bottom layer ends up being in the category where it's not great, the rest of your app kind of builds off these patterns. Lots of code is copied and pasted, brought over from various screens, functions, etc. Boom, you end up with a pile of turd. And that's the part where when I'm building my mobile app, I try to be very careful and not allow any of this bad code into my codebase because when Claude goes and looks around and writes more code, it could find that bad code and it self-replicates. I think that part, I'm going to be really excited once Opus 5 comes out. They have better taste in design, they have better taste in writing generic code. It'll be a time in which I can be more hands-off and just know it knows what it's doing. It writes great code. I just say, "Go." And yeah, it's going to be great.
Yeah, well, and one thing before we close up on that is on that example of my graph and my workflow that I was building, one of the things I remember I had asked it because I wasn't sure how to do this—again, in FastAPI, in the context of that framework—was how to serve up these different models. And so because I was using Pydantic, long story short, it created a custom serializer to take the class object that I have and display the results. And it was like 60 lines of code of just an if statement: if this, if this, if this, all these different types. And I was like, "That doesn't feel right." People love Pydantic, people love FastAPI. There's no way that can be the right way to do it. And then I just Googled it, you know, old school, I guess, just Googled it, and Pydantic has a built-in function where it'll serialize all those things that it was trying to do. Like, it just did it, you know? It was just again one of those things where it's like, if I didn't know enough to call BS, I would have ended up with, I guess, tech debt from that perspective, of AI slop.
Yeah, AI slop. And you know, but it's funny because it can go the other way. I was using the Google ADK, the Agent Development Kit—and oh no, no, it wasn't Google ADK, it was LangGraph and LangChain, but they have a pre-built agent where it's called like `create_react_agent`, like reasoning actions, not React like the JavaScript framework. And it was like a pre-built thing where it abstracts some of the details away from you to try and make it easier. But for me, I found that it was too abstracted. I needed the granularity to see, "Okay, I'm going to put these parameters in when I make this LLM. I want it to confirm the tool calls and stuff like that before it executes them." So all those things that it was abstracting away from me, I was like, "No, I want that detail." You know? And so there's a taste that you have as a developer too, or as a builder, that may be at odds with the AI, and you've got to kind of pick and choose. That's design, you know? That's kind of... you need to put your own stamp on it.
Yeah, I think that part kind of flows into vibes too, where any modern LLM can write code, but how it writes that, there's a lot of nuance and flavor and flair. And if it's the right style, you know it and you like it. If it's not, you kind of recognize it. I think Anthropic does a good job on that. One thing I wanted to call out, though: I've been off the rails, really doing Claude Code again for my mobile app, and I've made insane progress. I think the biggest unlock is using voice. So I talked about this on the podcast maybe four months ago, five months ago. I think I brought up that there were a few apps that can essentially transcribe what you say into text. And this past week, even this weekend, I had picked up Wispr Flow. So W-I-S-P-R Flow. And essentially, I've used it before, but I never used it with Claude Code. So I follow a few folks on Twitter. These folks are very, very vocal about the best Claude Code workflow, and I trust what they have to say. And essentially, they had mentioned, "I use Wispr Flow. I've been just yapping to my computer to write a prompt, and that has produced significantly better results." So I installed it sometime last week and just went ham explaining things because, again, I know what I need, I just don't want to write it. It's very time-consuming for me to translate web UI into React Native UI. And so what I do is I explain the feature from top to bottom, navigating throughout the code, explaining all the nuance. I think the biggest thing that I messed up last time when I used software like this that transcribes is there are two ways to record. One is you can hold the bottom-left key on your keyboard. On Mac, it's this function key. While you hold it, it records. Once you let go, it transcribes and pastes it. I was doing that before, where I'd go in Claude Code, I'd hold down the button, and I'd have to think, "How do I describe what I want in one shot, knowing exactly what I want?" I couldn't figure out the right workflow. It sounds weird, but I couldn't get out the right thoughts and feelings, etc. What I realized, which I don't know why I didn't realize this last time, is that there's a different recording workflow in which you can hold that same key but then press space. That'll be like an unlocked recording mode in which I can navigate around my computer and talk, and it will keep recording. And then I just press the bottom-left key again to end the recording. So now what I do, which I think is an immense unlock for people who are using Claude Code, I would 100% recommend this, is use Cursor, use Claude Code, use Wispr Flow, and what you do is you start the recording in the kind of unlocked mode, you talk about your feature, and you actually navigate in your codebase talking about things within Cursor. Because again, I'm talking about three or four files at once, how to join these files together, what the architecture should look like, what it should look like on web, what it should look like on mobile. I'm jumping through codebases talking about these things. And it's really hard to get all that without having an unlocked recording. So I've been doing this over the past week and I've made an insane amount of progress. I have been talking so much to my computer. It almost feels a little bit weird at times, honestly, but I have stats on it. So I have a two-week weekly streak. I have 145 words per minute, which is fast. Again, it's faster to talk than to type for most people. And I am at 20,000 words dictated. And I looked that up, and that's actually like 50 pages' worth. So if you're writing a 50-page essay of words that I have typed or talked into Claude, and they told me I'm in the top 1% of all Flow users, which is the app. So it has been an absolute game-changer. When I pitched Claude Code in early June, I felt like I was early. This time, I don't know if I'm that early. Some people have been using it, but I would highly, highly, highly recommend it. Now, even if I want a follow-up change in Claude Code—so I'll brain-dump a massive prompt, it'll then write code—once it's done writing code, if I want to change anything, I immediately go to the recording mode and just say it. It's just faster to talk and let go, even if it's one sentence. Seems dumb, but I would highly, highly recommend it. It's been a huge unlock for me. I've done more in the past week than I've done in the past three months. And it's a huge, huge level up. I think I would not want to write things again. Once you go with this approach, you don't go back. It is $15 a month, but so, so worth it.
Yeah. Yeah, that's cool. I've heard people talk about the speech-to-text quite a bit and like it a lot. So I've used the... not speech-to-text a lot, but just used the speech interface with ChatGPT or with Gemini. It does feel... it's a different feeling. It feels more natural to kind of just talk about your ideas. But I do find on the flip side that when you are writing something, I think you are having to think more about it, you know, and have to kind of really pause on your thoughts a little bit more.
It's funny you say that. What I've learned is that you can have straight-up gibberish, but the idea is there. So like, I've said things that the speech-to-text actually gets wrong. In the text, if I look at the text, it's not what I said, but only a few words are off, but it gets it. Like, the idea is there. And I think that was a big unlock for me, is that I thought I had to craft the perfect prompt in text, which I spent a lot of freaking time typing out a good prompt by hand. Then I realized, "Oh, I can talk to it." But then when I first did the talking, I thought, "I can't navigate my codebase. Now I need to one-shot what I would try to perfect." Now what I do is just yap to it. Like, quite literally, just sit here and talk for one minute. "Here's my payment screen. Here it is on web. Here it is on mobile. This is how you do this. This is how you do that. Here's my expectation. Here are edge cases." Like, I will just talk and talk and talk. And it's not necessarily that I'm talking code, and sometimes I am, but more or less the idea is there. And for some reason, it really works well. So I think that barrier of writing a great prompt ends up being just spending time talking. And that is a much easier barrier for people to overcome than having the perfectly engineered initial prompt for Claude Code.
Yeah. Yeah, for sure. One of the things I like to do whenever we make the drive out to California from Arizona, if I'm doing it solo, which I've had to do a couple of times, is it's like a six-hour drive. So every now and then if I have something I'm thinking about, I'll just record it on a voice memo, you know, and just be like, just chat about it. One, 'cause I'm probably partly bored, but then two, because it's, you know, you have ideas you're kicking around, and sometimes almost getting them out in some kind of medium, whether it's writing or speaking, lets you think about it differently than if it's just in your head and you're talking to yourself, you know what I mean? Even though you are literally just talking out loud, it does get the ideas out, I think, differently. So yeah, pretty cool. Awesome. Okay, should we wrap this up with some bookmarks, Brad?
Yeah, one last thing, though. I was going to ask, did you see the news article that came out? So you know the whole Cursor drama about the acquisition, founders leaving, et cetera. The latest update is that Cursor got acquired by Cognition Labs, which is the creator of Devin. I think last week there was an announcement that they had laid off 30 employees at Cursor who were acquired by Cognition, and their founder from Cognition came out and he said, "Oh, like, Cognition is a super busy company and we work like 60-hour work weeks, and for the people who were acqui-hired from Cursor, we told you that, but they weren't really bought into the culture, so they got laid off." And I just thought, "Damn, what a journey that would be." From like, "Oh, you're going to be acquired by OpenAI." "Oh, never mind. Your founder is leaving and you get nothing." "Oh, never mind. You get bought out and actually you do get some money." And then next week you get laid off.
Yeah. I wonder if there are any clawbacks on the money they might have received if they got bought out, you know?
Yeah. And one of the quotes the founder had used, I'm pulling it up right now, quote, "We don't believe in work-life balance. Building the future of software engineering is a mission we care so deeply about that we couldn't possibly separate the two." A little bit cuckoo to me.
Yeah, I don't know. I mean, he says 80-hour work weeks, actually. It wasn't 60, it was 80, now that I'm looking at the article. So, yeah, the AI race is absolutely bonkers. Yeah, you know, I would never work at a place like that, to be completely honest with you. That's just not the stage of life I'm in. Maybe at a different stage, I would have been interested in that. I mean, I was a Big Four auditor, and it wasn't hours like that. Some people have crazy hours like that as an auditor; I didn't have it that bad. But you know, wasn't getting paid what they're getting paid over there either. But yeah, you know, I don't fully disagree with the sentiment. I think what people would say about Steve Jobs is that he was incredibly difficult to work with and work for, but the people that were bought in were *bought in*. They believed in the mission and were unified on his vision and were passionate about it. So then it kind of doesn't feel like work, it feels like you're building something really cool. That only works if you're building something cool. And so I don't know if the people that were coming over from Cursor were just like, "You know, screw Devin, who cares?" you know what I mean? "I'd rather go work for Meta with the billion-dollar offers." But yeah, so if you're bought in, great. If you're not, then don't waste your time, you know? And don't work 80-hour weeks because it's usually not worth it. Go find somewhere else that you're going to be happy and live your life how you want to live it.
Yeah. Yeah, I don't know if you saw that, so I wanted to bring it up. I just thought, "What a saga they had."
I have to think that, you know, and you're up there in the Bay Area, I have to think that a lot of times people that are at those companies, they know it's going to be a ride. They're not there for the, you know, for the Dunder Mifflin, like, nine-to-five-that's-it kind of thing. They're there to be part of that crazy, bizarre culture that's up there.
Yeah.
Yeah. Like the HBO show.
Yeah, right. That's a great show. Awesome. All right, well, let's wrap it up with bookmarks. Brad, I'll go first. Mine is—I want to expand on this more actually, maybe in a different episode—is about the change that the Google AI Overview has on people's search results and click-through rates. So for a long time, and you'll know this really well from kind of being an indie hacker and everything, is, you know, Google SEO was the bee's knees for the trends. Like if you could get placed high on SEO, then you have a better chance of people clicking on your page, and with people clicking on your page, a better chance of getting the signups and so on. But now, if you do a search in Google, I think in almost all cases that I've seen, you will get an AI overview first and foremost in the search result. And then below, they'll have the searchers, the actual links, like we're used to, us millennials and whatever your generation is. I guess you're a millennial. So, but I guess the article is from, I'm sure I'll try and find the article link, but it's from a person who is quoting publicly traded companies that are all talking about in their most recent earnings releases and their most recent 10-Qs how the Google AI overview has had a negative impact, like a meaningful negative impact on their conversion rates, click-through rates, and impressions. And it's interesting because again, with Google, you wanted to get on Google's SEO. It was like you always tried to find a way to get the keywords ranked, and it was kind of like a little game. People charge you money in consulting to do that, and Google's kind of upended that and made this new AI overview, which I think is still kind of... people are catching up to it. Like, "How do I get placed in AI overview? How do I get it..." It kind of goes into the AI search of, "How do I get my search or my company to be featured in an AI response when they go look up CPAs in the Phoenix, Arizona area? How do I get mine to show up on that search result?" And then so the other article that's kind of in the same vein, so it's kind of two different links, but someone, I think, replied or linked or reposted, whatever the right term is, that there's going to be ads in people's responses from chat. And that's going to be kind of annoying. It's like now we're going to have, you know, like you have Google search reviews, or Google searches also have sponsored pages or the paid links, you know? Now they're going to have paid ads in ChatGPT.
I've heard about that for so long. Once ChatGPT comes out with ads, I think people are going to go bananas, but I assume they need to make money somehow.
Yeah. So I'll find the... it's two different links that are kind of in the same vein. I'll find them and we'll put them in the show notes. But yeah, I thought it was really interesting. And a quick shift, you know, again, the speed at which that's shifting is causing, again, leaders of publicly traded companies to call it out on their earnings releases. It's not a slow trickle. It's, "Google turned this thing on, and we're seeing meaningful impacts in our financial results because of it." So pretty crazy.
Good for Google actually changing things. I know, I think there's been a history of like Gmail being around and no one changing a damn thing. Glad their AI models are good and they're actually changing search a little bit. So hats off to them. Glad they're iterating a bit more than you expect from a big company. But yeah, for my bookmark, a little bit non-technical here. I was browsing Twitter and I found this tweet from Brandon Butch, and he said—it caught me with the tweet—"Apple had no business going this hard with the new 'Dreamer' ringtone." I thought, "Oh, that sounds kind of interesting." And so it's... I don't know how to describe it on the podcast, but I'll leave the bookmark link. Essentially, he screen-records playing the ringtone in the Settings app, and part of the ringtone kind of reminds me of the traditional ringtone alarm noise that the iOS app uses, and it really triggers me hearing that noise. But they have like a remix hypno-beat on it, and it sounds amazing. And I thought, "Wow, like, why the hell are they spending time in the latest iOS 18 beta playing this kind of remix version of something that probably triggers a lot of people because it's the default alarm?" But I thought, you know, that's innovative. And it has 5K likes and 250K views on the tweet. So maybe Apple has a new marketing stint here. But yeah, if you're on the iOS beta, definitely check out the 'Dreamer' ringtone. And I'll link it in the show notes. But once you play the first 10 seconds, you're like, "Oh, this is kind of a banger." So it's a great ringtone, and I'm excited that Apple's spending some effort to make a better beat when I wake up.
Yeah. Oh, well, they definitely got their priorities in order because, you know, iOS 18 was universally received so well.
Yeah, I was listening to that. It is a pretty cool ringtone, though.
It is good. It is good. But it's not better than the Breakeven Brothers intro song. So just for the record.
Yeah, yeah, yeah, can't beat that.
Tell us what you think about the intro. I think Ben spent a lot of time on that and, you know, he put together the band and it sounds excellent. I hope people really, really appreciate it. But yeah, it's been a jam-packed episode.
Yeah, yeah, of course. Yes, it's me singing, so just to be clear. So, sorry, continue.
Yeah, I was going to say, today was a jam-packed episode. I thought we would spend a lot of time on GPT-5, and we did. So I'm excited. But for anyone using it, you know, please let us know where you think it shines and, conversely, where you think it kind of falls short. I know there's been a lot of expectations here on GPT-5, and we both held a pretty high bar for it. I think there are clear areas where it does well and where it doesn't. And we'd love to hear from you, you know, in the YouTube comments, on Twitter, on LinkedIn. You can find us pretty much in every spot. We'd love to hear where you think GPT-5 lands in the spectrum and what you find it most useful for.
Yeah, absolutely. Well said. All righty. Well, until next time, Brad.
See you later.
Until next time.
Thank you for listening to the Breakeven Brothers podcast. If you enjoyed the episode, please leave us a five-star review on Spotify, Apple Podcasts, or wherever else you may be listening from. Also, be sure to subscribe to our show and YouTube channel so you never miss an episode. Thanks, take care.
All views and opinions expressed by Bradley and Ben are solely their own and unaffiliated with any external parties.
Creators and Guests


