Claude Opus 4.5 is a POWERHOUSE, here's our vibe check...

Download MP3

[JINGLE]

Alright, we are live. Back at it again, the Breakeven Brothers podcast. Episode what? Thirty-three?

Thirty-three. Live and locked in. I'm pretty sure it's 33.

Yeah, it's nice. It's already December. Oh my gosh. How was your Thanksgiving?

It was good. I had some turkey, had some good ol' mashed potatoes, stuffing. Yeah, I went up to my sister's place. Had some fun games there, ate some great food, and it was pretty chill, honestly. I didn't do much travel, so it was nice to relax and take a longer break. Work was definitely slow too, so it was nice to kind of check in with that. And another big announcement is the Android app is out now for Split My Expenses, but we'll get to that later. How was your Thanksgiving?

Yeah, it was good. We did a Friendsgiving with some friends who live in the Atlanta, Georgia area, about 30 minutes outside of the city. So that was cool. I'd been to Georgia once before, but it was for a work trip, which means you just stay in the hotel and don't really go anywhere cool. So this was cool to actually experience Georgia. And our friends were nice enough to have us stay with them the whole time, which was great. It was cool. I mean, I'm definitely not an East Coast person. I think I was realizing that after being there. Because you know, your mind always wanders. You and I are both from the West Coast and have never really left. I'm in Arizona now, but that's the furthest we've really gone. So I don't know about you, but I always kind of wonder if it would be cool to live on, you know, in New York or in Florida.

It would be cool. It would be very cool, actually, but I don't think "live," I think "visit." I've visited a lot of those places and I think Georgia was nice. They have a pretty famous aquarium, right?

Yeah, I think so. We didn't do that. But yeah, it's really pretty, you know, and it was cold, which was kind of nice because it's never really that cold out here in Arizona. So yeah, it was a great time. Ate lots of food, watched some football. But yeah, afterwards, I was ready to be home, back to my desert.

Did you pick up anything for Black Friday?

No, I don't think so, because we flew back Friday. So I wasn't really sitting on my computer or on my phone looking at stuff. I was in the airport, on the plane. So no, I wasn't really looking for any deals. I did go to Home Depot and Lowe's Sunday evening, but there were no deals. It was just, yeah.

I'm surprised there were no deals, because I went shopping at the outlets in Livermore. I think it was Saturday or Sunday, and I thought the deals were pretty good. I picked up this shirt, new shirt. I like it. I think it was like 50 or 60% off. I was looking to expand, but I kind of go in waves when I shop for clothing. I don't buy anything for like a year, and then I'll just go and buy three or four new shirts or two or three new pairs of pants, and that's it for the year. And I was actually very successful shopping in the post-Black Friday period that still had the sale. I was worried that there wouldn't be a lot of inventory in stock for my sizes, but yeah, it seemed totally fine. So I picked up a few pieces, which was really nice.

Because it's so hard to shop for clothes. One day when they come out with AI try-ons, which I think they already kind of do, but I've never done them yet, I would love to sit at home and swipe and be like, "Does that look good? Does that feel good?" Because half of it's fit and color. It always looks good on the model, and then you try it on and it's like, you know. So I would love a day where that could be better. You're spending hours and hours in these stores trying things on. It's fun, but it gets kind of exhausting.

You have that much trouble finding clothes that fit you?

Maybe not fit, but I think I only buy something if I'm like, "Oh, I really, really like that." I don't find that a lot. Like when I look at things on the rack, I'm like, "I don't know if that would really be good." So I have to try it on, of course. But yeah, I wish I had a better intuition for what is a good clothing item.

Yeah. Cool. Well, I don't know how we got to clothing, but let's... you mentioned something that was really cool that you had released recently. And that is the Android app for Smee.

Oh my gosh, so much effort. Honestly, the iOS app was out in the middle of November. Right now we're recording on December 2nd, which honestly is insane; time is flying. But iOS came out in mid-November, and Android came out less than a week ago, so at the end of November. And for viewers who are looking at the podcast or watching on YouTube, I'll try to show it on my iPhone, but you can see there are two Split My Expenses icons. The production one is this one. I don't really have any data since it's a test account. It's probably hard to see anyway. But there it is, new expense page. You can type in your description, your amount. This is all Android. This is a testing device I actually picked up from Back Market, which sells second-hand devices.

And yeah, long story short for Android is, damn, it was confusing. Using the Google Play Store to get my app uploaded, approved, and released took way too long. I don't even know how to describe the pain and trouble that I went through trying to understand Google's interface. So if you use Google Cloud, or GCP, or AWS, you know how confusing these kinds of conglomerate software platforms are. When I used the Apple App Store software for iOS, I do think it's confusing, but I've been in that domain before, so I know what I'm doing. Take me to Android, and I'm expecting a similar iOS setup, but it's completely different. It sits in the realm of just bloated software, and it took me many days to figure out how to do things that should have taken me maybe 10 minutes.

So the app is out there now. I sent an email to all the splitmyexpenses.com registered users and said, "Hey, go download the mobile apps." I put out a Cyber Monday sale. So if you're listening to this this week and potentially a little bit longer into next week due to marketing delays, I'm probably going to do a Cyber Monday sale for, I think, 29% off the yearly plan. So a lot of things are in the works and have come out, and I'm super proud of it.

But yeah, I wrote an extensive blog post on my blog. So if you go to bradleybernardo.com, click on the blog section at the top, the latest blog post you'll see is about the journey to shipping on iOS and Android. And I go into a ton of detail on the pain and suffering and the journey of me moving from a web app to a mobile app, which a lot of people asked me, like, "Hey, why'd you even take that approach?" "What influenced you to actually make the mobile app after so many months of just having a web app?" So I won't go into it too much here, but it's in the blog post. I kind of talked to my computer for 20 minutes and this is the result of it. I added a bunch of screenshots, fluffed it up, and so I'm pretty proud of that blog post. But yeah, if you're an Android user and you hate Splitwise or you're looking for something new, I'm here for you. It's splitmyexpenses.com/ios or splitmyexpenses.com/android. That'll take you directly to the App Store. And I literally added that redirect 10 minutes ago, so I was really proud of that one.

Nice, nice. I'll be curious to see what the relative downloads are per platform. Like, what's Android? How many people are on Android, how many people are on iPhone? I feel like iPhone is losing its, you know, its grip on the market a little bit in the last couple of years. I don't know if that's just a hot take or an anecdote. It's how I feel.

Dude, I've thought about switching, but I don't want the green text bubbles, you know. But if it wasn't for that, I'd be really...

Yeah, I can tell you. So far, the app was released on November 13th for iOS and has averaged probably 20 downloads a day. But I've had peak days where I think the first day was 45, and then the day I sent an email out was 126. So roughly 600 installs from the middle of November to December. I think my entire user base is sitting around 19,000 registered users. But the one unfortunate part, which kind of sucked, is I sent out this email describing that the mobile apps are out, dark mode is here (which is a big release on web), and the Cyber Monday deal. But it actually went to spam, which was super frustrating.

Oh, that stinks.

I sent it out, I reached out to people, and they said, "Hey, I didn't get anything." And I checked Postmark, the app that I use for sending emails, and I had a registered Google account for my own service and I checked my Gmail inbox and it went to junk. And I was like, "Damn." You know, I'd spent so much time working on everything behind the scenes, I thought, "Oh, my emails will be fine." I don't really know why, so I'm going to audit that this week or next and see if I'm missing some security DNS settings for email. But clearly telling people about it will increase downloads, which is not a novel fact. But that's something I'm going to do a lot more in December: talk about the app, try to get it out there, do marketing, SEO, Twitter, you name it. But they're out there now. About 500 or 600 for iOS. For Android, since it's so new, I'm not even sure if it gives me any numbers, but my guess is much lower. I think I have a lot fewer users. I'm pulling up the Play Store. It looks like 120 installs since November 28th, I think it is, or 29th. So yeah, not many days.

That's pretty good.

Yeah, right. Again, a lot of work to do to get that higher, but we're making progress.

It's new. Yeah. And it's your first—it's your first Android app, right? Because you've built other iPhone apps. So yeah, right.

Yeah, and I totally can relate to the Google Cloud Platform being confusing as all can be. Because for a while I was using the Google SDK, we've talked about that a little bit on the pod, and of course they want you to host things on Google Cloud. And so I've gone down that path, tried it out, and it's just way over my head.

Yeah, it's just too many options. If you know how to use it, it's easy. If you're a newcomer, which everyone is at one point, it's really overwhelming.

Yeah. Yeah, it's too much. Well, congrats on the release. Split My Expenses, you can find it everywhere now, not just on the web, but on iPhone and Android. And there's a free plan too. There's a competitor that doesn't have a free plan. So if you need a free plan just to dip your toes in, Split My Expenses has got you. Don't pay attention to the other one. We won't say what it is, but yeah, you gotta talk your stuff. You gotta talk your stuff.

Yeah, no, I appreciate it. But outside of that, there has been so much to catch up on in the AI world. I feel like Ben and I have been talking to each other like, "Oh my gosh, this, that, and the other." And I think last time we talked on the pod, it was expectations for Claude 3. Literally two days after the podcast was posted, Claude 3 was released, and this is Claude 3 Opus. So there's no Claude 3 Sonnet or Claude 3 Haiku, which are kind of the cheaper, more cost-effective models, but they came out with the big one first. Which I think they do that normally; then they distill it and make it smaller. And they also came out with Imagen, which was their pretty incredible image model. So if you take a look at the leaderboards for LM Arena, I believe Claude 3 Opus is number one on text, number one on vision, number one on text-to-image, and number two on web dev.

So I think the big categories are text, you know, general chatting with your AI chat assistant; two, vision, so how does it understand images, OCR, etc.; and web dev, can it write code for you that's, you know, production-ready. And it taking first place in two out of the three, potentially even three out of the four if we talk about text-to-image, is really impressive stuff. There was a lot of hype around this release. People kept saying it changes the game. I think it delivered. I think it did a really good job. The price is pretty fair. They've had pretty good capacity to serve the model. I think it even was out for free for about a week through Cursor and people picked it up. So yeah, any reactions from you on Anthropic's latest release?

Yeah, I mean it was funny that it came out right after we were talking about it. Because I think I texted you like "emergency pod sesh." Half joking, but yeah, I mean it's been pretty phenomenal. I'm trying to think of some use cases that I've used it for where I had to really kind of push it. Like I had some tax questions I was working through with someone. And it wasn't... you know, when I use it for accounting, I don't like to use AI to give me the answer, but it's really good at finding the general chapters of the guidance that is relevant to what you're talking about. Or it can be really good for looking up court cases where there's precedent. If it's like an IRS court case and the taxpayer is trying to do something, a lot of times the tax courts will rely on a precedent. So you can kind of go look at the court cases, understand the conclusion from that court case, and that kind of tells you what the position will be if you try to do something similar. So I like it from that standpoint, and it was really good. I had it do deep research on one of those issues.

I think for coding, I used it a little bit, but really basic. I wasn't doing anything crazy. I think I was just making some changes to my website for my business. And it was great because I think a couple of weeks before I was using GPT-4o or whatever to get my website looking the way it was. But then when I was doing some rebranding and kind of changing some things around, I was actually doing this over the Thanksgiving break a little bit. And I was using Claude 3 Opus, and it was really good. Like it definitely, I noticed, took a little bit longer on some things and did a little bit more of the thinking. I had the thinking mode enabled. But it was really robust. Like one of the examples, and this is not a technical flex at all, but you know, I had a pricing page. And a lot of times with accounting, you have packages, right? Like, "Okay, this is my mid-tier package and these are the services you get. This is my premium tier package," right? And I told it, I'm like, "Hey, these are the services I offer, this is the pricing. Can you put this onto my website? Like can you just put it on my pricing page?" And I thought in my head it would just go and put those prices and those services in the static HTML, but it did a whole different FastAPI route that had the packages in a JSON format and then that was being... so it kind of, I don't want to say over-complicated it, but it did it in a way that I'm sure is better from a technical perspective. But I was like, "Oh, okay, I didn't tell you to do that." I thought you were just gonna put it in static HTML. But yeah, I was impressed with it and I did see all the stats too. When it first came out, people were over the moon about it.

Yeah, I think talking about the stats, I think they were top of nearly everything that was relevant, but I think there is an agentic coding benchmark that they weren't top of the list for. And I think that one was highlighted as like, is that a relevant benchmark? Should we be focusing on that being the sole kind of, you know, measure of model quality? Because I don't know about you, but I think Claude 3 Opus really showcased good design skill. I think the iteration from their previous models to this one creates a really, really good UI. And that means a lot for a lot of people, especially the "vibe coders" out there who want to create something but don't have that design skill. And I think a lot of people have the idea, but they don't know how to write the code and they definitely don't know how to design it. So I think Claude 3 Opus did a good job crafting more design skills and baking that into the model. So yeah, pretty impressive scorecard that they had for topping a lot of the benchmarks. Even though they weren't top of the benchmark for agentic coding, so to speak—I think it's called SWE-bench—they were very, very close to it at the time. And then a bunch of new models came out.

But before we move on from their exciting day, they also had Imagen, which I actually used. I won't describe which thumbnail, but there might be a thumbnail out there that might have used it. And if you can put those pieces together, you might find it. But my experience with Imagen was I was trying to get this thumbnail. I imported said previous thumbnails and asked it to make a new one. As it generated the first image from my prompt, it looked really good. Then I asked it, "Hey, can you change this facial expression to do this?" The second time it generated the second image, it looked different. The faces that came out were definitely different. It didn't look like the people that were originally there. And so I had to reset my chat and go back to the original chat where I just uploaded the sample thumbnails and then have my prompt be more directed and kind of include that second message I sent in the initial prompt. So I think it was pretty damn good, pretty fast, and pretty, I feel like it would unlock a lot of marketing things. Like if you're working with images and you need to create mockups or different product placements, it seems like Imagen would be there to handle the 95% of use cases that probably pop up. So my experience was cheap, fast, had a little bit of a bump due to it losing faces, but it was easy to fix, so not a big deal.

Yeah, no, I think Imagen is what... I think that's the technical name for it, right?

I think so.

Yeah, was probably the most exciting thing that they've released recently in my opinion. Like Claude 3 is obviously super impressive, but it's, you know, a step forward in model intelligence, but there wasn't anything novel that it could do that the last version couldn't. It just seems like it could do those things better. But Imagen... you know, it's funny, when I was in Atlanta for Thanksgiving, we were talking about AI and using ChatGPT to do photos, like to make text-to-image, right? But everyone's like, "Oh, well, you can tell those are so fake." Like, you know, they're still, you know, they're getting better, but it's easy to tell and spot a fake. But the people I was talking to hadn't used Imagen yet or hadn't seen it. And so I said it's a lot, it's much more difficult than you think now with this model. And so we did a little game of like, "Is it AI or not?" And it was pretty... there were some that were really fooled by it because in my opinion, the realism is just night and day compared to their main competitor in this area, which is ChatGPT. And yeah, like it can make things look super realistic. People are super good at the prompting and stuff like that.

That's a great idea. When you mentioned that, we should create "AI or Not"—or like, imageaiornot.com—and just see what real people vote on, because I think it'd be funny and we'd just burn like $500 of Imagen credits to create these kind of left and right images.

Yeah. That'd be a good test. I think it has gotten sadly so realistic. The good news is that Google did come out with their, I was just looking it up, SynthID. So this is a watermark that Google puts inside the image such that they can tell if this image was AI-generated. So SynthID, a digital watermark, pretty, pretty cool software. I think it's going to be very useful in the future when you have images uploaded to various platforms, if they can determine if it's AI-generated or not. But at the end of the day, a lot of the times it's a human looking at it, and I can tell you I've been fooled many times by all these pretty robust image generation models and even the video ones, honestly, are getting to a point where I'm like, "I don't really like this." I'm so confused. I'm not even sure what to think.

Yeah, and then the other thing I've noticed with Imagen, because you're talking about marketing, you know, it's actually going to be a foreshadow into my bookmark. But what I've noticed and what other people have noticed too that I've seen online is that it's way better with text. Like you can create brochures and a fake food menu, for example, if you have a restaurant business. And like, you know, before, Chat would always kind of get the text... you could tell, it just looked like alien language almost. But now it's gotten really good at keeping the actual text intact and having it make sense. And so I've seen people say that they can use it in—I haven't done this myself—but you can use it in marketing materials or, if you're going to make flyers and brochures, you know, you can now keep the nice design that AI has been good at. It's always been pretty, well not always, but recently it's been pretty good at making nice brochures and pamphlets and all that kind of stuff, but the text was always a problem. But now it's like, okay, the text is there, we have the design, we can actually use this in a marketing workflow kind of thing.

Yeah, the alien text is such a tell. If you've worked with any AI image generation, it's just not even the right words, just some weird slanted font. And I didn't do any text. Well, I guess I did a little bit, but it was pretty straightforward. I didn't give it anything complicated, but I did see a bunch of examples on Twitter describing that kind of brochure style of pretty complicated rendering for text, and it pulling it off in a pretty legible way. So hats off to Google for that one. I think that's one of the big unlocks of an AI image model: making sure the input, for example, faces, is preserved, and then making text actually usable and legible. Having both of those at a cost-effective price, that's a really compelling argument. I wish I did more with marketing. Maybe I will with my apps, but I think it's a good time for developers too, because you can create a mockup of what you think you want it to look like using Imagen and then send that to Claude and then you're off to the races. So very, very impressive from Google.

Yeah. Yep, they keep cheffing, they keep cooking. Um, what else came out recently that we should chat about?

Yeah, Codex Max. So we won't go into this one in too much detail, but I'll give a high-level summary for those interested. GPT-4o came out. With that, they had this kind of flexible thinking budget we talked about: GPT-4o low, GPT-4o medium, and GPT-4o high. So this is kind of the prescribed thinking budget to make good responses. Based on that, they created GPT-4o Max, which is officially called Codex Max. And the reason it's called Codex Max is Codex is their CLI tool. This is their agentic tool to write and run code, interact with your computer, so to speak, similar to Cursor. And that "Max" at the end of the model name is like the maximum thinking budget optimized for an agentic harness. So what that means is it's built for long-running tasks, uses the context window much more effectively through compaction, and is able to just run hopefully without any interruptions.

So the GPT-4o Codex Max release came, I think, one or two days after Anthropic's release, which I think I even called that on our previous podcast: Anthropic's going to ship, OpenAI's going to ship the day after, and that's exactly what happened. And I think the 4o Codex Max is a great model. It's not the top of the leaderboard, but if I go take a look, it looks like it's maybe five or six on text and on web dev it's like five or six. So it's not top three, which is held by Anthropic and Google, but it's still really good. I think for the most part, it depends on the type of problem you throw at it. But this model is just an upgraded kind of agentic model that's used in Codex mostly. So it's trained on PRs, code review, front-end coding. A very coding-specific model that excels in their Codex CLI. That's the release, nothing crazy, I would say. I think it does a good job within Codex, but when you look at something like Cursor, which can use any model, it doesn't win out on the leaderboard. And that kind of shows you the raw model intelligence. Either you go for something really fast, like we talked about Composer-1, or you go for the top-of-the-line intelligence. So that whole middle ground in between—you're not fast and you're not smart—that part is the graveyard. And I think Codex exists on the very smart side because they're top five, but anything below that, yeah, pretty tough to get developers and just everyday users to use your model.

Mmm-hmm. Cool. And then we have Claude 3 Opus and my goodness. Incredible release. Hands, hands down, I think Anthropic's biggest model shipped in quite some time. So the story was Claude 3 Sonnet came out maybe a month ago. Before that they had Claude 2. And so the long history of Cursor was that Claude 2 was so good, but people were just abusing Cursor, so they added rate limits. That means you couldn't really use Opus that much. Then they came out with Claude 3 Sonnet. It was what I'd put as really fast and really smart, but not the ultimate intelligence. So I personally thought that the earlier Opus model was better than Claude 3 Sonnet. And by the numbering you could think that, but when I'm using it in coding, building my mobile apps specifically, I felt like it wasn't there.

Then, you know, two weeks ago, my Cursor subscription is coming up for renewal. I think, "Oh damn, Claude 3 is out, Codex Max is out. Should I be switching from Anthropic or to Google?" But as Anthropic always delivers, they came out with Claude 3 Opus and the first day I used it, I thought, "Holy shit, this is really good." And I, you know, I've been singing Anthropic's praises since the end of May when I first touched Cursor. I truly think that this model is the end-all be-all in terms of coding within Cursor. So I only use Claude 3 Opus. I'm not touching other models anymore because I already paid $200 a month for Anthropic and I was this close to canceling and going on the Codex train. But this release, I think, was perfectly timed and delivered just the right amount of increase in intelligence and capabilities that I thought was extremely worth it and produced high-quality code.

And I could talk about Opus for a long time, but I think one of the major examples, kind of like how you described Gemini refactoring your code, is I'd ask Opus to do a similar thing where I said, "Hey, could you please implement this feature on React Native that I want to add?" And as it implemented the feature, it said, "Hey, I also noticed all this unused code in this file. I'm going to implement the feature, go clean up this file and refactor it all at the same time." And I thought, "I didn't really ask you that, but if you're so good at coding, you would recognize these things." Like if you're an engineer and you're working in a file, you're modifying that file. Maybe you're not modifying the entire file, but you'll take a look and you'll say, "Hey, is there anything else I can do while I'm in this area?" And Opus, I know it's a small example, but it really delivers the kind of change which I see in the model where Cursor with Opus is just this incredible value beast where it delivers really high-quality code pretty darn fast. And the $200 a month package is basically unlimited. So I used Opus extensively to get my iOS app refactored and polished up and ready to go, and the same for Android. It's the same code base, so I used it a ton to crack hard problems and even problems that Claude 3 Sonnet could not fix, Opus did fix.

So yeah, again, I could talk all day about it, but if you're listening, there's one takeaway at least from this episode: if you're on Cursor, use Claude 3 Opus. The sad part is they've changed their pricing model and Claude 3 Opus is only, I believe, on the $200 a month plan. So they know it's good, they've made it paywalled, and honestly, it probably works out for people who are paying because then they're not killing the limits like they were previously. So yeah, if you want to give it a run, it has my full blessing. Claude 3 Opus is a stellar model and to me stands out across all three of these that we just talked about.

Yeah, it seems like they got a good... I haven't used it. I think I've said over and over by now, I'm not an Anthropic person, but I did see a lot of the discourse on Twitter saying that Opus was good, like it was pretty solid. I mean, it seems like they really have their footing in the premium coding model, you know, above all else.

Yeah, that's their niche, you know. So, good for them.

Yeah, maybe, maybe Ben's Christmas present is a Cursor Max subscription.

Don't waste your money, dude. Don't waste your money. It's just... it's just not for me.

We need to convert him.

No, I had it. I had the $20 a month plan, never used it. You know, Cursor, I think at the time, I don't know if it's still the case, but it wasn't supported on Windows. And they wanted me to do the Windows Subsystem for Linux thing. I was like, I'm not doing that. Like, miss me with that.

It's worth it.

No, for my website, for my stuff, it's not. But I can, if it really, you know, if you have some complex application that you're building...

Yeah.

...more power to you. And I think that just goes to again, like their customer base is going to be those really die-hard hardcore programmers. Whereas, you know, your mom or your sister is going to be using ChatGPT for their more day-to-day stuff or Gemini for the more day-to-day stuff. And maybe that's just where the camps divide, you know. So they'll have the developers paying lots of money for all that stuff.

Yeah, I think they truly have a niche in that area. And the last sentence I'll say, the last topic about Claude 3 Opus is oftentimes there's a lot of discourse on these graphs that show up when they release these new models. The latest one that came out from them is there's a SWE-bench verified, which is just a benchmark for coding models, and Claude 3 Sonnet gets 77.2% on the graph. And Claude 3 Opus gets 80.9%. So roughly a 4% difference. Not a huge difference. But when we plot these graphs and create these charts, it looks like a massive difference in the chart because everything is so close. Like Ben has talked about this multiple times in the podcast, like, "When is too much?" or "When do we get to a point where coding is good enough that I don't need to think about some of these things?"

And based on these graphs, we're increasing on these benchmarks, which are never perfect, but increasing on the benchmark by 4%. But what that actually translates to is a lot more than that. The vibes are right on Claude 3 Opus. The score on the SWE-bench verified for software engineering is only 4% higher, but that feels like a lot. And again, it could be partly vibes, could be partly benchmarking. Anthropic has a really strong position on developers and coding agents. But I think it just goes to show that between these percentage increases, there's a lot more to it. And I don't love showing these graphs where Claude 3 Opus is 80.9% and something like Gemini 1.5 Pro is 76%. So you could be worse than Claude 3 Sonnet, but people were loving Gemini 1.5 Pro better than Claude 3 Sonnet, so to speak. So there's a lot more that goes into it, but I think we get to that point where it's hard to tell if the model is good. That's why you use it. When I started using Claude 3 Opus, from the first three hours that I used it, I knew, "Okay, this is it." Same thing that happened with Cursor when it first came out. When I was using the original Opus model, I thought, "Damn, this is good." It knows exactly what I want to do, no more, no less, and it does a great job. And that's what I felt with Claude 3 Opus. So yeah, pretty interesting that we're getting to the point of very small gains on benchmarks, but actually equating to a lot in terms of model performance and overall vibes. So I'd love to see when Anthropic comes out with, you know, Claude 4, but I imagine they'll do a pretty good job. And if there's anything about Anthropic, they ship a very linear performance increase. There have been many graphs, I think we've talked about it on the pod too. Anthropic's steady ships just follow a straight-line curve of model performance. So hopefully they ship their next model before the end of the year and you know, maybe it'll make Cursor even better.

Yeah, I wonder if their focus, and I don't know this to be clear, I could be talking out of my rear end, but they don't have a text-to-image model, right? They don't have video.

I don't think so.

Like they're not as multi-modal as Chat and as Gemini are. So maybe that's part of their advantage too. It's like when they release, they're just focused on, you know, again, their special area. They're not worried about having an image model and a video model and, you know, a browser tool and XYZ, right? Like they're just kind of like, this is what we're doing. We've got Cursor, we're working on the model, this is for engineers, you know. I think that focus—sometimes it's better to have a few things that you're focused on rather than 20 different things that you can't really go tackle and execute on at once. So maybe that's what has enabled them to be so successful.

I never thought about that, but yeah, that totally makes sense and they released a work report. So I think this was actually today, so it's December 2nd. Their title is "How AI is Transforming Work at Anthropic." So just one caveat that they surveyed Anthropic people and the TLDR is that their AI is extremely powerful in reducing headaches and burdens. I'll just read maybe three headlines from the key finding sections in the report that I agree with.

So employees self-report using Claude in 60% of their work, achieving a 50% productivity boost, which is a two to three times increase from this same time last year. And so they say this productivity looks like slightly less time per category, but considerably more output volume. So we're doing more with the same amount of hours that we have, is how I read that. Two, 27% of Claude-assisted work consists of tasks that wouldn't have been done otherwise. So I feel this all the time where I'm working on my code base and I think, "I don't want to refactor that, but Claude can, and I will." So I feel strongly about that; it just makes things better overall. And again, we talked about Claude being a code model beast and this is exactly as it's being reported. And then three, I think most people can delegate their work. So they say most employees use Claude frequently while reporting they can fully delegate 0 to 20% of their work. So they've done a good job at making sure that Claude is good at collaborating with you, but also you can give the reins to Claude and understand that it knows you, understands your intent, and does what you expect. So pretty impressive. But it is Anthropic talking about their own software, interviewing their own engineers. So with that comes, I'm sure, a lot of bias, but I feel strongly about Cursor doing well in the software world. And reading some of these headlines and stats does line up with how I feel personally.

Yeah, I think, and to be fair, somewhere in the article, they talked about some of the negative things that engineers or the survey responses kind of highlighted. They're not trying to just paint only a rosy picture. They talk about how they talk to their coworkers less because they just go to Claude for a question or two. That it changes the career trajectory a little bit; in the short term, it's great, but in the long term, what does it mean? So as much as it is obviously a biased piece, they do balance it a little bit with some things that are maybe contra to the AI hype train. But yeah, of course, it's still very much pro-AI.

The thing I was thinking about too was, so they put this out, and I think a couple of months ago we talked about an article from, gosh, MIT, I think it was, that talked about how organizations weren't finding any use out of AI. And of course, you would think and expect that an AI company would have better success implementing AI at their company. That just, of course. But it's still to me, you know, there's still a big gap. When I talk to people too, just anecdotally in the accounting community, a lot of people aren't using AI for meaningful delegation of work. Like, yeah, it can help you craft up an email or, like I did, you kind of chat through and try and get some guidance for something. But as far as doing meaningful work, the infrastructure is hard to put into place. There's probably a technical level that's still too high for people to do on their own. But yeah, if Anthropic is seeing success with their employees using it, that's great, you know.

I'm glad you called out that... I was gonna say, I'm glad you called out that negative part because as I was looking at the article or the survey more that they'd released, one part does come out about the reduced social interactions. There's a quote in the article saying, "I actually don't love that the common response is, 'Have you asked Claude?' I really enjoy working with people in person and highly value that." So it kind of almost feels like, "Have you Googled that?" The common phrase has popped up where, "Hey Ben, do you know how this works?" He's like, "Hey, have you Googled that?" I mean, maybe it's now, "Have you asked chat? Have you asked Claude?" Like that's what it's getting at when these tools are so good and, you know, better than the average person or just have the tools and know-how to get things done effectively. So definitely a little bit of a negative and I think software is changing, which is where Anthropic is focused on. Not all of it is positive. Clearly that's a negative, and people like working with people for the most part. And seeing something like that be the default answer where they don't talk to you anymore, they talk to Claude... not the best, you know, for sure.

For sure. Yeah. Brad, should we hit the other two items that we want to talk about with Anthropic real fast, like as a speed round? Because I want to make sure that we're good on our time.

Okay. So let's do this one. So today, as of recording on December 2nd, OpenAI acquired Bun, which I don't know much about. Bun, I think it's a JavaScript runtime environment. Is it like Babel or Webpack, one of those things?

Yeah, you're pushing my technical abilities here, but I'll try my best. It is a JavaScript runtime similar to Node.js but faster. I think that's the best way I can put it where you can run JavaScript where Node would run, but it runs a hell of a lot faster. It's also not only a runtime but a bundler too. So there's a lot of complex terms in the JavaScript land which I know you love. But it's basically an all-in-one ecosystem. So it can run JavaScript, it can do bundling. So when you think of `npm install`, you can do `bun install`. If you've used Vite, V-I-T-E, like the dev server, there's a Bun dev server. So essentially anything you can do in the JavaScript land like toolchain, dependencies, running code, Bun can do it all. And it's really performant. So there's a famous graph that talks about the performance of different languages. Bun is up there with Go. Go is a compiled language. So Bun is very, very efficient and yeah, that's my segue of technical bits. I'll hand it off to you.

Cool. Yeah, no, I mean, I was curious. I don't know, like I'm sure there's a reason for it. They know what they're doing, but it was a surprising move because I hadn't... like why would an AI company want to acquire a JavaScript runtime company? I'm just confused on that.

Yeah, I don't know what the reason is.

Yeah, I don't know either. I mean, I think there was a lot of discourse on Twitter talking about why would they do that. So there wasn't even a good agreement there, but my thoughts are, as you mentioned, Claude and Anthropic are just going after developers. Bun is a faster way to do things as a JavaScript developer. JavaScript and TypeScript are the number one languages used across all software systems. So to get the best bang for their buck is hire these very, very talented, smart engineers that work on Bun, get them to work on Bun and Cursor and just make them do things better and faster. So I don't know the exact angle, but they're collecting the right kind of developer-experience folk, which I would consider Bun a part of that. So hopefully they build some cool stuff, make Bun faster. Who knows? I'm excited to see what they do with those folks.

Yeah, the first, the top comment on the post on the Claude AI Reddit was, "Why couldn't they code a better one with Claude?" Which is a fair question.

That's good. I haven't seen that.

Yeah. So, but anyways, we'll move on for the sake of time. The other big news or rumor or what do you want to call it, is Anthropic might be preparing for something. What do you think on that?

An IPO for 2026. So I had seen articles about OpenAI prepping a one trillion IPO for 2027. It's not happening. I don't know. I have no comments and no idea about how that whole process works, but one trillion is a lot of dollars and 2027 is not that far away. So I was surprised to hear Anthropic is planning to do less of an evaluation. I think it was targeted at 500 billion for 2026. So definitely speculative, rumor, rumor mill, you know, etc. But I don't know. I think all this AI money gets shifted from chipmaker to software company back to chipmaker to the clouds of the world. So I don't really know who has the value and who's here to stay. Anthropic has my money and it probably will for some time for Cursor, but yeah, it's tough. I don't know the long-term hold on these things. They change so fast. Every week there's a new model. So I think good for Anthropic for preparing for this. If there's anything I know about software company IPOs, they are never on time. So if I hear 2026, I'm thinking 2028. If I hear 2027, same thing, probably 2029 is my guess.

Yeah, in the article, they also mentioned that Cursor, six months after becoming available to the public, has reached $1 billion in run-rate revenue, which I'm assuming is an ARR number, the way that they're talking about it. So yeah, that is impressive, for sure. That's a lot of money in these super competitive and expensive for the right talent, right? And all that kind of stuff. So...

I mean, it's one of those subscriptions I pay for, and I only feel bad about it if a new model comes out that's not in the Anthropic family, and I'm like, "Oh no." But then Anthropic comes within the next, you know, two weeks, three weeks and delivers. So there have been times where I've considered, but as I look back on it, I never falter.

Mmm-hmm. Yeah. Cool. Well, we'll see. Okay, last item we should hit is what is going on with a code red? "Code red" generally means something bad's going on. What's going on with code red and who is it about?

Yeah, so OpenAI declared a code red, I think yesterday, December 1st, and this was after Claude 3 came out. So I think Anthropic came out with Claude 3 Opus. They also, I think, gave maybe a free quota in their Claude app, which is their kind of AI chat app. And I think in the article they had mentioned that ChatGPT was losing 6 or 7% of their monthly active users ever since Claude 3 came out. So, as we talked about, the AI flywheel of getting people on board to chat apps, having them talk to you, getting their data, building this whole data repository on people... Google already owns that in search. But now people are moving to AI. If Google has search and they have AI, they've cracked it. They have all everyone's data, you know, we're in for a tough future.

But I think OpenAI is trying a lot of different things. We have Sora, we have ads, we have you name it. They're kind of working on it. And I think declaring code red to me is Sam saying, "We have a lot of stuff going on. We have a lot more competition than we expected. Let's go back to our roots and kind of polish up ChatGPT, make sure it's the best-in-class chat experience." Because that's the majority of their money is this ChatGPT revenue that's expected to grow over time to make them that valuable. There's the API side where they serve, you know, GPT-4 models, but you could think that doing too much can spread you thin. And maybe, just maybe, with competition and doing too many things, code red can realign them to ship what they've been shipping, their core products, and improve those and retain more users over time. So that's my thought.

Mmm-hmm. Yeah, as I was looking at the article too as you were talking, they mention Google, and in the article they say that Google has had monthly active users increase from 450 million in July to 650 million in October, which is like, you know, two, three months, depending on how you slice it. And that's a pretty big jump. And yeah, it's interesting because I think with OpenAI, I feel like they were the first ones to kind of get a really enterprise kind of vibe going. Again, just from my own experience, but you know, Google Workspace is in a lot of companies now. And so as Gemini gets better and better, it just seems like it'll be a natural fit to kind of slide right in there.

Yeah. I think it's a good and bad thing. Good thing that OpenAI is recognizing that they need to refocus. A bad thing in my personal opinion is if Google wins this AI race, Google has Search, Google has Maps, Google has YouTube, Google has Waymo, Google has AI. I'm not sure I love that. I think I love the competition in the AI model space, but I don't love one company holding everything in that basket. So that's my personal opinion is I love Google and I think they're doing a great job and they have great models, but I don't know if I want them crushing so much that people can't catch up. And we're not at that spot, but I do have a slight fear about that.

Well, not even just that too, but I think, and I'm less knowledgeable in this area, just to be super clear, but you know, I know that Google is really working on competing with Nvidia on chips too, right? Like they have their TPUs versus Nvidia's GPUs and the progress seemed so strong in that area that Nvidia put out some weird statement about like, "Oh, you know, we see what they're doing, but we're still the leader." And it was one of those everyone was clowning on because it was like, why would you even feel the need to say that if that was really how you felt? And so yeah, they definitely, you know, we haven't really talked about that, but if their TPUs are significantly better... because ultimately, and this is where I go back to like Anthropic, I don't think Anthropic has a lot of staying power at that valuation at least, because they don't have the infrastructure. They don't have the chips. They don't have these giant data center deals, right? Like it just, yeah, to me, the players are so big now.

It doesn't mean that you'll be successful. Like I don't know if Meta has really turned into something successful. We talked about all the crazy compensation packages a couple of months ago. I don't know if they've really made significant progress yet on that. So it doesn't mean that the big players are going to by default do better than Anthropic. I think Anthropic has made a real solid product and carved out a niche, but like, you know, there's just so much more than just a smart model. Like there's again, the energy you need to run it, there's the chips you need to do it all, right? And so like, yeah, but Google... I think the real competition with Google is not from OpenAI as much as I think it's from... this is my conspiracy theory. I'm putting my tin foil hat on. It's Grok, it's, you know, SpaceX, it's Tesla. Like I think that's going to be the competition a little bit because Grok is pretty good. We don't talk about Grok a lot, but in the LM Arena, they're pretty good in a couple of different areas and I don't use it much.

They're definitely not as enterprise-y. I think that's the problem is OpenAI, Google, Anthropic, they've all built their enterprise arm and wield that pretty effectively, but Grok is a little bit of the wild west so to speak.

But Grok saw it ranks best in search, which is interesting because you'd think Google would be... I think, I could be wrong, but you'd think that Google would be the best in search, but you know... I don't know what the search category exactly is checking, but yeah, that's a little bit interesting.

Yeah, I mean Google's third in that category from my own... I think search is like searching real-time information, so "grounding" is what they usually call it. So I guess Twitter does have a lot of data and that's where it shines.

Yeah, so I think OpenAI, I still think OpenAI will be around, but I think I could see Grok coming in and kind of swooping up. And just again, from an overall ecosystem perspective, it's going to be in the Teslas. It might already be, I don't have a Tesla, but you know, you're going to have a model in the Tesla, you're going to have the Neuralink in the brain chip if you do the brain chip thing. You know, when you're going out to space in the SpaceX rocket, you're probably going to hear a Grok voice talking to you. You got the robots now. So like, yeah, I think, you know, that's where it's going.

A lot of conspiracy theories. We got to have a separate podcast where we just hash Ben about his tin foil hat.

Yeah, and I'll say this and I'll shut up. None of them have really ventured into hardware yet, which is interesting. OpenAI will, probably. There's been all these articles.

Supposedly, yeah.

There's Google Home, right? But that's like Alexa. But none of them have done robotics yet, except for, you know, Optimus Prime.

Okay, I have to add one more topic since you brought it up. We are close to time, but Apple's, I think, chief of AI just left the company sometime this week or last week. So Apple is clearly behind on the AI game. I think their head of AI or someone pretty high up in the company, I don't have the exact details, just left, and I think they're replacing him with somebody else. So hopefully Apple will step up to the plate.

Yeah, I was going to say, if they hadn't filled that position, you can give me a call, but um, you know.

Next time.

Next time. Yeah, yeah. Um, yeah, we could probably deep dive Meta a little bit some other day, um because you know, for those big, big headlines, it's always easy to say something in the news, but executing and delivering is a lot harder.

Yeah, I haven't heard much from them sadly.

Yeah. Yeah, but um, all right, let's wrap it up there. Let's do some bookmarks then real fast. Thirty-second bookmarks. You go first.

Okay. Alright. So my bookmark is a zoom tool that's built into macOS. So I've needed this myself, but essentially, if you've seen screenshots where it's a screenshot image and then one part of the UI essentially has a magnifying glass where it's blown up a bit. I guess, as I learned today, this is a built-in feature in the Preview app on macOS. So there's a toolbar that you can activate, you can click on this magnifying glass, you can hover over the UI and it'll show you a kind of blown-up view of the area that you're looking at. So pretty cool. Honestly, I might use this for my App Store screenshots because oftentimes you show an iPhone screenshot and then you zoom in on one element of the text or one element of the page. So we'll probably be using this. Pretty cool feature. Thanks, Julian, for shouting that out.

Cool. Okay, my bookmark is, I teased it a little bit with the Imagen talk, and it is from The Boring Marketer who does a lot of really cool AI marketing type stuff, obviously. But he actually reposted someone who has built a Claude AI skill that takes any input and converts it into different kinds of art with Imagen. And they showed the pictures that the person has done and they look really cool. Like it looks almost like a medical journal or, um, and it has a headline and a subtitle and then it has an image. So they look really cool. Has, you know, graphs and stuff like that. So I definitely want to check it out just for my own use cases too because again, we talked about generating content and being... I'm not a creative, I don't think you'd call yourself a creative. So that's the appeal for something like Imagen for us. And so seeing this person, Daniel Miessler, put this out there is pretty cool. So I'm gonna check it out.

So would you recommend a follow for The Boring Marketer Twitter account?

Yeah. Yeah, you know, I think if you've probably heard of Greg Isenberg, he's pretty big in that AI space. The Boring Marketer, I found out about The Boring Marketer through Greg and I think Greg had him on his podcast at one point. They were talking about something else. It wasn't about this. It was, I can't remember what it was about, some marketing thing obviously, but I saw him through that and yeah, The Boring Marketer puts out a lot of good content too about specific AI stuff with, you know, marketing, of course. It's in the name. So yeah, no, he's a good follow for sure.

Nice. Cool. Well, we'll wrap it at this. We talked about so much this episode, but again, my takeaway for all those listening out there is Claude 3 Opus. Game-changer model. Go download and use Cursor. And secondly, a huge shout out to Michael B. He's a listener of the podcast, reached out to me on LinkedIn and said he loved the podcast. Appreciate it, Michael. We met at Laracon, really awesome guy. So hopefully you're doing well.

Nice, nice. Appreciate anyone listening for sure. Give us a review, tell us how you like it. And yeah, as always, I had a good time, Brad. We'll get it posted.

'Til next time.

Sounds good. See ya.

See ya.

[MUSIC]

Thank you for listening to the Breakeven Brothers podcast. If you enjoyed the episode, please leave us a five-star review on Spotify, Apple Podcasts, or wherever else you may be listening from. Also be sure to subscribe to our show and YouTube channel so you never miss an episode. Thanks and take care. All views and opinions by Bradley and Bennett are solely their own and unaffiliated with any external parties.

Creators and Guests

Bennett Bernard
Host
Bennett Bernard
Mortgage Accounting & Finance at Zillow. Tweets about Mortgage Banking and random thoughts. My views are my own and have not been reviewed/approved by Zillow
Bradley Bernard
Host
Bradley Bernard
Coder, builder, mobile app developer, & aspiring creator. Software Engineer at @Snap working on the iOS app. Views expressed are my own.
Claude Opus 4.5 is a POWERHOUSE, here's our vibe check...
Broadcast by