Introduction

Last week, Claude-3.7-Sonnet, the world's first hybrid inference model, was released, and the code generation capabilities have also improved significantly, and many people have been amazed by the resulting UI.

On March 3, an $3.5 billion Series E round of funding was officially announced, and Anthropic's valuation reached $65.1 billion after this round of financing.

It can be said that there have been new achievements in product strength and business.

Recently, Mike Krieger, Anthropic's current Chief Product Officer and former co-founder of Instagram, was interviewed by the podcast 20VC, where Krieger not only revealed a lot of Anthropic's product strategy thinking, but also provided some perspectives on the entry point and future direction of AI startups, as well as thoughts on DeepSeek.

01

Startups to:

Model-building products of the future

Moderator: I want to start with a very challenging question: as a venture capitalist, I have to judge where the value is in the future today. But frankly, looking around the world today, I’m really not sure.

So what I want to ask you is, looking ahead, where is the value going to be generated in the next decade of AI-driven era? I often hear from entrepreneurs about different versions of this question. They often ask me, “What can I build so that I don’t compete directly with Anthropic or a big lab like that?”

Mike Krieger: I don’t have a perfect answer because it’s kind of like predicting the future. But I feel that the most valuable areas will be those where you have a differentiated market strategy (GTM), a unique knowledge of a particular industry or a particular data – ideally, two or three of those at the same time. For example, companies in the financial, legal, or healthcare sectors.

Healthcare, in particular, was extremely complex and a mess when I came into contact with it. Upfront work isn’t sexy, and it’s not done at an accelerator or in a short period of time. But it is these early accumulations and groundwork that can produce lasting value in these areas. You can then take advantage of the underlying model and fine-tune or optimize for AI as needed. But what really keeps you on your toes and stay competitive is the ability to sell in those areas, the unique understanding of those areas, and the ability to improve in those areas over time.

Moderator: You mentioned “upfront accumulation” and also talked about differentiated GTM and data sources. So, what about the next wave of AI that is more beneficial to existing vertical SaaS companies that already have these advantages and can apply AI? Or is it more beneficial for new companies that are created from scratch in these areas? Which is more?

Mike Krieger: I think there’s an opportunity for both.

At a high level, the key to AI and product design is that you have to strike a delicate balance between presenting a vision for the future and leveraging the model’s current capabilities. Because you need to design for the capabilities of the model in three months, the technology is moving too fast now. But at the same time, you can’t over-promise and under-deliver, as that can wreak havoc on trust.

If you’re a startup, you can do a little more “over-promise” because early adopters and early adopters are more willing to try and more tolerant. But if you’re an existing vertical SaaS company and you say, “We’ve added AI capabilities,” and users try it out and find “not that good,” or think, “It should do more,” or “You say you can do 30 things, but you can only do two,” that’s bad.

I think these two types of companies face very different challenges. For the former (existing SaaS companies), where you already have mature product and user habits, you need to anticipate trends, but you can’t alienate existing customers, and there are some good patterns to do that. For startups, you may not have the data yet, or you’re fighting for an initial benchmark customer. Your differentiator isn’t in the relationships you’ve built, it’s about mapping a vision for the future and finding ways to deliver value quickly to give hope to companies that are willing to bet on you.

Moderator: You mentioned that startups need to “build products for the models of the future”. This period is very challenging now because the quality of a start-up product is largely dependent on the quality of the model. Any change in the model can have a huge impact on the startup’s output, whether it’s code, software, legal platforms, and more. So, should startups build on today’s models, or should they build on our predictions about future models?

Mike Krieger: I’ve heard from a lot of people that their startups didn’t really take off until Claude 3.5 Sonnet or something like that breakthrough model came along. Some entrepreneurs have told me that their companies weren’t companies at all until a model broke through. For example, the accuracy of the model has increased from 95% to 99%, which is close enough to perfection for some industries; Or from 70% to 90%, this generational leap is critical.

So, how can you tell when this leap will occur? Some entrepreneurs have been hitting a wall in a particular field for years, whether it’s helping people write code, doing legal analysis, or in areas like healthcare. They may have pieced together (and “patchwork” may be a bit of an understatement, I should say elaborately assembled) a scheme that may involve multiple tools. However, this solution is either not competitive in price because it requires a high-end model at the Opus level, which cannot be supported by the underlying business.

But even so, these efforts are still valuable, because when more powerful models emerge, you’re not starting from scratch. Often, the companies that benefit from the generational upgrade of the model are not the ones that suddenly got off the ground on the day the model was released, but the ones that have been working in the field for a long time. In the case of Cursor, someone showed me a list of posts submitted by the Cursor founders on Hacker News, and they ended up making a breakthrough, but it wasn’t their first product or first iteration. They have been trying and trying, and the time may not be short. So, their success is not just driven by the rapid progress of the model, but is based on background knowledge, experience accumulation, and an understanding of the pain points and successes in the field that make the model really work.

So, to put it more succinctly, don’t wait for the model to become perfect, but actively explore the field, get frustrated with the limitations of the current model, and then actively try the next generation model. That way, you can feel that you can finally achieve what you have in mind, if only the model is a little bit more powerful.

02

The models of the future will be more and more different,

Instead of getting more and more similar

Moderator: You mentioned differentiated GTM and differentiated data. There are so many different models released now, and so fast. I wonder if the model layer itself is still valuable if it doesn’t have a differentiated data advantage, or a differentiated GTM advantage? What do you think about this?

Mike Krieger: In terms of model layers, and in particular the base model layer, I think there are three areas that are worth investing in over the long term:

The first is talent. I know it’s hard to quantify talent, and it’s hard to say what talent density means. But talent attracts talent, right? You become a magnetic field, especially when talent is held together around a shared mission or vision. I saw this at Anthropic. I love our research team and feel like every month we welcome some important new members, who may be from other labs or academia, to join us. This is an advantage that you must nurture and maintain, as talent is highly mobile and they are free to choose. You have to keep what originally attracted them, but it’s very important. Because to stay ahead, you need not only the accumulation of quantity, but also the right breakthrough. That’s the first point.

The second point is that I think the models will become more and more different over time, not more and more similar. Of course, there are a lot of similar benchmarks that everyone is looking at. But Claude is Claude and GPT is GPT, and they all have their own advantages and disadvantages. This is not only in terms of personality and tone, but also in the areas where these models are really good at. Coding is obviously a very important vertical for us, and we’ve been working on it. This is no accident, and we don’t just settle for “models are good at coding” and stop there. We’re seeing the demand for code models, and seeing so many companies now rely on Claude models for coding or smart planning, and it’s inspiring us to think about how the next generation of models should evolve, what to do from a reinforcement learning perspective. So, the first is talent, and the second is focus and model features, which you develop in depth over time.

Thirdly, when DeepSeek was released, I was asked a lot of questions about DeepSeek, such as “What does DeepSeek mean to you?”. I think on a technical level, we can learn something from what they’re doing. But from the perspective of market strategy and market position, DeepSeek has little to no impact. Because the relationship we have with the company is not a simple API call, it’s not that they send input tokens in exchange for output tokens. It’s like, “Hey, I want to be your long-term AI partner, I want to help you co-design products with your application AI team, I want to imagine the future with you, I want to think not only about your API, but also about Claude for Work.” It’s more like a company offering AI partnerships than just AI models.

I think it might be more helpful to understand by looking at the failure pattern the other way around. Failure patterns include settling for the status quo and not retaining the best talent, believing that incremental improvements in the model are enough, and seeing APIs as just a way to trade money for intelligence without thinking about how to become a deeper AI partner. If you can’t do those three things, I think you’re in trouble.

Moderator: When we look at the impediments or obstacles to progress, what do you think is the biggest obstacle today? Because on this issue, I’ve heard very different perspectives from different people, whether it’s Alex Wang or Grog’s Jonathan Ross. Is the obstacle computing power? Data? Algorithm? Or do you want the model training environment to better match the real-world challenges rather than the challenges of a single interaction?

Mike Krieger: I think it’s the latter, which is improving the model training environment to better reflect real-world complex tasks, rather than just standalone, one-off evaluations. I know Alex is thinking about this as well, because we’ve talked about the evaluation of intelligent behavior, and it’s just one specific aspect of what I’m talking about in the broader problem.

Even in the field of software engineering, the job of a software engineer is not just about writing code, but also understanding what needs to be built, working with product managers to create timelines, gaining a deep understanding of requirements and user use cases, and then delivering results in a testable and iterative manner, and getting feedback from end users if they’re building a public-facing product. This is a complex workflow and there is currently no suitable assessment method. Interestingly, we call the most common software engineering benchmark “SWE-bench”, but being a good software engineer is much more than just looking at a PR, submitting a PR, and waiting for approval. Therefore, it is important to build assessments and environments that better reflect the real work environment.

We’re also thinking a lot about the use cases of office professionals within Anthropic, which is probably one of the areas where future models can be hugely empowering. But no one has really assessed this aspect well yet. In the field of research, we are starting to get better at assessments, such as the Humanity’s Last Exam, which is extremely complex, multi-step reasoning. But there is no way to simulate “I join a new company, quickly understand my role, my organization, my relationships, and where to find the information I need, and then integrate it into the day-to-day running of the company.” It’s a difficult environment to capture. So, for me, figuring out how to better break down this problem, or thinking about it as a whole, is the biggest obstacle to at least one aspect of model progress—how the model can move from being good at extremely narrow tasks, to being a more general, useful collaborator.

Moderator: As we look to the future of data in our models, will synthetic data be more and more cumulative? Or will human data continue to be the primary source of data to drive model progress? What do you think about this?

Mike Krieger: I think in order to improve the model, you need a solution, maybe first bootstrap the model with raw human data, and then generate all of this synthetic environment where the model can explore and find its way.

Claude has been playing Pokémon this week, which has been a fun but somewhat distracting pastime for our research and engineering teams. Everyone is following the live stream of Claude playing Pokémon. I think the game is an interesting example where you can imagine many different runs in the same game with some constraints and rules set. But when the question space is not as good as “Did you get out of the Tokiwa Forest?” (I haven’t played Pokémon, I just learned it from watching the stream) when it’s clear, the situation becomes even more complicated. But it’s still important to be able to take the golden path and integrate the various approaches so that you can think about how the model can progress through uncertainty.

So I think this definitely requires a hybrid approach, where the best models will come from a combination of excellent human data and synthetic data. For example, for a code model, you need to have a good code base and examples, but also be able to explore a wide variety of paths. Another part that is still underestimated is how the model’s personality is measured and evaluated, and how personality data is obtained. I use a very broad word – Vibes. What exactly is the “feel” of the model? We don’t actually know it until we actually sit down and experience it.

In a way, this is a good feature, because it means that the model has a very subjective, human-like aspect. But that also means you can’t do a good regression test on it. For example, when we upgrade from Claude 3.5 to 3.7, people might say “Claude seems friendlier, but also dumber”, or “Claude seems more willing to answer my questions, but I wish it was better at creative writing”. These things are hard to assess. This comes back to the data. So I think it’s important to have both data on these softer skills and ways to assess them.

03

Model quality is strongly correlated with product experience,

In the future, users will not need to choose their own models

Moderator: One of the strange things I found is that we can now choose which model to use. Of course, you might say, because they all have their own specialty. But when I look at the next three to five years, I don’t think you will have to choose which model to use anymore, just as you won’t choose which Google to use. Am I completely wrong, or am I completely missing the point?

Mike Krieger: No, you’re not wrong. I like a concept that comes from the world of human-computer interaction, and you may have heard the term “leaky abstractions” (abstraction layers that can’t hide details and require the user to understand the underlying mechanics). For software builders, we try to encapsulate all the complexity perfectly, hidden under a little “shell” that doesn’t have to worry about any of the underlying details. But the reality is that most AI products are currently designed with the problem of “leaked abstraction”. For example, the user needs to choose a model, which shouldn’t happen at all. Why should users choose Opus, Haiku, or Sonnet? Most people simply don’t understand the difference between them. Or, if you open OpenAI’s model selector, there are a lot of models in it, each with its raison d’être. But the overall experience is, why should I choose this over that? This feature is available here, but not there. We ourselves are plagued by this problem. Model selection is the first “leaky abstraction”.

The second is that once you understand how these models are built, you know that they accumulate context, and each conversation replays the full context for the next inference. This leads to a situation where every conversation is different. It always occurred to me that when you talk to a colleague, you may have different email exchanges, but behind all of them, it’s still the same colleague. If you mention their favorite team, or a project you’ve worked on together, they won’t say “I don’t know what you’re talking about” or “I need to retrieve my memory.” There are some basic knowledge that you share among yourself. This is another “leaked abstraction” where we force the user to understand how the model works, but I don’t think the user needs to understand that.

The last one is prompting. While cue engineering has evolved a lot, we’ve done a lot of work to refine prompts and turn simple human prompts into model-optimal prompts. But I want prompt engineering to be completely transparent to the user, not something that the user needs to actively participate in. If the model lacks a clear understanding of the problem or needs more help understanding the problem, the model should clarify through a conversation, rather than having the user distinguish between who is a good prompt engineer and who is not. Right now, the gap in prompting engineering is closing generation by generation, but I hope we can close it even further.

Moderator: How do you see the relationship between model quality and product user experience (UX)? And how to weigh the two, and sort out the relationship between them?

Mike Krieger: You can’t look at the two separately anymore.

In my opinion, to be a good UX designer, you have to consider the quality of the model as well. I think back to a product design meeting at Instagram, where we were talking about pixels, some synthetic data, or real data, like reformatting with my feed data into our proposed UX interface. At that time, there wasn’t much uncertainty in product design. You put the product out there, and people might use it in certain ways. But these days, designers, product managers, and especially engineers need to think, “I’m actually designing scaffolding and products around a fundamentally uncertain system.” This means that all the back-end stuff like model quality, prompt engineering, etc., becomes part of the product design and has a direct impact on the product.

For example, you can prompt Claude if you want to ask a follow-up question, which may be what you want in some parts of the product but not in others. You can also prompt Claude if he wants to spend more time thinking and reasoning. These are all decisions you need to make early in the design of your product, and they will be reflected in the actual product.

On the other hand, as we discussed earlier, whether it’s a startup founder, or a traditional B2B SaaS company, you need to sort out where the model is headed, what the model is capable of, and what users want. The same applies to your product design. You need to evaluate ahead of time to see if what you want to do can be implemented with an existing model, or at least focus on what the model is likely to achieve. But the model changes over time, and so does the product. If you don’t have a good evaluation framework, or even a regression test evaluation, you might end up releasing a product, but after three months, users will feel like “the product used to be good, but now it seems like something is wrong and doesn’t meet the needs anymore”. You’re not sure if the model has changed, or if the product design has changed, or if a different feature has been introduced, or if the system prompt has become longer. In many ways, this is the most complex product development job I’ve ever done.

Moderator: Sam Altman once said that one of the joys of being a startup is that they can release products faster and don’t have to strive for perfection. But as the company grew, it was under increasing pressure with each release. What do you think about the idea that releases don’t have to be perfect, users can use them first, and how do you think about it as a product owner now that Anthropic is a giant company with millions of users?

Mike Krieger: I’ve been thinking about this a lot, especially when we have different product interfaces and audiences that have different expectations for stability and different desires for cutting-edge technology.

For example, in API products, predictability and stability are valued, as well as the choice of more future-proof technologies. Therefore, API products can be “opt-in”. I remember we introduced prompt caching, which was a huge cost savings for users. Initially, we gave users an opt-in through a beta header. A lot of what we do with APIs is in this form. But if you’re using this approach for customer-facing or more consumer-grade products, getting users to “opt-in” is not as good as that. You definitely want to be able to iterate and experiment, and you don’t want to completely ruin the user experience, but you can get more licenses to experiment.

And then we have enterprise customers who use Claude for Work in an enterprise environment. In my opinion, the adoption of AI in the enterprise is still in its early stages. So, you can be a little bit more nimble than established companies like Salesforce, I don’t know how many times a year, but a lot of those companies only release two or three times a year, and they usually revolve around big events. We’re still far from that release cadence, we’re still releasing fast, but to be honest, we’re still finding a balance, like releasing once a month? Or do you release as often as possible? Or some sort of admin opt-in mechanism, but that also adds complexity.

So, that’s a great question. I think we are still actively discussing the “vigour” of the release and the speed of the release. We want to bring new features to market as quickly as possible, because you’re not sure how users will accept them, and you need to keep learning. But as visibility grows, and more and more people start relying on your product for their workflows, you can no longer treat releases as casually as you used to.

04

DeepSeek’s Takeaway:

Learn to market yourself and launch products quickly

Moderator: I discussed this with Alex Wang, and he thinks we’re grossly underestimating China’s capabilities in AI. Do you agree that we underestimate China?

Mike Krieger: Yes, people were surprised by the emergence of DeepSeek, and it seems that a lot of people didn’t expect such a cutting-edge research team in China. But if you’ve been following this area, this part shouldn’t be surprising. We saw early on that Instagram was blocked in China, and then a parallel startup world emerged. What happens if Facebook and Instagram are blocked? What will emerge? As a result, those products are often of high quality, show a lot of creative thinking, and are also used on a large scale. They solved a technical challenge on a scale comparable to Facebook’s.

So, underestimating, or continuing to underestimate, China’s capabilities in AI is definitely a mistake. China has tremendous potential, both in terms of cutting-edge model training (especially if they have access to computing power) and in terms of continuous innovation. The idea that “they’re just copying something that has worked elsewhere” is a very Western-centric perspective, and I’ve seen that in the traditional software space as well. But this ignores the emergence of differentiated products within the Chinese market, and the fact that these products sometimes go overseas. TikTok is an interesting example.

Moderator: Did the advent of DeepSeek make you rethink the direction of Anthropic before we moved on to the “ultimate product”? Or changed Anthropic’s strategy?

Mike Krieger: At the architectural level, there are some things to think about. I can’t speak on behalf of the research team because they’re the real experts. But they do think some of DeepSeek’s practices are worth considering, or re-evaluating some ideas that they thought about before but have since given up. I think there is an impact in this aspect.

Interestingly, we originally planned to show the Chain of Thought when we released our inference model. So, the advent of DeepSeek doesn’t mean that we are rethinking that, but it’s interesting to see others doing the same. When it comes to the user interface, there are also some details to learn from. Grok is now also adding a “chain of thought” display to their model. So, I’m curious to see how the Chain of Thought develops. For the distillation issue you mentioned earlier, this may be one of the reasons why more labs choose not to show or blur the “chain of thought”.

On the other hand, from a product perspective, there are two things worth pondering about the emergence of DeepSeek. I think the most underrated thing about DeepSeek is that they went from obscurity to being more famous than Claude in a lot of circles, which is simply incredible. Even the partners at Greylock were asking me what I thought of DeepSeek, and it wasn’t a joke, it was a real thing.

I started to think, what the hell did DeepSeek do to make such a big breakthrough that Claude didn’t? I think this is closely related to the current world situation and the “DeepSeek is cheaper” narrative. Whether or not this is entirely true, or whether they actually found some kind of breakthrough, the story itself is fascinating. Frankly, I’ve also talked to our marketing team about it, and I don’t think we’ve been able to tell the story of Claude enough to show what makes Claude unique or what makes it interesting to be interesting, like Claude 3, which was trained with a much smaller team than other labs, but still at a state-of-the-art level. We’ve always been very efficient in our use of computing power. I don’t know if this is a story that they intentionally told or if it was a story that the media spontaneously shaped for them, because it is a very compelling story indeed. This uniqueness was very important at that particular point in time, and these factors all created the perfect backdrop for DeepSeek’s rise. I think that’s a good job.

The second point is that in terms of products, DeepSeek went from having no product to launching an iOS app and doing a great job in terms of details. For me, it’s like a good nudge, even a “thrust”, reminding us that we should bring some ideas to market faster and not focus too much on how perfect every detail is, as we did before, and instead be more willing to get the product out there and learn by doing. Because sometimes, the novelty of the experience is valuable in itself. This is the first time most people have experienced a real-time presentation of the Chain of Thought, which is a lot of fun. I wish we had done this sooner rather than later because this could have been a novel experience for users.

Moderator: If you look at user usage, you can see that there is a high level of usage in emerging markets, and there is a high retention rate, but that’s not the case in Western markets. What do you think of DeepSeek as a sustainable, credible threat? They have reached a certain level of popularity, does that mean they have the ability to continue to grow?

Mike Krieger: I think all these AI-led, lab-generated products that we’re doing now, even after six months or a year, if it’s still just “I can ask questions and occasionally make proactive suggestions,” that’s going to be undifferentiated and unappealing. The product that is really valuable should be “Wow, I can do something unique now because I used Claude or DeepSeek or something else and it saves me hours of work, makes me smarter, makes me a better partner for the important people in my life”. The product must go beyond superficial practicality. Of course, there are people who will find a deeper value, and they are also your DAU now.

But for a lot of people, they’re just trying it out, using it to generate a poem, writing a letter to a son, and those features may provide some value at the moment, but I still think we’re still in the “Day One” where AI is an integral part of most people’s work. I think the key to whether a product can remain competitive for DeepSeek and all of us is who can be the first to achieve this and grow sustainably over time, with the right product design, the right integration, and the right deployment strategy to really succeed. Who will build these products is often my primary concern as an investor: when will model providers transition to application providers?

05

Claude wants to transition from a model provider.

Transform into an application provider

Moderator: What attracted you to invest resources in becoming an application provider, not just a model provider?

Mike Krieger: I’m focused on two main criteria. The first is versatility. Despite the size of Anthropic’s team, our product team may be one-tenth. Our product team is already large compared to the second year of Instagram’s existence, but it’s still small compared to large SaaS companies. We’re somewhere in between. But we support a lot of products, including Claude Code, API, Claude, Claude for Work, and more.

So, I think versatility is very important. Even if we choose a Persona or a vertical to target, what we’re building should be generic and there may be some specialization at the user level, but not at the underlying architecture level. I don’t want us to build a lot of vertical, highly customized products that only work for specific workflows or use cases. We are focusing more on more general, homogeneous areas such as translation, transcription, and customer service, which seems to be the right direction.

Moderator: I agree, unless……

Mike Krieger: Unless you consider workflow knowledge, which means you can keep your product differentiator for the long term. For example, if you’re a professional translator, you may need some specific features for your translation workflow.

Moderator: If you’re an advanced user, that’s probably the case. But if you’re not a translator, just your mother, she may only use the translation function once a month to handle some odd things.

Mike Krieger: yes, I think the basic feature of “we can translate this for you” is a bit of a hangover if you want to have an individual user pay a $10 monthly subscription fee, because the current model is pretty good at that. Maybe you’re right, there’s not much room for differentiation when it comes to foundational AI products. But if you use consoles and workbenches like ElevenLabs, you’ll find that a lot of the features they’ve built are obviously designed for professionals who translate content for hours a day, or dub large amounts of content with a reliable voice.

The product design of Descript, an AI video editing tool, is one of the best I’ve seen in the AI space. They obviously put a lot of time into the workflow. I once used Descript for a personal podcast, and I found it to be clearly built by people who sit in the workflow day in and day out, and understand the workflow. So, I think we probably have some consensus on the point that specialized use cases and the workflows that are unlocked from that are valuable. And in terms of consumer or even prosumer, the model is good enough from the perspective of the underlying AI product.

Moderator: When you look at the areas that Anthropic excels at today, like the code side that we mentioned earlier, you’ve done a great job. Does Anthropic have any plans to launch its own IDE or code agent? How do you approach this from a product perspective?

Mike Krieger: I think we have to be careful about where we want to go. Even Claude Code, which we just released, was originally built as an internal command-line smart coding tool because we just wanted to speed up the productivity of our own team. After a few months of observation, we thought it was pretty good. It’s not a solution to all coding problems, and it can’t replace an IDE, but it’s useful for us in a lot of situations, so we’d like to see people use it in the real world. Then, you’re faced with the cost of publishing. You need to name it, find the right packaging, and develop a marketing strategy. So, we’re very cautious about that.

I think that, at the current level of the model, you still need to operate the keyboard with your own hands, and you still need to communicate with the model, like, “I did this, did I do this?” , “Okay, let’s continue in this direction”, “Awesome, submit a PR”, “No, we’re going in the wrong direction, let’s backwind a bit”, and then iterate on it in practice. That’s why I think there’s a middle ground between IDEs and fully autonomous Devin (Cognition). Cognitive Devin can fully delegate tasks, but the current model can’t do that.

Claude Code can be used for certain types of tasks, and our product engineers love Claude Code because a lot of product engineering work is about building end-to-end product workflows, such as updating the backend, creating the frontend, submitting translations, or solving minor problems. Claude Code is very good at handling this kind of task that requires intelligent collaboration between different parts. I submitted two PRs last week, and this is the first time I’ve written code since I joined Anthropic, and it makes me a little sad. But I finally got the chance to use Claude Code. I’ve never opened our codebase before and don’t know anything about the code structure, but Claude Code is very good at finding files with the right code snippets and editing them. Of course, not everyone is in the same situation as me, but Claude Code is really valuable for these kinds of use cases.

So, when I think about the coding space and where we can play a role, add value, I think our focus should be on the agent side, not the IDE side. Every day, companies think about how to build a great IDE, and it involves complex questions about low-latency autocomplete, the right integrations, how to work with the VS Code plugin ecosystem, and so on. It’s a lot of work and it’s very different from what we’re doing. I think we can play an important role in talking to the model, using the model to do the real work, and building an intelligent collaboration loop. But we also recognize that the current model is not yet fully hands-off in many use cases, and requires more human intervention.

06

Model iterations are frequent,

But developers shouldn’t be anxious

Moderator: Are we in the midst of a “product marketing nightmare”? I mean, this week DeepSeek released new models, OpenAI also released new models, Anthropic also released new models, and Mistral also released new models 10 days ago. With new releases almost every day, the world can become numb. What do you think of this situation? How does this affect your thinking about product launch and messaging?

Mike Krieger: yes, it’s a lot more complicated than it used to be. On Instagram, you need to be aware of big events that are known in advance, like WWDC week, or the iOS launch event in September, or other big holidays. From a product marketing perspective, it’s much easier. The current situation reminds me of Crossy Road (a game) where you have to look at the traffic as if you were crossing the street to find the “window” for the release of your product. “Okay, the car has passed, and now there’s a gap, let’s release it tomorrow, or release it now. But, oh, now I’ve heard rumors again……

Now the situation is much more difficult. I’ve also heard from friends in other labs that everyone is trying to decipher “tea (various gossip in the industry)” and see if “is it calm now?” Can it be released now? Or how about we release it next Tuesday?”. This requires a completely different approach.

We released Claude 3.7 Sonnet this time, released it on Monday, and finalized the blog post on Sunday at 9pm, which is not a best practice from a marketing perspective. We also gave a briefing to journalists on Sunday. But that’s when everything is done, ready to go, and ready to be released. So, this requires quick reflexes and the ability to be flexible. It even includes model cards, evaluation reports, comparison tables, etc., which may contain data from a model that was released the week before (like Grok-3, which was released just a week ago). So, it requires a completely different approach.

Moderator: When Grok-3 was released, was everyone at Anthropic and OpenAI going to think “Oops, they overtook us again”, or “Awesome, we won”?

Mike Krieger: I think it takes a mindset, and I often try to remind the team that model releases happen all the time, and at any point you can go through a “lead-behind-lead” cycle. You have to adapt to this rhythm in the AI space, and you can’t get too frustrated with a single release. Of course, inevitably, sometimes you’ll be lucky that the product or model you release stays ahead for two or three months, but sometimes it can only be a week. You can’t overreact to either situation. You can’t rest on your laurels, and you can’t be too frustrated.

I think what’s really useful is to show a chart, which I present at almost every sales meeting, and it shows the milestones from the inception of Anthropic to where it is today. At any point in time, you can say “Wow, Claude 2 looks like it’s already lagging behind” or “Claude 3 is state-of-the-art”, but it will soon be overtaken. You need to focus on the long-term trajectory and believe that you will continue to improve. That’s the first point.

Second, remind yourself that it would be crazy if everyone switched models every day just because of a change in the evaluation metrics. Not only is this crazy for your user base, but it also makes the industry even crazier. Over time, you’ll come to realize that when people deploy models, they don’t just use them, they do fine-tuning, or do a lot of customization to make the model a good fit for a specific use case. Model switching is not something that can be done overnight. You’re still one of three or four options in the model selector. For example, in a coding environment, you still have a chance. But it does require a state of mind, and I don’t know if it’s the need to find a meditative, detached angle, or just getting used to being transcended, or both. But for sure, every time a model is released, I guess every lab watches the livestream, looks at the evaluation metrics, and realizes, “Okay, we’ve got work to do.”

Moderator: I think the brand is the most important thing. Like you said, people don’t switch models every day, they say “I’m a Claude user”, or “I’m a ChatGPT user” and they’ve developed a sense of identification with the model they’re using. Do you agree with this statement?

Mike Krieger: I agree with that, especially when it comes to consumer products.

I was reading an article by Ben Thompson recently, who often invites Nat Friedman and Daniel Gross to the show, and they also talk about some people being Claude users and some being ChatGPT users. I think this phenomenon does exist, and users will like the personality of a model, the interface design, or the overall atmosphere. It reminds me of our years of competing with Snapchat, and before that, people would launch new products like “Instagram, but only for high-end photographers,” or “Instagram with a few extra features,” or “Instagram with only one photo a day,” like BeReal.

I have a false formula (I’m obviously not an Anthropic mathematician) that social networks are made up of product format, audience, and vibes. For Instagram, product forms include Stories, Feeds, and later Video; The audience was initially photographers who enjoyed a retro style, and later expanded to anyone interested in visual storytelling or visual media; But even though our product form is more similar to Snapchat or even Facebook, Instagram has a very different vibe. I don’t know what the fake formula for AI products is, but I think it has some kind of similarity to the formula of social networks. Among them, the personality of the model may be one of the factors, the scaffolding prescriptiveness of the product scaffolding may be another, and then the sense of atmosphere. The sense of atmosphere is hard to measure, but it’s definitely there.

07

First-party products can better help iterate on models

Moderator: We talked earlier about model products, and building them. When you’re thinking about building a product for consumers, vs. building a company’s API department, how do you balance the balance between the API business and the end-user consumer business?

Mike Krieger: I think we can learn faster with one product. As a very specific example, a week after Claude Code was deployed on-premises, we found an issue where the model wasn’t taking full advantage of a certain tool it had access to. This issue feeds directly into the improvements made in Claude 3.7 Sonnet. In-house trials of first-party tools have led directly to improvements to next-generation models. We have found similar cases in a number of other places. But with third-party products, it’s hard to get this kind of direct feedback. Third-party partners will tell you what went wrong, but this feedback is always separated by a layer. Even though we work closely with those coding startups you mentioned, the situation is still different. As a result, first-party products have a lot of value in terms of learning.

On the other hand, it is also easier to build user stickiness and brand loyalty with a one-party product. I think it’s easier to build a brand around a first-party product than it is to just build an API. We provide technical support for a lot of coding products, which is obvious to people in the industry, as Claude is often the default option in the drop-down selector. But not everyone understands this, and the API isn’t a product that users download or install that they don’t tell their friends. But APIs are also a way for us to get a huge distribution channel. We can’t invent all the companies ourselves, and with APIs, we can play a more investor-like role, see more possibilities, and have more than one goal.

Therefore, from the perspective of resource allocation, the investment in the API business and the first-party product business is quite balanced. If anything, we’re not investing a little bit in two things: one is accelerating the iteration of first-party products, which is what I’m most concerned about right now; The second is how to build a more advanced abstraction on top of the basic pattern of “tokens in, tokens out” in terms of APIs. Every time we do this, we get good reviews from users. Whether it’s helping models plan intelligently and work autonomously, or having models build more knowledge bases and knowledge graphs that reflect the inner workings of the company (if you need to build internal knowledge products), whether it’s refining tool usage, or understanding a lot of context and maintaining memory across conversations, these issues are worth working on with APIs. Because we can apply what we’ve learned in model training directly to the API and build great products around the API. That’s how I see both. But this is a new problem. At Instagram, it’s simple, 95% is product and 5% is API.

Moderator: What can you do now, or what will you do in the future, to speed up the development of consumer products?

Mike Krieger: I think there are two things. First, it’s important to recognize that we’re still running a startup model. Even though the company is gaining momentum, the API business is doing well, and users are using Claude and upgrading to Claude Pro, we are still in the early stages and still face a “make it or fail it” situation. We need to operate with a startup mindset. This means bringing the right people together faster, ignoring organizational boundaries. I think we’re getting too rigid and putting too much emphasis on “this is a team’s responsibility” or “it can’t be done this quarter because it’s not part of this team’s OKRs”.

I understand why organizations have evolved the way they are, some rigidity is natural, but we can’t afford it right now. So, the focus now is to find the right people, get them together as quickly as possible, clear all other distractions, and then move fast like a startup. Clearing my schedule allows me to devote more time to product reviews and design reviews rather than administration.

Moderator: Does the emergence of DeepSeek show the benefits of constraints? Are Western companies, especially you and OpenAI, over-funded?

Mike Krieger: I think it would be more accurate to say that our products are getting more recognition than their actual product-market fit, because they’re still the best way to get a model. I don’t think this will last. This is not a reason for us to rest on our laurels. And, I don’t think we’re serving our users well because I don’t think we’ve made a really suitable product. It’s both something I wake up feeling stressed every morning and something that motivates me, depending on the mood of the day. I think we still have a lot of work to do on the product side.

08

Regret not doing first-party products earlier

Moderator: What does OpenAI do better than yours?

Mike Krieger: They released V1 much faster, sometimes even before the model was fully ready.

Moderator: In what ways are they worse than yours?

Mike Krieger: It’s probably the coherence of the persona of the product and the features they’re building.

Moderator: Of the other model providers you respect, which one do you respect the most?

Mike Krieger：OpenAI。 I think they strike a balance between first-party product development and APIs, and their APIs are also being used at scale. And, I think they often “do the simple things first,” which is a principle of Instagram.

Moderator: If you had to rebuild Anthropic’s product and technology stack from scratch, what would you do differently?

Mike Krieger: I think the very valuable stuff that we built in the last year, now feels like it’s taking a toll on the information architecture. This may sound nerdy, but basically, users shouldn’t need to think about projects, artifacts, and chats, and how they relate to each other.

On the product side, I think it’s time to get rid of these concepts altogether and think about what’s really important: Are you getting the right context in the right conversation? Do you always know what to do next? Can Anthropic and Claude themselves be helpful guides to guide you through the most important work? It’s different from the “I know how to create a project” paradigm. If you’re good at creating projects, the product will be great, but it takes a lot of steps.

I think in terms of the tech stack, Claude AI and claude.ai started out as a demonstration of models and weren’t built into the foundation for more complex, multi-product systems in many ways. I think we’re actively working on tearing down some of the old architectures and rebuilding the core user experience to make it better. The user experience isn’t great now, it feels like an evolving product, originally built for a specific purpose, but now being asked to do so much more that incremental improvements are becoming harder and slower.

Moderator: In what ways have you changed your mind over the past 12 months?

Mike Krieger: The importance of first-party products. I saw the growth of the API business and thought we should devote more time to APIs. But I now think that if you don’t invest equally, or even more in one product, you’re missing out on a great opportunity and not building a lasting moat.

Moderator: How much damage did you do to be late in this regard?

Mike Krieger: I think it’s a big impact. In the case of DeepSeek, ideally, we should lead the narrative about “not just one leading AI product or API available”. I think we’re undermined in that regard.

09

Developers of the future need to learn:

Delegate tasks correctly

Moderator: You’re working with both Cursor and Codium and Stablity AI. I want to ask you, when you see the change in developer behavior and, like you said, the first time you wrote code after joining Anthropic, what do you think the role of a software developer will be in the next three to five years?

Mike Krieger: I think the role of the software developer is starting to change. I’ve been a big fan of GitHub Copilot for a long time, and my rating even appeared on their homepage at one point (I don’t know if it’s still there) because I saw its potential. Even before GPT-4 was released, I tried to use it for Swift development. I’ll draw the ASCII art of the interface I want to build, then let GPT-4 generate the code, go make a cup of coffee myself, and come back after a while, it’s already 80% of the code. Now, with a model like Claude 3.7 Sonnet, code generation can be 95 to 99 percent complete.

In my opinion, the skills that software developers of the future need to master are, first and foremost, interdisciplinary, or generalist. You need to know what to build, and it’s just as important as knowing how to implement it precisely. I like this about our engineers, and a lot of our good product ideas come from engineers, from their prototyping. I think that’s the role of a lot of engineers in the future.

Second, when much of your work goes into evaluating AI-generated code, code review also changes a lot. I’ve experienced it myself, I submitted a PR and some comments said “Claude Code does this sometimes, but in this case we don’t actually use the default parameters”. I was like, “Okay, that’s bad.” If I had written the code myself, I might have noticed these patterns better. Therefore, we need to work together on two fronts: on the one hand, models and model infrastructure need to learn better from code bases and code reviews in order to produce code that is more in line with the company’s code specifications; On the other hand, how do we go from being primarily code writers to being primarily task delegates and code reviewers?

I think this is what software development will look like in the next three years: coming up with the right ideas, designing the right human-machine interactions, figuring out how to delegate tasks correctly, and then figuring out how to review code at scale. This may require a combination of static analysis or AI-driven code analysis tools to check the generated code for security vulnerabilities, defects, or bugs. Computer vision also comes into play, such as automated testing of UI.

In the future, ideally, you delegate a task to the AI and come back after a while, and it will tell you “I’ve done it, I’ve evaluated three scenarios and tested them in the browser, and this is the one that works best, and I’ve done a vulnerability scan with another agent, and everything looks good, you just need to confirm that this key code snippet meets your expectations”. In this way, you suddenly become a manager and a delegate of tasks, rather than just participating in the workflow as a partner.

Moderator: You said that “three years is too long, one year is more realistic”, I agree with you. When we see the speed at which technology is evolving, do you think the acceleration of product launches will reach a plateau or asymptote? Or will this exponential growth continue?

Mike Krieger: That’s a question that I think about a lot. At the beginning of the year, I looked at our product development process and where we used Claude and where we didn’t. You’ll find that Claude is useful in many ways, such as generating PRDs (Product Requirements Documents) from initial ideas, but also in coding, and can also help synthesize people’s discussions about the product, find controversial issues, and drive consensus. But actually deciding what to build is still the hardest part. In fact, this can only be best addressed by getting together to discuss the pros and cons, or to explore the Figma prototype together.

So, with any dynamic system, if you optimize one link, there will soon be other parts that become bottlenecks or critical paths. In my opinion, it is still very difficult to reach a consensus, decide what to build, solve real user problems, and develop a coherent product strategy. It may take more than a year for the model to resolve this issue. That’s why I’m optimistic that small startup teams will be able to explore this space. I’ve learned from my experiences at Instagram and Artifact that for a small team, reaching a consensus might just be a tea-time conversation, rather than navigating a giant ship, making commitments to customers, and so on, as a large company would do. Reaching consensus is still a very human issue, and I don’t think the model will be able to solve this problem at such an abstract level for at least three years.

10

Distillation is not the key,

It’s the data that’s the key

Moderator: When we have so many different models and vendors, open source is a very viable option. Has distillation been demonized? If distillation technology can finally advance the field, wouldn’t it be very valuable, even within the labs, to be able to transfer knowledge from high-end models to low-latency, more economical models, assuming that every lab is using distillation technology?

Mike Krieger: I think what’s interesting about distillation technology is: first, do we want any country to be able to distill a model from another country’s model? My personal answer is no. I think there’s value in thinking about this issue from a national security standpoint as AI capabilities increase. Second, in order for technological advances to continue at the current pace and achieve long-term sustainability, laboratories need to be able to commercialize all training and innovation. I think it’s important to find the right business model. Open source models like Llama, they were able to do that from their own research, data ingestion, and training. So I don’t think distillation is necessary to unlock these capabilities, and it comes with other issues, even in terms of terms of service.

Moderator: Does the release of Llama show that the model itself has no value, all the value is in the data? Because Facebook is willing to release Llama for free because they know that no one can copy the data they have? Does this illustrate this?

Mike Krieger: That’s an interesting question, and it’s worth thinking about.

Does Llama owe its quality to the fact that they can (I don’t know if they admit it publicly, but they apparently can) train with data from Instagram and Facebook, etc.? Does Gemini perform better because of its ability to train with data from YouTube? I see it more clearly that Gemini is benefiting from this. For example, whenever they show a great video comprehension demo, I think that they probably have the largest video repository in the world and can train on a lot of video data. But on Facebook’s side, that’s less clear. I’ve never heard anyone say “Llama is very good at generating content that works well on social media”. Llama seems to be just a generic model. So, this goes back to our earlier conversation, where the value is how good your team is, whether you have the underlying data you need, and how useful your model is in real-world use cases. The latter is the most important.

I wish I had emphasized this at the outset, because metrics aside, metrics are great for internal research and continuous improvement, but they don’t tell you if the model is good, if it’s up to a particular task, or even if the model is good at a task, whether it only excels in very narrow scenarios, or if entrepreneurs can rely on the model as a “representative” in the product. So, I think for the lab, the value is in the team, in the ability of the model to do the right thing in the real world, and to avoid too much uncertainty so that it doesn’t become unreliable.

11

AI is a complement to human relationships,

But it doesn’t replace real interactions

Moderator: In the field of AI, what do you think are the most important technical or product challenges in the future? But no one talks about it yet, but what do you think is crucial?

Mike Krieger: As models become more powerful, one of the underrated challenges is “discernment” and privacy. As the models become more powerful, so do they. You might discuss all sorts of things with the model, from something very private, to something very sensitive to the company, or the model might have access to all of your company’s data. Everyone likes to talk about interactions between agents, but few people think about the intersection of these two factors: Do you trust your Mike agent or Harry agent to operate in the outside world without being “jailbroken” or revealing private or sensitive information that it knows?

My analogy is that of my five-year-old daughter, who is not quite able to distinguish between family secrets and privacy and things to talk about with a new friend or cashier when she gets along with someone she just knows. Discernment is a skill that people acquire over time, and I think models are grossly underestimated in this regard, and there may not be enough research in this area from the perspective of model capabilities. Because models are fundamentally meant to be helpful, but that’s not always what you want. It’s not just about security, it’s also about privacy and data security.

Moderator: Are you worried that your five-year-old daughter will be more used to talking to models and agents than to humans?

Mike Krieger: I’ve had a lot of conversations with Alex Wang about this, because he thinks that most of my friends in the future will be AI friends. I don’t think he’s wrong. I think that’s starting to happen, like people spending a lot of time playing online games, where some of the characters are NPCs (non-player characters) and you might feel more comfortable in the virtual world. Even if you don’t break through this, I’m still worried. My daughter is very outgoing, so I personally don’t worry about her.

But if we take this question in the abstract, from a broader perspective, there is really a lot to think about. Here’s an optimistic view: I was a rather clumsy teenager when I was younger, and it might help if I had some AI interactive practice modes to help me improve my social skills. But at the same time, this does not fully address the consequences associated with human interaction. It’s like the difference between reading an article on “What was it like to have a heated argument for the first time with a high school girlfriend” and actually experiencing an argument. When you’re in the middle of an argument, you realize it’s completely different from reading. This reminds me of the classic “Chinese house” experiment. Or another thought experiment: someone stays in a black and white room all the time, only reads a description of the color red, and then one day he walks out of the room and sees red. Will he get some experience that is completely different from what he had done before? Absolutely. So, is there a difference between talking to a model, even if it’s an emotional role-play, and having the same interaction with a real person? Absolutely. As a result, AI may be a useful addition to human interaction in the future, but it is definitely not enough to replace real human interaction.

Moderator: Last question, Dario Amodei once said that our generation will probably live to be 150 years old. I may have misinterpreted and summarized what he was saying. But what he means is that our generation is likely to live very long. I’m very optimistic about this, my mother has multiple sclerosis, and I hope that AI will help find a cure for diseases like MS. Do you agree with his optimistic predictions? What do you think about the role of AI in extending lifespan and human lifespan?

Mike Krieger: I think the potential is huge. In terms of the areas where AI is already starting to play a role today, including accelerating the closed-loop process of drug discovery and clinical trials. For example, Noon Neri, which used to take 15 weeks to complete a clinical trial report, now uses Claude and does it in 20 minutes. This is a huge step forward. Of course, there’s a lot of research behind that, and I’m not saying we’ve cut those years into weeks or minutes, but we can really speed up some parts of the process. This is the capability of the current model.

And then you see the Arc Institute, a scientific research institute founded and funded by Patrick Collison et al., who are working on the basic model of the cell. With a realistic model of the cell, you can run experiments, which will greatly speed up drug discovery and experimentation because you shorten the cycle time of experiments. So I’m very optimistic about that. I don’t think the potential of AI is being fully exploited in many areas. I remember that some of the brightest minds of my generation used to work on more targeted advertising, which may have been true at some point. But today, many of them are working on how to build models that are extremely useful, valuable, and intelligent in every domain.

Author：AlLin师傅
Source：Anthropic CPO 万字专访：不再只做模型！后悔没有更早做第一方产品
The copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.