Claude 3.7 was released

Just now, Anthropic launched the first hybrid reasoning Claude 3.7 Sonnet, which is considered the strongest model for expanding thinking patterns. In the latest coding test, the new model beat o3-mini and DeepSeek R1, and the king of AI coding was born.

After holding back for more than half a year, Anthropic finally released its big move – the first hybrid inference model Claude 3.7 Sonnet made its debut!

This is the smartest model in the Claude series yet, capable of almost instant response and scalable, step-by-step thinking.

In short, one model, two ways of thinking.

Suppose you want to solve a game theory math problem, the Monty Hall problem, and give it to Claude 3.7 Sonnet, and then select the “Extended” mode at the same time.

It will then show the detailed CoT process, which was completed in 52 seconds.

The most important thing is that Claude 3.7 Sonnet is currently available to everyone for free, and the “Extended Thinking” mode is not yet online.

In multiple benchmark tests, Claude 3.7 Sonnet, enabled by the “Extended Thinking” mode, set new SOTA records in mathematics, physics, instruction execution, and programming.

Compared to the previous generation Claude 3.5 Sonnet, math and coding capabilities have increased by more than 10%.

Except for mathematics, Claude 3.7 Sonnet (64k extended thinking) almost completely crushes o3-mini, DeepSeek R1, and is comparable to Grok 3.

API users can precisely control the model’s think time

It can be said that Claude 3.7 Sonnet is the strongest “software engineering AI”. On the SWE-bench, it achieved a high score of 70.3%.

At the same time, the first “intelligent agent programming” tool Claude Code (preview version) was also released today.

Today, it has become an indispensable tool within Anthropic. In early tests, Claude completed a task that would take a human 45 minutes in a single try.

In other words, you are a product manager and AI works for you and writes code.

Although there is no Claude 4, Anthropic’s sudden play is indeed another shock to the AI world.

This half month is destined to be the most valuable for AI since the beginning of 2025.

Grok 3 was just released last week, DeepSeek was open source for 5 consecutive days this week, OpenAI GPT-4.5 is said to be online soon, and with Claude 3.7 Sonnet, the melee in the field of large models has begun again.

The world’s first “hybrid reasoning” model was born

In an official blog post, Anthropic stated that Claude 3.7 Sonnet is Anthropic’s smartest model to date and the first hybrid inference model on the market.

Claude 3.7 Sonnet can generate almost instant responses or step through detailed steps of the thought process that are visible to the user. API users can also fine-tune the model’s thinking time.

Claude 3.7 Sonnet has been significantly improved in terms of coding and front-end web development.

In addition, they also launched a command line tool called Claude Code for agent coding.

Currently available only as a limited research preview, Claude Code enables developers to delegate a wide range of engineering tasks to Claude directly from their terminal.

Reasoning is an overall LLM ability

The design concept of Claude 3.7 Sonnet is different from other inference models on the market.

Anthropic believes that just as humans use one brain to handle both quick reactions and deep thinking, reasoning should be the overall capability of the frontier model rather than a completely separate model. This unified approach provides users with a smoother experience.

The Claude 3.7 Sonnet embodies this philosophy in several ways.

First, Claude 3.7 Sonnet is both a normal language model (LLM) and a reasoning model: you can choose when you want the model to answer normally and when you want it to think longer before answering.

In standard mode, the Claude 3.7 Sonnet is an upgraded version of the Claude 3.5 Sonnet.

In extended thinking mode, it reflects on itself before answering, which improves performance on math, physics, instruction following, coding, and many other tasks.

In general, both modalities had similar prompting effects on the model.

Secondly, when using Claude 3.7 Sonnet through the API, users can also control the thinking budget –

You can tell Claude to consider at most N tokens when answering, with N being the maximum output limit of 128K tokens. This allows the user to trade off speed (and cost) against answer quality.

Third, in developing its inference models, Anthropic has optimized slightly less on math and computer science competition problems and instead shifted its focus to real-world tasks that are more reflective of how businesses actually use LLMs.

Claude 3.7 Sonnet achieved SOTA on SWE-bench Verified, a benchmark designed to evaluate the ability of AI models to solve real-world software problems.

Claude 3.7 Sonnet refreshes SOT on TAU-bench, a framework for testing AI agents’ ability to interact with users and tools in complex real-world tasks

As mentioned earlier, Claude 3.7 Sonnet has achieved significant performance improvements in almost all major benchmarks.

Compared with the latest Grok 3 Beta model, Claude 3.7 Sonnet (64k extended thinking) is almost on par in terms of reasoning, but slightly inferior to Grok 3 Beta in terms of mathematical and visual reasoning.

Compared with o3-mini and DeepSeek R1, except for mathematics, Claude 3.7 Sonnet with extended thinking mode scored the highest.

Claude 3.7 Sonnet excels in task instruction following, general reasoning, multimodal ability, and autonomous programming, and his extended thinking mode brings significant improvements in mathematics and science. In addition to traditional benchmarks, it even surpassed all previous models in the Pokémon game test.

AI coded agent completes 45-minute task at a time

Since June 2024, the Sonnet series has been the model of choice for developers around the world.

Today, Anthropic’s first agent coding tool, Claude Code, was released as a limited research preview.

Claude Code actively collaborates with people, with the ability to search and read code, edit files, write and run tests, commit and push code to GitHub, and use command-line tools—all while ensuring that users can participate every step of the way.

Additionally, this update improves the coding experience on Claude.ai.

All Claude plans now support GitHub integration—developers can connect their code repositories directly to Claude.

As Anthropic’s most powerful coding model to date, Claude 3.7 Sonnet can provide a deeper understanding of personal projects, work projects, and open source projects, and become a powerful assistant for fixing bugs, developing new features, and writing GitHub documentation.

Claude Code is still in its early stages, but it has already become an indispensable tool for the Anthropic team, especially for test-driven development, debugging complex problems, and large-scale refactoring.

In early testing, it was able to complete a task that would normally take more than 45 minutes of manual work in a single go, significantly reducing development time and effort.

In the coming weeks, Anthropic plans to continue to improve it based on usage: improving the reliability of tool invocations, adding support for long-running commands, improving in-app rendering, and expanding Claude’s understanding of its own capabilities.

New Test-Time Scaling

Claude as an AI agent

Claude 3.7 Sonnet has a new feature called “action scaling” – this improvement enables it to iteratively call functions, respond to changes in the environment, and continue to operate until an open-ended task is completed.

For example, in computer usage: Claude can complete tasks on behalf of users by issuing virtual mouse clicks and keyboard keys. Compared with its predecessor, Claude 3.7 Sonnet can devote more interactions to computer usage tasks, and is equipped with more sufficient time and computing resources, so it can often achieve better results.

This progress was fully demonstrated in the OSWorld evaluation, a testbed for evaluating the capabilities of multimodal AI agents.

Claude 3.7 Sonnet showed better performance in the initial stage, and its performance advantage will continue to expand over time as it continues to interact with the virtual computer.

Claude’s extended thinking model combined with AI agent training not only helped it achieve better performance in many standard evaluations such as OSWorld, but also enabled it to achieve major breakthroughs in some other unexpected tasks.

Take playing games, for example – specifically the classic Game Boy handheld game Pokémon Red. They equipped Claude with basic memory, screen pixel input, and function call capabilities for keystrokes and screen navigation, allowing it to break through the normal contextual limitations and continue playing games, achieving continuous interactions for tens of thousands of times.

In the image below, they compare the progress of Claude 3.7 Sonnet, which has expanded thinking abilities, with previous versions of Claude Sonnet in the Pokémon game.

As shown in the picture, the early version had difficulty advancing at the beginning of the game, and Claude 3.0 Sonnet could not even walk out of the initial hut in Pallet Town where the story started.

And Claude 3.7 Sonnet made significant progress with his improved AI agent capabilities, successfully challenging and defeating three gym leaders and earning the corresponding badges.

Claude 3.7 Sonnet excels at trying multiple strategies and re-examining existing assumptions, which allows it to continually improve its capabilities during the game.

Calculating Scaling for Serial and Parallel Testing

When Claude 3.7 Sonnet uses its extended thinking capabilities, it can be said to take advantage of the “serial test-time computation” mechanism.

Specifically, it performs multiple consecutive reasoning steps before generating the final output, while continuously increasing the input of computing resources in the process.

Overall, this mechanism improves performance in a predictable way: for example, in solving math problems, accuracy grows logarithmically with the number of “thinking tokens” allowed to be sampled.

Claude researchers are also exploring the use of parallel test-time computation to improve model performance.

The specific method is to sample multiple independent thought processes and select the best result without knowing the correct answer in advance. This can be achieved through a majority voting or consensus voting mechanism, that is, the answer that appears most frequently is selected as the “best” answer.

Alternatively, you can use another LLM to verify your work or use the trained scoring function to select the best answer.

These optimization strategies (and related research work) have been verified in evaluation reports of multiple AI models.

In the GPQA evaluation, they achieved breakthrough progress by computing scaling during parallel testing.

Specifically, by calling computing resources equivalent to 256 independent samples, combining a trained and optimized scoring model, and setting a maximum inference limit of 64,000 tokens, Claude 3.7 Sonnet achieved an overall score of 84.8% in the GPQA test (including 96.5% in the physics section).

It is worth noting that even beyond the limits of regular majority voting, model performance continues to improve.

The following figure lists the detailed results of the scoring model method and the majority voting method.

These methods can improve the quality of Claude’s answers, often without waiting for him to complete his reasoning process. By performing multiple different deep thinking operations simultaneously, Claude can explore more problem-solving ideas and significantly increase the frequency of correct answers.

Three-step roadmap, Claude collaborators are here

Claude 3.7 Sonnet and Claude Code mark an important step towards artificial intelligence systems that truly enhance human capabilities.

With their ability to reason deeply, work autonomously, and collaborate effectively, they bring us closer to a future where AI enriches what humans can achieve.

Now, Claude’s collaborator has arrived.

The latest version is free to use

It is worth mentioning that Claude 3.7 Sonnet is now available on the Claude.ai platform, and Web, iOS and Android users can experience it for free.

For developers looking to build custom AI solutions, access is available through the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.

In both standard and extended thinking modes, the prices for Claude 3.7 Sonnet remain the same as its predecessors: $3/million input tokens, $15/million output tokens — this includes the cost of thinking tokens.

Anthropic Package Pricing

AI Boss Test

Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, has been testing Claude 3.7 over the past few days.

Claude 3.7 often gives him the same feeling as when he first used ChatGPT-4: a mixture of wonder and a little trepidation at their capabilities. Taking Claude’s native coding capabilities as an example, we can now get a working program from a natural conversation or document without any programming skills.

For example, he provided Claude with a proposal for a new AI educational tool, and in the conversation asked it to “show the proposed system architecture in 3D and make it interactive.” As a result, it generated an interactive visualization of the core design in our paper without any errors.

The graphics are neat, but that’s not the most impressive part. What’s really impressive is that Claude took the initiative to make a step-by-step demonstration to explain the concept, which was not something we asked him to do.

This anticipation of demand and thinking about new methods is a new breakthrough in the field of AI.

To give a more interesting example, Ethan Mollick told Claude: “Make me an interactive time machine that allows me to travel back in time and have interesting things happen. Pick some unusual time for me to go back to…” and “Add more images.”

Just those two prompts later, a fully functional interactive experience emerged, even complete with crude but charming pixelated images (which were actually surprisingly impressive — the AI had to “draw” these images using pure code, without being able to see what it was creating, like a blindfolded artist).

References:

Author：新智元
Source：https://mp.weixin.qq.com/s/JZ-HYbSj9R48UZwyvTQ1XQ
The copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.