Other

DeepSeek secrets unveiled: engineers reveal science behind Chinese AI model

By Victoria Bela

Copyright scmp

DeepSeek secrets unveiled: engineers reveal science behind Chinese AI model

Engineers behind the viral Chinese artificial intelligence (AI) reasoning model DeepSeek-R1 have unveiled the deep science behind its training.
Upon its release in January, the open-source model developed by Hangzhou-based AI start-up DeepSeek sent shock waves through the industry when it became a challenger to US-based OpenAI’s industry-leading o1 model.
Now, the DeepSeek AI team has revealed how they used rewards to train their R1 model to solve problems, allowing them to bypass some of the costly computational and scaling barriers to teaching AI models to reason like humans.
“General reasoning represents a long-standing and formidable challenge in artificial intelligence,” the team said in a paper published in the peer-reviewed journal Nature on Wednesday.
Reasoning, or the logical process of using existing knowledge and new information to form conclusions, is a cornerstone of human cognition.
It allows for the execution of complex cognitive tasks, including mathematical problem solving, making it a key element in developing more advanced, humanlike AI.
The development of large language models (LLMs), which can understand and generate humanlike language, has achieved notable success in foundational reasoning tasks.

“However, this success is heavily contingent on extensive human-annotated demonstrations, and the capabilities of models are still insufficient for more complex problems,” the DeepSeek AI team said.
This limits scalability and can introduce human biases to model training. It could also potentially limit the exploration of superior reasoning pathways beyond the capabilities of the human mind.
To overcome this barrier, the team used reinforcement learning to allow their LLM to develop reasoning abilities through self-evolution.
Reinforcement learning is a process in which learning is prompted by an agent or model interacting with the environment through trial and error, for which it receives penalties or rewards for certain actions.
When applied to a language model like DeepSeek’s, once the model generates an output in response to a prompt, it receives feedback in the form of reward signals, allowing it to improve its responses.
“Rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem-solving strategies,” the researchers said.
This allowed them to bypass a supervised fine-tuning stage in initial model training, where a database of human-written prompts and ideal responses is used to tweak the model.
The resulting DeepSeek-R1 model, which still requires some human supervision to train, achieved “superior performance” in mathematics and coding competitions, surpassing conventionally trained counterparts.
“This design choice originates from our hypothesis that human-defined reasoning patterns may limit model exploration, whereas unrestricted [reinforcement learning] training can better incentivise the emergence of new reasoning capabilities in LLMs,” the team said.

The researchers started by using an efficient reinforcement learning pipeline applied to their DeepSeek-V3 model base, which allowed the resulting DeepSeek-R1-Zero model to naturally develop “diverse and sophisticated reasoning behaviours”.
They used several benchmarks to evaluate the model throughout the training process, including the 2024 American Invitational Mathematics Examination (AIME). The DeepSeek-R1-Zero model saw its average pass score go from 15.6 per cent to 77.9 per cent throughout the training process, surpassing the accuracy of human participants.
Despite this, the DeepSeek-R1-Zero model still faced challenges such as language mixing, owing to the model being trained in several languages, including Chinese and English.
This prompted the team to undergo further training to develop the DeepSeek-R1 model, which inherited the reasoning capabilities of its predecessor while aligning its behaviour more with human preferences.
This model reached an accuracy rate of 79.8 per cent in the AIME and saw improvements in some other coding and mathematics benchmarks.
“This serves as a reminder of the potential of [reinforcement learning] to unlock higher levels of capabilities in LLMs, paving the way for more autonomous and adaptive models in the future”.
The team noted that advancements in reasoning capabilities had potential ethical risks, including the possibility of jailbreak attacks, which attempt to “trick” the model into bypassing its safety features.
The model still faces limitations that the authors said they hoped to address in the future, including an inability to use outside tools such as calculators to improve its performance and its scores on software engineering benchmarks.