Deepseek’s success shows why motivation is key to AI innovation


Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more


It shook the AI ​​landscape in January 2025. The seemingly unstoppable Openai and the powerful American tech giant were shocked by what could certainly be called a loser in the field of large-scale language models (LLMS). Deepseek, a Chinese company not on anyone’s radar, suddenly tried Openai. The Deepseek-R1 wasn’t superior to the top American giants. We were a little behind in terms of benchmarks, but suddenly it made everyone think about efficiency when it comes to hardware and energy use.

Given the lack of availability of the best high-end hardware, DeepSeek appears to have been motivated to innovate in the field of efficiency, a lesser concern for larger players. Openai claims that there is evidence to suggest that Deepseek may have used the model for training, but there is no concrete evidence to support this. So whether it is true or whether it is Openai or not is merely an attempt to appease investors is a topic of discussion. But Deepseek has made their work public and people have confirmed that the results are reproducible at least on a much smaller scale.

But how did Deepseek achieve such cost savings, but did American companies not? The short answer is simple. They had more motivations. A longer answer requires a bit more technical explanation.

DeepSeek used KV cache optimization

One important cost savings for GPU memory was the optimization of key value caches used in all attention layers of LLM.

The LLMS consists of transformer blocks, each forming a attention layer followed by a regular vanilla feedforward network. Feedforward networks conceptually model any relationship, but in practice it is difficult to always determine the pattern of data. The attention layer solves this problem for language modeling.

The model uses tokens to process text, but for simplicity, we call them words. In LLM, each word can be assigned a vector of high dimensions (for example, a thousand dimensions). Conceptually, each dimension represents the concepts of heat, cold, green, soft, noun. A vector representation of a word is its meaning and value according to each dimension.

However, our language allows other words to change the meaning of each word. For example, apples have meaning. However, you can have a green apple as a modified version. A more extreme example of fixing is that apples within the context of an iPhone are different from apples in the context of a meadow. How can I change the meaning of a word vector into a system based on another word? This is where attention is attracting.

The attention model assigns two other vectors to each word: key and query. A query represents the quality of the meaning of a word that can be modified, and a key represents the type of change that can be provided to other words. For example, the word “green” can provide information about color and greenness. Therefore, the key to the word “green” has a high value on the “green” dimension. On the other hand, the word “Apple” is green or not, and the query vector of “Apple” also has a high value in the green dimension. When collecting a “green” key dot product with an Apple query, the product must be relatively large compared to the “table” key and the “Apple” query product. The note layer then adds a small portion of the value of the word “green” to the value of the word “Apple.” In this way, the value of the word “Apple” is changed to be a little green.

When LLM generates text, it executes words one after another. When you generate a word, all previously generated words become part of its context. However, the keys and values ​​for these words have already been calculated. If another word is added to the context, it must be updated based on the keys and values ​​of that query and all previous words. So all these values ​​are stored in GPU memory. This is a KV cache.

Deepseek determined that the key and value of a word are related. Therefore, the meaning of the word green and the ability to influence green is clearly very closely related. So, while processing very easily, you can compress and decompress both vectors as a single (and perhaps smaller) vector. Deepseek has found it to affect performance on benchmarks, but it saves a lot of GPU memory.

Deepseek Applied Moe

The nature of neural networks is that they need to evaluate (or calculate) the entire network for each query. However, not all of this is a useful calculation. World knowledge is in network weights or parameters. Knowledge about the Eiffel Tower is not used to answer questions about the history of South American tribes. Knowing that apples are fruit is useless in answering questions about a general theory of relativity. However, once a network is calculated, all parts of the network are processed regardless. This incurs enormous computational costs during text generation that should ideally be avoided. This is where the idea of ​​mixing Experts (MOE) comes in.

In the MOE model, neural networks are divided into several small networks called experts. Note that the subject’s “expert” is not explicitly defined. The network will grasp it during training. However, the network assigns several relevance scores to each query, and only activates parts with a higher matching score. This significantly reduces calculations. Note that some questions require multiple domain expertise to properly answer, which will slow down the performance of such queries. However, the area is captured from the data, so the number of such questions is kept to a minimum.

The importance of reinforcement learning

LLM is taught to think of a chain of thinking model by tweaking fine-tuned models to mimic thought before providing answers. The model is asked to express the thought in words (generate the thought before generating the answer). The model is then evaluated in both thought and answer and trained in reinforcement learning (rewarded for the correct match and penalized for the wrong match with training data).

This requires expensive training data using thought tokens. Deepseek asked the system to generate ideas between tags and Generate answers between tags and . The model is purely rewarded or punished based on the match between the form (using tags) and the answer. This required much cheaper training data. In the early stages of RL, the answers were not incorrect as the models were rarely considered. Ultimately, the model learned to generate both long and coherent thoughts. Since this point, the quality of the answers has improved significantly.

DeepSeek employs some additional optimization tricks. But they are very technical so I won’t dig them here.

Final Thoughts on Deepseek and the Bigger Market

Every technology research needs to see what is possible before improving efficiency. This is a natural progression. Deepseek’s contribution to the LLM landscape is incredible. Whether trained using OpenAI outputs or not, academic contributions cannot be ignored. You can also change the way your startup works. However, there is no reason why Openai and other American giants are in despair. This is how research works. One group benefits from research from other groups. Deepseek certainly benefited from previous research conducted by Google, Openai, and many other researchers.

However, the idea that Openai will rule the LLM world indefinitely is very unlikely now. Regulatory lobbying and finger pointing amounts do not maintain monopoly. This technology is already in many hands and cannot stop its progress. This may be a bit of a headache for Openai investors, but it’s the end of the victory for us. The future belongs to many things, but I’m always grateful to early contributors like Google and Openai.

Debasish Ray Chawdhuri is a Senior Principal Engineer at Talentica Software.



Source link