Loading...
Loading...
The breakthrough that powers every modern LLM — attention mechanisms, and why 2017 changed everything
The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent neural networks (RNNs) and is the foundation of every major LLM today.
RNNs processed sequences one token at a time: each step depended on the previous one. This was slow and forgot early tokens in long sequences.
Each token creates three vectors:
Attention computes how much each token should "attend" to every other token.
Transformers enabled:
| Wrong | Why | Right |
|---|---|---|
| Transformers are only for text | The architecture generalizes | Also used for vision, audio, proteins |
| Bigger is always better | Diminishing returns | Data quality matters too |
| Attention = reasoning | Attention is a mechanism | Not an explanation tool |
Implement simplified attention:
import math
def attention(query, keys, values):
scores = [sum(q*k for q,k in zip(query,key)) for key in keys]
d_k = len(query)
scores = [s/math.sqrt(d_k) for s in scores]
exp_s = [math.exp(s) for s in scores]
total = sum(exp_s)
weights = [e/total for e in exp_s]
output = [sum(w*v[i] for w,v in zip(weights, values)) for i in range(len(values[0]))]
return output, weights
# Test: Does 'bank' attend more to 'river' or 'money'?
query = [1, 0] # river context
keys = [[0.9, 0.1], [0.1, 0.9], [0.5, 0.5]]
values = [[1, 0], [0, 1], [0.5, 0.5]]
tokens = ['river', 'money', 'the']
_, weights = attention(query, keys, values)
for token, weight in zip(tokens, weights):
print(f'Attention to {token}: {weight:.3f}')