* Famously:
+ $\text{Attention}(Q, K, V) = \text{softmax}( \frac{QK^T}{\sqrt{d_k}})V$
* Or, looking at our current $\hat{k}$
+ $\text{Attention}(\hat{k}, \text{context}) = \text{softmax}( \frac{\hat{k} . [k_0 ... k_t]}{\sqrt{d_k}}) [v_0 ... v_t]$

* Memory = Context
+ update $M$ by inserting new (key, value)