Machine Learning Singapore : 15-January-2026 : NeurIPS Recap

## NeurIPS Recap
; - The State of AI Research
#### Machine Learning Singapore

<small>

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
<br/>
[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)

</small>

<br />
<small>15-January-2026</small>
<br />
<br />

---

## Today's Line-up

* "Nested Learning; mHC and Sub-1bit Compression"<br/>- _Martin Andrews_
* "EmergentDB: What if Your Database Learned Like an LLM?"<br/>- _Rach Pradhan_
* "The Big Shift: From "Bigger Models" to "Smarter Systems""<br/>- _Sam Witteveen_

---

### Nested Learning,<br/><span style="text-transform: none;">mHC</span> / Engram,<br/> & Sub-1bit Compression
#### Machine Learning Singapore

<small>

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)

</small>

; $x^2=17_i$  <span>$</span>
; ## <span style="text-transform: none;">DSPy</span>

<br />
<small>15-January-2026</small>
<br />
<br />

---

## About Me

* Machine Intelligence / Startups / Finance  
  + Moved from NYC to Singapore in Sep-2013

* 2014 = 'fun' :
  + Machine Learning, Deep Learning, NLP
  + Robots, drones

* Since 2015 = 'serious' :: NLP + deep learning
  + Including Papers...
  + & GDE ML; ML-Singapore co-organiser...
  + & Red Dragon AI...

## About Red Dragon AI

* Deep Learning Consulting & Prototyping (Google Partner)
  - Education / Training
  - Research : NeurIPS / ICML / ICLR / NAACL / EMNLP

* Please contact us for : 
  - Language model training (eg: on-prem)
  - Knowledgebase interaction & reasoning
  - Sales-oriented applications

---

## Outline

* Three interesting directions:
  + "mHC" and "Engram" - DeepSeek
  + "Nested Learning" - Google
  + "LittleBit: Ultra Low-Bit Quantization" - Samsung
* Heads-Up!
* Wrap-up & QR-code

---

## DeepSeek on the Move!

* Started releasing results ahead of "DeepSeek R1" in Jan-2025
* Have just released two interesting papers:
  + "mHC" and "Engram"
    - where Engram paper 'assumes' mHC is being used
* Why publish papers now?

---

## <span style="text-transform: none;">mHC</span>

* [mHC: Manifold-Constrained Hyper-Connections](https://arxiv.org/abs/2512.24880)<br/> - Xie _et al_ (2025)
  + DeepSeek's New Year release
  + All about improving Residual Connections
    - [YouTube explainer video using `manim`](https://www.youtube.com/watch?v=jYn_1PpRzxI)
  + Builds on [Hyper-Connections](https://arxiv.org/abs/2409.19606) - Zhu _et al_ (2024)
    - from ByteDance

<!--
* [mHC: Manifold-Constrained Hyper-Connections](https://arxiv.org/abs/2512.24880)  HavePDF
  + [Supporter Thread](https://x.com/TayKolasinski/status/2010514784194601028)
  + [a surprisingly clean way to make “multi-lane” residuals actually train efficiently](https://x.com/lqiao/status/2007172263171563965)
  + [Some notes on Hyper-Connections, from reading both the original ByteDance paper and the new DeepSeek one:](https://x.com/nrehiew_/status/2008541630060752907)

* Alternatives:
  + Everyone talking about mHC but no one yet talking about AltUp https://research.google/blog/alternating-updates-for-efficient-transformers/ (published 2023) and was part of gemma-3n.
  + THE 9-YEAR ROAD TO DEEPSEEK'S HYPER-CONNECTIONS : (interesting diagram) = https://x.com/byebyescaling/status/2007147288809087281
~
Deep Delta Learning
  https://x.com/yifan_zhang_/status/2006674032549310782 HavePDF
**Both fix up Residual Connection - may be part of DeepSeek v4**
-->

## Hyper-Connections

### Hyper-Connections Results

* Results from ByteDance were very attractive ...
  + But *doesn't scale well* (parameter blow-ups for >7B)

## What does <span style="text-transform: none;">mHC</span> do?

* mHC is essentially a fix-up of Hyper-Connections
  + so that the 'residual connection mixers' don't *explode*
* Idea: Ensure that the mixers are well-behaved
  - force the entries to be positive
  - and the rows and columns to each sum to 1
    * "Sinkhorn-Knopp algorithm"
* The above (plus efficiency optimisations) make it work
  + slight improvement over Hyper-Connections performance...
    - and it works as models are scaled up!
    - with only a 6.7% training overhead vs regular ResNet

## Stability Comparison

* Manifold-Constrained Hyper-Connection stability

---

## Engram

* [Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models](https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf) - Cheng _et al_ (2026)
  + [GitHub Repo](https://github.com/deepseek-ai/Engram) (Apache 2.0)
  + Idea: 
    - add n-gram embeddings alongside regular tokenisation
  + Why?
    - Standard Transformers *simulate retrieval through computation*
    - Mixture-of-Experts scales capacity through *conditional computation*
    - Engram adds a second axis: *conditional memory*
  + Benefits: 
    - extra parameters benefit benchmarks
      * can be scaled up (like number of experts was)
    - n-gram parameters can be stored off-GPU

## n-grams

* Super-old idea about Language Modeling
  + [Large Language Models in Machine Translation](https://aclanthology.org/D07-1090.pdf) - Brants _et al_ (2007)
    - Google paper, including author Jeff Dean
    - n-gram Language Model with 1.8 trillion parameters (!)
* For every combination of (say) 3 words
  + store an occurrence count
    - NB: talking about 100K × 100K × 100K positions...
  + can then calculate next-word frequencies

## "Engrams"

* Modern version
  + store a trainable embedding
  + for every combination of (say) 3 tokens (de-duplicated)
* Memory-saving trick (multi-head hashing):
  + convert the 'address' into multiple indices
    - into a fixed-size memory store 
  + combine the returned results 
    - collisions are just 'normal behaviour'

## Architecture

## Results (cropped)

## Memory
;#### Training & Test-time

---

## DeepSeek Wrap-up

* Some have pointed out previous works...
  + obviously ByteDance
  + but Google's Gemma-3n model has similar mechanisms
    - problem = missing paper!
* Points to potential "DeepSeek V4" *soon*
  + Release cycle has tended to coincide with Chinese holidays
  + (still time for another paper though...)
    - see (for ideas) : [Ultra-Sparse Memory Network](https://arxiv.org/abs/2411.12364)

---

## Nested Learning

* [Nested Learning: The Illusion of Deep Learning Architectures](https://arxiv.org/abs/2512.24695) - Behrouz _et al_ (2025)
  + Follow-on from [Titans: Learning to Memorize at Test Time](https://arxiv.org/abs/2501.00663)
    - which was also accepted at NeurIPS 2025
  + [40mins Poster talk on YouTube](https://www.youtube.com/watch?v=uX12aCdni9Q)
  + [Author 1Hr interview on YouTube](https://www.youtube.com/watch?v=3WqZIja7kdA)
  + The following is rather "High Concept" 🤔🤯

## Instant Bin?

## Associative Memory

Memory $M$ converts a key into a value:
$$M(k) \rightarrow v$$

Update with new pair:
$$M_t \leftarrow M_{t-1} + \Delta f(k,v)$$
</div>
</div>

### Matrix Soup<br/>as Associative Memory

* $M = k_0 \otimes v_0 + k_1 \otimes v_1 + ... + k_N \otimes v_N$
* Retrieval : $M . \hat{k} \approx v$
  - because multiplying 'wrong' $k_t$ by $\hat{k} \approx 0$ 
* Memory = Content mapped to matrix 
  - See also : RWKV
  - $M_t \leftarrow M_{t-1} + k_t \otimes v_t$

<small>
<br/><br/>
* Some liberties taken with Transposes & Normalisation
</small>

### Attention as Associative Memory

* Famously:
  + $\text{Attention}(Q, K, V) = \text{softmax}( \frac{QK^T}{\sqrt{d_k}})V$
* Or, looking at our current $\hat{k}$ 
  + $\text{Attention}(\hat{k}, \text{context}) = \text{softmax}( \frac{\hat{k} . [k_0 ... k_t]}{\sqrt{d_k}}) [v_0 ... v_t]$

</div>
<div class="col-content">
<img width="400" src="img/NestedLearning_attention_250x187.png" alt="Attention as Associative Memory">

* Memory = Context 
  + update $M$ by inserting new (key, value)

</div>
</div>

### FFN unit as Associative Memory

* Memory = Just this token

</div>
<div class="col-content">

* when $\hat{k}$ matches $M_{up}$ row
  - middle output is large 
* selects from $M_{down}$
  - to return $v$
* (have to train these $M_{*}$)

</div>
</div>

### Optimiser as Associative Memory

* QUESTION:
  + What else is like $M_t \leftarrow M_{t-1} + \Delta f(k,v)$ ?
* ANSWER:
  + gradient updates from optimisers!
  + i.e. $W_t \leftarrow W_{t-1} + \Delta \mathcal{L}(k,v)$ 
  + can think of optimiser itself as a memory store...
* Memory = Whole of pre-training!
  + in fact, Adam+Momentum has a particularly nice form

### Nested Learning Wrap-Up

* The mathematics ~lines up
  + very interesting conceptually
    - but perhaps this is an *artifact* of how <br/> humans conceptualise systems
  + architecture + optimisers + updates are intertwined
* Papers show multi-time-scale results
* If this is used : Will we ever find out?

---

## Sub-1bit Quantisation

* [LittleBit: Ultra Low-Bit Quantization via Latent Factorization](https://arxiv.org/abs/2506.13771) - Lee _et al_ (2025)
  + Problem :
    - Phones don't have much memory compared to LLM requirements...
  + Goal:
    - Compression of <1 BPW
  + Outcome:
    - Extreme compression works even in 0.1 BPW regime!
;    - Extreme Compression: Works even in 0.1 BPW regime!
;    - High Efficiency: 31× memory reduction compared to FP16
;    - Novel Method: Latent Matrix Factorization with Binarization & Multi-scale Compensation

## Problem : 138>12

* Even regular quantisation (eg: `q4`) won't work

## Solution : 2<12

## Mathematics : SVD

* Typical $r$ values are $100 \dots 500$ ($4096 \times 4096$ matrices $W$)

## Binarisation

## Fix-Up ...

* Paper also looks at residual fix-up too... (two paths)

## Distillation

## Results

## <span style="text-transform: none;">LittleBit</span> Wrap-Up

* [LittleBit: Ultra Low-Bit Quantization via Latent Factorization](https://arxiv.org/abs/2506.13771) - Lee _et al_ (2025)
  + [Project Page](https://github.com/SamsungLabs/LittleBit) & [Link to NeurIPS 2025 site](https://neurips.cc/virtual/2025/loc/san-diego/poster/115061)
  + [Repo](https://github.com/SamsungLabs/LittleBit) (Non-Commercial license)
    - [Actual LittleBit Implementation](https://github.com/SamsungLabs/LittleBit/blob/main/quantization/modules/littlebit.py)
    - Can convert OPT; Llama-2 & Llama-3; Phi-4; Qwen2.5; Gemma-2 & Gemma-3
  + Final Results:
    - 31× memory reduction compared to FP16
    - 11× faster inference (binary operations, custom kernel)
    - 21× smaller KV cache (in latent space)

---

## Heads-Up!

* [Prompt Repetition Improves Non-Reasoning LLMs](https://arxiv.org/abs/2512.14982)<br/> - Leviathan _et al_ (2025)
  + Google Paper
    - repeating prompt 2x allows prompt self-interaction
    - this is *very actionable*
* [Extending the Context of Pretrained LLMs by Dropping their Positional Embeddings](https://arxiv.org/abs/2512.12167) - Gelberg _et al_ (2025)
  + Sakana AI (Tokyo) : [DroPE blog post](https://pub.sakana.ai/DroPE/)
    - positional embeddings are necessary when starting training
    - but can actually be faded out (to nothing) later
    - no need to worry about training for long context!

---

## Wrap-Up

* NeurIPS had many interesting papers 
  - but a lot has happened since last May
  - in fact, a lot has happened *since NeurIPS*
* Who knew that model compression could work so well?
* Look out for new Chinese models!

<small>NB: MLSG wants to feature Your Talk!<br />(Say "Hello"...)</small>

## Link to Slides

[<img width="300" src="img/bit.ly_MLSG_2026-01_656x656.png" alt="NeurIPS 2025 roundup QR code"/>](https://bit.ly/MLSG_2026-01)

[https://bit.ly/MLSG_2026-01](https://bit.ly/MLSG_2026-01)

---

### [EmergentDB: What if Your Database Learned Like an LLM?](https://emergentdb.com/)
#### Rach Pradhan (lightning talk)

* DeepMind Gemini Hackathon
* MAP-Elites for database configuration 
* EmergentDB released as proof-of-concept

<!--
In his lightning talk, Rach will describe his highly-rated entry into the recent 
DeepMind Gemini Hackathon held in Singapore.  His "EmergentDB" is a proof-of-concept vector database 
that uses MAP-Elites (Quality Diversity) to automatically discover optimal index configurations. 
Instead of hand-tuning HNSW parameters or guessing at partition counts, the idea is that the database 
evolves configurations specific to your data with impressive speedups.

""" ORIGINAL
Blurb: Transformers learn by evolving weights through backpropagation. What if databases could learn the same way, but by evolving their structure instead?

EmergentDB is a proof-of-concept vector database that uses MAP-Elites (Quality Diversity) to automatically discover optimal index configurations. Instead of hand-tuning HNSW parameters or guessing at partition counts, the database evolves configurations that fit your data—achieving 51x speedup over ChromaDB with 100% recall.

In this talk, I'll draw a surprising parallel between retrieval systems and language models, show how evolutionary algorithms can replace manual tuning, and share lessons from building EmergentDB from scratch at the Google DeepMind Hackathon Singapore (Honorary Mention).

Bio: Rach Pradhan is a systems builder obsessed with making things fast and making things learn. He's built memory databases from scratch, written Rust-based optimizations for reinforcement learning pipelines, and developed RL environments that he's sold to research labs. Currently, he's deep in the weeds on financial world models. A serial hackathon winner, his most recent project EmergentDB—a self-evolving vector database—won Honorary Mention at the Google DeepMind Hackathon Singapore.

Short version:
Systems builder. Built memory databases, RL infra (sold to labs), and now financial world models. Serial hackathon winner - most recently Google DeepMind Singapore.
"""
-->

---

### The Big Shift: From "Bigger Models" to "Smarter Systems"
#### Sam Witteveen

* Look at NeurIPS papers / topics
  - and research over the past few months
* Race for AGI : 
  - Moving from scaling models to Agentic Systems

---

## THANK YOU!

* Venue: 
  + Google
* MLSG Voluteers:
  + Jen; JF; Shern; Nicholas; Geoffrey; Anthony; Leonard; Malik; 
* MLSG Photographers (please let us know...):
  + JF Koh (awesome follow-up!)

---

## Further Study

* Field is growing very rapidly
* Lots of different things can be done
* Easy to find novel methods / applications

## Deep Learning Foundations

* 3 week-days + online content
* Play with real models & Pick-a-Project
* Held online, Live Coding, Certificates
* Next run : 2026-Q2

## NLP (Advanced)
### Advanced NLP and Sequence Processing

* NLP (eg: Named Entity Recognition)
* Transformers : Theory and Practice
* Generative AI
* Next run : 2026-Q2

## Vision (Advanced)
### Advanced Computer Vision with Deep Learning

* Advanced classification
* Other architectures (eg: U-Nets)
* Transformer-based vision
* Next run : 2026-Q2

## Deep Learning for PMs
### ( `= Foundations - code` <br/> `+ management` )
* Much more about 'big picture'
* Only a few code examples
* Project process standardised
* Next run : TBA

## AI in Production
### Building Real World A.I. Applications

* DIY : node-server + task-queue + python-ml
* TensorFlow Serving / PyTorch Serve
* TF Lite + TF.js : edge device models
* Distillation, pruning, quantisation, etc...
* Next run : TBA

## Also...

* Unsupervised methods
* Time-series & Deep Learning
* Audio Processing (Sounds & Speech)

;--
;
;## QR code for Courses
;
;<img height="330" src="img/RDAI-courses-QRcode_172x165.png" alt="RDAI Courses QR code"/>

---

## Machine Learning SG <br/> MeetUp Group
* Next Meeting = ?-Feb-2026 @ Google
* Topic(s) : TBA
* Typical Contents : 
  + Talk for people starting out
  + Something from the bleeding-edge
  + Lightning Talks
* [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/)

## Quick Poll
#### Show of hands

* How did you hear about THIS event?
  + MeetUp email
  + luma.com email
  + Messaging group
  + MLSG friends directly
  + Work colleagues

## Quick Poll
#### Show of hands

* Who would like to have "Hand's On" events?
  + i.e. there's something to run on your own laptop
* What if *all* MLSG events had a "Hand's On" segment?

;--
;
;## Quick Poll
;#### Show of hands
;
;* What topic(s) would _compel_ you to come?
;  + Stable-diffusion++ / Video / Gaussian Splatting
;  + Robotics
;  + Reinforcement Learning
;  + AI for Education
;  + LLMs for Science
;  + Agents

---

# See You<br />Next Time !

<br />
<br />
Please add yourself to the <br/>MLSG Calendar on Luma!

;`Handouts :` [`https://bit.ly/`<br />`text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)