Machine Learning Singapore : 30-Apr-2025 : News from the Frontier

# News from the Frontier
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
 
[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)

30-April-2025

---

## Today's Line-up

* "Latent Space Reasoning" - _Martin Andrews_
* "Open-endedness and Interestingness" - _Jenny Zhang_
* "My ICLR Highlights" - _Raymond Chan_
* "Agent Announcements & Trends from Google Cloud Next 2025" - _Sam Witteveen_

---

# Reasoning in Latent Space
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)

; $x^2=17_i$

30-April-2025

---

## About Me

* Machine Intelligence / Startups / Finance  
  + Moved from NYC to Singapore in Sep-2013

* 2014 = 'fun' :
  + Machine Learning, Deep Learning, NLP
  + Robots, drones

* Since 2015 = 'serious' :: NLP + deep learning
  + Including Papers...
  + & GDE ML; ML-Singapore co-organiser...
  + & Red Dragon AI...

## About Red Dragon AI

* Deep Learning Consulting & Prototyping (Google Partner)
  - Education / Training
  - Research : NeurIPS / EMNLP / NAACL / ICML / ICLR

* Please contact us for : 
  - Language model training (eg: on-prem)
  - Knowledgebase interaction & reasoning
  - Sales-oriented applications

---

## Outline

* Latent Space Reasoning
  + What's involved?
  + Three different techniques
* ICLR
  + Three paper takeaways
* Head's Up!
* Wrap-up & QR-code

---

## GPT Models
;#### Training and Inference

Generative Pre-trained Transformer = "Autoregressive" (OpenAI, 2018)

## Tokens in-and-out

* Tokens are the inputs
  + From a dictionary of ~262k
    - which is \$2^{18}\$ - i.e. 18 bits of information
  + But 'hidden dim' is [3584d](https://github.com/google/gemma_pytorch/blob/main/gemma/config.py)
    - so there's "Tons of Space"
* and the last layer converts...
  + 3584d → 1-of-262k (SoftMax)
  + But *surely* there is rich information inside...
;    * Why are we throwing away information?

## The Latent Space

* We only train these models on tokens
  + but they seem to have a *sense* of the topic
  + ... over a *longer timescale* that 1 token
* Can we 'unpack' this knowledge better
  + or use it more effectively?

---

## Latent Space Reasoning

* Look at three main approaches:
  + Recurrent Depth
  + Latent Tokens
  + Large Concept Models

---

## Recurrent Depth

* Has existed for a long time...
  + [Universal Transformers](https://arxiv.org/abs/1807.03819) - Dehghani _et al_ (2019)
    - Focus : Turing complete T5
    - Downside : Didn't seem to catch on
  + [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) - Lan _et al_ (2019)
    - Focus : Save parameters 
    - Downside : No time saving

## Recurrent Reasoning

* [Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach](https://arxiv.org/abs/2502.05171) - Geiping _et al_ (2025)
 + "Huginn" = More modern take on which layers to repeat

## Recurrent Reasoning

## Recurrent Reasoning
#### Summary

* Authors did three runs, last of which worked
  + Big guesses at 'fixes' between iterations
* [Code Repo](https://github.com/seal-rg/recurrent-pretraining) (Apache 2) and [Open Weights](https://huggingface.co/tomg-group-umd/huginn-0125)
* Huginn demonstrates that the idea ~works
  + OTOH : ["...but don't get too excited cuz we don't beat OLMo2"](https://x.com/tomgoldsteincs/status/1888980680790393085)

---

## Latent Tokens

* [COCONUT : Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/abs/2412.06769) - Hao _et al_ (2024)
  + Facebook [Code on GitHub](https://github.com/facebookresearch/coconut) (MIT)
  + [GDE Blog Post](https://gonzoml.substack.com/p/chain-of-continuous-thought-coconut)
* Instead of decoding the last hidden state into a token:
  - feed state directly as input to the decoder 
  - ... as an embedding for the next step
  - ... in the autoregressive generation process

## COCONUT
#### Latent Tokens

## COCONUT

* Train by 'unlocking' tokens and training on 'true latents'

## Latent Tokens
#### Summary

* Results give so-so accuracy
  + while claiming fewer tokens
* But there's more scope for exploring:
  + Latent token tree-search 
  + Planning, etc
* Later work claims better results:
  + [CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation](https://arxiv.org/abs/2502.21074) - Shen _et al_ (2025)

---

## Large Concept Models

* [Large Concept Models: Language Modeling in a Sentence Representation Space](https://arxiv.org/abs/2412.08821) - Barrault _et al_ (2024)
  + Facebook [Blog Post](https://ai.meta.com/blog/meta-fair-updates-agents-robustness-safety-architecture/) & [Repo on GitHub](https://github.com/facebookresearch/large_concept_model) (MIT) with Training code
    - 7B model trained (and released)
* Relies on [SONAR](https://github.com/facebookresearch/SONAR) for 'thoughts'
  + SONAR encoders and decoders: 
    - ~200 languages and multiple modalities
    - Reconstruct text from SONAR embeddings

## Large Concept Model

;* "Sentences" (text or audio)

## Large Concept Model
#### Summary

* 49-page paper explores several transition models:
  + Regular Transformer decoder
  + Diffusion-based (!)
* and also how to 'quantise' the Concept vectors
  + Is 'regression' or 'classification' best?
* Key issue:
  + Are the SONAR embeddings good for chaining thoughts?
  + ... an open question

---

## Latent Wrap-up

* Is Latent Space Reasoning "a Thing"?
* It requires extra engineering:
  + Is this in opposition to the Bitter Lesson?
    - [Hyung Won Chung YouTube video](https://www.youtube.com/watch?v=3gb-ZkVRemQ&t=1854s)
  + Or are we victims of history?
    - Sara Hooker's ["The Hardware Lottery"](https://arxiv.org/abs/2009.06489) - Hooker (2020)
* Remains to be seen...

---

## ICLR take-aways

* Conference = 3 days, >10k attendees
  + 6 Poster sessions of 2.5 hours
  + Talked to 30+ presenters per session
  + Walked *miles*
* Aside:
  + Lots of Reasoning papers were OLD NEWS 
    - Presenters had 'follow up' work to talk about
* Then : 2 days of workshops
  + These are always more up-to-date
  + Includes focussed poster sessions

; https://x.com/iruletheworldmo/status/1915338995707359274
;   turns out the rl victory lap was premature. 
;   new tsinghua paper quietly shows the fancy reward loops just squeeze 
;   the same tired reasoning paths the base model already knew. 
;   pass@1 goes up, sure, but the model's world actually shrinks. 
;   feels like teaching a kid to ace flash cards and calling it wisdom.
;   [Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?](https://arxiv.org/abs/2504.13837) - Yue _et al_ (2025)

## Workshop Poster!

* [Generating Code to Verify Cryptic Crossword Reasoning](https://openreview.net/forum?id=2nC7zy7adD)
  + Our Poster at DL4Code!  & [SlidesLive video, etc](https://iclr.cc/virtual/2025/34846)

---

## ICLR Fun Papers
        
* Too many papers to choose between!
  + Only talked to ~30 presenters per session (of 600+)
;  + (also putting Openendedness/Interestingness to one side)
* Selection emphasising novelty/variety:
  + Apple random compression
  + Memory Mosaics
  + Faces

---

## SeedLM

* [SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators](https://arxiv.org/abs/2410.10714) - Shafipour _et al_ (2025)
  + Apple Research : [ICLR link](https://iclr.cc/virtual/2025/poster/28000) 
* Motivation : LLMs are slow due to Bandwidth
  + Compression can help on the edge
* Key idea: Generate the weights on-the-fly
  + Find a seed for RAND that can reproduce matrices (!)

## SeedLM

* Reduces 8 params → 3 params + 16-bit RAND seed

## SeedLM

* Does it work? : Apparently YES
 + Apple implemented it on FPGA hardware...
* Competitive accuracy with AWQ
 + ... but <<bandwidth

---

## Memory Mosaics

* [Memory Mosaics](https://arxiv.org/abs/2405.06394) - Zhang _et al_ (2025)
  + [ICLR Link](https://iclr.cc/virtual/2025/poster/30157)
  + FAIR @ Meta (in NYC) : Léon Bottou
  + "Return of the Associative Memory"
    - Has been around for DECADES
    - Have actual *theoretical* results !
    - Can scale _á la_ Transformer

## Memory Mosaics

* Each 'mem' is a KV-lookup Associative Memory
* [BabiStories dataset and code](https://github.com/facebookresearch/MemoryMosaics) (Apache 2)

---

## Faces

* Lots of papers about Faces/Portraits/Avatars
  + Mainly China-based companies
    - Bytedance, Alibaba, iFlytek, etc
* Different techniques
  + Diffusion models (+ distillation)?
  + Gaussian splatting 
  + Audio models / motion planning
  + Mesh vs image-based constraints
* Safety concerns?  Not so much...

## Loopy Architecture

* [Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency](https://arxiv.org/abs/2409.02634) - Jiang _et al_ (2025)
 + [Project Page](https://loopyavatar.github.io/) & [ICLR link](https://iclr.cc/virtual/2025/poster/32049)

---

## Head's Up!

* Some quick things... 
 + AMD GPU kernel contest
 + <strike>Llama 4 models</strike>
 + Qwen3 models launched
 + Shopping in ChatGPT

## AMD GPU kernels

[<img width="900" src="img/AMD-Developer-Challenge_1008x416.jpg" alt="AMD Developer Challend">](https://www.datamonsters.com/amd-developer-challenge-2025)

* Registration Deadline = midnight tonight (PST) 
  + Competition deadlines ~ end May

; https://x.com/pavel_4_ai/status/1915039361655083223
;   AMD software is improving rapidly
;   Cuda isn't a moat forever, but Nvidia is building new ones with the Python DSL, Dynamo, and more
;   Meanwhile Nvidia hardware advantage is huge this year, but perf/TCO of 355X has attracted some customers
;   MI450X is actually competitive with Rubin

## Qwen3

* Alibaba Cloud [blog post for release](https://qwenlm.github.io/blog/qwen3/)
* Early vibes: 
  + Excellent benchmarks
  + Nice sizes - e.g. Qwen3-30B-A3B MoE:
    - 128 experts / 8 active; 128K context
  + (Small Size too : Qwen3-0.6B)
  + Good 'thinkers'
  + ... but don't know many facts

## Shopping in ChatGPT

[<img height="500" src="img/ChatGPT-Shopping_461x739.png" alt="Shopping in ChatGPT">](https://x.com/gdb/status/1917009041035038837)

---

## Wrap-Up

* Reasoning in Latent Space is interesting idea
  + No clear direction (yet)
* ICLR was a lot of Fun
  + and visitors clearly had a good impression of Singapore
* Looking forward to AAAI in Jan-2026 (also in SG)!

NB: MLSG wants to feature Your Talk!

## Link to Slides

[<img width="300" src="img/bit.ly_MLSG_2025-04_656x656.png" alt="Reasoning in Latent Space presentation QR code"/>](https://bit.ly/MLSG_2025-04)

[https://bit.ly/MLSG_2025-04](https://bit.ly/MLSG_2025-04)

---

## Open-endedness and Interestingness
#### Jenny Zhang

* Personal experience at ICLR 
* Highlights include: 
  + keynotes
  + discussions and 
  + others interesting things...

; https://x.com/jennyzhangzt/status/1917103091691958304

---

## My ICLR Highlights
#### Raymond Chan

* Highlights from the ICLR speaker presentations :
  + Language Model Alignment in Multilingual Trolley Problems
  + Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View
  + Century: A Framework and Dataset for Evaluating Historical Contextualisation of Sensitive Images

---

## Agent Announcements & Trends from Google Cloud Next 2025
#### Sam Witteveen

* Google's Cloud Next '25 key announcements:
  + Agent Development Kit (ADK);
  + Agent Space; and 
  + Agent2Agent protocol
* Trends for startups and companies demoing agentic products

---

## Further Study

* Field is growing very rapidly
* Lots of different things can be done
* Easy to find novel methods / applications

## Deep Learning Foundations

* 3 week-days + online content
* Play with real models & Pick-a-Project
* Held online, Live Coding, Certificates
* Last run : Early September

## NLP (Advanced)
### Advanced NLP and Sequence Processing

* NLP (eg: Named Entity Recognition)
* Transformers : Theory and Practice
* Generative AI
* Last run : Early October

## Vision (Advanced)
### Advanced Computer Vision with Deep Learning

* Advanced classification
* Other architectures (eg: U-Nets)
* Transformer-based vision
* Last run : Early November

## AI in Production
### Building Real World A.I. Applications

* DIY : node-server + task-queue + python-ml
* TensorFlow Serving / PyTorch Serve
* TF Lite + TF.js : edge device models
* Distillation, pruning, quantisation, etc...
* Last run : Early February

## Deep Learning for PMs
### ( `= Foundations - code` `+ management` )
* Much more about 'big picture'
* Only a few code examples
* Project process standardised
* Last run : Late January

## Also...
* Unsupervised methods
* Time-series & Deep Learning
* Audio Processing (Sounds & Speech)

;--
;
;## QR code for Courses
;
;<img height="330" src="img/RDAI-courses-QRcode_172x165.png" alt="RDAI Courses QR code"/>

---

## Machine Learning SG MeetUp Group
* Next Meeting = 22-May-2025
* Topic : TBA
* Typical Contents : 
 + Talk for people starting out
 + Something from the bleeding-edge
 + Lightning Talks
* [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/)

## Advanced Build With AI

* 17-May-2025 (Saturday)
  + Pizza + Afternoon session
  + Topics = Googley things 
    - Gemini from A-Z
    - Agents
    - Vibe coding
  + Hands-on :: Laptops required!
* Sign-up page TBA

;--
;
;## Quick Poll
;#### Show of hands
;
;* What topic(s) would _compel_ you to come?
;  + Agents
;  + LLMs for Science
;  + Stable-diffusion++ / Video / Gaussian Splatting
;  + [Vibe Coding](https://x.com/MatthewBerman/status/1904039128611914144)
;  + LLMs with Retrieval (RAG)
;  + Robotics

---

# - Questions -

;`Handouts :` [`https://bit.ly/` `text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)