Machine Learning Singapore : 15-October-2025 : World Models

## Deep Agents, World Models and REFRAG
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
 
[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)

15-October-2025

---

## Today's Line-up

* "Neural Assets and World Models" - _Martin Andrews_
* "REFRAG: Rethinking RAG based Decoding" - _Xiaoqiang Lin_
* "Building DeepAgents" - _Sam Witteveen_

---

## Neural Assets and World Models
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)

; $x^2=17_i$ $
; ## DSPy

15-October-2025

---

## About Me

* Machine Intelligence / Startups / Finance  
  + Moved from NYC to Singapore in Sep-2013

* 2014 = 'fun' :
  + Machine Learning, Deep Learning, NLP
  + Robots, drones

* Since 2015 = 'serious' :: NLP + deep learning
  + Including Papers...
  + & GDE ML; ML-Singapore co-organiser...
  + & Red Dragon AI...

## About Red Dragon AI

* Deep Learning Consulting & Prototyping (Google Partner)
  - Education / Training
  - Research : NeurIPS / ICML / ICLR / NAACL / EMNLP

* Please contact us for : 
  - Language model training (eg: on-prem)
  - Knowledgebase interaction & reasoning
  - Sales-oriented applications

---

## Modern Training

* Image models are getting _very_ good
  + Huge data clearly helps
  + Many tricks are not disclosed
* Video models are a 'natural extension'
  + Again, have improved hugely in ~2 years
* "World models" are taking off
  + May be a setting for robot training _in silico_

---

### Image Faithfulness

* Key issue facing image models
  + Are the captions any good?
* Approaches:
  + Filtering out nonsense
  + Checking vs (say) CLIP
  + Actually improving the captions

### Qwen-Image Data Filtering

* Filter out CLIP caption mismatches, etc

## DALL-E (v3)
#### OpenAI Image Generation

* [Improving Image Generation with Better Captions](https://cdn.openai.com/papers/dall-e-3.pdf)
  + "This _paper_ focuses on evaluating the improved prompt following of DALL-E 3 as a result of training
on highly descriptive generated captions"
    - ("It does not cover training or implementation details of the
DALL-E 3 model")
* [Stable Diffusion 3](https://arxiv.org/abs/2403.03206) used same technique 
  + [CogVLM: Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079)

## DALL-E (v3)

* Synthetic captioning section (p6 of PDF)

---

### Image changing Models

* New breed of image models
  + Can update images, or incorporate images
* Allow for :
  + image token input
  + prompt
  + 'reasoning' (?)
  + image token output

### Qwen-Image Architecture

* Note the pretrained LLM and VAE elements

## Qwen-Image Editing

* Includes RL training too...

---

### Image models from Video

* [Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models](https://arxiv.org/abs/2406.09292)
  + [Project Page](https://neural-assets-paper.github.io/)  *← look at this page*
    - used the object orientations in video data to train an image model
    - ... by making use of ~captioning models (again)
    - contents are parameterized as fully spatially-manipulable "neural assets"
* Hypothesis : Nano Banana used this
  + Evidence? : [Kubric Github Repo](https://github.com/google-research/kubric)

## Neural Assets
;#### Operations

* Video naturally gives us images with object changes

## Neural Assets
#### Conditioning

## Training Image Models

* Can 'easily' see that :
  + the representation conditioning could be tokens
    - put them inside a multimodal LLM
    - get ability to edit individual aspects
* Now, the model 'understands' the objects within image 
  + and can manipulate them...

---

## World Models

* Do models 'understand' the world?
* Some speculation about text LLMs
  + no actual observability
* But can we have something we can interact with?

---

## Minecraft

* 2020 : OpenAI released a 'foundation model' for Minecraft
  + and sponsored the [NeurIPS 2020 MineRL competition](https://www.aicrowd.com/challenges/neurips-2022-minerl-basalt-competition)
  + NB : MineRL used official (Java) game engine

## VOYAGER

* Made use of a Minecraft-aware tool calling interface

## Adding Annotations
#### World Model First Steps

* [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795)
  + [Open AI Blog Post](https://openai.com/index/vpt/)
  + [Code available on GitHub](https://github.com/openai/Video-Pre-Training)
* Action model = Inverse Dynamics Model (IDM)
  + guesses 'Actions' from video from small dataset
    - gathered from (paid) keyboard interactions
  + used to annotate (many) YouTube videos

## Dreamer v4

* [Training Agents Inside of Scalable World Models](https://www.arxiv.org/abs/2509.24527)
  + Google Research [Project Page](https://danijar.com/project/dreamer4/)
  + Process:
    - Train 'world model' from video demonstrations
    - Then learn to play using 'world model' (not real play)
    - Finally : Test using actual play
* Learns to play Minecraft from only off-line data 
  + Obtains diamonds (20k sequential actions)
  + Significantly outperforms OpenAI's VPT offline agent
    - while using 100 times less data

---

## DeepMind GENIE

[<img height="500" src="img/GENIE_state-of-AI_1466x826.png" alt="DeepMind GENIE in State-of-AI report">](https://x.com/jparkerholder/status/1976560277043937590)

## Real-world dataset

* [A Large-Scale Video Dataset with Spatial Annotations](https://arxiv.org/abs/2509.09676)
  + [Project Page](https://nju-3dv.github.io/projects/SpatialVID/)

## DeepMind GENIE3

* Latest Version: [DeepMind GENIE 3](https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/)

---

## TinyWorlds
#### Open Source 'GENIE 3'

* ~3M param world Model
  + "... capable of generating playable game environments"
  + [Author Thread](https://x.com/Almondgodd/status/1971314283184259336)
    - "I spent the past month reimplementing DeepMind's Genie 3 world model from scratch"
  + Each TinyWorlds dataset is created from YouTube gameplay videos for :
    - Pole Position: 3D pixel racing game
    - Doom: 3D first-person shooter (`picodoom`)
    - Pong: the 2-player Atari game
    - Sonic: a 2D platformer
    - Zelda: birds-eye-view adventure game
  + [Repo with graphics](https://github.com/AlmondGod/tinyworlds) (No license)
    - "For training I used ~1-4 H200s depending on complexity, each run around a day"

; https://x.com/yohketi/status/1971581207638057415
; Training can technically be done on a MacBook GPU, I did as well the results are just much worse

## TinyWorlds
;#### Architecture

## TinyWorlds
;#### Architecture

## DEMO TIME!

* Have a play with the [Shared Colab notebook](https://colab.research.google.com/drive/1AL5zi5ayVvv5_-qPg3DeDb6HBfIA4Ue8?usp=sharing)

---

## Wrap-Up

* Have covered some ideas for Image models
  + with some Nano-Banana speculation...
* Have outlined how explicit "World models" work
* Actually possible to do this at smaller scale!

NB: MLSG wants to feature Your Talk! (Say "Hello"...)

## Link to Slides

[<img width="300" src="img/bit.ly_MLSG_2025-10_656x656.png" alt="Neural Assets and World Models QR code"/>](https://bit.ly/MLSG_2025-10)

[https://bit.ly/MLSG_2025-10](https://bit.ly/MLSG_2025-10)

---

## REFRAG: Rethinking RAG based Decoding
#### Xiaoqiang Lin

* Advanced pre-training / training 
  + RAG-like decoding FTW!

*  Connor Shorten  @CShorten30
I am SUPER EXCITED to publish the 130th episode of the Weaviate Podcast featuring 
* [Xiaoqiang Lin ( @xiaoqiang_98 ), the lead author of REFRAG from Meta Superintelligence Labs!](https://x.com/CShorten30/status/1985361515889803741)

---

### Building DeepAgents
#### Sam Witteveen

* Making Agents run effectively
  + long-time horizon tasks and trajectories

---

## THANK YOU!

* Venue: 
  + Google
* MLSG Voluteers:
  + Shern; Nicholas; Geoffrey; Anthony; Leonard; Malik
* MLSG Helpers:
  + Jen; JF

---

## Further Study

* Field is growing very rapidly
* Lots of different things can be done
* Easy to find novel methods / applications

## Deep Learning Foundations

* 3 week-days + online content
* Play with real models & Pick-a-Project
* Held online, Live Coding, Certificates
* Next run : TBA

## NLP (Advanced)
### Advanced NLP and Sequence Processing

* NLP (eg: Named Entity Recognition)
* Transformers : Theory and Practice
* Generative AI
* Next run : TBA

## Vision (Advanced)
### Advanced Computer Vision with Deep Learning

* Advanced classification
* Other architectures (eg: U-Nets)
* Transformer-based vision
* Next run : TBA

## Deep Learning for PMs
### ( `= Foundations - code` `+ management` )
* Much more about 'big picture'
* Only a few code examples
* Project process standardised
* Next run : 21, 22, 23 October

## AI in Production
### Building Real World A.I. Applications

* DIY : node-server + task-queue + python-ml
* TensorFlow Serving / PyTorch Serve
* TF Lite + TF.js : edge device models
* Distillation, pruning, quantisation, etc...
* Next run : 3, 4, 5 November

## Also...

* Unsupervised methods
* Time-series & Deep Learning
* Audio Processing (Sounds & Speech)

;--
;
;## QR code for Courses
;
;<img height="330" src="img/RDAI-courses-QRcode_172x165.png" alt="RDAI Courses QR code"/>

---

## Machine Learning SG MeetUp Group
* Next Meeting = {15,26,27}?-Nov-2025 @ Google
* Topic(s) : TBA
* Typical Contents : 
 + Talk for people starting out
 + Something from the bleeding-edge
 + Lightning Talks
* [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/)

## Quick Poll
#### Show of hands

* How did you hear about THIS event?
  + MeetUp email
  + luma.com email
  + Messaging group
  + MLSG friends directly
  + Work colleagues

## Quick Poll
#### Show of hands

* How do you feel about MeetUp vs Luma?
  + luma is better
  + MeetUp is better
  + Don't really care

;--
;
;## Quick Poll
;#### Show of hands
;
;* What topic(s) would _compel_ you to come?
;  + Stable-diffusion++ / Video / Gaussian Splatting
;  + Robotics
;  + Reinforcement Learning
;  + AI for Education
;  + LLMs for Science
;  + Agents

---

# See You Next Time !

Please add yourself to the MLSG Calendar on Luma!

;`Handouts :` [`https://bit.ly/` `text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)