## Deep Agents, World Models and REFRAG #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)
15-October-2025
--- ## Today's Line-up * "Neural Assets and World Models"
- _Martin Andrews_ * "REFRAG: Rethinking RAG based Decoding"
- _Xiaoqiang Lin_ * "Building DeepAgents"
- _Sam Witteveen_ --- ## Neural Assets and World Models #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
; $x^2=17_i$
$
; ##
DSPy
15-October-2025
--- ## About Me * Machine Intelligence / Startups / Finance + Moved from NYC to Singapore in Sep-2013 * 2014 = 'fun' : + Machine Learning, Deep Learning, NLP + Robots, drones * Since 2015 = 'serious' :: NLP + deep learning + Including Papers... + & GDE ML; ML-Singapore co-organiser... + & Red Dragon AI... -- ## About Red Dragon AI * Deep Learning Consulting & Prototyping (Google Partner) - Education / Training - Research : NeurIPS / ICML / ICLR / NAACL / EMNLP * Please contact us for : - Language model training (eg: on-prem) - Knowledgebase interaction & reasoning - Sales-oriented applications --- ## Modern Training * Image models are getting _very_ good + Huge data clearly helps + Many tricks are not disclosed * Video models are a 'natural extension' + Again, have improved hugely in ~2 years * "World models" are taking off + May be a setting for robot training _in silico_ --- ### Image Faithfulness * Key issue facing image models + Are the captions any good? * Approaches: + Filtering out nonsense + Checking vs (say) CLIP + Actually improving the captions -- ### Qwen-Image Data Filtering
* Filter out CLIP caption mismatches, etc -- ## DALL-E (v3) #### OpenAI Image Generation * [Improving Image Generation with Better Captions](https://cdn.openai.com/papers/dall-e-3.pdf) + "This _paper_ focuses on evaluating the improved prompt following of DALL-E 3 as a result of training on highly descriptive generated captions" - ("It does not cover training or implementation details of the DALL-E 3 model") * [Stable Diffusion 3](https://arxiv.org/abs/2403.03206) used same technique + [CogVLM: Visual Expert for Pretrained Language Models](https://arxiv.org/abs/2311.03079) -- ## DALL-E (v3)
* Synthetic captioning section (p6 of PDF) --- ### Image changing Models * New breed of image models + Can update images, or incorporate images * Allow for : + image token input + prompt + 'reasoning' (?) + image token output -- ### Qwen-Image Architecture
* Note the pretrained LLM and VAE elements -- ## Qwen-Image Editing
* Includes RL training too... --- ### Image models from Video * [Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models](https://arxiv.org/abs/2406.09292) + [Project Page](https://neural-assets-paper.github.io/) *← look at this page* - used the object orientations in video data to train an image model - ... by making use of ~captioning models (again) - contents are parameterized as fully spatially-manipulable "neural assets" * Hypothesis : Nano Banana used this + Evidence? : [Kubric Github Repo](https://github.com/google-research/kubric) -- ## Neural Assets ;#### Operations
* Video naturally gives us images with object changes -- ## Neural Assets #### Conditioning
-- ## Training Image Models * Can 'easily' see that : + the representation conditioning could be tokens - put them inside a multimodal LLM - get ability to edit individual aspects * Now, the model 'understands' the objects within image + and can manipulate them... --- ## World Models * Do models 'understand' the world? * Some speculation about text LLMs + no actual observability * But can we have something we can interact with? --- ## Minecraft
* 2020 : OpenAI released a 'foundation model' for Minecraft + and sponsored the [NeurIPS 2020 MineRL competition](https://www.aicrowd.com/challenges/neurips-2022-minerl-basalt-competition) + NB : MineRL used official (Java) game engine -- ## VOYAGER
* Made use of a Minecraft-aware tool calling interface -- ## Adding Annotations #### World Model First Steps * [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) + [Open AI Blog Post](https://openai.com/index/vpt/) + [Code available on GitHub](https://github.com/openai/Video-Pre-Training) * Action model = Inverse Dynamics Model (IDM) + guesses 'Actions' from video from small dataset - gathered from (paid) keyboard interactions + used to annotate (many) YouTube videos -- ## Dreamer v4 * [Training Agents Inside of Scalable World Models](https://www.arxiv.org/abs/2509.24527) + Google Research [Project Page](https://danijar.com/project/dreamer4/) + Process: - Train 'world model' from video demonstrations - Then learn to play using 'world model' (not real play) - Finally : Test using actual play * Learns to play Minecraft from only off-line data + Obtains diamonds (20k sequential actions) + Significantly outperforms OpenAI's VPT offline agent - while using 100 times less data --- ## DeepMind GENIE [
](https://x.com/jparkerholder/status/1976560277043937590) -- ## Real-world dataset
* [A Large-Scale Video Dataset with Spatial Annotations](https://arxiv.org/abs/2509.09676) + [Project Page](https://nju-3dv.github.io/projects/SpatialVID/) -- ## DeepMind GENIE3
* Latest Version: [DeepMind GENIE 3](https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/) --- ## TinyWorlds #### Open Source 'GENIE 3' * ~3M param world Model + "... capable of generating playable game environments" + [Author Thread](https://x.com/Almondgodd/status/1971314283184259336) - "I spent the past month reimplementing DeepMind's Genie 3 world model from scratch" + Each TinyWorlds dataset is created from YouTube gameplay videos for : - Pole Position: 3D pixel racing game - Doom: 3D first-person shooter (`picodoom`) - Pong: the 2-player Atari game - Sonic: a 2D platformer - Zelda: birds-eye-view adventure game + [Repo with graphics](https://github.com/AlmondGod/tinyworlds) (No license) - "For training I used ~1-4 H200s depending on complexity, each run around a day" ; https://x.com/yohketi/status/1971581207638057415 ; Training can technically be done on a MacBook GPU, I did as well the results are just much worse -- ## TinyWorlds ;#### Architecture
-- ## TinyWorlds ;#### Architecture
-- ## DEMO TIME!
* Have a play with the [Shared Colab notebook](https://colab.research.google.com/drive/1AL5zi5ayVvv5_-qPg3DeDb6HBfIA4Ue8?usp=sharing) --- ## Wrap-Up * Have covered some ideas for Image models + with some Nano-Banana speculation... * Have outlined how explicit "World models" work * Actually possible to do this at smaller scale!
NB: MLSG wants to feature Your Talk!
(Say "Hello"...)
-- ## Link to Slides [
](https://bit.ly/MLSG_2025-10) [https://bit.ly/MLSG_2025-10](https://bit.ly/MLSG_2025-10) --- ## REFRAG: Rethinking RAG based Decoding #### Xiaoqiang Lin * Advanced pre-training / training + RAG-like decoding FTW! --- ### Building DeepAgents #### Sam Witteveen * Making Agents run effectively + long-time horizon tasks and trajectories --- ## THANK YOU! * Venue: + Google * MLSG Voluteers: + Shern; Nicholas; Geoffrey; Anthony; Leonard; Malik * MLSG Helpers: + Jen; JF --- ## Further Study * Field is growing very rapidly * Lots of different things can be done * Easy to find novel methods / applications -- ## Deep Learning Foundations * 3 week-days + online content * Play with real models & Pick-a-Project * Held online, Live Coding, Certificates * Next run : TBA -- ## NLP (Advanced) ### Advanced NLP and Sequence Processing * NLP (eg: Named Entity Recognition) * Transformers : Theory and Practice * Generative AI * Next run : TBA -- ## Vision (Advanced) ### Advanced Computer Vision with Deep Learning * Advanced classification * Other architectures (eg: U-Nets) * Transformer-based vision * Next run : TBA -- ## Deep Learning for PMs ### ( `= Foundations - code`
`+ management` ) * Much more about 'big picture' * Only a few code examples * Project process standardised * Next run : 21, 22, 23 October -- ## AI in Production ### Building Real World A.I. Applications * DIY : node-server + task-queue + python-ml * TensorFlow Serving / PyTorch Serve * TF Lite + TF.js : edge device models * Distillation, pruning, quantisation, etc... * Next run : 3, 4, 5 November -- ## Also... * Unsupervised methods * Time-series & Deep Learning * Audio Processing (Sounds & Speech) ;-- ; ;## QR code for Courses ; ;
--- ## Machine Learning SG
MeetUp Group * Next Meeting = {15,26,27}?-Nov-2025 @ Google * Topic(s) : TBA * Typical Contents : + Talk for people starting out + Something from the bleeding-edge + Lightning Talks * [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/) -- ## Quick Poll #### Show of hands * How did you hear about THIS event? + MeetUp email + luma.com email + Messaging group + MLSG friends directly + Work colleagues -- ## Quick Poll #### Show of hands * How do you feel about MeetUp vs Luma? + luma is better + MeetUp is better + Don't really care ;-- ; ;## Quick Poll ;#### Show of hands ; ;* What topic(s) would _compel_ you to come? ; + Stable-diffusion++ / Video / Gaussian Splatting ; + Robotics ; + Reinforcement Learning ; + AI for Education ; + LLMs for Science ; + Agents --- # See You
Next Time !
Please add yourself to the
MLSG Calendar on Luma! ;`Handouts :` [`https://bit.ly/`
`text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)