# Self-Improving Agents #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
;[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)
19-June-2025
--- ## Today's Line-up * "RAG, Agents, RL"
- _Vivek Kalyan_ * "Rethinking Superapps: Voice-Driven
Multi-Agent Systems with Gradio MCP"
- _Leonard Loo_ * "Self-Improving Agents"
- _Martin Andrews_ --- ## RAG, Agents, RL #### Vivek Kalyan * Issues with RAG * Strong baselines * Training with Reinforcement Learning --- ## Rethinking Superapps: Voice-Driven
Multi-Agent Systems with Gradio MCP #### Leonard Loo * Superapps * Building MCP servers * Integrating LLMs with MCP and Voice --- # Self-Improving Agents #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
; $x^2=17_i$
19-June-2025
--- ## About Me * Machine Intelligence / Startups / Finance + Moved from NYC to Singapore in Sep-2013 * 2014 = 'fun' : + Machine Learning, Deep Learning, NLP + Robots, drones * Since 2015 = 'serious' :: NLP + deep learning + Including Papers... + & GDE ML; ML-Singapore co-organiser... + & Red Dragon AI... -- ## About Red Dragon AI * Deep Learning Consulting & Prototyping (Google Partner) - Education / Training - Research : NeurIPS / EMNLP / NAACL / ICML / ICLR * Please contact us for : - Language model training (eg: on-prem) - Knowledgebase interaction & reasoning - Sales-oriented applications --- ## Outline * DSPy * Darwin Gödel Machines * GPU Kernel Scientist * Wrap-up & QR-code ;* Head's Up! --- ##
DSPy
#### As Agentic Framework * DSPy is an LLM Framework + Can call multiple backends + Orchestrate 'flows', etc + ... but has a v. different 'feel' * DSPy @ MLSG in [March-2024](https://mdda.net/blog/research/talks/DSPy-gemini-and-gemma) --- ##
DSPy
Signature ```py import dspy dspy.settings.configure(lm=dspy.LM("gemini/gemini-2.5-flash")) class SentimentClassifier(dspy.Signature): """ Classify the sentiment of a text. """ text: str = dspy.InputField( desc="input text to classify sentiment") sentiment: int = dspy.OutputField( desc="sentiment, the higher the more positive", ge=0, le=10 ) ``` -- ##
DSPy
Predefined Module ```py predict = dspy.Predict(SentimentClassifier) # or: predict = dspy.ChainOfThought(SentimentClassifier) output = predict(text="I am feeling pretty happy!") print(output) # Prediction( # sentiment=8 # ) ``` -- ## Behind-the-scenes 1/3 ```log System message: Your input fields are: 1. `text` (str): input text to classify sentiment Your output fields are: 1. `sentiment` (int): sentiment, the higher the more positive Constraints: greater than or equal to: 0, less than or equal to: 10 ``` -- ## Behind-the-scenes 2/3 ```log All interactions will be structured in the following way, with the appropriate values filled in. [[ ## text ## ]] {text} [[ ## sentiment ## ]] {sentiment} # note: the value must be a single int value [[ ## completed ## ]] In adhering to this structure, your objective is: Classify the sentiment of a text. ``` -- ## Behind-the-scenes 3/3 ```log User message: [[ ## text ## ]] I am feeling pretty happy! Respond with the corresponding output fields, starting with the field `[[ ## sentiment ## ]]` and then ending with `[[ ## completed ## ]]`. Response: [[ ## sentiment ## ]] 8 [[ ## completed ## ]] ``` --- ##
DSPy
RAG ```py class QueryGenerator(dspy.Signature): """ Generate a query based on question to fetch relevant context """ question: str = dspy.InputField() query: str = dspy.OutputField() def search_wikipedia(query: str) -> list[str]: """ Query ColBERT endpoint, which is a knowledge source based on wikipedia data """ results = dspy.ColBERTv2( url='http://server:port/wiki17_abstracts')(query, k=1) return [x["text"] for x in results] ``` * `QueryGenerator` is a Signature * `search_wikipedia` is a Python tool -- ##
DSPy
Custom Module ```py class RAG(dspy.Module): def __init__(self): self.query_generator = dspy.Predict(QueryGenerator) self.answer_generator = dspy.ChainOfThought( "question,context->answer") def forward(self, question, **kwargs): query = self.query_generator(question=question).query context = search_wikipedia(query)[0] return self.answer_generator( question=question, context=context).answer rag = RAG() ``` * Looks very much like PyTorch... -- ##
DSPy
Optimisation ```py optimiser = dspy.MIPROv2( metric=dspy.evaluate.answer_exact_match, auto="light", num_threads=16 ) optimised_rag = optimiser.compile( rag, trainset=trainset, valset=valset, requires_permission_to_run=False, ) ``` * `optimised_rag` is a new version of `rag` ... + ... with its prompts **optimised** ! + (based on trainset/valset) --- ##
DSPy
Wrap-up * DSPy has maintained focus on first principles... + Still elegant, and extensible * NEW: [Databricks' 49mins Course](https://www.deeplearning.ai/short-courses/dspy-build-optimize-agentic-apps/) + Signatures and Modules + MLflow Tracing + Optimizing Agents with DSPy Optimizer --- ## Darwin Gödel Machines * [DGM: Open-Ended Evolution of Self-Improving Agents](https://arxiv.org/abs/2505.22954)
- Zhang _et al_ (2025) + Key authors (@UBC.CA) include: - Jenny Zhang (first author, spoke at MLSG in April-2025) * [Repo on GitHub](https://github.com/jennyzzt/dgm) (Apache 2) - Jeff Clune (see also: MAP-Elites, etc) + Sakana.ai blog : [DGM: AI that improves itself by rewriting its own code](https://sakana.ai/dgm/) ;* Gödel ~ Mathematician ; + Famous for Incompleteness Theorem ;* Darwin ~ -- ## DGM : Key Ideas
-- ## Gödel ~ Mathematician
* Gödel machines : Self-referential universal problem solvers making provably optimal self-improvements + Inspiration for [Schmidhuber's 2003 work](https://people.idsia.ch/~juergen/goedelmachine.html) ... + (Gödel = famous for [Incompleteness Theorem](https://www.quantamagazine.org/how-godels-proof-works-20200714/)) -- ## Darwin ~ evolution...
--- ## Evolutionary Algorithms #### A bit of history * Back in the mid-1990s: + Neural Networks only 'kinda' worked - whereas HMMs and SVMs were on the horizon + But Genetic Algorithms / Programming actually worked - So: My PhD was in a NN lab, but I did GP * Extensively covered in MLSG last month... -- ## Evolution Innovations * [Novelty Search](https://www.semanticscholar.org/paper/NOVELTY-SEARCH-AND-THE-PROBLEM-WITH-OBJECTIVES-TO-Lehman-Stanley/e49d1ee1bddea0922faca358f3fd42474baad300?p2df) - Lehman & Stanley (2011) + "Why Greatness Cannot be Planned" * [MAP-Elites](https://arxiv.org/abs/1504.04909) - Mouret & Clune (2015) + Also : Work by *MLSG speaker* Jenny Zhang * Help to solve "Population Collapse" -- ## Evolution with LLMs ;* Can use an LLM as the Mutation/Crossover operator ; + ... and operate on text / prompts / code * Evolving Prompts: + [Promptbreeder](https://arxiv.org/abs/2309.16797) - Fernando _et al_ (2023) - "Self-Referential Self-Improvement via Prompt Evolution" + [Self-Discover](https://arxiv.org/abs/2402.03620) - Zhou _et al_ (2024) - "Large Language Models Self-Compose Reasoning Structures" * Evolving Programs: + [FunSearch](https://www.nature.com/articles/s41586-023-06924-6.pdf) - Romera-Paredes _et al_ (2024) - "Mathematical discoveries from program search with large language models" + [AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) - Novikov _et al_ (2025) - "A Gemini-powered coding agent for designing advanced algorithms" --- ## Code & Agent re-writes
* Key task: `SelfImprovement(Parent)🠚Child` + Measure effectiveness of improvement on programming tasks -- ## Family Tree
-- ## Performance Steps
* _"Throughout this paper, the term SWE-bench refers by default to the SWE-bench Verified subset."_ --- ## Darwin Gödel Machines #### Wrap-up * Self-Improvement is a meta-level up from AlphaEvolve * Code repo contains all the prompts + (no agent framework used) * Also: [Andrej Karpathy's AI Startup School talk](https://www.youtube.com/watch?v=LCEmiRjPEtQ) + for Software 1.0, 2.0, 3.0 (+ ?) --- ## "GPU Kernel Scientist" * Needs a bit of background: + Much more detail at May MeetUp... * Ideas: + GPU Kernels & importance + Evolutionary methods * Actual method / paper --- ## GPU Kernels * Complexity of GPU normally hidden + PyTorch, Keras, JAX, TensorFlow * But sometimes the details matter + DeepSeek; FlashAttention; NeRFs; MAMBA + == Writing CUDA (or equivalent) -- ## Speed-ups Available * [Excellent Blog Post](https://siboehm.com/articles/22/CUDA-MMM) on DIY Matrix Multiply + NB: Doesn't use the [Tensor Cores...](https://github.com/andylolu2/simpleGEMM/blob/master/gemm.cuh) | Step | Method | GFLOPs/sec | | ---- | ------ | ---------: | | 1 | Naïve approach | 309 | | 5 | 2D Block Tiling | 15972 | | 10 | Warptiling | 21779 | | | cuBLAS library | 23250 | | | | | ;https://www.reddit.com/r/MachineLearning/comments/1cqhsln/p_simplegemm_fast_and_minimal_tensor_core_matrix/ -- ## AMD GPU kernels [
](https://www.datamonsters.com/amd-developer-challenge-2025) * Deadlines: + Registration=2025-05-01, Competition=2025-06-02 * `FP8 GEMM` · `Fused MOE` · `MLA with ROPE` ; https://x.com/pavel_4_ai/status/1915039361655083223 ; AMD software is improving rapidly ; CUDA isn't a moat forever, but Nvidia is building new ones with the Python DSL, Dynamo, and more ; Meanwhile Nvidia hardware advantage is huge this year, but perf/TCO of 355X has attracted some customers ; MI450X is actually competitive with Rubin --- ## LLM approach to competition * Goals: + Write high-performance FP8 AMD kernels + Use only LLM code capabilities (not human brain-power) * Obstacles: + Very limited AMD documentation + Very few working example of AMD kernels - Particularly for low-precision + Can only run code via limited REST API: - Compilation & run-time errors - Short report about any numerical errors - Benchmark results = end-to-end timing (No profiling data) -- ## LLM flows ;* Idea: + Get Gemini Pro to write GPU kernels - ... based on known working ones - bug-fixing if necessary + Use benchmarks (+Flash) to plan next iteration + Use Gemini Flash to suggest experiments - Choose which experiments to do + ... LOOP -- ## LLM 'Tricks' * Gemini Pro is very effective + But relevant context is crucial + Evolution allows us to A/B test *everything* * Gemini Flash decides which node to build on * Gemini Flash then suggests experiments + And also estimate: - how they might perform - how 'innovative' they are + We pick : Best, Least-Bad & Creative experiments --- ## Agentic Flows
-- ## Stage 1 : Selection
-- ## Stage 2 : Experiments
-- ## Stage 3 : Coding
-- ## End Results * `amd-fp8-mm` competition timings: + Naïve HIP: ~5000μs + PyTorch base-case: 850μs - (uses optimised `fp16`) + Winning human entry: ~150μs - Code is now available 'in-context' *next time* + Final LLM-only entry: 450μs - Gemini may have over-complicated its solutions --- ## GPU Kernel Scientist
* Experiments done & written up ~5 days after MLSG + Accepted paper at ES-FoMo Workshop at ICML 2025! --- ## Wrap-Up * LLMs can gain super-powers + ... when employed in a purposeful system * Building these systems is *wide open* * "Just Do It!" can pay off
NB: MLSG wants to feature Your Talk!
-- ## Link to Slides [
](https://bit.ly/MLSG_2025-06) [https://bit.ly/MLSG_2025-06](https://bit.ly/MLSG_2025-06) --- ## Further Study * Field is growing very rapidly * Lots of different things can be done * Easy to find novel methods / applications -- ## Deep Learning Foundations * 3 week-days + online content * Play with real models & Pick-a-Project * Held online, Live Coding, Certificates * Next run : Late August -- ## Vision (Advanced) ### Advanced Computer Vision with Deep Learning * Advanced classification * Other architectures (eg: U-Nets) * Transformer-based vision * Next run : Early September -- ## NLP (Advanced) ### Advanced NLP and Sequence Processing * NLP (eg: Named Entity Recognition) * Transformers : Theory and Practice * Generative AI * Next run : Late September -- ## AI in Production ### Building Real World A.I. Applications * DIY : node-server + task-queue + python-ml * TensorFlow Serving / PyTorch Serve * TF Lite + TF.js : edge device models * Distillation, pruning, quantisation, etc... * Next run : Late October -- ## Deep Learning for PMs ### ( `= Foundations - code`
`+ management` ) * Much more about 'big picture' * Only a few code examples * Project process standardised * Next run : September -- ## Also... * Unsupervised methods * Time-series & Deep Learning * Audio Processing (Sounds & Speech) ;-- ; ;## QR code for Courses ; ;
--- ## Machine Learning SG
MeetUp Group * Next Meeting = ?July?-2025 (NB: ICML in Vancouver) * Topic : TBA * Typical Contents : + Talk for people starting out + Something from the bleeding-edge + Lightning Talks * [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/) -- ## Quick Poll #### Show of hands * What topic(s) would _compel_ you to come? + Stable-diffusion++ / Video / Gaussian Splatting + Robotics + AI for Education + LLMs for Science + Agents --- # - Questions -
;`Handouts :` [`https://bit.ly/`
`text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)