Machine Learning Singapore : 22-May-2025 : LLMs for Software and Wetware

# Evolution & Wetware
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
 
;[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)

22-May-2025

---

## Today's Line-up

* "Evolving GPU Kernels" - _Martin Andrews_
* "A Deep Learning Approach for Nanomedicine Design" - _Alvin Chan_

---

# Evolving GPU Kernels
#### Machine Learning Singapore

[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)

; $x^2=17_i$

22-May-2025

---

## About Me

* Machine Intelligence / Startups / Finance  
  + Moved from NYC to Singapore in Sep-2013

* 2014 = 'fun' :
  + Machine Learning, Deep Learning, NLP
  + Robots, drones

* Since 2015 = 'serious' :: NLP + deep learning
  + Including Papers...
  + & GDE ML; ML-Singapore co-organiser...
  + & Red Dragon AI...

## About Red Dragon AI

* Deep Learning Consulting & Prototyping (Google Partner)
  - Education / Training
  - Research : NeurIPS / EMNLP / NAACL / ICML / ICLR

* Please contact us for : 
  - Language model training (eg: on-prem)
  - Knowledgebase interaction & reasoning
  - Sales-oriented applications

---

## Outline

* GPU Kernels
  + What's involved?
  + AMD Challenge
* Evolutionary Algorithms
  + Some History / Ideas
* AlphaEvolve
  + What's new?
* Wrap-up & QR-code

;* Head's Up!

---

## GPU Kernels

* Complexity of GPU normally hidden
  + PyTorch, Keras, JAX, TensorFlow
* But sometimes the details matter 
  + DeepSeek; FlashAttention; NeRFs; MAMBA
  + == Writing CUDA (or equivalent) 
* So what's so difficult?

## CPUs vs GPUs

* CPU : Few complex, independent cores 
* GPU : Many simple, tied cores

## 2080Ti ( Turing )

[<img height="400" src="img/GPU_Turing102-2080Ti_901x754.png" alt="TU102 = 2080Ti Chip layout">](https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/)

* So are the boxes the cores?

## No... It goes deeper

## ... Tensor Cores

---

## Matrix Multiply

```java
for(int m=0; m<M; m++) {
 for(int n=0; n<N; n++) {
 C[m][n] = 0;
 for(int k=0; k<K; k++) {
 C[m][n] += A[m][k] * B[k][n];
 }
 }
}
```

## Matrix Multiply
#### Larger Size

[<img width="800" src="img/MatrixMultiply_MKKN_1024x420.png" alt="Larger Matrix Multiply">](https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/)

## Launching GPU kernels

* Each (regular) core is assigned to an area of the output matrix

;Blog post : progressively more involved
;Quite a few slides…

## But there's more...

* [Excellent Blog Post](https://siboehm.com/articles/22/CUDA-MMM) on DIY Matrix Multiply
  + NB: Doesn't use the [Tensor Cores...](https://github.com/andylolu2/simpleGEMM/blob/master/gemm.cuh)

| Step | Method | GFLOPs/sec |
| ---- | ------ | ---------: |
| 1 | Naïve approach | 309 |
| 5 | 2D Block Tiling | 15972 |
| 10 | Warptiling | 21779 |
| | cuBLAS library | 23250 |
| | | |

;https://www.reddit.com/r/MachineLearning/comments/1cqhsln/p_simplegemm_fast_and_minimal_tensor_core_matrix/

---

## AMD GPU kernels

[<img width="800" src="img/AMD-Developer-Challenge_1008x416.jpg" alt="AMD Developer Challenge">](https://www.datamonsters.com/amd-developer-challenge-2025)

* Registration Deadline ~ last MeetUp
  + Competition deadlines : 27-May - 2-June
  + `FP8 GEMM` · `Fused MOE` · `MLA with ROPE`

; https://x.com/pavel_4_ai/status/1915039361655083223
;   AMD software is improving rapidly
;   CUDA isn't a moat forever, but Nvidia is building new ones with the Python DSL, Dynamo, and more
;   Meanwhile Nvidia hardware advantage is huge this year, but perf/TCO of 355X has attracted some customers
;   MI450X is actually competitive with Rubin

## Big Question

* Can `Gemini Pro 2.5` write AMD GPU kernels?
  + Short answer: YES
  + Longer answer: YES, but how competitive is it?
* Next Question: 
  + How can we automate code optimisation?

---

## Evolutionary Algorithms
#### A bit of history

* Back in the mid-1990s:
  + Neural Networks only 'kinda' worked 
    - whereas HMMs and SVMs were on the horizon
  + But Genetic Algorithms / Programming actually worked
    - So: My PhD was in a NN lab, but I did GP

;    - and did temporarily 'win' for 2000-2010

---

## [Genetic Algorithms](https://en.wikipedia.org/wiki/Genetic_algorithm)

* Basic Ideas:
  + Individual = Bit String
  + Population = 100s of Individuals
  + Fitness = evaluate each Individual
  + Selection = Choose 'good' individuals
  + Mutation & Crossover
    - to generate new individuals
    - which replace 'bad' individuals

## Genetic Algorithms

## Genetic Algorithms

* Widely believed / observed: 
  + Mutation is a local operator :: Weak 
  + Crossover powers the global search 
* Note strong parallels with Nature 
  + Mutations are clearly a thing, BUT... 
  + Practically all species have 2 parents

---

## [Genetic Programming](https://en.wikipedia.org/wiki/Genetic_programming)

* Each Individual is a program, represented as a tree

## GP Crossover

* Crossover operation : Appears to be MADNESS!

## GP Crossover Madness

* Behaviour of Population != Individual x 100
* For Crossover to work at all:
  + Qualities that propagate need more robust Individuals
  + We can look at 'dead code' (for instance)
    - and draw an analogy with Junk DNA

## Genetic Programming
#### The Field

* Genetic Programming Bibliography ...
  + now surpasses 10k entries
* In 2010, Koza listed 77 results ...
  + where GP was human-competitive
  + ... in all sorts of fields

---

## Evolution Innovations

* [Novelty Search](https://www.semanticscholar.org/paper/NOVELTY-SEARCH-AND-THE-PROBLEM-WITH-OBJECTIVES-TO-Lehman-Stanley/e49d1ee1bddea0922faca358f3fd42474baad300?p2df) - Lehman & Stanley (2011)
  + "Why Greatness Cannot be Planned"
* [MAP-Elites](https://arxiv.org/abs/1504.04909) - Mouret & Clune (2015)
  + Also : Work by *MLSG speaker* Jenny Zhang
* Help to solve "Population Collapse"

---

## Evolution with LLMs

* Can use an LLM as the Mutation/Crossover operator
  + ... and operate on text / prompts / code 
* Evolving Prompts:
  + [Promptbreeder](https://arxiv.org/abs/2309.16797) - Fernando _et al._ (2023)
    - "Self-Referential Self-Improvement via Prompt Evolution"
  + [Self-Discover](https://arxiv.org/abs/2402.03620) - Zhou _et al._ (2024)
    - "Large Language Models Self-Compose Reasoning Structures"
* Evolving Programs:
  + [FunSearch](https://www.nature.com/articles/s41586-023-06924-6.pdf) - Romera-Paredes _et al._ (2024)
    - "Mathematical discoveries from program search with large language models"

## Applications ...

* ... to GPU Kernel writing should be clear!

---

## AlphaEvolve

* [_AlphaEvolve_: A coding agent for scientific and algorithmic discovery](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf) - Novikov _et al._ (2025)
  + DeepMind [Blog Post](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)
* The headline:
  + New AI agent evolves algorithms ...
  + ... for math and practical applications in computing ...
  + ... by combining the creativity of LLMs with automated evaluators

## AlphaEvolve

## Open Implementation

* [OpenEvolve Repo on GitHub](https://github.com/codelion/openevolve) (Apache 2)
  + Super-fast follow-up to the AlphaEvolve announcement!
  + Developer [Twitter Thread](https://x.com/asankhaya/status/1925153525597982970) & [Reddit Posting](https://www.reddit.com/r/LocalLLaMA/comments/1kr9rvp/openevolve_open_source_implementation_of/)

## AlphaEvolve Key Results

* Faster matrix multiplication
  + for 4x4 matrices (and others)
* Discovering mathematical objects or constructions
  + that possess optimal (or near-optimal) properties
* Optimizing Google's computing ecosystem
  + +0.7% for Google's fleet-wide compute resources
* Optimizing Gemini kernel tiling strategy
  + 23% kernel speedup across all kernels
  + 1% reduction in Gemini's overall training time

;+ (Not in the GPU sense)

## Basic Matrix Multiplication

* 2x2 matrices "clearly" requires 8 multiplies

## Strassen's Method

* 2x2 matrices *needs* only 7 multiplies!
* AlphaEvolve finds similar tricks for larger matrices

---

## New Factor : Better LLMs

* Things LLMs can do:
  + Write / change programs
  + Compare methods / Compare outcomes
  + Create / combine instructions
  + Adding : Human notions of elegance & novelty
* Key question: 
  + How do we achieve crossover *everywhere*?

;  + Phenotypes Vs Genotypes

---

## Wrap-Up

* Everything Old is New again!
* LLMs can power larger systems
  + ... that have surprising capabilities
* Experimentally : It's very early days!

NB: MLSG wants to feature Your Talk!

## Link to Slides

[<img width="300" src="img/bit.ly_MLSG_2025-05_656x656.png" alt="Evolving GPU Kernels QR code"/>](https://bit.ly/MLSG_2025-05)

[https://bit.ly/MLSG_2025-05](https://bit.ly/MLSG_2025-05)

---

## A Deep Learning Approach for Nanomedicine Design
#### Alvin Chan

* Lipid nanoparticles (LNPs)
* COMET : 
  + Predict LNP efficacy and ...
  + ... accelerate the design of next-generation RNA medicines

---

## Further Study

* Field is growing very rapidly
* Lots of different things can be done
* Easy to find novel methods / applications

## Deep Learning Foundations

* 3 week-days + online content
* Play with real models & Pick-a-Project
* Held online, Live Coding, Certificates
* Last run : Early September

## NLP (Advanced)
### Advanced NLP and Sequence Processing

* NLP (eg: Named Entity Recognition)
* Transformers : Theory and Practice
* Generative AI
* Last run : Early October

## Vision (Advanced)
### Advanced Computer Vision with Deep Learning

* Advanced classification
* Other architectures (eg: U-Nets)
* Transformer-based vision
* Last run : Early November

## AI in Production
### Building Real World A.I. Applications

* DIY : node-server + task-queue + python-ml
* TensorFlow Serving / PyTorch Serve
* TF Lite + TF.js : edge device models
* Distillation, pruning, quantisation, etc...
* Last run : Early February

## Deep Learning for PMs
### ( `= Foundations - code` `+ management` )
* Much more about 'big picture'
* Only a few code examples
* Project process standardised
* Last run : Late January

## Also...
* Unsupervised methods
* Time-series & Deep Learning
* Audio Processing (Sounds & Speech)

;--
;
;## QR code for Courses
;
;<img height="330" src="img/RDAI-courses-QRcode_172x165.png" alt="RDAI Courses QR code"/>

---

## Machine Learning SG MeetUp Group
* Next Meeting = 19-June-2025
* Topic : TBA
* Typical Contents : 
 + Talk for people starting out
 + Something from the bleeding-edge
 + Lightning Talks
* [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/)

## Quick Poll
#### Show of hands

* What topic(s) would _compel_ you to come?
  + [Vibe Coding](https://x.com/MatthewBerman/status/1904039128611914144)
  + LLMs for Science
  + Agents
  + Stable-diffusion++ / Video / Gaussian Splatting
  + Robotics
  + AI for Education

---

# - Questions -

;`Handouts :` [`https://bit.ly/` `text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)