## Agents, Experts
and Extracting
Structured Data #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
[Sam Witteveen](http://samwitteveen.com) @ [reddragon.ai](http://reddragon.ai/)
27-August-2025
--- ## Today's Line-up * "LangExtract + New Gemma"
- _Sam Witteveen_ * "Mixture of Experts Routing"
- _Florian Kowarsch_ * "Agent Efficiency, Memory and Confidence"
- _Martin Andrews_ ;* "How to Spend your Tokens"
- _Martin Andrews_ --- ## LangExtract + New Gemma #### Sam Witteveen * LangExtract overview * Updates for Gemma family --- ## Mixture of Experts Routing #### Florian Kowarsch * Mixtures of Experts basics * Intuitions * The future... --- ;## How to Spend your Tokens ## Agent Efficiency, Memory and Confidence #### Machine Learning Singapore
[Martin Andrews](http://mdda.net) @ [reddragon.ai](http://reddragon.ai/)
; $x^2=17_i$
$
; ##
DSPy
27-August-2025
--- ## About Me * Machine Intelligence / Startups / Finance + Moved from NYC to Singapore in Sep-2013 * 2014 = 'fun' : + Machine Learning, Deep Learning, NLP + Robots, drones * Since 2015 = 'serious' :: NLP + deep learning + Including Papers... + & GDE ML; ML-Singapore co-organiser... + & Red Dragon AI... -- ## About Red Dragon AI * Deep Learning Consulting & Prototyping (Google Partner) - Education / Training - Research : NeurIPS / EMNLP / NAACL / ICML / ICLR * Please contact us for : - Language model training (eg: on-prem) - Knowledgebase interaction & reasoning - Sales-oriented applications --- ## Outline * Efficiency * Memory * Confidence * Wrap-up & QR-code ;* Head's Up! --- ## Agent System Efficiency? * Tokens cost us money * Regular models: + Need to control tokens returned * 'Thinking Models' (Proprietary): + Now not all tokens are returned * Agent Systems: + Agents decide on token spending... --- ## The Paper * [Efficient Agents: Building Effective Agents While Reducing Cost](https://www.arxiv.org/abs/2508.02694) - Wang _et al._ (2025) + ['Oppo Team' Code Repo](https://github.com/OPPO-PersonalAI/OAgents) (Apache 2.0) + Demonstrates a lean agent stack that: - stays cheap and fast ... - ... without losing quality ; + [Supporter Thread](https://x.com/rohanpaul_ai/status/1954683311248359512) ; - Has summarised a lot of the key take-aways nicely -- ## Agentic System Diagram
--- ## Benchmark * GAIA (= General AI Assistants): a hard multi-step benchmark - has 3 difficulty levels (l1, l2, l3 in the paper) - human respondents obtain ~92% overall
-- ## Metrics * Efficiency is 'cost-of-pass' : - the expected dollar cost for 1 correct answer - i.e. cost per try divided by success rate -- ## Improvements * Biggest wins come from pruning overhead * Backbone choice drives accuracy + Many 'extras' just add tokens --- ## Planning ;#### Agentic method
* Planning has a sweet spot: + `steps_max=8` works well (in accuracy terms) + However, cost-of-pass rises from
$
0.48 to
$
0.70 - when moving from 4 to 8 + Beyond 8 steps cost rises + Re-plan at every step (spacing=no saving) -- ## Best-of-$N$ ;#### Agentic method
* Try $N$ candidates and keep the highest scoring one + ... gives tiny gains * Overall accuracy: 53.33% → 53.94% from N=1 to N=4 + but token count jumps 243K → 325K * This is a poor trade -- ## Tool use ;#### Agentic method
* Tool use should be wide and light + more search sources help + simple browser actions beat complex clicks * Query expansion around 5 to 10 is good + pulls in better evidence without bloating the context -- ## Memory ;#### Agentic method
* Having a simple history of observations and actions wins + leads to 56.36% accuracy and
$
0.74 cost-of-pass * Heavily *summarized* memory burns tokens and does not help * Effectiveness = Biggest surprise in the paper --- ## Overall Results * 'Efficient Agents' keeps 96.7% of OWL accuracy + while cutting per-success cost by 28.4% * Final choices: + Backbone model = GPT-4.1 (!) + Planning: `steps_max=8`, interval=1 + Multi-search sources, 5 query rewrites + Simple memory + No Best-of-N * Each choice cuts waste while keeping capability. + NB: Gemini series not tested --- ##
AgentFly
Memento
;#### 'RL' using Memory * [Memento: Fine-tuning LLM Agents without Fine-tuning LLMs](https://arxiv.org/abs/2508.16153)
- Zhou _et al._ (2025) + Agent stores each solved attempt in its 'memory' - then picks similar cases to guide the next plan * Using simple similarity or a small neural scorer + Policy learns online from task rewards - ⇒ case choice keeps improving + Because only memory and the retrieval policy update, - the base LLM stays frozen, - cost stays low, and - the agent adapts continuously ; Rename AgentFly -> Memento ; + Cast it as a memory augmented decision process, ; - learned retrieval policy scores which past cases to reuse. ; + [Supporter Thread](https://x.com/rohanpaul_ai/status/1959892755393458207) ; - AgentFly stores each solved attempt as a case in episodic memory, then picks similar cases to guide the next plan. ; - They cast it as a memory augmented decision process, where a learned retrieval policy scores which past cases to reuse. ; - That policy learns online from task rewards, using either simple similarity or a small neural scorer, so case choice keeps improving. ; - Because only memory and the retrieval policy update, the base LLM stays frozen, cost stays low, and the agent adapts continuously. ; + [Supporter Thread](https://x.com/omarsar0/status/1960047046444085363) -- ##
Memento
Diagram
-- ##
Memento
Results
* These seem overall higher than Oppo was achieving... -- ##
Memento
Results
* HLE = "Humanity's Last Exam" -- ##
Memento
Summary * Paper seems to be rather *Mathified* + "Memory-Based MDP with Case-based Reasoning Policy" * Eventually: - "... the CBR planner can be simplified to a single-step setting instead of a multi-step M-MDP" - "This single-step setting collapses the TD target to the immediate reward, thereby simplifying the learning objective" * Implementation has quite a few arbitrary, empirical choices + [Code Repo](https://github.com/Agent-on-the-Fly/Memento) * Memento shows the effectiveness of a simple memory buffer + No LLM retraining was required... ; Open Source MCP LLM Memory : https://github.com/campfirein/cipher --- ## LLM Confidence * [Physics of Neural Networks ICML 2024 Tutorial](https://www.youtube.com/watch?v=yBL7J0kgldU): + ⇒ LLMs *know* when they have made a mistake * Discussed at MLSG [back in 2024-09 "System2"](https://mdda.net/blog/research/talks/rl-for-llms-o1) + Context : What is o-1 *doing*? * Also "Entropix" meme-fest on Twitter ~ @doomslide * NEWS: It now works! + Two papers worth mentioning... ; + Jeremy Howard : "IIUC, someone just got entropix to work and published it!..." ; - [@doomslide tweet](https://x.com/doomslide/status/1959322444973645896) : "I'm of course very happy about the (impressive) result but more prominent is the soothing feeling of finally escaping the tiny box some of yall tried to fit me to. Here, it's done now. We can move on. We all want to." --- ## Agent Rollout Confidence * [ARPO : Agentic Reinforced Policy Optimization](https://arxiv.org/abs/2507.19849)
- Dong _et al._ (2025) + Idea: - LLMs show spikes in token entropy (uncertainty) right after using tools * a signal that reveals where the model is "thinking hard" but often gets ignored - So : use this signal insight to guide exploration * it dynamically branches sampling when uncertainty is high, instead of just passively rolling out fixed trajectories. - Also, add an Advantage Attribution Estimation step * helps the model learn which tool-use paths are truly valuable + [Code Repo](https://github.com/dongguanting/ARPO) * Came out first - from `.cn` ; + Nice entropy forking diagram in paper p1 ; + [Supporter Thread](https://x.com/DataScienceDojo/status/1950949575453278303) -- ## ARPO Rollout Entropy
-- ## ARPO Results
* Qwen3-14B + ARPO uses only half the tool-call budget required by trajectory-level RL methods ; - The results? On 13 tough benchmarks in math, knowledge reasoning, and deep search, ARPO outperforms mainstream trajectory-level RL algorithms — and does it with half the tool-call budget. --- ## LLM Rollout Confidence * [Deep Think with Confidence](https://arxiv.org/abs/2508.15260) - Fu _et al._ (2025) + Meta's [Project Page](https://jiaweizzhao.github.io/deepconf/) - Confidence = fixed window moving average of the negative mean candidate logprobs * Seems somewhat akin to Beam Search + Easy to deploy: "Just ~50 lines of code in vLLM" + Can be embedded within vLLM inference engine: - [Example page](https://jiaweizzhao.github.io/deepconf/static/htmls/code_example.html) - [Live PR](https://github.com/vllm-project/vllm/pull/23201) * Got more attention on Twitter - from Meta ; + Meta [Last Author Thread](https://x.com/jiawzhao/status/1958982524333678877) ; - Code repo for paper is empty, but the vLLM PR exists -- ### Deep Think with Confidence [
](https://jiaweizzhao.github.io/deepconf/) * [Click for animation](https://jiaweizzhao.github.io/deepconf/static/htmls/online_deep_thinking.html) -- ### Deep Think with Confidence ;#### Results
-- ### Deep Think with Confidence #### Summary * Yes, it works: + Performance boost: ~ +10% accuracy across models & datasets + "Ultra-efficient": Up to 85% fewer tokens generated * BUT ... "DeepConf@512 achieves up to 99.9% accuracy" - That's a lot of rollouts for "Real Reasoning" ... --- ## Wrap-Up * Agentic Systems are improving + But we also need to be careful with costs * Adding Memory is ~easy + eg: [cipher :: OS MCP LLM Memory](https://github.com/campfirein/cipher) * Confidence (aka Entropix) now works! + in vLLM Real Soon(TM)
NB: MLSG wants to feature Your Talk!
(Say "Hello"...)
-- ## Link to Slides [
](https://bit.ly/MLSG_2025-08) [https://bit.ly/MLSG_2025-08](https://bit.ly/MLSG_2025-08) --- ## Further Study * Field is growing very rapidly * Lots of different things can be done * Easy to find novel methods / applications -- ## Deep Learning Foundations * 3 week-days + online content * Play with real models & Pick-a-Project * Held online, Live Coding, Certificates * Next run : TBA -- ## NLP (Advanced) ### Advanced NLP and Sequence Processing * NLP (eg: Named Entity Recognition) * Transformers : Theory and Practice * Generative AI * Next run : 9, 10 11 September -- ## Vision (Advanced) ### Advanced Computer Vision with Deep Learning * Advanced classification * Other architectures (eg: U-Nets) * Transformer-based vision * Next run : 7, 8, 9 October -- ## Deep Learning for PMs ### ( `= Foundations - code`
`+ management` ) * Much more about 'big picture' * Only a few code examples * Project process standardised * Next run : 21, 22, 23 October -- ## AI in Production ### Building Real World A.I. Applications * DIY : node-server + task-queue + python-ml * TensorFlow Serving / PyTorch Serve * TF Lite + TF.js : edge device models * Distillation, pruning, quantisation, etc... * Next run : 3, 4, 5 November -- ## Also... * Unsupervised methods * Time-series & Deep Learning * Audio Processing (Sounds & Speech) ;-- ; ;## QR code for Courses ; ;
--- ## Machine Learning SG
MeetUp Group * Next Meeting = 25-Sept-2025 (NB: Location=In Town!) * Topic : TBA * Typical Contents : + Talk for people starting out + Something from the bleeding-edge + Lightning Talks * [MeetUp.com / Machine-Learning-Singapore](https://www.meetup.com/Machine-Learning-Singapore/) -- ## Quick Poll #### Show of hands * How did you hear about THIS event? + MeetUp email *last* week + MeetUp email *last* weekend + MeetUp email *this* week + Messaging group + Friends + Work colleagues -- ## Quick Poll #### Show of hands * What topic(s) would _compel_ you to come? + Stable-diffusion++ / Video / Gaussian Splatting + Robotics + Reinforcement Learning + AI for Education + LLMs for Science + Agents --- # See You
Next Time !
;`Handouts :` [`https://bit.ly/`
`text-similarity-jan-2022`](https://bit.ly/text-similarity-jan-2022)