Single Path
NAS ++


TensorFlow & Deep Learning SG


Martin Andrews @ reddragon.ai

30 May 2019

About Me

  • Machine Intelligence / Startups / Finance
    • Moved from NYC to Singapore in Sep-2013
  • 2014 = 'fun' :
    • Machine Learning, Deep Learning, NLP
    • Robots, drones
  • Since 2015 = 'serious' :: NLP + deep learning
    • & GDE ML; TF&DL co-organiser
    • & Papers...
    • & Dev Course...

About Red Dragon AI

  • Google Partner : Deep Learning Consulting & Prototyping
  • SGInnovate/Govt : Education / Training
  • Products :
    • Conversational Computing
    • Natural Voice Generation - multiple languages
    • Knowledgebase interaction & reasoning

Outline

  • whoami = DONE
  • Neural Architectures
  • Neural Architecture Search
  • Single-Path NAS
  • ... different thing, but ...
  • Lottery Ticket Hypothesis
  • Wrap-up

Neural Architectures

  • Simplest viable CNN...
LeNet

Neural Architecture Design

  • Design Problem (#types) ^ (#layers)
  • 5 layers & 5 types ⇒ 3,125 networks
  • 5 layers & 50 types ⇒ 300 million networks
  • 50 layers & 5 types ⇒ 8 * 10e34 networks

Neural Architecture Search

RL NAS

e-Neural Architecture Search

RL eNAS

RL is not magic, though...

RL vs Random

NASnet idea

NASnet picture

NASnet

SOTA Results

  • All available as Keras pre-trained models
Performance : NASnet

Single-Path NAS

Single-Path NAS idea

  • Allow each layer to have different structure
  • Beat scaling problem by :
    • Combining different layer types into "SuperKernel"
    • Make search space differentiable
  • Can do search 5000x faster
  • Include hardware-aware loss

Single-Path picture

Single-Path NAS

Switching layers

  • $\mathbf{w}_k=\mathbf{w}_{3\times 3} + \unicode{x1D7D9}(\text{use } 5\times 5)\mathbf{w}_{5\times 5 \backslash 3\times 3}$
  • Change the indicator into a threshold, and let
  • $g(x, t) = \unicode{x1D7D9}(x > t)$ become
  • $\hat{g}(x, t) = \sigma(x > t)$ when computing gradients

Optimise Latency

  • Latency approximation :
Latency Approximation

Single-Path
Training Results

SinglePath results

Single-Path
Final Network

SinglePath network

Change Gears

  • Interesting resonance with Lottery Ticket Hypothesis
  • No direct relationship other than both :
    • are interesting papers
    • rely on 90% of network being prunable
    • include the masking out of irrelevant layers

Lottery Ticket Hypothesis

Lottery Ticket Start

  • Train a network from scratch
    • random init=R
  • Find the important weights in finished network
  • Create a mask and set other stuff to zero
  • Performance of network ~ same

Lottery Ticket Trick

  • Start a new pre-pruned network from scratch
    • Start the weights from R/mask
    • i.e. same random values as R for just the ones that mattered in the end
  • Network still trains to be good
  • Performance of network ~ same
    • Even without the rest of the network to 'smooth the gradients'

Key Quote

The winning tickets we find have won the initialization lottery :
their connections have initial weights that make training particularly effective.

Lottery Tickets : Scale up

Lottery Tickets : Uber Investigation

Lottery Tickets : Uber Investigation

  • What's important about the final weights?
    • Magnitude?
    • Those that have changed most?
  • What's important to 'carry back' as the original mask?
    • Magnitude?
    • Sign?
  • Also 'supermasks'... Hmmm

Lottery Ticket Questions

  • Does this speed up training?
    • Not yet, since we don't know mask until already trained
  • Why is this interesting?
    • Because networks are performing architecture search

Architecture Search?

  • Train layer with 5 hidden units many times
    • Probably won't work
  • Train layer with 50 hidden units once
    • Throw away 45 of the hidden units
    • Probably does work
    • Remembering permutations and combinations...
    • ${50 \choose 5} = \frac{50!}{5! (50-5)!}$ networks have trained
    • Resulting best 5 are '1 in a million'

EfficientNet

EfficientNet picture

EfficientNet idea

EfficientNet model

  • Reduce search space by tying variables
  • (having done some investigations first)
EfficientNet equations

EfficientNet model

EfficientNet model

EfficientNet Results

Performance : EfficientNet

Wrap-up

  • Neural Architecture Search still has room for experimentation
  • Switching layers on-and-off is still a thing
  • Maybe optimisation is not-so delicate
GitHub - mdda

* Please add a star... *

Deep Learning
MeetUp Group

Deep Learning : Jump-Start Workshop

Deep Learning
Developer Course

RedDragon AI
Intern Hunt

  • Opportunity to do Deep Learning all day
  • Work on something cutting-edge
  • Location : Singapore
  • Status : Remote possible
  • Need to coordinate timing...

- QUESTIONS -


Martin @
RedDragon . AI


Slides & code : https:// bit.ly / TFDL-AutoML