Single Path
NAS ++
TensorFlow & Deep Learning SG
30 May 2019
About Me
- Machine Intelligence / Startups / Finance
-
- Moved from NYC to Singapore in Sep-2013
- 2014 = 'fun' :
-
- Machine Learning, Deep Learning, NLP
- Robots, drones
- Since 2015 = 'serious' :: NLP + deep learning
-
- & GDE ML; TF&DL co-organiser
- & Papers...
- & Dev Course...
About Red Dragon AI
- Google Partner : Deep Learning Consulting & Prototyping
- SGInnovate/Govt : Education / Training
- Products :
-
- Conversational Computing
- Natural Voice Generation - multiple languages
- Knowledgebase interaction & reasoning
Outline
whoami
= DONE
- Neural Architectures
- Neural Architecture Search
- Single-Path NAS
- ... different thing, but ...
- Lottery Ticket Hypothesis
- Wrap-up
Neural Architectures
Neural Architecture Design
- Design Problem (#types) ^ (#layers)
- 5 layers & 5 types ⇒ 3,125 networks
- 5 layers & 50 types ⇒ 300 million networks
- 50 layers & 5 types ⇒ 8 * 10e34 networks
Neural Architecture Search
e-Neural Architecture Search
RL is not magic, though...
NASnet picture
SOTA Results
- All available as Keras pre-trained models
Single-Path NAS idea
- Allow each layer to have different structure
- Beat scaling problem by :
-
- Combining different layer types into "SuperKernel"
- Make search space differentiable
- Can do search 5000x faster
- Include hardware-aware loss
Single-Path picture
Switching layers
- $\mathbf{w}_k=\mathbf{w}_{3\times 3} + \unicode{x1D7D9}(\text{use } 5\times 5)\mathbf{w}_{5\times 5 \backslash 3\times 3}$
- Change the indicator into a threshold, and let
- $g(x, t) = \unicode{x1D7D9}(x > t)$ become
- $\hat{g}(x, t) = \sigma(x > t)$ when computing gradients
Optimise Latency
Single-Path
Training Results
Single-Path
Final Network
Change Gears
- Interesting resonance with Lottery Ticket Hypothesis
- No direct relationship other than both :
-
- are interesting papers
- rely on 90% of network being prunable
- include the masking out of irrelevant layers
Lottery Ticket Hypothesis
Lottery Ticket Start
- Train a network from scratch
-
- Find the important weights in finished network
- Create a mask and set other stuff to zero
- Performance of network ~ same
Lottery Ticket Trick
- Start a new pre-pruned network from scratch
-
- Start the weights from R/mask
- i.e. same random values as R for just the ones that mattered in the end
- Network still trains to be good
- Performance of network ~ same
-
- Even without the rest of the network to 'smooth the gradients'
Key Quote
The winning tickets we find have won the initialization lottery :
their connections have initial weights that make training particularly effective.
Lottery Tickets : Scale up
Lottery Tickets : Uber Investigation
Lottery Tickets : Uber Investigation
- What's important about the final weights?
-
- Magnitude?
- Those that have changed most?
- What's important to 'carry back' as the original mask?
-
- Also 'supermasks'... Hmmm
Lottery Ticket Questions
- Does this speed up training?
-
- Not yet, since we don't know mask until already trained
- Why is this interesting?
-
- Because networks are performing architecture search
Architecture Search?
- Train layer with 5 hidden units many times
-
- Train layer with 50 hidden units once
-
- Throw away 45 of the hidden units
- Probably does work
- Remembering permutations and combinations...
- ${50 \choose 5} = \frac{50!}{5! (50-5)!}$ networks have trained
- Resulting best 5 are '1 in a million'
EfficientNet picture
EfficientNet model
- Reduce search space by tying variables
- (having done some investigations first)
EfficientNet model
EfficientNet Results
Wrap-up
- Neural Architecture Search still has room for experimentation
- Switching layers on-and-off is still a thing
- Maybe optimisation is not-so delicate
* Please add a star... *
Deep Learning
MeetUp Group
Deep Learning : Jump-Start Workshop
Deep Learning
Developer Course
RedDragon AI
Intern Hunt
- Opportunity to do Deep Learning all day
- Work on something cutting-edge
- Location : Singapore
- Status : Remote possible
- Need to coordinate timing...