SOTA
( & where are we going? )
TensorFlow & Deep Learning SG
6 February 2020
About Me
 Machine Intelligence / Startups / Finance

 Moved from NYC to Singapore in Sep2013
 2014 = 'fun' :

 Machine Learning, Deep Learning, NLP
 Robots, drones
 Since 2015 = 'serious' :: NLP + deep learning

 GDE ML; TF&DL coorganiser
 Red Dragon AI...
About Red Dragon AI
 Google Partner : Deep Learning Consulting & Prototyping
 SGInnovate/Govt : Education / Training
 Research : NeurIPS / EMNLP
 Products :

 Conversational Computing
 Natural Voice Generation  multiple languages
 Knowledgebase interaction & reasoning
Outline
whoami
= DONE
 The Debate : Bengio v. Marcus
 NeurIPS keynote


Deep Learning : From System 1 to System 2
 Topics

 Attention, Representations and Symbols
 Causality
 Conciousness Prior and Losses
 Wrapup
Deep Learning
 Layers of simple units
 Many parameters
 Lots of data
 Gradient descent
 ... works unfairly well
GOFAI
 Symbols
 Logic
 Planning
 ... but clearly not fashionable now
Tweet...
Tweets...
Debate Summary
 Mostly a storm in a teacup
Some Punches Landed
 Deep Learning faces an uphill battle...
 ... symbols are discrete (nondifferentiable)
 So how does DL learn :

 Logic and Reasoning
 Planning
 Generalisation from small samples
 ?
Bengio outline
 OutofDistribution (OOD) Generalisation:

 Semantic representations
 Compositionality
 Higherlevel Cognition:

 Conciousness Prior
 Causality
 Pointable objects
 Agent perspective:

 Better world models / Knowledge seeking
Condensed version
 Attention, Representations and Symbols
 Causality
 Consciousness Prior and Losses
Attention
 Focus on a few elements
 Can learn 'policy' using soft attention
NewStyle Attention
 Neural Machine Translation revolution
 Memoryextended neural nets
 SOTA in NLP (selfattention, transformers)
 Operating on SETS of (key, value) pairs
Attention as Indirection
 Attention = dynamic connection
 Receiver gets the selected Value
 Can also pass 'name' (i.e. Key)
 Can keep track of names
 Manipulate sets of objects
Composable Representations
 Manipulate sets of objects
 Combine different concepts
 Composability becomes beneficial
 A major feature of Language
Causality in 10 minutes
 Basic statistical principal :

 Correlation is not Causation
 But what if we have questions about causality?
Three Stages of Causality
 Association = Seeing/observing

 Intervention = Doing

 Counterfactuals = Imagining/understanding

Simple Stats I
 Describe model using a graph
 Suppose X=ThumbSize, Y=ReadingAbility
 ... stats ⇒ large thumbs = better readers
Simple Stats II
 This is for primaryschool children
 X=ThumbSize, Y=ReadingAbility, U=Age
 The Model lets us explain the correlation
Interventions
 Describe model using a graph
 Question : "Does Smoking cause Cancer?"
DoCalculus
CounterFactuals
 Question : "What if I had taken the job?"
 Question : "What if Hillary had won?"
 Question : "What is the healthcare cost of obesity?"
Gender Discrimination I
 Question : "Does the data show there is Discrimination in Hiring?"
Gender Discrimination II
 Question : "Does the data show there is Discrimination in Hiring?"
 Legal Question : "What is the probability that the employer would have acted differently had the employee been of different sex and qualification had been the same"
Other Addressable Problems
 Cope with Missing Data
 Reconcile several datasets
 Find causal models compatible with the data
Causality Summary
 Progress has been made
 But this was decades in arriving
 SOTA software can deal with "up to 5 variables"
 Not yet tackled by Deep Learning
 Needs a "Model" provided externally
Data science is the science of interpreting
reality, not of summarizing data.
Consciousness Prior
Consciousness Prior
 Highlevel represention manipulation = just a few words
 "Joint distribution between highlevel variables is a sparse factor graph"
⇒ Pressure to learn representations
Losses
 Can encourage meta behaviour
 Can tease out structure from unlabelled data
 Can be from a learned process
 ...
MetaLearning
 Train for a single task
  vs 
 Train for ability to learn tasks
 Fast weights & Slow weights
 What information can be in DNA?
Unlabelled Data
 BERT <MASK> training
 Word reordering
 Outofplace losses

 Noise Contrastive Estimation (NCE)
 Temporal consistency

 Contrastive Predictive Coding (CPC)
Learned Losses
 Examples created Antagonistically

 Examples created 'best efforts'

 ELECTRA (~BERT, but more efficient)
 Benefit : Loss mechanism only needed for training
Deep Learning
MeetUp Group
Deep Learning : JumpStart Workshop
Deep Learning
Developer Course
 QUESTIONS 
Martin @
RedDragon . AI