Learning Language
with BERT
TensorFlow & Deep Learning SG
21 November 2018
About Me
- Machine Intelligence / Startups / Finance
-
- Moved from NYC to Singapore in Sep-2013
- 2014 = 'fun' :
-
- Machine Learning, Deep Learning, NLP
- Robots, drones
- Since 2015 = 'serious' :: NLP + deep learning
-
- & GDE ML; TF&DL co-organiser
- & Papers...
- & Dev Course...
About Red Dragon AI
- Google Partner : Deep Learning Consulting & Prototyping
- SGInnovate/Govt : Education / Training
- Products :
-
- Conversational Computing
- Natural Voice Generation - multiple languages
- Knowledgebase interaction & reasoning
Outline
whoami
= DONE
- "Traditional" deep NLP
- Innovations (with references)
- New hotness : BERT
- ~ Code
- Wrap-up
Traditional Deep NLP
- Embeddings
- Bi-LSTM layer(s)
- Initialisation & Training
Traditional Model
Word Embeddings
- Words that are nearby in the text should have similar representations
- Assign a vector (~300d, initially random) to each word
-
- Slide a 'window' over the text (1Bn words?)
- Word vectors are nudged around to minimise surprise
- Keep iterating until 'good enough'
- The vector-space of words self-organizes
LSTM chain
One issue: Unrolling forces sequential calculation
Initialisation, etc
- Pre-trained knowledge is in embeddings
- Everything else is 'from zero'
- Need lots of training examples
Innovations
- BPE / SentencePiece
- Transformers
- Language Model tasks
- Fine-tuning
Byte-Pair Encodings
- Initial vocabulary with counts :
-
{low:5, lowest:2, newer:6, wider:3}
- 4 steps of merging (words end with
</w>
) :
-
r + </w> :9 → r</w>
= new symbol
e + r</w> :9 → er</w>
l + o :7 → lo
lo + w :7 → low
- Out-of-Vocab : "
lower
" → "low_er</w>
"
Transformer Structure
Unsupervised Language Tasks
- Train whole network on large corpus of text :
-
- ~Embeddings, but context++
- Sample tasks :
-
- Predict next word ("Language Model")
- Predict missing word ("Cloze tasks")
- Detect sentence/phrase switching
- Obvious in retrospect...
Fine Tuning
- Take a model pretrained on huge corpus
- Do additional training on your (unlabelled) data
- Learn actual task - using only a few examples
New Hotness
BERT for Tasks
BERT Performance
For Your Problem
- Old way :
-
- Build model; GloVe embeddings; Train
- Needs lots of data
- New way :
-
- Use pretrained BERT;
Fine-tune on unlabelled data; Train on labelled data
- Less data required
- Expect better results
Wrap-up
- BERT is the latest innovation this NLP trend
- All-round SOTA performance, fully released
- ImageNet moment for NLP
* Please add a star... *
Deep Learning
MeetUp Group
Deep Learning : Jump-Start Workshop
Deep Learning
Developer Course
RedDragon AI
Intern Hunt
- Opportunity to do Deep Learning all day
- Work on something cutting-edge
- Location : Singapore
- Status : SG/PR FTW
- Need to coordinate timing...