The Rise of the
Language Models
TensorFlow & Deep Learning KL
28 February 2019
About Me
- Machine Intelligence / Startups / Finance
-
- Moved from NYC to Singapore in Sep-2013
- 2014 = 'fun' :
-
- Machine Learning, Deep Learning, NLP
- Robots, drones
- Since 2015 = 'serious' :: NLP + deep learning
-
- & GDE ML; TF&DL co-organiser
- & Papers...
- & Dev Course...
About Red Dragon AI
- Google Partner : Deep Learning Consulting & Prototyping
- SGInnovate/Govt : Education / Training
- Products :
-
- Conversational Computing
- Natural Voice Generation - multiple languages
- Knowledgebase interaction & reasoning
Outline
whoami
= DONE
- What is a Language Model?
- ELMo, ULMFit, OpenAI GPT, BERT
- New hotness : GPT2
- ~ Anxiety
- Wrap-up
Major New Trend
- Starting at beginning of 2018
- "The Rise of the Language Model"
- Also "The ImageNet Moment for NLP"
Language Model
- Examples :
-
- The domestic cat is a small, typically _____
- There are more than seventy cat _____
- I want this talk to be _____
Why Language Models?
- LMs are now back in vogue
- Benefits :
-
- Unsupervised training (lots of data)
- New attention techniques
- Fine tuning works unfairly well
Fine-tuning
- Take an existing (pre-trained) LM :
-
- Add a classifier for your task
- Weights can be trained quickly
- Sudden breaking of multiple SoTA records
ELMo
What happens if you don't have a
good diagram in your blog / paper
ELMo TF-Hub
Download and use in TensorFlow = 2 lines of Python
ULMFiT Fine-tuning
Focus on fine-tuning (& practical tricks)
Attention Key-Value
Basic idea for "Attention is all you need" = AIAYN
Transformer Structure
OpenAI : Tricks
- Instead of fine-tuning the model ...
- ... possible to do 'black magic'
-
- i.e.: Just use the LM to rate problem statements
Sentiment Trick
- Problem :
-
- Review : "I loved this movie."
- Question : Is the sentiment positive?
- Trick :
-
- R1: "I loved this movie. Very positive."
- R2: "I loved this movie. Very negative."
- Q : Which review is most likely?
Other Unsupervised Language Tasks
- Train whole network on large corpus of text :
-
- ~Embeddings, but context++
- Sample tasks :
-
- Predict next word ("Language Model")
- Predict missing word ("Cloze tasks")
- Detect sentence/phrase switching
- Obvious in retrospect...
BERT !
BERT for Tasks
BERT Performance
OpenAI GPT-2 : Training
- Apart from the large model sizes...
- Use of large (unsupervised) training set
- Links from Reddit :
-
- Posts that are >=3 karma
- ... and have a link (~45 million)
- Extract main text with scraper (~8 million docs)
- Net result : 40Gb text data
OpenAI GPT-2 : Experiments
- Now more interested in Zero-shot results
- Just ask the model the answer (no training data)
- Question Answering :
-
- Just add the question on the end and listen...
- Summarisation :
-
- Add
'TL;DR'
on the end, and listen...
- Translation (in essence) :
-
- Add
'In French, "xyz" is " '
and listen...
- NB: Only 10Mb of text with any French included
OpenAI GPT-2 : Stories
- Cherry picked example (unicorn) is incredible
- Others are merely remarkable
- Look in the paper Appendices to see more...
OpenAI GPT-2 : Ethics
- They decided to wait 6 months ...
-
- ... before releasing their pre-trained model
- ... because it was potentially 'too good'
- Twitter storm ensued
- Just PR or was the Caution warranted?
For Your Problem
- Old way :
-
- Build model; GloVe embeddings; Train
- Needs lots of data
- New way :
-
- Use pretrained Language Model;
Fine-tune on unlabelled data; Train on labelled data
- Less data required
- Expect better results
- Hopefully don't release malevolent AI on world
Wrap-up
- GPT-2 is the latest innovation this NLP trend
- SOTA performance, but not fully released...
- ImageNet moment for NLP : Confirmed
* Please add a star... *
Deep Learning
MeetUp Group
Deep Learning : Jump-Start Workshop
RedDragon AI
Intern Hunt
- Opportunity to do Deep Learning all day
- Work on something cutting-edge
- Location : Singapore
- Status : Remote possible
- Need to coordinate timing...