Learning Language
with BERT


TensorFlow & Deep Learning SG


Martin Andrews @ reddragon.ai

21 November 2018

About Me

  • Machine Intelligence / Startups / Finance
    • Moved from NYC to Singapore in Sep-2013
  • 2014 = 'fun' :
    • Machine Learning, Deep Learning, NLP
    • Robots, drones
  • Since 2015 = 'serious' :: NLP + deep learning
    • & GDE ML; TF&DL co-organiser
    • & Papers...
    • & Dev Course...

About Red Dragon AI

  • Google Partner : Deep Learning Consulting & Prototyping
  • SGInnovate/Govt : Education / Training
  • Products :
    • Conversational Computing
    • Natural Voice Generation - multiple languages
    • Knowledgebase interaction & reasoning

Outline

  • whoami = DONE
  • "Traditional" deep NLP
  • Innovations (with references)
  • New hotness : BERT
  • ~ Code
  • Wrap-up

Traditional Deep NLP

  • Embeddings
  • Bi-LSTM layer(s)
  • Initialisation & Training

Traditional Model

Traditional Deep NLP

Word Embeddings

  • Words that are nearby in the text should have similar representations
  • Assign a vector (~300d, initially random) to each word
    • Slide a 'window' over the text (1Bn words?)
    • Word vectors are nudged around to minimise surprise
    • Keep iterating until 'good enough'
  • The vector-space of words self-organizes

Embedding Visualisation

TensorBoard Embeddings

TensorBoard FTW!

LSTM chain

Traditional LSTM

One issue: Unrolling forces sequential calculation

Initialisation, etc

  • Pre-trained knowledge is in embeddings
  • Everything else is 'from zero'
  • Need lots of training examples

Innovations

  • BPE / SentencePiece
  • Transformers
  • Language Model tasks
  • Fine-tuning

Byte-Pair Encodings

  • Initial vocabulary with counts :
    • {low:5, lowest:2, newer:6, wider:3}
  • 4 steps of merging (words end with </w>) :
    • r + </w> :9 → r</w> = new symbol
    • e + r</w> :9 → er</w>
    • l + o :7 → lo
    • lo + w :7 → low
  • Out-of-Vocab : "lower" → "low_er</w>"

Sentence-Piece Paper

SentencePiece paper

SentencePiece on GitHub

Transformer Structure

Single Transformer

Transformers

Many Transformer

Attention-is-all-you-need

Unsupervised Language Tasks

  • Train whole network on large corpus of text :
    • ~Embeddings, but context++
  • Sample tasks :
    • Predict next word ("Language Model")
    • Predict missing word ("Cloze tasks")
    • Detect sentence/phrase switching
  • Obvious in retrospect...

Fine Tuning

  • Take a model pretrained on huge corpus
  • Do additional training on your (unlabelled) data
  • Learn actual task - using only a few examples

Recent Progress

New Hotness

BERT the muppet

BERT for Tasks

BERT model configurations

BERT Performance

BERT performance

BERT on GitHub

For Your Problem

  • Old way :
    • Build model; GloVe embeddings; Train
    • Needs lots of data
  • New way :
    • Use pretrained BERT;
      Fine-tune on unlabelled data; Train on labelled data
    • Less data required
    • Expect better results

Wrap-up

  • BERT is the latest innovation this NLP trend
  • All-round SOTA performance, fully released
  • ImageNet moment for NLP
GitHub - mdda

* Please add a star... *

Deep Learning
MeetUp Group

Deep Learning : Jump-Start Workshop

Deep Learning
Developer Course

RedDragon AI
Intern Hunt

  • Opportunity to do Deep Learning all day
  • Work on something cutting-edge
  • Location : Singapore
  • Status : SG/PR FTW
  • Need to coordinate timing...

- QUESTIONS -


Martin @
RedDragon . AI


My blog : http://blog.mdda.net/

GitHub : mdda