XLNet


TensorFlow & Deep Learning SG


Martin Andrews @ reddragon.ai

6 July 2019

About Me

  • Machine Intelligence / Startups / Finance
    • Moved from NYC to Singapore in Sep-2013
  • 2014 = 'fun' :
    • Machine Learning, Deep Learning, NLP
    • Robots, drones
  • Since 2015 = 'serious' :: NLP + deep learning
    • & GDE ML; TF&DL co-organiser
    • & Papers...
    • & Dev Course...

About Red Dragon AI

  • Google Partner : Deep Learning Consulting & Prototyping
  • SGInnovate/Govt : Education / Training
  • Products :
    • Conversational Computing
    • Natural Voice Generation - multiple languages
    • Knowledgebase interaction & reasoning

Outline

  • whoami = DONE
  • Transformer Architectures
  • What's new about XLNet?
  • Other interesting stuff
  • Wrap-up

Transformer Architectures

Transformer Overview

"Attention Is All You Need" - Vaswani et al (2017)

Transformer Layer

Transformer Single

Let's focus on the Attention box

Attention (recap)

Attention Key-Value

Sequential Attention...

Some Intuition

  • "q" = Queries
    • What is needed here?
  • "k" = Keys
    • Why should you choose me?
  • "v" = Values
    • What you get why you choose me

Transformer Attention

Transformer Calc

Attention Illustration

Transformer Multihead

Visualization of words and multiple-heads

Transformer Layer

Transformer Single

Let's focus on the Tokenization+Position box

Tokenization

Tokenization Example

Tokenization is also learned -
infinite vocabulary in 50k tokens

Positional Encoding

Positional Encoding

Sine-waves define position (think : FFT)

Training Objectives

  • Language Models
  • Longer models
  • Introspective Models

Language Models

GPT training

eg: OpenAI's GPT-1 and GPT-2

Long Language Models

TransformerXL training

"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context" - Dai et al (2019)

Introspection

BERT training

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" - Devlin et al (2018)

BERT Masking

BERT masking

MASKing tasks are still self-supervised...

Transformer Layer

Transformer Single

Let's focus on the Top boxes

Reconfigurable outputs

Transformer Outputs

Transformers :
Key Features

  • Attention is main element in processing
    • Some interpretability is possible
  • Pure feed-forward ⇒ Speed
  • Data + Compute → Better model

Timeline

  • GPT-1
  • BERT
  • TransformerXL
  • GPT-2
  • ...

XLNet

XLNet paper

"XLNet: Generalized Autoregressive Pretraining for Language Understanding" - Yang et al (2018)

Key Enhancements

  • Make maximum use of contexts
  • Two-stream attention (fixup)
  • Long memory (like TransformerXL)
  • Loads of compute ⇒ Results++

Fixing the MASK problem

  • Multiple MASK entries clash in BERT
  • Words in sentence not independent
    • ... but MASK appearances are
  • MASK never appears in real-world data

Solution : Better Hiding

  • Choose tokens to omit
  • Rely on Positional Encoding to preserve order
XLNet factorization

Problem : Where is the slot we are filling?

  • Need to fix up position 'flow'
  • Content and Positional information are all mixed up in regular transformer
  • Solution : Split position and content in two streams

Solution : Split Streams

XLNet streams

XL Memory

  • Similar considerations to TransformerXL
  • Need to make sure that Positional Encoding 'joins up'
  • As in TransformerXL, use Relative Segment Encodings

"Misc"

  • Train on whole words
    • (BERT now also updated for this)
  • Abandon 'next-sentence-or-not' task

Compute

  • XLNet-Large trained on :
    • 512 TPU v3 chips for 500K steps
    • batch_size = 2048
    • hidden_size = 1024
    • max_seq_len = 512
    • i.e. 1 Billion of these analysed
  • Pretrained model(s) available on GitHub
    • ... similar n_params to BERT-large

Code

  • TPU-ready code

def embedding_lookup(x, n_token, d_embed, initializer, use_tpu=True,
                     scope='embedding', reuse=None, dtype=tf.float32):
  """TPU and GPU embedding_lookup function."""
  with tf.variable_scope(scope, reuse=reuse):
    lookup_table = tf.get_variable('lookup_table', [n_token, d_embed],
                          dtype=dtype, initializer=initializer)
    if use_tpu:
      one_hot_idx = tf.one_hot(x, n_token, dtype=dtype)
      if one_hot_idx.shape.ndims == 2:
        return tf.einsum('in,nd->id', one_hot_idx, lookup_table)
      else:
        return tf.einsum('ibn,nd->ibd', one_hot_idx, lookup_table)
    else:
      return tf.nn.embedding_lookup(lookup_table, x)

Results

GLUE

XLNet results : GLUE

SQuAD

XLNet results : SQuAD

Summary

  • Not just a little bit better
  • Subtle MASK fix made a big difference
  • This is a heavy-compute activity...

Other Interesting Stuff

  • How do Transformers do NLP?
  • Distilling Language Models
  • Better fine-tuning
  • Making Graphs from text
  • Multi-modal Transformers
  • Wiki-scale SQuAD

NLP Pipeline Idea

  • How do Transformers do NLP ?
  • Create output embedding :
    • Weighted sum of layer outputs
    • The layer weights are trainable
  • Use this embedding to do task
  • See which layers the task needs

NLP Pipeline Results

BERT NLP pipeline attribution

"BERT Rediscovers the Classical NLP Pipeline"
- Tenney et al (2019)

Distilling Language Models

  • These models are 'hefty' :
    • Difficult to scale
    • Difficult for mobile
  • Solution : Distill model to CNN version

Distilling Transformers

Distilling Transformers

"Transformer to CNN: Label-scarce distillation for
efficient text classification"
- Chia et al (2018)

Distillation Results

  • Basically same accuracy, but :
    • 39x fewer parameters
    • 300x faster inference speed
Distillation Results

Better Fine Tuning

  • Normal fine-tuning :
    • Train all parameters in the network
    • ... very expensive (need to be careful)
  • 'Adapter' fine-tuning :
    • Don't update the original Transformer
    • Add in extra trainable layers
    • These 'fix up' enough to be effective

Adding Adapter Layers

Transformer Adapter Layers

"Parameter-Efficient Transfer Learning for NLP"
- Houlsby et al (2019)

Graphs from Text

  • Knowledge Graphs are cool :
    • But can we learn them from text?
  • Solution : Last layer should 'write graph'

Task as a picture

ViGIL task

ViGIL NeurIPS paper

Graphs from Text poster

"Scene Graph Parsing by Attention Graph"
- Andrews et al (2018)

Multi-Modal Learning

  • Use the MASK technique :
    • To 'fill in' text
    • To 'fill in' photos
VideoBERT

Cooking Dataset

  • Massive dataset :
    • 312K videos
    • Total duration : 23,186 hours (966 days)
  • >100x size of 'YouCook II'
Cooking Dataset

Cooking with Transformers

Cooking with Transformers

"VideoBERT: A Joint Model for Video and Language Representation Learning"
- Sun et al (2019)

Wrap-up

  • XLNet gives us (another) jump in NLP performance
  • Uses a reasonable-sized model
  • Other experimentation still accessible
GitHub - mdda

* Please add a star... *

Deep Learning
MeetUp Group

Deep Learning : Jump-Start Workshop

Deep Learning
Developer Course

RedDragon AI
Intern Hunt

  • Opportunity to do Deep Learning all day
  • Work on something cutting-edge
  • Location : Singapore
  • Status : Remote possible
  • Need to coordinate timing...

- QUESTIONS -


Martin @
RedDragon . AI