6 July 2019
whoami
= DONE"Attention Is All You Need" - Vaswani et al (2017)
Let's focus on the Attention box
Sequential Attention...
Let's focus on the Tokenization+Position box
Tokenization is also learned -
infinite vocabulary in 50k tokens
Sine-waves define position (think : FFT)
"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context" - Dai et al (2019)
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" - Devlin et al (2018)
MASKing tasks are still self-supervised...
Let's focus on the Top boxes
"XLNet: Generalized Autoregressive Pretraining for Language Understanding" - Yang et al (2018)
MASK
problemMASK
entries clash in BERTMASK
appearances areMASK
never appears in real-world databatch_size =
2048hidden_size =
1024max_seq_len =
512n_params
to BERT-large
def embedding_lookup(x, n_token, d_embed, initializer, use_tpu=True,
scope='embedding', reuse=None, dtype=tf.float32):
"""TPU and GPU embedding_lookup function."""
with tf.variable_scope(scope, reuse=reuse):
lookup_table = tf.get_variable('lookup_table', [n_token, d_embed],
dtype=dtype, initializer=initializer)
if use_tpu:
one_hot_idx = tf.one_hot(x, n_token, dtype=dtype)
if one_hot_idx.shape.ndims == 2:
return tf.einsum('in,nd->id', one_hot_idx, lookup_table)
else:
return tf.einsum('ibn,nd->ibd', one_hot_idx, lookup_table)
else:
return tf.nn.embedding_lookup(lookup_table, x)
GLUE
SQuAD
MASK
fix made a big difference"BERT Rediscovers the Classical NLP Pipeline"
- Tenney et al (2019)
"Transformer to CNN: Label-scarce distillation for
efficient text classification" - Chia et al (2018)
"Parameter-Efficient Transfer Learning for NLP"
- Houlsby et al (2019)
"Scene Graph Parsing by Attention Graph"
- Andrews et al (2018)
MASK
technique :"VideoBERT: A Joint Model for Video and Language Representation Learning"
- Sun et al (2019)