NLP Lessons Learned

Fifth Elephant 2016

Martin Andrews @ redcatlabs.com

29 July 2016

About Me

  • Machine Intelligence / Startups / Finance
    • After PhD, went in to Finance as a Quant
    • Moved from NYC to Singapore in Sep-2013
  • 2014 = 'fun' :
    • Machine Learning, Deep Learning, NLP
  • Since 2015 = 'serious' :: NLP + deep learning
    • & Papers...

The Project

  • Input : Company text
  • Output : Entities & Relationships

Relationships Graph

Example

Dr Willie Lee Leng Ghee has an MBBS from the University of Singapore and has been a medical practitioner for over 40 years.

Dr Lee was appointed as a non-executive independent director of LCH in February 2006. He is a member of the Audit and Remuneration Committees.

Theoretical Approach

  • ~ Stanford NLP suite;
  • MITIE Relationship Extraction; and
  • GUI to allow interactivity

It's going to be great!

Milestone

  • Better than human data entry
  • e.g. >70% recognition rate

The Real Project

  • Input : PDF version of Annual Reports
  • Output : Entities & Relationships
  • 80% F1 score on Relationships
  • ... without human intervention

What Changed?

  • Scored by Relationships only
  • Text is in PDFs
  • Standalone, rather than human-assist

Problem : Scoring

  • Let's look closer...

Scoring Relationships

  • Entities must be found
  • Entities must be disambiguated
  • 100 different relationships
  • ... all pieces correct :: score += 1
  • ... any mistake :: score −= 1

Example (revisit)

Dr Willie Lee Leng Ghee has an MBBS from the University of Singapore and has been a medical practitioner for over 40 years.

Dr Lee was appointed as a non-executive independent director of LCH in February 2006. He is a member of the Audit and Remuneration Committees.

Back-of-Envelope Scoring

  • Entities must be found : p=0.9*0.9
  • Entities must be disambiguated : p=0.9*0.9
  • 100 different relationships : p=0.9 (?)
  • Problem : 90%^5 << 80%

Problem: PDFs

  • Formatting isn't so simple
  • File format is unstructured :
    • Individual words
    • (x,y) coordinates

Example : Columns

PDF Layout

Example : Tables

PDF Text / Tables

Example : Hierarchy

PDF Hierarchy

Is it important?

  • Available libaries destroy essential information
  • Layout accounts for ~50% of relationships

Problem : Standalone

  • If system goes 'off the rails'
  • ... no user intervention available

Other Issues

  • Good Named Entity Extraction (NER)
  • External library licensing
  • Asian Names
  • Specialised NER
  • Disambiguation : Legal entities
  • Users and Noise

Good NER

  • Academically : ~90% is 'solved'
  • But it's not going to be sufficient in practice
  • (also : speed concerns)

Library Licensing

  • Milestone[+1] : Sell to external customer
  • Excludes using some academic libraries

Asian Names

  • Most corpora are US-centric
  • Many quirks uniquely asian
  • e.g.: "Dr Willie Lee Leng Ghee"

Specialised NER

  • Domain-specific NER ("DSNER") is important too
  • So system has to allow for multiple NER sources :
    • Build it as a collection of services
    • Allows for more experimentation

Users and Noise

  • How about using multiple NER to detect more?
  • Multiple NER won't overlap, so should increase recall
  • ... but multiplies noise too
  • ... and noisy NER looks glaringly bad

Practical Machine Learning

  • Needs some magic ...
  • ... which is in the company's existing data

Building New NER

  • Auto-label existing documents :
    • Learn from labels
    • Actual labels are not license-encumbered
  • Can incorporate Asian quirks

Coverage & Critical Mass

  • If database has 'critical mass' :
    • ... in/out base-assumption flips over
  • Can use DB facts more aggressively

Entity Fingerprinting

  • Groundtruth data can confirm
  • Groundtruth data can weight choices
  • Groundtruth data can create ideas

Ping Pong

  • Guess NER (noise gets attenuated...)
  • Bounce against DB
  • Bounce against document
  • Bounce against web

Net Result

  • Scores ++
  • Robustness ++
  • Data ++
  • Happiness ++

Wrap-up

  • Building NLP systems is not so simple
  • System becomes more dynamic as it grows
  • Very satisfying when it achieves 'lift-off'

- QUESTIONS -


Martin.Andrews @
RedCatLabs.com


My blog : http://mdda.net/

GitHub : mdda