About Me
- Machine Intelligence / Startups / Finance
-
- After PhD, went in to Finance as a Quant
- Moved from NYC to Singapore in Sep-2013
- 2014 = 'fun' :
-
- Machine Learning, Deep Learning, NLP
- Since 2015 = 'serious' :: NLP + deep learning
-
The Project
- Input : Company text
- Output : Entities & Relationships
Example
Dr Willie Lee Leng Ghee has an MBBS from the University of Singapore
and has been a medical practitioner for over 40 years.
Dr Lee was appointed as a non-executive independent director of LCH in February 2006.
He is a member of the Audit and Remuneration Committees.
Theoretical Approach
- ~ Stanford NLP suite;
- MITIE Relationship Extraction; and
- GUI to allow interactivity
It's going to be great!
Milestone
- Better than human data entry
- e.g. >70% recognition rate
The Real Project
- Input : PDF version of Annual Reports
- Output : Entities & Relationships
- 80% F1 score on Relationships
- ... without human intervention
What Changed?
- Scored by Relationships only
- Text is in PDFs
- Standalone, rather than human-assist
Scoring Relationships
- Entities must be found
- Entities must be disambiguated
- 100 different relationships
- ... all pieces correct :: score += 1
- ... any mistake :: score −= 1
Example (revisit)
Dr Willie Lee Leng Ghee has an MBBS from the University of Singapore
and has been a medical practitioner for over 40 years.
Dr Lee was appointed as a non-executive independent director of LCH in February 2006.
He is a member of the Audit and Remuneration Committees.
Back-of-Envelope Scoring
- Entities must be found : p=0.9*0.9
- Entities must be disambiguated : p=0.9*0.9
- 100 different relationships : p=0.9 (?)
- Problem : 90%^5 << 80%
Problem: PDFs
- Formatting isn't so simple
- File format is unstructured :
-
- Individual words
- (x,y) coordinates
Example : Columns
Example : Tables
Example : Hierarchy
Is it important?
- Available libaries destroy essential information
- Layout accounts for ~50% of relationships
Problem : Standalone
- If system goes 'off the rails'
- ... no user intervention available
Other Issues
- Good Named Entity Extraction (NER)
- External library licensing
- Asian Names
- Specialised NER
- Disambiguation : Legal entities
- Users and Noise
Good NER
- Academically : ~90% is 'solved'
- But it's not going to be sufficient in practice
- (also : speed concerns)
Library Licensing
- Milestone[+1] : Sell to external customer
- Excludes using some academic libraries
Asian Names
- Most corpora are US-centric
- Many quirks uniquely asian
- e.g.: "Dr Willie Lee Leng Ghee"
Specialised NER
- Domain-specific NER ("DSNER") is important too
- So system has to allow for multiple NER sources :
-
- Build it as a collection of services
- Allows for more experimentation
Users and Noise
- How about using multiple NER to detect more?
- Multiple NER won't overlap, so should increase recall
- ... but multiplies noise too
- ... and noisy NER looks glaringly bad
Practical Machine Learning
- Needs some magic ...
- ... which is in the company's existing data
Building New NER
- Auto-label existing documents :
-
- Learn from labels
- Actual labels are not license-encumbered
- Can incorporate Asian quirks
Coverage & Critical Mass
- If database has 'critical mass' :
-
- ... in/out base-assumption flips over
- Can use DB facts more aggressively
Entity Fingerprinting
- Groundtruth data can confirm
- Groundtruth data can weight choices
- Groundtruth data can create ideas
Ping Pong
- Guess NER (noise gets attenuated...)
- Bounce against DB
- Bounce against document
- Bounce against web
Net Result
- Scores ++
- Robustness ++
- Data ++
- Happiness ++
Wrap-up
- Building NLP systems is not so simple
- System becomes more dynamic as it grows
- Very satisfying when it achieves 'lift-off'