Deep Learning Voices


San Francisco GDG


Martin Andrews @ reddragon.ai

12 November 2018

About Me

  • Machine Intelligence / Startups / Finance
    • Moved from NYC to Singapore in Sep-2013
  • 2014 = 'fun' :
    • Machine Learning, Deep Learning, NLP
    • Robots, drones
  • Since 2015 = 'serious' :: NLP + deep learning
    • & GDE ML; TF&DL co-organiser
    • & Papers...
    • & Dev Course...

About Red Dragon AI

  • Google Partner : Deep Learning Consulting & Prototyping
  • SGInnovate/Govt : Education / Training
  • Products :
    • Conversational Computing
    • Natural Voice Generation - multiple languages
    • Knowledgebase interaction & reasoning

Outline

  • whoami = DONE
  • Problem : Generating Speech Audio
  • Background
  • WaveNet(s)
  • New directions
  • Wrap-up

Text to Speech

  • Typical structure :
    • Encode text to 'audio features'
    • Convert features to actual audio
TTS basic flow

Last Audio Step

  • Has been 'working' for a while :
    • Concatenative (eg: original-Siri)
    • Parametric (eg: WORLD/Merlin)
    • Spectrum inversion (eg: Griffin-Lim)
Waves and spectra

Concatenative

Concatenative TTS

Sounds lumpy / jumpy

Parametric

Parametric TTS

Sounds like robot / accordian

Spectrum Inversion

  • Problem is that a spectrum is lossy :
    • Phase information gets discarded
    • Introduces 'chorus' artifacts unless perfect
Phase difference from original

WaveNet v1

WaveNet v1 MOS

Key Elements

  • Produce audio samples from network
  • Output of distributions
  • CNN with dilation
  • Computational burden

Audio samples
from network

  • Data output :
    • 16 KHz rate (now 24KHz)
    • 8-bit μ-law (now 16-bit PCM)
  • Very long time-dependencies :
    • Normal RNNs are limited to ~50 steps
    • Word features are 1000s of steps

Output of distributions

  • Instead of raw audio :
    • Output a complete distribution for each timestep
    • Seems like 256x as much work
    • ... seems crazy, but the results speak for themselves ...

WaveNet v1 bins

Regular CNNs

Regular CNN

Look at the 'linear footprint'

Dilated CNNs

Dilated CNN

Look at the 'exponential footprint'

CNNs Pro/Con

  • Advantages :
    • Can have very long 'look back'
    • Fast to train
  • Disadvantages :
    • No 'next sample' scheme

Computational burden

  • Training is QUICK :
    • All timesteps have known next training samples
  • Inference / Running is SLOW :
    • 1 sec of output = 1 minute of GPU

WaveNet Sequential

Parallel WaveNet

Goal = Parallel

Parallel WaveNet Noise to Waveform

New Training Process

Parallel WaveNet student-teacher

Noise → Distribution → Sample → Distribution
(optimise for distributions being the same)

Using RNNs

WaveRNN

WaveRNN model diagram

Significant Google engineering effort...

'Flow' Techniques

WaveGlow

WaveGlow model

Based on 'Glow' from OpenAI

FloWaveNet

FloWaveNet reddit posting

Lots of activity right now...

Back to Google

Wrap-up

  • WaveNet started out as very good but very expensive
  • ... but that proved it was worth optimising
  • Lots of opportunity for innovation
GitHub - mdda

* Please add a star... *

Deep Learning
MeetUp Group

8-week Deep Learning
Developer Course

  • 25 September - 25-November (2017)
  • Twice-Weekly 3-hour sessions included :
    • Instruction
    • Individual Projects
    • Support by Singapore Government
  • Location : SGInnovate
  • Status : FINISHED

- QUESTIONS -


Martin @
RedDragon . AI


My blog : http://blog.mdda.net/

GitHub : mdda