58 lines
1.3 KiB
Markdown
58 lines
1.3 KiB
Markdown
|
1. Automatic Speech Recognition
|
||
|
- Spoken words to machine-readable form
|
||
|
2. Natural language understanding
|
||
|
- High level cognitive interpretation
|
||
|
- Structure
|
||
|
- Meaning
|
||
|
- Intention
|
||
|
|
||
|
# Automatic Speech Recognition
|
||
|
## Applications
|
||
|
- Business/desktop apps
|
||
|
- Dictation
|
||
|
- Voice commands
|
||
|
- Voice enabled services/apps
|
||
|
- Siri
|
||
|
- Home automation
|
||
|
- Game & Entertainment
|
||
|
- Education
|
||
|
- Speech therapy/Rehab
|
||
|
- Hearing assistance
|
||
|
- Live CC
|
||
|
|
||
|
## Challenges
|
||
|
- Speaker dependency
|
||
|
- Accent
|
||
|
- Emotion
|
||
|
- Vocab size
|
||
|
- Slang
|
||
|
- Isolated words vs Continuous speech
|
||
|
- Hard to segment continuous speech
|
||
|
- Language constraints & Knowledge sources
|
||
|
- Training source is critical
|
||
|
- Acoustic ambiguity
|
||
|
- Similar sounding speech
|
||
|
- Noise robustness
|
||
|
- Background noise
|
||
|
- Reverberation
|
||
|
|
||
|
# Speech Diarisation
|
||
|
- Who speaks when?
|
||
|
- Split stream into homogenous segments for identity
|
||
|
- Structure stream into speaker turns
|
||
|
- Provide speaker identity
|
||
|
- Combination of
|
||
|
- Speaker segmentation
|
||
|
- Speaker changes in stream
|
||
|
- Speaker clustering
|
||
|
- Grouping segments together on basis of characteristics
|
||
|
- Gaussian mixture model
|
||
|
- HMM
|
||
|
- Bottom-up
|
||
|
- More popular
|
||
|
- Succession of clusters
|
||
|
- Merge redundant clusters
|
||
|
- Remaining belong to speakers
|
||
|
- Top-down
|
||
|
- Single cluster
|
||
|
- Iteratively split until speaker clusters
|