2023-12-22 16:39:03 +00:00
|
|
|
---
|
|
|
|
tags:
|
|
|
|
- ai
|
|
|
|
---
|
2023-06-06 17:01:49 +01:00
|
|
|
1. Automatic Speech Recognition
|
|
|
|
- Spoken words to machine-readable form
|
|
|
|
2. Natural language understanding
|
|
|
|
- High level cognitive interpretation
|
|
|
|
- Structure
|
|
|
|
- Meaning
|
|
|
|
- Intention
|
|
|
|
|
|
|
|
# Automatic Speech Recognition
|
|
|
|
## Applications
|
|
|
|
- Business/desktop apps
|
|
|
|
- Dictation
|
|
|
|
- Voice commands
|
|
|
|
- Voice enabled services/apps
|
|
|
|
- Siri
|
|
|
|
- Home automation
|
|
|
|
- Game & Entertainment
|
|
|
|
- Education
|
|
|
|
- Speech therapy/Rehab
|
|
|
|
- Hearing assistance
|
|
|
|
- Live CC
|
|
|
|
|
|
|
|
## Challenges
|
|
|
|
- Speaker dependency
|
|
|
|
- Accent
|
|
|
|
- Emotion
|
|
|
|
- Vocab size
|
|
|
|
- Slang
|
|
|
|
- Isolated words vs Continuous speech
|
|
|
|
- Hard to segment continuous speech
|
|
|
|
- Language constraints & Knowledge sources
|
|
|
|
- Training source is critical
|
|
|
|
- Acoustic ambiguity
|
|
|
|
- Similar sounding speech
|
|
|
|
- Noise robustness
|
|
|
|
- Background noise
|
|
|
|
- Reverberation
|
|
|
|
|
|
|
|
# Speech Diarisation
|
|
|
|
- Who speaks when?
|
|
|
|
- Split stream into homogenous segments for identity
|
|
|
|
- Structure stream into speaker turns
|
|
|
|
- Provide speaker identity
|
|
|
|
- Combination of
|
|
|
|
- Speaker segmentation
|
|
|
|
- Speaker changes in stream
|
|
|
|
- Speaker clustering
|
|
|
|
- Grouping segments together on basis of characteristics
|
|
|
|
- Gaussian mixture model
|
|
|
|
- HMM
|
|
|
|
- Bottom-up
|
|
|
|
- More popular
|
|
|
|
- Succession of clusters
|
|
|
|
- Merge redundant clusters
|
|
|
|
- Remaining belong to speakers
|
|
|
|
- Top-down
|
|
|
|
- Single cluster
|
|
|
|
- Iteratively split until speaker clusters
|