stem/Speech/NLP/NLP.md
andy 5a94c5ff1a vault backup: 2023-06-06 17:01:49
Affected files:
STEM/AI/Kalman Filter.md
STEM/Signal Proc/Convolution.md
STEM/Signal Proc/Image/Tracking.md
STEM/Signal Proc/Pole-Zero.md
STEM/Signal Proc/Transfer Function.md
STEM/Speech/Linguistics/Consonants.md
STEM/Speech/Linguistics/Linguistics.md
STEM/Speech/Linguistics/README.md
STEM/Speech/Linguistics/Terms.md
STEM/Speech/Linguistics/Vowels.md
STEM/Speech/NLP/Jargon.md
STEM/Speech/NLP/NLP.md
STEM/Speech/NLP/README.md
STEM/Speech/NLP/Recognition.md
STEM/Speech/Perception/Perception.md
STEM/Speech/Perception/README.md
STEM/Speech/Speech Processing/Applications.md
STEM/Speech/Speech Processing/README.md
STEM/Speech/Speech Processing/Source-Filter.md
STEM/Speech/Speech Processing/Vocal Tract.md
STEM/img/english-phoneme-table.png
STEM/img/formant.png
STEM/img/pole-zero-attenuation.png
STEM/img/pole-zero-feedback.png
STEM/img/pole-zero-stable.png
STEM/img/roc-right-left.png
STEM/img/roc-two-sided.png
STEM/img/spectrum-vocal-tract.png
STEM/img/transfer-stable-unstable.png
STEM/img/vowel-chart.png
STEM/img/vowel-spaces.png
2023-06-06 17:01:49 +01:00

29 lines
652 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Text Normalisation
- Tokenisation
- Labelling parts of sentence
- Usually words
- Can be multiple
- Proper nouns
- New York
- Emoticons
- Hashtags
- May need some named entity recognition
- Penn Treebank standard
- Byte-pair encoding
- Standard cant understand unseen words
- Encode as subwords
- -est, -er
- Lemmatisation
- Determining roots of words
- Verb infinitives
- Find lemma
- Derived forms are inflections or inflected
- Word-forms
- Critical for morphological complex languages
- Arabic
- Stemming
- Simpler than lemmatisation
- Just removing suffixes
- Normalising word formats
- Segmenting sentences