andy
5a94c5ff1a
Affected files: STEM/AI/Kalman Filter.md STEM/Signal Proc/Convolution.md STEM/Signal Proc/Image/Tracking.md STEM/Signal Proc/Pole-Zero.md STEM/Signal Proc/Transfer Function.md STEM/Speech/Linguistics/Consonants.md STEM/Speech/Linguistics/Linguistics.md STEM/Speech/Linguistics/README.md STEM/Speech/Linguistics/Terms.md STEM/Speech/Linguistics/Vowels.md STEM/Speech/NLP/Jargon.md STEM/Speech/NLP/NLP.md STEM/Speech/NLP/README.md STEM/Speech/NLP/Recognition.md STEM/Speech/Perception/Perception.md STEM/Speech/Perception/README.md STEM/Speech/Speech Processing/Applications.md STEM/Speech/Speech Processing/README.md STEM/Speech/Speech Processing/Source-Filter.md STEM/Speech/Speech Processing/Vocal Tract.md STEM/img/english-phoneme-table.png STEM/img/formant.png STEM/img/pole-zero-attenuation.png STEM/img/pole-zero-feedback.png STEM/img/pole-zero-stable.png STEM/img/roc-right-left.png STEM/img/roc-two-sided.png STEM/img/spectrum-vocal-tract.png STEM/img/transfer-stable-unstable.png STEM/img/vowel-chart.png STEM/img/vowel-spaces.png
29 lines
652 B
Markdown
29 lines
652 B
Markdown
|
||
# Text Normalisation
|
||
- Tokenisation
|
||
- Labelling parts of sentence
|
||
- Usually words
|
||
- Can be multiple
|
||
- Proper nouns
|
||
- New York
|
||
- Emoticons
|
||
- Hashtags
|
||
- May need some named entity recognition
|
||
- Penn Treebank standard
|
||
- Byte-pair encoding
|
||
- Standard can’t understand unseen words
|
||
- Encode as subwords
|
||
- -est, -er
|
||
- Lemmatisation
|
||
- Determining roots of words
|
||
- Verb infinitives
|
||
- Find lemma
|
||
- Derived forms are inflections or inflected
|
||
- Word-forms
|
||
- Critical for morphological complex languages
|
||
- Arabic
|
||
- Stemming
|
||
- Simpler than lemmatisation
|
||
- Just removing suffixes
|
||
- Normalising word formats
|
||
- Segmenting sentences |