stem/Speech/NLP/NLP.md
Andy Pack f29c435494 vault backup: 2023-12-27 22:38:55
Affected files:
.obsidian/graph.json
.obsidian/workspace.json
Gaming/Steam controllers.md
Gaming/Ubisoft.md
STEM/Signal Proc/Convolution.md
STEM/Signal Proc/Fourier Transform.md
STEM/Signal Proc/Pole-Zero.md
STEM/Signal Proc/System Classes.md
STEM/Signal Proc/Transfer Function.md
STEM/Speech/Linguistics/Consonants.md
STEM/Speech/Linguistics/Linguistics.md
STEM/Speech/Linguistics/Terms.md
STEM/Speech/Linguistics/Vowels.md
STEM/Speech/Literature.md
STEM/Speech/NLP/Jargon.md
STEM/Speech/NLP/NLP.md
STEM/Speech/NLP/Recognition.md
STEM/Speech/Perception/Perception.md
STEM/Speech/Speech Processing/Applications.md
STEM/Speech/Speech Processing/Source-Filter.md
STEM/Speech/Speech Processing/Vocal Tract.md
Work/Applications/Anthropic/Cover letter.md
Work/Applications/Anthropic/In line with values.md
Work/Applications/Anthropic/Why Work.md
Work/Companies.md
Work/Freelancing.md
Work/Products.md
Work/Tech.md
2023-12-27 22:38:56 +00:00

33 lines
683 B
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
tags:
- ai
- speech
---
# Text Normalisation
- Tokenisation
- Labelling parts of sentence
- Usually words
- Can be multiple
- Proper nouns
- New York
- Emoticons
- Hashtags
- May need some named entity recognition
- Penn Treebank standard
- Byte-pair encoding
- Standard cant understand unseen words
- Encode as subwords
- -est, -er
- Lemmatisation
- Determining roots of words
- Verb infinitives
- Find lemma
- Derived forms are inflections or inflected
- Word-forms
- Critical for morphological complex languages
- Arabic
- Stemming
- Simpler than lemmatisation
- Just removing suffixes
- Normalising word formats
- Segmenting sentences