Andy Pack
f29c435494
Affected files: .obsidian/graph.json .obsidian/workspace.json Gaming/Steam controllers.md Gaming/Ubisoft.md STEM/Signal Proc/Convolution.md STEM/Signal Proc/Fourier Transform.md STEM/Signal Proc/Pole-Zero.md STEM/Signal Proc/System Classes.md STEM/Signal Proc/Transfer Function.md STEM/Speech/Linguistics/Consonants.md STEM/Speech/Linguistics/Linguistics.md STEM/Speech/Linguistics/Terms.md STEM/Speech/Linguistics/Vowels.md STEM/Speech/Literature.md STEM/Speech/NLP/Jargon.md STEM/Speech/NLP/NLP.md STEM/Speech/NLP/Recognition.md STEM/Speech/Perception/Perception.md STEM/Speech/Speech Processing/Applications.md STEM/Speech/Speech Processing/Source-Filter.md STEM/Speech/Speech Processing/Vocal Tract.md Work/Applications/Anthropic/Cover letter.md Work/Applications/Anthropic/In line with values.md Work/Applications/Anthropic/Why Work.md Work/Companies.md Work/Freelancing.md Work/Products.md Work/Tech.md |
||
---|---|---|
.. | ||
Jargon.md | ||
NLP.md | ||
README.md | ||
Recognition.md |
tags | ||
---|---|---|
|
Text Normalisation
- Tokenisation
- Labelling parts of sentence
- Usually words
- Can be multiple
- Proper nouns
- New York
- Emoticons
- Hashtags
- May need some named entity recognition
- Penn Treebank standard
- Byte-pair encoding
- Standard can’t understand unseen words
- Encode as subwords
- -est, -er
- Lemmatisation
- Determining roots of words
- Verb infinitives
- Find lemma
- Derived forms are inflections or inflected
- Word-forms
- Derived forms are inflections or inflected
- Critical for morphological complex languages
- Arabic
- Stemming
- Simpler than lemmatisation
- Just removing suffixes
- Normalising word formats
- Segmenting sentences