stem/Speech/NLP/NLP.md
Andy Pack f29c435494 vault backup: 2023-12-27 22:38:55
Affected files:
.obsidian/graph.json
.obsidian/workspace.json
Gaming/Steam controllers.md
Gaming/Ubisoft.md
STEM/Signal Proc/Convolution.md
STEM/Signal Proc/Fourier Transform.md
STEM/Signal Proc/Pole-Zero.md
STEM/Signal Proc/System Classes.md
STEM/Signal Proc/Transfer Function.md
STEM/Speech/Linguistics/Consonants.md
STEM/Speech/Linguistics/Linguistics.md
STEM/Speech/Linguistics/Terms.md
STEM/Speech/Linguistics/Vowels.md
STEM/Speech/Literature.md
STEM/Speech/NLP/Jargon.md
STEM/Speech/NLP/NLP.md
STEM/Speech/NLP/Recognition.md
STEM/Speech/Perception/Perception.md
STEM/Speech/Speech Processing/Applications.md
STEM/Speech/Speech Processing/Source-Filter.md
STEM/Speech/Speech Processing/Vocal Tract.md
Work/Applications/Anthropic/Cover letter.md
Work/Applications/Anthropic/In line with values.md
Work/Applications/Anthropic/Why Work.md
Work/Companies.md
Work/Freelancing.md
Work/Products.md
Work/Tech.md
2023-12-27 22:38:56 +00:00

683 B
Raw Permalink Blame History

tags
ai
speech

Text Normalisation

  • Tokenisation
    • Labelling parts of sentence
    • Usually words
    • Can be multiple
      • Proper nouns
      • New York
      • Emoticons
      • Hashtags
    • May need some named entity recognition
    • Penn Treebank standard
    • Byte-pair encoding
      • Standard cant understand unseen words
      • Encode as subwords
        • -est, -er
  • Lemmatisation
    • Determining roots of words
    • Verb infinitives
    • Find lemma
      • Derived forms are inflections or inflected
        • Word-forms
    • Critical for morphological complex languages
      • Arabic
  • Stemming
    • Simpler than lemmatisation
    • Just removing suffixes
  • Normalising word formats
  • Segmenting sentences