stem/Speech/NLP/NLP.md
andy 5a94c5ff1a vault backup: 2023-06-06 17:01:49
Affected files:
STEM/AI/Kalman Filter.md
STEM/Signal Proc/Convolution.md
STEM/Signal Proc/Image/Tracking.md
STEM/Signal Proc/Pole-Zero.md
STEM/Signal Proc/Transfer Function.md
STEM/Speech/Linguistics/Consonants.md
STEM/Speech/Linguistics/Linguistics.md
STEM/Speech/Linguistics/README.md
STEM/Speech/Linguistics/Terms.md
STEM/Speech/Linguistics/Vowels.md
STEM/Speech/NLP/Jargon.md
STEM/Speech/NLP/NLP.md
STEM/Speech/NLP/README.md
STEM/Speech/NLP/Recognition.md
STEM/Speech/Perception/Perception.md
STEM/Speech/Perception/README.md
STEM/Speech/Speech Processing/Applications.md
STEM/Speech/Speech Processing/README.md
STEM/Speech/Speech Processing/Source-Filter.md
STEM/Speech/Speech Processing/Vocal Tract.md
STEM/img/english-phoneme-table.png
STEM/img/formant.png
STEM/img/pole-zero-attenuation.png
STEM/img/pole-zero-feedback.png
STEM/img/pole-zero-stable.png
STEM/img/roc-right-left.png
STEM/img/roc-two-sided.png
STEM/img/spectrum-vocal-tract.png
STEM/img/transfer-stable-unstable.png
STEM/img/vowel-chart.png
STEM/img/vowel-spaces.png
2023-06-06 17:01:49 +01:00

652 B
Raw Blame History

Text Normalisation

  • Tokenisation
    • Labelling parts of sentence
    • Usually words
    • Can be multiple
      • Proper nouns
      • New York
      • Emoticons
      • Hashtags
    • May need some named entity recognition
    • Penn Treebank standard
    • Byte-pair encoding
      • Standard cant understand unseen words
      • Encode as subwords
        • -est, -er
  • Lemmatisation
    • Determining roots of words
    • Verb infinitives
    • Find lemma
      • Derived forms are inflections or inflected
        • Word-forms
    • Critical for morphological complex languages
      • Arabic
  • Stemming
    • Simpler than lemmatisation
    • Just removing suffixes
  • Normalising word formats
  • Segmenting sentences