--- tags: - ai --- # Text Normalisation - Tokenisation - Labelling parts of sentence - Usually words - Can be multiple - Proper nouns - New York - Emoticons - Hashtags - May need some named entity recognition - Penn Treebank standard - Byte-pair encoding - Standard can’t understand unseen words - Encode as subwords - -est, -er - Lemmatisation - Determining roots of words - Verb infinitives - Find lemma - Derived forms are inflections or inflected - Word-forms - Critical for morphological complex languages - Arabic - Stemming - Simpler than lemmatisation - Just removing suffixes - Normalising word formats - Segmenting sentences