---
tags:
- ai
# Text Normalisation
- Tokenisation
- Labelling parts of sentence
- Usually words
- Can be multiple
- Proper nouns
- New York
- Emoticons
- Hashtags
- May need some named entity recognition
- Penn Treebank standard
- Byte-pair encoding
- Standard can’t understand unseen words
- Encode as subwords
- -est, -er
- Lemmatisation
- Determining roots of words
- Verb infinitives
- Find lemma
- Derived forms are inflections or inflected
- Word-forms
- Critical for morphological complex languages
- Arabic
- Stemming
- Simpler than lemmatisation
- Just removing suffixes
- Normalising word formats
- Segmenting sentences