29 lines
652 B
Markdown
29 lines
652 B
Markdown
|
|
|||
|
# Text Normalisation
|
|||
|
- Tokenisation
|
|||
|
- Labelling parts of sentence
|
|||
|
- Usually words
|
|||
|
- Can be multiple
|
|||
|
- Proper nouns
|
|||
|
- New York
|
|||
|
- Emoticons
|
|||
|
- Hashtags
|
|||
|
- May need some named entity recognition
|
|||
|
- Penn Treebank standard
|
|||
|
- Byte-pair encoding
|
|||
|
- Standard can’t understand unseen words
|
|||
|
- Encode as subwords
|
|||
|
- -est, -er
|
|||
|
- Lemmatisation
|
|||
|
- Determining roots of words
|
|||
|
- Verb infinitives
|
|||
|
- Find lemma
|
|||
|
- Derived forms are inflections or inflected
|
|||
|
- Word-forms
|
|||
|
- Critical for morphological complex languages
|
|||
|
- Arabic
|
|||
|
- Stemming
|
|||
|
- Simpler than lemmatisation
|
|||
|
- Just removing suffixes
|
|||
|
- Normalising word formats
|
|||
|
- Segmenting sentences
|