stem/Speech/NLP/NLP.md


# Text Normalisation
- Tokenisation
	- Labelling parts of sentence
	- Usually words
	- Can be multiple
		- Proper nouns
		- New York
		- Emoticons
		- Hashtags
	- May need some named entity recognition
	- Penn Treebank standard
	- Byte-pair encoding
		- Standard can’t understand unseen words
		- Encode as subwords
			- -est, -er
- Lemmatisation
	- Determining roots of words
	- Verb infinitives
	- Find lemma
		- Derived forms are inflections or inflected
			- Word-forms
	- Critical for morphological complex languages
		- Arabic
- Stemming
	- Simpler than lemmatisation
	- Just removing suffixes
- Normalising word formats
- Segmenting sentences
-												vault backup: 2023-06-06 17:01:49

Affected files:
STEM/AI/Kalman Filter.md
STEM/Signal Proc/Convolution.md
STEM/Signal Proc/Image/Tracking.md
STEM/Signal Proc/Pole-Zero.md
STEM/Signal Proc/Transfer Function.md
STEM/Speech/Linguistics/Consonants.md
STEM/Speech/Linguistics/Linguistics.md
STEM/Speech/Linguistics/README.md
STEM/Speech/Linguistics/Terms.md
STEM/Speech/Linguistics/Vowels.md
STEM/Speech/NLP/Jargon.md
STEM/Speech/NLP/NLP.md
STEM/Speech/NLP/README.md
STEM/Speech/NLP/Recognition.md
STEM/Speech/Perception/Perception.md
STEM/Speech/Perception/README.md
STEM/Speech/Speech Processing/Applications.md
STEM/Speech/Speech Processing/README.md
STEM/Speech/Speech Processing/Source-Filter.md
STEM/Speech/Speech Processing/Vocal Tract.md
STEM/img/english-phoneme-table.png
STEM/img/formant.png
STEM/img/pole-zero-attenuation.png
STEM/img/pole-zero-feedback.png
STEM/img/pole-zero-stable.png
STEM/img/roc-right-left.png
STEM/img/roc-two-sided.png
STEM/img/spectrum-vocal-tract.png
STEM/img/transfer-stable-unstable.png
STEM/img/vowel-chart.png
STEM/img/vowel-spaces.png

											
										
										
											2023-06-06 17:01:49 +01:00
 								# Text Normalisation
 								- Tokenisation
 									- Labelling parts of sentence
 									- Usually words
 									- Can be multiple
 										- Proper nouns
 										- New York
 										- Emoticons
 										- Hashtags
 									- May need some named entity recognition
 									- Penn Treebank standard
 									- Byte-pair encoding
 										- Standard can’t understand unseen words
 										- Encode as subwords
 											- -est, -er
 								- Lemmatisation
 									- Determining roots of words
 									- Verb infinitives
 									- Find lemma
 										- Derived forms are inflections or inflected
 											- Word-forms
 									- Critical for morphological complex languages
 										- Arabic
 								- Stemming
 									- Simpler than lemmatisation
 									- Just removing suffixes
 								- Normalising word formats
 								- Segmenting sentences