vault backup: 2023-06-06 17:01:49
Affected files: STEM/AI/Kalman Filter.md STEM/Signal Proc/Convolution.md STEM/Signal Proc/Image/Tracking.md STEM/Signal Proc/Pole-Zero.md STEM/Signal Proc/Transfer Function.md STEM/Speech/Linguistics/Consonants.md STEM/Speech/Linguistics/Linguistics.md STEM/Speech/Linguistics/README.md STEM/Speech/Linguistics/Terms.md STEM/Speech/Linguistics/Vowels.md STEM/Speech/NLP/Jargon.md STEM/Speech/NLP/NLP.md STEM/Speech/NLP/README.md STEM/Speech/NLP/Recognition.md STEM/Speech/Perception/Perception.md STEM/Speech/Perception/README.md STEM/Speech/Speech Processing/Applications.md STEM/Speech/Speech Processing/README.md STEM/Speech/Speech Processing/Source-Filter.md STEM/Speech/Speech Processing/Vocal Tract.md STEM/img/english-phoneme-table.png STEM/img/formant.png STEM/img/pole-zero-attenuation.png STEM/img/pole-zero-feedback.png STEM/img/pole-zero-stable.png STEM/img/roc-right-left.png STEM/img/roc-two-sided.png STEM/img/spectrum-vocal-tract.png STEM/img/transfer-stable-unstable.png STEM/img/vowel-chart.png STEM/img/vowel-spaces.png
9
AI/Kalman Filter.md
Normal file
@ -0,0 +1,9 @@
|
||||
- Measure
|
||||
- Predict
|
||||
- Update
|
||||
|
||||
- Positions and confidences modelled as gaussians
|
||||
- Mean
|
||||
- Position
|
||||
- Variance
|
||||
- Confidence
|
@ -14,13 +14,22 @@ $$x(t)=x_1(t)\circledast x_2(t)=\int_{-\infty}^\infty x_1(t-\tau)\cdot x_2(\tau)
|
||||
4. $Ax_1(t)\circledast Bx_2(t)=AB[x_1(t)\circledast x_2(t)]$
|
||||
- Associativity with Scalar
|
||||
5. Symmetrical graph about origin
|
||||
6. $y(t)=x_1(t-a)\circledast x_2(t-b)$
|
||||
- $x(t)=x_1(t)\circledast x_2(t)$
|
||||
- $y(t)=x(t-a-b)$
|
||||
7. $x(t)=x_1(t)\circledast x_2(t)$
|
||||
- $x_1$ between $a_1$ and $b_1$
|
||||
- $x_2$ between $a_2$ and $b_2$
|
||||
- Starting point of $x(t)=a_1+a_2$
|
||||
- Ending point of $x(t)=b_1+b_2$
|
||||
8. $\overline{x \circledast y}=\bar x \circledast \bar y$
|
||||
9. $(x \circledast y)'=x'\circledast y=x\circledast y'$
|
||||
|
||||
# Applications
|
||||
|
||||
1. Communications systems
|
||||
- Shift signal in frequency domain (Frequency modulation)
|
||||
2. System analysis
|
||||
- Find system output given input and transfer function
|
||||
- Find system output given input and [transfer function](Transfer%20Function.md)
|
||||
|
||||
# Polynomial Multiplication
|
||||
- Convolving coefficients of two poly gives coefficients of product
|
||||
|
42
Signal Proc/Image/Tracking.md
Normal file
@ -0,0 +1,42 @@
|
||||
# Challenges
|
||||
- Clutter
|
||||
- Distractors
|
||||
- Occlusion
|
||||
- Hidden by other objects
|
||||
|
||||
- Objects appearance can evolve
|
||||
- Rotation, scale, camera viewpoint
|
||||
- Take new template every n frames
|
||||
- Take new template when confidence falls below threshold
|
||||
|
||||
# Background Tracking
|
||||
- Static camera
|
||||
- Capture clean shots of background
|
||||
- Object present
|
||||
- Average enough footage
|
||||
- Background image - current video frame = difference image
|
||||
- Threshold for binary mask
|
||||
|
||||
# Nearest Neighbour Tracking
|
||||
- Decide component with closest centroid using previous centroid
|
||||
- Not good for occlusion
|
||||
- Will snap to next candidate
|
||||
|
||||
# Blob Tracking
|
||||
- Build colour model of object
|
||||
- Eigenmodel
|
||||
- Mask of pixels that match object
|
||||
- Use centroid as location over time
|
||||
- Pick connected component with centroid closest to previous location
|
||||
- Good for distinctive colours
|
||||
- Not for practical situations though
|
||||
|
||||
# Template Tracking
|
||||
- Sample distinctive patch from image
|
||||
- Search all positions in video for patch
|
||||
- Use cross-correlation
|
||||
- Illumination changes
|
||||
- Brightness is uniform shift of greyscale values up or down
|
||||
- Correlated to the mean pixel value
|
||||
- Subtract means in template and frame to give invariance
|
||||
- Normalised cross-correlation
|
62
Signal Proc/Pole-Zero.md
Normal file
@ -0,0 +1,62 @@
|
||||
- Poles
|
||||
- **X**
|
||||
- Let $X(z) = inf$
|
||||
- Let $1/X(z) = 0$
|
||||
- Roots of denominator
|
||||
- Zeros
|
||||
- **O**
|
||||
- Let $X(z) = 0$
|
||||
- Roots of numerator
|
||||
- In complex (Z for speech) domain
|
||||
|
||||
[Magnitude Response From Pole/Zeros](https://www.youtube.com/watch?v=8jNjVkoZQCU)
|
||||
[MIT Pole Zero](https://web.mit.edu/2.14/www/Handouts/PoleZero.pdf)
|
||||
|
||||
Representation of rational transfer function, identifies
|
||||
- Stability
|
||||
- Causal/Anti-causal system
|
||||
- ROC
|
||||
- Minimum phase/Non minimum phase
|
||||
|
||||
![](../img/pole-zero-attenuation.png)
|
||||
![](../img/pole-zero-stable.png)
|
||||
![](../img/pole-zero-feedback.png)
|
||||
|
||||
# BIBO Stable
|
||||
- All poles of H must lie within the unit circle of the plot
|
||||
- If we give an input less than a constant
|
||||
- Will get an output less than some constant
|
||||
|
||||
# Region of Convergence
|
||||
- Depends on whether causal or anti-causal
|
||||
- Cannot contain poles
|
||||
- Goes to infinity
|
||||
|
||||
## Continuous
|
||||
1. If includes imaginary axis
|
||||
- BIBO stable
|
||||
- All poles must be left of i axis
|
||||
2. Rightwards from pole with largest real-part (not infinity)
|
||||
- Causal
|
||||
3. Leftward from pole with smallest real-part (not -infinity)
|
||||
- Anti-causal
|
||||
|
||||
## Discrete
|
||||
1. If includes unit circle
|
||||
- BIBO stable
|
||||
2. Outward from pole with largest (not infinite) magnitude
|
||||
- Right-sided impulse response
|
||||
- Causal (if no pole at infinity)
|
||||
3. Inward from pole with smallest (nonzero) magnitude
|
||||
- Anti-causal
|
||||
|
||||
![](../img/roc-right-left.png)
|
||||
![](../img/roc-two-sided.png)
|
||||
|
||||
Sinusoidal when complex pair
|
||||
- $e^{-j\omega}$
|
||||
- Euler's for oscillating
|
||||
Exponential when on the axis
|
||||
- Decays, no $i$ in the exponent
|
||||
|
||||
![](../img/transfer-stable-unstable.png)
|
25
Signal Proc/Transfer Function.md
Normal file
@ -0,0 +1,25 @@
|
||||
$$Y(s)=H(s)\cdot X(s)$$
|
||||
- $H(s)=\frac{Y(s)}{X(s)}=\frac{\mathcal L\{y(t)\}}{\mathcal L\{x(t)\}}$
|
||||
|
||||
$$Y(z)=H(z)\cdot X(z)$$
|
||||
- $H(z)=\frac{Y(z)}{X(z)}=\frac{\mathcal Z\{y[n]\}}{\mathcal Z\{x[n]\}}$
|
||||
|
||||
$$G(\omega)=\frac{|Y|}{|X|}=|H(j\omega)|$$
|
||||
- $H(j\omega)$, Frequency response
|
||||
|
||||
$$\phi(\omega)=arg(Y)-arg(X)=arg\left(H\left(j\omega\right)\right)$$
|
||||
- $\phi(\omega)$, Phase shift
|
||||
|
||||
$$\tau_\phi(\omega)=-\frac{\phi(\omega)}{\omega}$$
|
||||
- $\tau_\phi$, Phase delay
|
||||
- Frequency-dependent amount of delay introduced to the sinusoid by $H$
|
||||
|
||||
$$\tau_g(\omega)=-\frac{d\phi(\omega)}{d\omega}$$
|
||||
- $\tau_g$, Group delay
|
||||
- Frequency-dependent amount of delay introduced to the envelope of the sinusoid by $H$
|
||||
|
||||
[Partial Fractions](https://lpsa.swarthmore.edu/BackGround/PartialFraction/PartialFraction.html#Order_of_numerator_polynomial_is_not_less_than_that_of_the_denominator)
|
||||
[Partial Fractions for Laplace](https://lpsa.swarthmore.edu/LaplaceXform/InvLaplace/InvLaplaceXformPFE.html)
|
||||
[Inverse Z Transform](https://lpsa.swarthmore.edu/ZXform/InvZXform/InvZXform.html)
|
||||
|
||||
[Discrete Time Systems:Impulse responses and convolution; An introduction to the Z-transform](https://homes.esat.kuleuven.be/~maapc/static/files/SYSTHEORY/Slides/Lecture5/Lecture5-Impulse%20responses%20and%20convolution%20layout.pdf)
|
61
Speech/Linguistics/Consonants.md
Normal file
@ -0,0 +1,61 @@
|
||||
- Complete or partial closure of vocal tract
|
||||
- Voiced/Unvoiced
|
||||
|
||||
# Nasal
|
||||
- Mouth closed and velum lowered
|
||||
- Vowel-like structure, weaker energy
|
||||
- Sound through nose
|
||||
- English has only voiced nasals
|
||||
- Can be identified by direction of formant movement
|
||||
- m, n, ng
|
||||
- Locations
|
||||
- Labial
|
||||
- Lips
|
||||
- m
|
||||
- Alveolar
|
||||
- Palatal ridge
|
||||
- n
|
||||
- Velar
|
||||
- Soft palate
|
||||
- ng
|
||||
|
||||
# Plosive
|
||||
- Closure which is released
|
||||
- Voiced or unvoiced
|
||||
- Based on when voicing starts
|
||||
- t, d
|
||||
- Trill
|
||||
- Tap or flap
|
||||
- Locations
|
||||
- Labial
|
||||
- Lips
|
||||
- p, b
|
||||
- Alveolar
|
||||
- Palatal ridge
|
||||
- t, d
|
||||
- Velar
|
||||
- Soft palate
|
||||
- k, g
|
||||
|
||||
# Fricative
|
||||
- Air forced through constriction
|
||||
- Jet of turbulence
|
||||
- Noisy
|
||||
- Voiced
|
||||
- v, z
|
||||
- Unvoiced
|
||||
- f, s
|
||||
- Frequency cut-off
|
||||
- Inversely proportional to length of cavity in front of constriction
|
||||
- Sibilants
|
||||
- Air forced over teeth
|
||||
- s, z
|
||||
- Affricate
|
||||
- Abrupt start to frication
|
||||
- Ch, J
|
||||
|
||||
# Approximant
|
||||
- Articulators move close, some perturbations
|
||||
- Vowel-like structure, weaker energy
|
||||
- Upward trajectory of formants
|
||||
- r, y, l , w
|
32
Speech/Linguistics/Linguistics.md
Normal file
@ -0,0 +1,32 @@
|
||||
- Phonetics
|
||||
- Sound of language
|
||||
- Acoustic result of speech articulation
|
||||
- Phonology
|
||||
- How languages or dialects organise sounds
|
||||
- Within and across languages
|
||||
- Function of sound units in language
|
||||
- How phonemes are used
|
||||
- Morphology
|
||||
- Word structure
|
||||
- Syntax
|
||||
- Sentence structure
|
||||
- Semantics
|
||||
- Meaning of words or sentences
|
||||
- Pragmatics
|
||||
- How context contributes to meaning
|
||||
- Speech act theory
|
||||
- Discourse analysis
|
||||
- How sentences form text
|
||||
|
||||
# Phonetics vs Phonology
|
||||
- [Phoneme](Terms.md#Phoneme)
|
||||
- Unit of sound structure
|
||||
- With linguistic content
|
||||
- Abstraction of set of sounds
|
||||
- Allophones
|
||||
- Phonology
|
||||
- Phone
|
||||
- Segment of speech recording
|
||||
- Single sound
|
||||
- Single phoneme
|
||||
- Phonetics
|
1
Speech/Linguistics/README.md
Symbolic link
@ -0,0 +1 @@
|
||||
Linguistics.md
|
31
Speech/Linguistics/Terms.md
Normal file
@ -0,0 +1,31 @@
|
||||
# Phoneme
|
||||
- Smallest unit of speech
|
||||
- Distinguish words
|
||||
- Continuous speech is stream of different phonemes
|
||||
- English has 44
|
||||
- Consonants
|
||||
- Voiced or unvoiced
|
||||
- Vowels
|
||||
- All voiced
|
||||
|
||||
# Voiced
|
||||
- Quasi-periodic vibration of vocal cords
|
||||
- Sound transmitted along vocal tract unimpeded
|
||||
- Air forced through glottis
|
||||
- Quasi-periodic oscillation
|
||||
|
||||
# Unvoiced
|
||||
- Air flow through vocal apparatus is either cut-off or impeded
|
||||
- Constriction using tongue or lips
|
||||
- Turbulence or noise
|
||||
- Fricatives
|
||||
- \\s\\, \\f\
|
||||
- Plosives
|
||||
- \\p\\, \\t\
|
||||
- Modelled as random noise
|
||||
|
||||
# Formant
|
||||
- Vocal tract can be modelled as resonant system
|
||||
- Modes or peaks in spectral response of resonant system
|
||||
- Lowest 3 most important in speech
|
||||
![](../../img/formant.png)
|
65
Speech/Linguistics/Vowels.md
Normal file
26
Speech/NLP/Jargon.md
Normal file
@ -0,0 +1,26 @@
|
||||
- Types
|
||||
- Distinct words
|
||||
- |V|
|
||||
- Tokens
|
||||
- All words
|
||||
- N
|
||||
- Related quantities
|
||||
- Have equations that estimate
|
||||
- Disfluencies
|
||||
- Fragments
|
||||
- Broken/half spoken words
|
||||
- Fillers
|
||||
- Oo
|
||||
- Uh
|
||||
- May strip
|
||||
- May be helpful
|
||||
- Speaker may start clause again
|
||||
- Clitic
|
||||
- Part of a word that can’t stand on its own
|
||||
- What’re
|
||||
- Contractions
|
||||
|
||||
# Edit Distance
|
||||
- Distance between words
|
||||
- Number of insertions deletions
|
||||
- Similarity
|
29
Speech/NLP/NLP.md
Normal file
@ -0,0 +1,29 @@
|
||||
|
||||
# Text Normalisation
|
||||
- Tokenisation
|
||||
- Labelling parts of sentence
|
||||
- Usually words
|
||||
- Can be multiple
|
||||
- Proper nouns
|
||||
- New York
|
||||
- Emoticons
|
||||
- Hashtags
|
||||
- May need some named entity recognition
|
||||
- Penn Treebank standard
|
||||
- Byte-pair encoding
|
||||
- Standard can’t understand unseen words
|
||||
- Encode as subwords
|
||||
- -est, -er
|
||||
- Lemmatisation
|
||||
- Determining roots of words
|
||||
- Verb infinitives
|
||||
- Find lemma
|
||||
- Derived forms are inflections or inflected
|
||||
- Word-forms
|
||||
- Critical for morphological complex languages
|
||||
- Arabic
|
||||
- Stemming
|
||||
- Simpler than lemmatisation
|
||||
- Just removing suffixes
|
||||
- Normalising word formats
|
||||
- Segmenting sentences
|
1
Speech/NLP/README.md
Symbolic link
@ -0,0 +1 @@
|
||||
NLP.md
|
58
Speech/NLP/Recognition.md
Normal file
@ -0,0 +1,58 @@
|
||||
1. Automatic Speech Recognition
|
||||
- Spoken words to machine-readable form
|
||||
2. Natural language understanding
|
||||
- High level cognitive interpretation
|
||||
- Structure
|
||||
- Meaning
|
||||
- Intention
|
||||
|
||||
# Automatic Speech Recognition
|
||||
## Applications
|
||||
- Business/desktop apps
|
||||
- Dictation
|
||||
- Voice commands
|
||||
- Voice enabled services/apps
|
||||
- Siri
|
||||
- Home automation
|
||||
- Game & Entertainment
|
||||
- Education
|
||||
- Speech therapy/Rehab
|
||||
- Hearing assistance
|
||||
- Live CC
|
||||
|
||||
## Challenges
|
||||
- Speaker dependency
|
||||
- Accent
|
||||
- Emotion
|
||||
- Vocab size
|
||||
- Slang
|
||||
- Isolated words vs Continuous speech
|
||||
- Hard to segment continuous speech
|
||||
- Language constraints & Knowledge sources
|
||||
- Training source is critical
|
||||
- Acoustic ambiguity
|
||||
- Similar sounding speech
|
||||
- Noise robustness
|
||||
- Background noise
|
||||
- Reverberation
|
||||
|
||||
# Speech Diarisation
|
||||
- Who speaks when?
|
||||
- Split stream into homogenous segments for identity
|
||||
- Structure stream into speaker turns
|
||||
- Provide speaker identity
|
||||
- Combination of
|
||||
- Speaker segmentation
|
||||
- Speaker changes in stream
|
||||
- Speaker clustering
|
||||
- Grouping segments together on basis of characteristics
|
||||
- Gaussian mixture model
|
||||
- HMM
|
||||
- Bottom-up
|
||||
- More popular
|
||||
- Succession of clusters
|
||||
- Merge redundant clusters
|
||||
- Remaining belong to speakers
|
||||
- Top-down
|
||||
- Single cluster
|
||||
- Iteratively split until speaker clusters
|
8
Speech/Perception/Perception.md
Normal file
@ -0,0 +1,8 @@
|
||||
# Physiological
|
||||
- Physical/mechanical processing of sound
|
||||
- Ear stuff
|
||||
|
||||
# Psychological
|
||||
- Brain function and processing
|
||||
|
||||
***Psychoacoustics incorporates both***
|
1
Speech/Perception/README.md
Symbolic link
@ -0,0 +1 @@
|
||||
Perception.md
|
25
Speech/Speech Processing/Applications.md
Normal file
@ -0,0 +1,25 @@
|
||||
- Speech telecommunications & Encoding
|
||||
- Preserving perceptibility and quality over the wire
|
||||
- Minimising bandwidth
|
||||
- Speech enhancement
|
||||
- Restoration of degraded speech
|
||||
- Additive noise
|
||||
- Reverberation
|
||||
- Echoes
|
||||
- Background sounds
|
||||
- Blind source separation
|
||||
- Adaptive filtering
|
||||
- Spectral subtraction
|
||||
- Speech & Speaker recognition
|
||||
- Auto conversion of speech to written
|
||||
- Identifying speaker based on speech
|
||||
- Dictation, speaker recognition for security
|
||||
- Speaker diarisation
|
||||
- "Who speaks when"
|
||||
- Speaker segmentation
|
||||
- Speaker change point in stream
|
||||
- Speaker clustering
|
||||
- Grouping segments based on speaker identity
|
||||
- Speech synthesis
|
||||
- Speech analysis
|
||||
- Waveform & spectrum
|
1
Speech/Speech Processing/README.md
Symbolic link
@ -0,0 +1 @@
|
||||
Applications.md
|
19
Speech/Speech Processing/Source-Filter.md
Normal file
16
Speech/Speech Processing/Vocal Tract.md
Normal file
@ -0,0 +1,16 @@
|
||||
- Input and output signals are real
|
||||
- Filter coefficients are real for rational $H(z)$
|
||||
- Poles/zeros either real or complex conjugate pairs
|
||||
- BIBO stability important
|
||||
|
||||
# Frequency Response
|
||||
|
||||
- Sample along unit circle
|
||||
- $|z|=\left|e^{i\omega}\right|=1$
|
||||
|
||||
# Magnitude Response
|
||||
$$\left|H(e^{i\omega})\right|=\frac{b_0|e^{i\omega}-\beta_1|\cdot\cdot\cdot|e^{i\omega}-\beta_q|}{|e^{i\omega}-\alpha_1|\cdot\cdot\cdot|e^{i\omega}-\alpha_p|}$$
|
||||
|
||||
![](../../img/spectrum-vocal-tract.png)
|
||||
- LPC & Cepstral Analysis to separate
|
||||
- Residual allows more accurate estimation of pitch period
|
BIN
img/english-phoneme-table.png
Normal file
After Width: | Height: | Size: 92 KiB |
BIN
img/formant.png
Normal file
After Width: | Height: | Size: 26 KiB |
BIN
img/pole-zero-attenuation.png
Normal file
After Width: | Height: | Size: 113 KiB |
BIN
img/pole-zero-feedback.png
Normal file
After Width: | Height: | Size: 112 KiB |
BIN
img/pole-zero-stable.png
Normal file
After Width: | Height: | Size: 111 KiB |
BIN
img/roc-right-left.png
Normal file
After Width: | Height: | Size: 134 KiB |
BIN
img/roc-two-sided.png
Normal file
After Width: | Height: | Size: 112 KiB |
BIN
img/spectrum-vocal-tract.png
Normal file
After Width: | Height: | Size: 320 KiB |
BIN
img/transfer-stable-unstable.png
Normal file
After Width: | Height: | Size: 36 KiB |
BIN
img/vowel-chart.png
Normal file
After Width: | Height: | Size: 22 KiB |
BIN
img/vowel-spaces.png
Normal file
After Width: | Height: | Size: 78 KiB |