Visual Question Answering

- Combine visual with text sequence
	- [[CNN]] + [[LSTM]]
	- Generate text from images
		- Automatic scene description
	- Cross-modal

![[cnn+lstm.png]]
- Word embedding not character

# Freeform
- Encode facts with two text streams
![[vqa-block.png]]
# Limitations
- Repetitive answers
	- Not much variation
- No creativity
	- Wont generalise beyond taught concepts