---
tags:
  - ai
  - media
---
Visual Question Answering

- Combine visual with text sequence
	- [CNN](../CNN/CNN.md) + [LSTM](LSTM.md)
	- Generate text from images
		- Automatic scene description
	- Cross-modal

![cnn+lstm](../../../img/cnn+lstm.png)
- Word embedding not character

# Freeform
- Encode facts with two text streams
![vqa-block](../../../img/vqa-block.png)
# Limitations
- Repetitive answers
	- Not much variation
- No creativity
	- Wont generalise beyond taught concepts