---
tags:
- ai
- media
Visual Question Answering
- Combine visual with text sequence
- [CNN](../CNN/CNN.md) + [LSTM](LSTM.md)
- Generate text from images
- Automatic scene description
- Cross-modal
![cnn+lstm](../../../img/cnn+lstm.png)
- Word embedding not character
# Freeform
- Encode facts with two text streams
![vqa-block](../../../img/vqa-block.png)
# Limitations
- Repetitive answers
- Not much variation
- No creativity
- Wont generalise beyond taught concepts