Visual Question Answering - Combine visual with text sequence - [[CNN]] + [[LSTM]] - Generate text from images - Automatic scene description - Cross-modal ![[cnn+lstm.png]] - Word embedding not character # Freeform - Encode facts with two text streams ![[vqa-block.png]] # Limitations - Repetitive answers - Not much variation - No creativity - Wont generalise beyond taught concepts