19 lines
393 B
Markdown
19 lines
393 B
Markdown
|
Visual Question Answering
|
||
|
|
||
|
- Combine visual with text sequence
|
||
|
- [[CNN]] + [[LSTM]]
|
||
|
- Generate text from images
|
||
|
- Automatic scene description
|
||
|
- Cross-modal
|
||
|
|
||
|
![[cnn+lstm.png]]
|
||
|
- Word embedding not character
|
||
|
|
||
|
# Freeform
|
||
|
- Encode facts with two text streams
|
||
|
![[vqa-block.png]]
|
||
|
# Limitations
|
||
|
- Repetitive answers
|
||
|
- Not much variation
|
||
|
- No creativity
|
||
|
- Wont generalise beyond taught concepts
|