http://imatge-upc.github.io/vqa-2016-cvprw/
This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework.As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62\% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations.
4. Visual Question-Answering
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).
4
6. Visual Question-Answering: Types
6
Real images Abstract scenes
Multi-Choice
Open-ended
Q: Does it
appear to be
rainy?
A: no
Q: What is just
under the tree?
A: a ball
Q: How
many slices
of pizza are
there?
A: 1, 2, 3, 4
Q: What is for
desert?
A: cake, ice
cream,
cheesecake, pie
9. Motivation: AI research
● Multidisciplinary tasks
● Models able to perform more
complex activities
● Different sub-problems tackled at
once
9
Computer Vision
Knowledge
Representation
and Reasoning
Natural
Language
Processing
13. Tools: Convolutional Neural Networks (CNN)
13
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 1097-1105).
AlexNet
14. Tools: Word and Sentence embeddings
14
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases
and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119).
Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011)
King Man- Woman+ Queen=
15. Tools: Long Short-Term Memory networks (LSTM)
15
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
18. Extending text-based QA for VQA
18
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
19. Substitute VGG-16 with KCNN
19
Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:
1509.04581.
22. VQA Dataset: Real Images, Open-ended questions
22
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. CVPR 2015.
1 (image) x 3 (questions) x 10 (answers)
23. Evaluation
23
Metric: Script:
● Characters to lowercase
● Remove periods (unless decimal
periods)
● Number words to digits
● Remove articles
● Add apostrophe to contractions
● Replace punctuation with space
26. Results in detail
26
VALIDATION SET TEST SET
Model Yes/No Number Other Overall Yes/No Number Other Overall
Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70
Model 3 75.02 28.60 29.30 46.32 - - - -
Model 2 75.62 31.81 28.11 46.36 - - - -
Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03
Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62
27. Results in context
27
100%0%
Humans
83.30%
UC Berkeley
& Sony
66.47%
Baseline
LSTM&CNN
54.06%
Baseline Nearest
neighbor
42.85%
Baseline Prior per
question type
37.47%
Baseline All yes
29.88%
Ours
53.62%
28. Comparison with the baseline
Our model
● Single word answer
● Generate answers
28
Baseline
● Multi word answers (hardcoded)
● Classify over the 1000 most common
answers
38. Conclusion
38
✓ Present to VQA Challenge,
CVPR 2016
Goals accomplished
✓ First GPI project using text
processing techniques
✓ Create a scalable VQA model
✓ Build a modular and reusable
software package
✓ Extended abstract accepted
to VQA workshop CVPR 2016
39. Conclusion
Personal overview
● Submission to VQA Challenge
● VQA, hot topic at CVPR 2016
● Model designed to generate
answers instead of classifying
them
● Question-Answer pair
generation proposal
39