8. Multimodal Tasks - Basic NN Components
Long-Short Term Memory (LSTM)
Convolutional Neural Network (CNN)
Attention Mechanism
where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
.
.
.
v1
v2
vn
α1
(t)
α2
(t)
αn
(t)
.
.
.
LSTM
t-1
n
∑ αi
(t)
vi
i=1
LSTM
t
9. [Task 1] Video Description
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural
networks. arXiv preprint arXiv:1412.4729. 2014 Dec 15.
Two men working on a high
building
Two teams are playing
soccer
● 1970 open domain clips collected from YouTube.
● Annotated using a crowdsourcing platform.
● Variable number of captions per video.
● 80.000 different video-caption pairs.
Microsoft Video Description (MSVD) Dataset
10. [Task 1] Video Description - Model
CNN( )
CNN( )
CNN( )
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
.
.
.
Two
elephants
water
.
.
.
.
.
.
Álvaro Peris, Marc Bolaños, Petia Radeva, and Francisco Casacuberta. "Video Description using Bidirectional Recurrent Neural Networks."
In Proceedings of the International Conference on Artificial Neural Networks (ICANN) (IN PRESS) (2016)
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x1
,...,xJ
)
11. [Task 1] Video Description - Results
* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the
IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).
● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-
past fashion.
● Attention mechanisms: helpful when applying a step-by-step sentence generation.
Future work:
● CNNs at a higher and temporal level (3D CNNs).
12. [Task 2] Visual Question Answering
VQA Dataset Open-Ended question answering task
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE
International Conference on Computer Vision 2015 (pp. 2425-2433).
● 200.000 images
● 3 questions per image
● 10 (SHORT) answers per question annotated by
different users
13. [Task 2] VIBIKNet for VQA
SOFTMAX
text embedding
(GLOVE initialization)
LSTM
forward
LSTM
forward
LSTM
forward
visual
embedding
LSTM
backward
LSTM
backward
LSTM
backward
KCNN
(L2 norm)
element-wise summation
vector concatenation
Bidirectional
LSTM
[ , ]
[ , ]
where
is
the
giraffe
LSTM
forward
LSTM
backward
behind fence
Marc Bolaños, Álvaro Peris, Francisco Casacuberta and Petia Radeva. "VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question
Answering" Challenge on Visual Question Answering CVPR (no proceedings) (2016)
Answer generation: argmaxa
P(a|q1
,...,qn
,x)
15. [Task 2] VIBIKNet for VQA - Results
Model Accuracy [%] on dev 2014 Accuracy [%] on test 2015
Yes/No Number Other Overall Yes/No Number Other Overall
LSTM 79.00 38.16 33.68 52.88 - - - -
BLSTM 79.13 38.26 33.52 52.96 78.30 38.88 38.97 54.86
BLSTM
train+dev
- - - - 78.88 36.33 40.27 56.1
● Classification models work better than generative models on datasets with simple answers.
● Models for compacting and jointly describing the information (KCNN) present in the
images seem promising.
● The use of pre-trained but adaptable representations is crucial for small and medium-sized
datasets.
16. [Task 3] Image Description
img1 CNN
LSTM A
LSTM
LSTM
.
.
.
rally
road
.
.
.
● Image Description formulated as a translation problem:
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x)
Basic initial tests on Flickr8k obtaining a result of BLEU = 20.2%
17. Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM t=1
LSTM t=2
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN [ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of METEOR = 41.2%
BLSTM
.
.
.
18. Future Directions
We are working on adding several state-of-the-art architectures and ideas:
● Highway Networks
● Compact Bilinear Pooling
● Class Activation Maps
Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387. 2015 May 3.
Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. arXiv preprint arXiv:1511.06062. 2015 Nov
19.
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization.
arXiv preprint arXiv:1512.04150. 2015 Dec 14.
19. Collaboration supported by the R-MIPRCV:
● Stay of Marc Bolaños (CVC-UB) at UPV, 2015.
● Stay of Álvaro Peris (UPV) at UB, 2016.
● To be extended with the incorporation of UGR (stay of a PhD student October, 2016).
Publications and challenges:
● ICANN’2016
● CVPR’2016
Resume
www.github.com/MarcBS/VIBIKNet
Download VIBIKNet
23. Multimodal Tasks - Basic NN Components
where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
Long-Short Term Memory (LSTM)
25. [Task 2] VIBIKNet for VQA - Kernelized CNN
Object Detector
PCA
Gaussian
Mixture Model
(Fisher Vectors)
PCA
GoogLeNet
PCA
PCA
PCA
GoogLeNet
GoogLeNet
GoogLeNet
CNN
feature
vector
KCNN
feature
vector
Liu Z. Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581. 2015 Sep 15.
26. Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN
[ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of BLEU = 20.2
27. Deep Neural Networks for
Multimodal Learning
Presented by: Marc Bolaños
where
is
the
giraffe
behind
CNN
BLSTM
the
fence
LSTM
28. [Task 1] Video Description - Results
* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the
IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).
● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-
past fashion.
● Attention mechanisms: helpful when applying a step-by-step sentence generation.
Future work:
● CNNs at a higher and temporal level (3D CNNs).