Deep Neural Networks for Multimodal Learning

Deep Neural Networks for
Multimodal Learning
Presented by: Marc Bolaños
Álvaro
Peris
Francisco
Casacuberta
Marc
Bolaños
Petia
Radeva

Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description

Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Two young guys with shaggy hair look at their hands while hanging out in the yard.
Two young, White males are outside near many bushes.
Two men in green shirts are standing in a yard.
A man in a blue shirt standing in a garden.
Two friends enjoy time spent together.
Dos hombres están en el jardín.

Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
A man is smiling at a stuffed lion.
Un hombre sonríe a un león de peluche.

Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Q: What kind of store is this?
A: bakery
Q: What number is the bus?
A: 48

Multimodal Tasks - Basic NN Components
Long-Short Term Memory (LSTM)
Convolutional Neural Network (CNN)
Attention Mechanism
where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
.
.
.
v1
v2
vn
α1
(t)
α2
(t)
αn
(t)
.
.
.
LSTM
t-1
n
∑ αi
(t)
vi
i=1
LSTM
t

[Task 1] Video Description
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural
networks. arXiv preprint arXiv:1412.4729. 2014 Dec 15.
Two men working on a high
building
Two teams are playing
soccer
● 1970 open domain clips collected from YouTube.
● Annotated using a crowdsourcing platform.
● Variable number of captions per video.
● 80.000 different video-caption pairs.
Microsoft Video Description (MSVD) Dataset

[Task 1] Video Description - Model
CNN( )
CNN( )
CNN( )
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
.
.
.
Two
elephants
water
.
.
.
.
.
.
Álvaro Peris, Marc Bolaños, Petia Radeva, and Francisco Casacuberta. "Video Description using Bidirectional Recurrent Neural Networks."
In Proceedings of the International Conference on Artificial Neural Networks (ICANN) (IN PRESS) (2016)
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x1
,...,xJ
)

[Task 1] Video Description - Results
* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the
IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).
● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-
past fashion.
● Attention mechanisms: helpful when applying a step-by-step sentence generation.
Future work:
● CNNs at a higher and temporal level (3D CNNs).

[Task 2] Visual Question Answering
VQA Dataset Open-Ended question answering task
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE
International Conference on Computer Vision 2015 (pp. 2425-2433).
● 200.000 images
● 3 questions per image
● 10 (SHORT) answers per question annotated by
different users

[Task 2] VIBIKNet for VQA
SOFTMAX
text embedding
(GLOVE initialization)
LSTM
forward
LSTM
forward
LSTM
forward
visual
embedding
LSTM
backward
LSTM
backward
LSTM
backward
KCNN
(L2 norm)
element-wise summation
vector concatenation
Bidirectional
LSTM
[ , ]
[ , ]
where
is
the
giraffe
LSTM
forward
LSTM
backward
behind fence
Marc Bolaños, Álvaro Peris, Francisco Casacuberta and Petia Radeva. "VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question
Answering" Challenge on Visual Question Answering CVPR (no proceedings) (2016)
Answer generation: argmaxa
P(a|q1
,...,qn
,x)

[Task 2] VIBIKNet for VQA - Results

[Task 2] VIBIKNet for VQA - Results
Model Accuracy [%] on dev 2014 Accuracy [%] on test 2015
Yes/No Number Other Overall Yes/No Number Other Overall
LSTM 79.00 38.16 33.68 52.88 - - - -
BLSTM 79.13 38.26 33.52 52.96 78.30 38.88 38.97 54.86
BLSTM
train+dev
- - - - 78.88 36.33 40.27 56.1
● Classification models work better than generative models on datasets with simple answers.
● Models for compacting and jointly describing the information (KCNN) present in the
images seem promising.
● The use of pre-trained but adaptable representations is crucial for small and medium-sized
datasets.

[Task 3] Image Description
img1 CNN
LSTM A
LSTM
LSTM
.
.
.
rally
road
.
.
.
● Image Description formulated as a translation problem:
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x)
Basic initial tests on Flickr8k obtaining a result of BLEU = 20.2%

Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM t=1
LSTM t=2
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN [ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of METEOR = 41.2%
BLSTM
.
.
.

Future Directions
We are working on adding several state-of-the-art architectures and ideas:
● Highway Networks
● Compact Bilinear Pooling
● Class Activation Maps
Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387. 2015 May 3.
Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. arXiv preprint arXiv:1511.06062. 2015 Nov
19.
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization.
arXiv preprint arXiv:1512.04150. 2015 Dec 14.

Collaboration supported by the R-MIPRCV:
● Stay of Marc Bolaños (CVC-UB) at UPV, 2015.
● Stay of Álvaro Peris (UPV) at UB, 2016.
● To be extended with the incorporation of UGR (stay of a PhD student October, 2016).
Publications and challenges:
● ICANN’2016
● CVPR’2016
Resume
www.github.com/MarcBS/VIBIKNet
Download VIBIKNet

Download VIBIKNet
www.github.com/MarcBS/VIBIKNet
www.ub.edu/cvub/marcbolanos
marc.bolanos@ub.edu

Convolutional Neural Network (CNN)

where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
Long-Short Term Memory (LSTM)

Attention Mechanism
.
.
.
v1
v2
vn
α1
(t)
α2
(t)
αn
(t)
.
.
.
LSTMt-1
LSTMt

[Task 2] VIBIKNet for VQA - Kernelized CNN
Object Detector
PCA
Gaussian
Mixture Model
(Fisher Vectors)
PCA
GoogLeNet
PCA
PCA
PCA
GoogLeNet
GoogLeNet
GoogLeNet
CNN
feature
vector
KCNN
feature
vector
Liu Z. Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581. 2015 Sep 15.

Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN
[ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of BLEU = 20.2

Deep Neural Networks for
Multimodal Learning
Presented by: Marc Bolaños
where
is
the
giraffe
behind
CNN
BLSTM
the
fence
LSTM

Deep Neural Networks for Multimodal Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Deep Neural Networks for Multimodal Learning

Ähnlich wie Deep Neural Networks for Multimodal Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep Neural Networks for Multimodal Learning