SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Deep Neural Networks for
Multimodal Learning
Presented by: Marc Bolaños
Álvaro
Peris
Francisco
Casacuberta
Marc
Bolaños
Petia
Radeva
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Two young guys with shaggy hair look at their hands while hanging out in the yard.
Two young, White males are outside near many bushes.
Two men in green shirts are standing in a yard.
A man in a blue shirt standing in a garden.
Two friends enjoy time spent together.
Dos hombres están en el jardín.
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
A man is smiling at a stuffed lion.
Un hombre sonríe a un león de peluche.
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Q: What kind of store is this?
A: bakery
Q: What number is the bus?
A: 48
Multimodal Tasks
Video
Description
Multimodal
Translation
Multimodal
Description
Visual Question
Answering
Dense
Captioning
Image
Description
Multimodal Tasks - Basic NN Components
Long-Short Term Memory (LSTM)
Convolutional Neural Network (CNN)
Attention Mechanism
where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
.
.
.
v1
v2
vn
α1
(t)
α2
(t)
αn
(t)
.
.
.
LSTM
t-1
n
∑ αi
(t)
vi
i=1
LSTM
t
[Task 1] Video Description
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural
networks. arXiv preprint arXiv:1412.4729. 2014 Dec 15.
Two men working on a high
building
Two teams are playing
soccer
● 1970 open domain clips collected from YouTube.
● Annotated using a crowdsourcing platform.
● Variable number of captions per video.
● 80.000 different video-caption pairs.
Microsoft Video Description (MSVD) Dataset
[Task 1] Video Description - Model
CNN( )
CNN( )
CNN( )
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
.
.
.
Two
elephants
water
.
.
.
.
.
.
Álvaro Peris, Marc Bolaños, Petia Radeva, and Francisco Casacuberta. "Video Description using Bidirectional Recurrent Neural Networks."
In Proceedings of the International Conference on Artificial Neural Networks (ICANN) (IN PRESS) (2016)
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x1
,...,xJ
)
[Task 1] Video Description - Results
* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the
IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).
● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-
past fashion.
● Attention mechanisms: helpful when applying a step-by-step sentence generation.
Future work:
● CNNs at a higher and temporal level (3D CNNs).
[Task 2] Visual Question Answering
VQA Dataset Open-Ended question answering task
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE
International Conference on Computer Vision 2015 (pp. 2425-2433).
● 200.000 images
● 3 questions per image
● 10 (SHORT) answers per question annotated by
different users
[Task 2] VIBIKNet for VQA
SOFTMAX
text embedding
(GLOVE initialization)
LSTM
forward
LSTM
forward
LSTM
forward
visual
embedding
LSTM
backward
LSTM
backward
LSTM
backward
KCNN
(L2 norm)
element-wise summation
vector concatenation
Bidirectional
LSTM
[ , ]
[ , ]
where
is
the
giraffe
LSTM
forward
LSTM
backward
behind fence
Marc Bolaños, Álvaro Peris, Francisco Casacuberta and Petia Radeva. "VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question
Answering" Challenge on Visual Question Answering CVPR (no proceedings) (2016)
Answer generation: argmaxa
P(a|q1
,...,qn
,x)
[Task 2] VIBIKNet for VQA - Results
[Task 2] VIBIKNet for VQA - Results
Model Accuracy [%] on dev 2014 Accuracy [%] on test 2015
Yes/No Number Other Overall Yes/No Number Other Overall
LSTM 79.00 38.16 33.68 52.88 - - - -
BLSTM 79.13 38.26 33.52 52.96 78.30 38.88 38.97 54.86
BLSTM
train+dev
- - - - 78.88 36.33 40.27 56.1
● Classification models work better than generative models on datasets with simple answers.
● Models for compacting and jointly describing the information (KCNN) present in the
images seem promising.
● The use of pre-trained but adaptable representations is crucial for small and medium-sized
datasets.
[Task 3] Image Description
img1 CNN
LSTM A
LSTM
LSTM
.
.
.
rally
road
.
.
.
● Image Description formulated as a translation problem:
Sentence generation: argmaxy
P(y|y1
,...,yt-1
,x)
Basic initial tests on Flickr8k obtaining a result of BLEU = 20.2%
Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM t=1
LSTM t=2
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN [ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of METEOR = 41.2%
BLSTM
.
.
.
Future Directions
We are working on adding several state-of-the-art architectures and ideas:
● Highway Networks
● Compact Bilinear Pooling
● Class Activation Maps
Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387. 2015 May 3.
Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. arXiv preprint arXiv:1511.06062. 2015 Nov
19.
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization.
arXiv preprint arXiv:1512.04150. 2015 Dec 14.
Collaboration supported by the R-MIPRCV:
● Stay of Marc Bolaños (CVC-UB) at UPV, 2015.
● Stay of Álvaro Peris (UPV) at UB, 2016.
● To be extended with the incorporation of UGR (stay of a PhD student October, 2016).
Publications and challenges:
● ICANN’2016
● CVPR’2016
Resume
www.github.com/MarcBS/VIBIKNet
Download VIBIKNet
Download VIBIKNet
www.github.com/MarcBS/VIBIKNet
www.ub.edu/cvub/marcbolanos
marc.bolanos@ub.edu
Multimodal Tasks - Basic NN Components
Convolutional Neural Network (CNN)
Multimodal Tasks - Basic NN Components
where
is
the
giraffe
LSTM
LSTM
LSTM
LSTM
dónde
está
la
jirafa
LSTM
LSTM
LSTM
LSTM <EOS>
Long-Short Term Memory (LSTM)
Multimodal Tasks - Basic NN Components
Attention Mechanism
.
.
.
v1
v2
vn
α1
(t)
α2
(t)
αn
(t)
.
.
.
LSTMt-1
LSTMt
[Task 2] VIBIKNet for VQA - Kernelized CNN
Object Detector
PCA
Gaussian
Mixture Model
(Fisher Vectors)
PCA
GoogLeNet
PCA
PCA
PCA
GoogLeNet
GoogLeNet
GoogLeNet
CNN
feature
vector
KCNN
feature
vector
Liu Z. Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581. 2015 Sep 15.
Embed.
[Task 4] Multimodal Translation
● Translation problem aided by image information:
Embed
Embed
.
.
.
LSTM j=1
LSTM j=J
LSTM j=2
LSTM j=J
LSTM j=2
LSTM j=1
.
.
.
.
.
.
LSTM t=1
LSTM t=2
.
.
.
.
.
.
SOFT
ATTENTION
MODEL
LSTM t=T
ENCODER DECODER
Dos
elefantes
agua
.
.
.
.
.
.
Sentence translation: argmaxy
P(y|y1
,...,yt-1
, x, z1
,...,zJ
)
Two
elephants
water
z1
.
.
.
z2
zJ
KCNN
[ , ]
[ , ]
[ , ]
Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of BLEU = 20.2
Deep Neural Networks for
Multimodal Learning
Presented by: Marc Bolaños
where
is
the
giraffe
behind
CNN
BLSTM
the
fence
LSTM
[Task 1] Video Description - Results
* Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the
IEEE International Conference on Computer Vision 2015 (pp. 4507-4515).
● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to-
past fashion.
● Attention mechanisms: helpful when applying a step-by-step sentence generation.
Future work:
● CNNs at a higher and temporal level (3D CNNs).

Weitere ähnliche Inhalte

Was ist angesagt?

Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural NetworksTemporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural NetworksUniversitat Politècnica de Catalunya
 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language ProcessingBerlin Language Technology
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural NetworkJunho Cho
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 

Was ist angesagt? (20)

Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural NetworksTemporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 

Andere mochten auch

Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringNAVER D2
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Multi-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiMulti-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiRoelof Pieters
 
Lecture Note
Lecture NoteLecture Note
Lecture Notebutest
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013gmorong
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESRamnandan Krishnamurthy
 
Deep Learning for industrial Prognostics & Health Management (PHM)
Deep Learning for industrial Prognostics & Health Management (PHM) Deep Learning for industrial Prognostics & Health Management (PHM)
Deep Learning for industrial Prognostics & Health Management (PHM) Michael Giering
 
introduce to Multimodal Deep Learning for Robust RGB-D Object Recognition
introduce to Multimodal Deep Learning for Robust RGB-D Object Recognitionintroduce to Multimodal Deep Learning for Robust RGB-D Object Recognition
introduce to Multimodal Deep Learning for Robust RGB-D Object RecognitionWEBFARMER. ltd.
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learninghoai_ln
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learningRishikesh .
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionananth
 
Multimodal man machine interaction
Multimodal man machine interactionMultimodal man machine interaction
Multimodal man machine interactionDr. Rajesh P Barnwal
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networksShuhei Iitsuka
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learningPeter Wlodarczak
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 
Variational autoencoder talk
Variational autoencoder talkVariational autoencoder talk
Variational autoencoder talkShai Harel
 

Andere mochten auch (20)

Multimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-AnsweringMultimodal Residual Learning for Visual Question-Answering
Multimodal Residual Learning for Visual Question-Answering
 
presentation
presentationpresentation
presentation
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Multi-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative aiMulti-modal embeddings: from discriminative to generative models and creative ai
Multi-modal embeddings: from discriminative to generative models and creative ai
 
Lecture Note
Lecture NoteLecture Note
Lecture Note
 
Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013Eddl5131 assignment 1 march2013
Eddl5131 assignment 1 march2013
 
ECML-2015 Presentation
ECML-2015 PresentationECML-2015 Presentation
ECML-2015 Presentation
 
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMESREPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
REPRESENTATION LEARNING FOR STATE APPROXIMATION IN PLATFORM GAMES
 
Deep Learning for industrial Prognostics & Health Management (PHM)
Deep Learning for industrial Prognostics & Health Management (PHM) Deep Learning for industrial Prognostics & Health Management (PHM)
Deep Learning for industrial Prognostics & Health Management (PHM)
 
introduce to Multimodal Deep Learning for Robust RGB-D Object Recognition
introduce to Multimodal Deep Learning for Robust RGB-D Object Recognitionintroduce to Multimodal Deep Learning for Robust RGB-D Object Recognition
introduce to Multimodal Deep Learning for Robust RGB-D Object Recognition
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
Introduction to un supervised learning
Introduction to un supervised learningIntroduction to un supervised learning
Introduction to un supervised learning
 
CBIR by deep learning
CBIR by deep learningCBIR by deep learning
CBIR by deep learning
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 
Multimodal man machine interaction
Multimodal man machine interactionMultimodal man machine interaction
Multimodal man machine interaction
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Procedural modeling using autoencoder networks
Procedural modeling using autoencoder networksProcedural modeling using autoencoder networks
Procedural modeling using autoencoder networks
 
Multimedia data mining using deep learning
Multimedia data mining using deep learningMultimedia data mining using deep learning
Multimedia data mining using deep learning
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Variational autoencoder talk
Variational autoencoder talkVariational autoencoder talk
Variational autoencoder talk
 

Ähnlich wie Deep Neural Networks for Multimodal Learning

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Universitat Politècnica de Catalunya
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Real Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep LearningReal Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep LearningIRJET Journal
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learningijtsrd
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression Roberto Iacoviello
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Research and activity report
Research and activity reportResearch and activity report
Research and activity reportMarco Cagnazzo
 
Video Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPVideo Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPinventionjournals
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videosabidhavp
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using KerasIRJET Journal
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxAtharvaTanawade
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...INFOGAIN PUBLICATION
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET Journal
 
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...CSCJournals
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureIRJET Journal
 

Ähnlich wie Deep Neural Networks for Multimodal Learning (20)

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Real Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep LearningReal Time Sign Language Recognition Using Deep Learning
Real Time Sign Language Recognition Using Deep Learning
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Research and activity report
Research and activity reportResearch and activity report
Research and activity report
 
Video Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAPVideo Manifold Feature Extraction Based on ISOMAP
Video Manifold Feature Extraction Based on ISOMAP
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
Multiple Object Tracking - Laura Leal-Taixe - UPC Barcelona 2018
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
 
IRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A ReviewIRJET-Image Question Answering: A Review
IRJET-Image Question Answering: A Review
 
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
Comparison Between Levenberg-Marquardt And Scaled Conjugate Gradient Training...
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU Architecture
 

Kürzlich hochgeladen

Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 

Kürzlich hochgeladen (20)

Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 

Deep Neural Networks for Multimodal Learning

  • 1. Deep Neural Networks for Multimodal Learning Presented by: Marc Bolaños Álvaro Peris Francisco Casacuberta Marc Bolaños Petia Radeva
  • 4. Multimodal Tasks Video Description Multimodal Translation Multimodal Description Visual Question Answering Dense Captioning Image Description Two young guys with shaggy hair look at their hands while hanging out in the yard. Two young, White males are outside near many bushes. Two men in green shirts are standing in a yard. A man in a blue shirt standing in a garden. Two friends enjoy time spent together. Dos hombres están en el jardín.
  • 8. Multimodal Tasks - Basic NN Components Long-Short Term Memory (LSTM) Convolutional Neural Network (CNN) Attention Mechanism where is the giraffe LSTM LSTM LSTM LSTM dónde está la jirafa LSTM LSTM LSTM LSTM <EOS> . . . v1 v2 vn α1 (t) α2 (t) αn (t) . . . LSTM t-1 n ∑ αi (t) vi i=1 LSTM t
  • 9. [Task 1] Video Description Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. 2014 Dec 15. Two men working on a high building Two teams are playing soccer ● 1970 open domain clips collected from YouTube. ● Annotated using a crowdsourcing platform. ● Variable number of captions per video. ● 80.000 different video-caption pairs. Microsoft Video Description (MSVD) Dataset
  • 10. [Task 1] Video Description - Model CNN( ) CNN( ) CNN( ) . . . LSTM j=1 LSTM j=J LSTM j=2 LSTM j=J LSTM j=2 LSTM j=1 . . . . . . LSTM t=1 LSTM t=2 . . . . . . SOFT ATTENTION MODEL LSTM t=T ENCODER DECODER . . . Two elephants water . . . . . . Álvaro Peris, Marc Bolaños, Petia Radeva, and Francisco Casacuberta. "Video Description using Bidirectional Recurrent Neural Networks." In Proceedings of the International Conference on Artificial Neural Networks (ICANN) (IN PRESS) (2016) Sentence generation: argmaxy P(y|y1 ,...,yt-1 ,x1 ,...,xJ )
  • 11. [Task 1] Video Description - Results * Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 4507-4515). ● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to- past fashion. ● Attention mechanisms: helpful when applying a step-by-step sentence generation. Future work: ● CNNs at a higher and temporal level (3D CNNs).
  • 12. [Task 2] Visual Question Answering VQA Dataset Open-Ended question answering task Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 2425-2433). ● 200.000 images ● 3 questions per image ● 10 (SHORT) answers per question annotated by different users
  • 13. [Task 2] VIBIKNet for VQA SOFTMAX text embedding (GLOVE initialization) LSTM forward LSTM forward LSTM forward visual embedding LSTM backward LSTM backward LSTM backward KCNN (L2 norm) element-wise summation vector concatenation Bidirectional LSTM [ , ] [ , ] where is the giraffe LSTM forward LSTM backward behind fence Marc Bolaños, Álvaro Peris, Francisco Casacuberta and Petia Radeva. "VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering" Challenge on Visual Question Answering CVPR (no proceedings) (2016) Answer generation: argmaxa P(a|q1 ,...,qn ,x)
  • 14. [Task 2] VIBIKNet for VQA - Results
  • 15. [Task 2] VIBIKNet for VQA - Results Model Accuracy [%] on dev 2014 Accuracy [%] on test 2015 Yes/No Number Other Overall Yes/No Number Other Overall LSTM 79.00 38.16 33.68 52.88 - - - - BLSTM 79.13 38.26 33.52 52.96 78.30 38.88 38.97 54.86 BLSTM train+dev - - - - 78.88 36.33 40.27 56.1 ● Classification models work better than generative models on datasets with simple answers. ● Models for compacting and jointly describing the information (KCNN) present in the images seem promising. ● The use of pre-trained but adaptable representations is crucial for small and medium-sized datasets.
  • 16. [Task 3] Image Description img1 CNN LSTM A LSTM LSTM . . . rally road . . . ● Image Description formulated as a translation problem: Sentence generation: argmaxy P(y|y1 ,...,yt-1 ,x) Basic initial tests on Flickr8k obtaining a result of BLEU = 20.2%
  • 17. Embed. [Task 4] Multimodal Translation ● Translation problem aided by image information: Embed Embed . . . LSTM t=1 LSTM t=2 . . . SOFT ATTENTION MODEL LSTM t=T ENCODER DECODER Dos elefantes agua . . . Sentence translation: argmaxy P(y|y1 ,...,yt-1 , x, z1 ,...,zJ ) Two elephants water z1 . . . z2 zJ KCNN [ , ] [ , ] [ , ] Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of METEOR = 41.2% BLSTM . . .
  • 18. Future Directions We are working on adding several state-of-the-art architectures and ideas: ● Highway Networks ● Compact Bilinear Pooling ● Class Activation Maps Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387. 2015 May 3. Gao Y, Beijbom O, Zhang N, Darrell T. Compact bilinear pooling. arXiv preprint arXiv:1511.06062. 2015 Nov 19. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization. arXiv preprint arXiv:1512.04150. 2015 Dec 14.
  • 19. Collaboration supported by the R-MIPRCV: ● Stay of Marc Bolaños (CVC-UB) at UPV, 2015. ● Stay of Álvaro Peris (UPV) at UB, 2016. ● To be extended with the incorporation of UGR (stay of a PhD student October, 2016). Publications and challenges: ● ICANN’2016 ● CVPR’2016 Resume www.github.com/MarcBS/VIBIKNet Download VIBIKNet
  • 20.
  • 22. Multimodal Tasks - Basic NN Components Convolutional Neural Network (CNN)
  • 23. Multimodal Tasks - Basic NN Components where is the giraffe LSTM LSTM LSTM LSTM dónde está la jirafa LSTM LSTM LSTM LSTM <EOS> Long-Short Term Memory (LSTM)
  • 24. Multimodal Tasks - Basic NN Components Attention Mechanism . . . v1 v2 vn α1 (t) α2 (t) αn (t) . . . LSTMt-1 LSTMt
  • 25. [Task 2] VIBIKNet for VQA - Kernelized CNN Object Detector PCA Gaussian Mixture Model (Fisher Vectors) PCA GoogLeNet PCA PCA PCA GoogLeNet GoogLeNet GoogLeNet CNN feature vector KCNN feature vector Liu Z. Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581. 2015 Sep 15.
  • 26. Embed. [Task 4] Multimodal Translation ● Translation problem aided by image information: Embed Embed . . . LSTM j=1 LSTM j=J LSTM j=2 LSTM j=J LSTM j=2 LSTM j=1 . . . . . . LSTM t=1 LSTM t=2 . . . . . . SOFT ATTENTION MODEL LSTM t=T ENCODER DECODER Dos elefantes agua . . . . . . Sentence translation: argmaxy P(y|y1 ,...,yt-1 , x, z1 ,...,zJ ) Two elephants water z1 . . . z2 zJ KCNN [ , ] [ , ] [ , ] Basic initial tests on Flickr30k ACLTask1 Challenge obtaining a result of BLEU = 20.2
  • 27. Deep Neural Networks for Multimodal Learning Presented by: Marc Bolaños where is the giraffe behind CNN BLSTM the fence LSTM
  • 28. [Task 1] Video Description - Results * Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. InProceedings of the IEEE International Conference on Computer Vision 2015 (pp. 4507-4515). ● Bidirectional temporal mechanism (BLSTM): allows to extract information in a past-to-future and future-to- past fashion. ● Attention mechanisms: helpful when applying a step-by-step sentence generation. Future work: ● CNNs at a higher and temporal level (3D CNNs).