SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Multimodal Affect Recognition
at utterance-level with spatio-
temporal feature fusion by
using Face, Audio, Text [,and
Body] features
Carlos Toxtli
Index
● Basic concepts
● Architecture
● Experiments
● Results
● Conclusions
● Next steps
Paper
Multimodal Utterance-level Affect Analysis using Visual, Audio and Text
Features
Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E. Shi
Department of Electronic and Computer Engineering, Hong Kong University of
Science and Technology
IFP, Beckman, University of Illinois at Urbana-Champaign
Highest score in the “Visual + Audio + Text” category of the OMG Emotion
Challenge 2018
Long-term (spatio-temporal) emotion recognition
● The integration of information across multiple modalities and across time
is a promising way to enhance the emotion recognition performance of
affective systems.
● Much previous work has focused on instantaneous emotion recognition.
● This work addresses long-term emotion recognition by integrating cues
from multiple modalities.
● Since emotions normally change gradually under the same context,
analyzing long-term dependency of emotions will stabilize the overall
predictions.
Utterance level
● Spoken statement
● It is a continuous piece of speech beginning and ending with a clear
pause.
● Utterances do not exist in written language
● This word does not exist in some languages.
Multimodal
● Humans perceive others’ emotional states by combining information
across multiple modalities simultaneously.
● Intuitively, a multi-modal inference network should be able to leverage
information from each modality and their correlations to improve recognition
over that achievable by a single modality network.
● This work uses multiple modalities including facial expression, audio and
language.
● The paper describes a multi-modal neural architecture that integrates visual
information over time using LSTMs, and combines it with utterance level
audio and text cues to recognize human sentiment from multimodal clips.
Affect (dimensional) vs Emotion (discrete) recognition
● Dimensional models aim to avoid the restrictiveness of discrete states,
and allow more flexible definition of affective states as points in a multi-
dimensional space spanned by concepts such as affect intensity and
positivity.
● For affect recognition, the dimensional space is commonly operationalized as
a regression task.
● The most commonly dimensional model is Russell’s circumplex model,
which consists of the two dimensions valence and arousal.
Affect and emotion
Database - OMG - One Minute Gradual-Emotion
10 hours of data
497 videos
6422 utterances
Annotations:
Arousal: -1 Calm to +1 Alert
Valence: -1 Negative to +1 Positive
Emotions: "Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"
Video example, What emotion is represented?
Options: “Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"
Arousal? Valence? (numbers between -1 and 1)
https://youtu.be/EWRTue-AeSo
Result
arousal 0.3730994852
Valence 0.2109641637
Emotion: "Surprise"
OpenFace (709 features): Facial behavior
analysis tool that provides accurate facial
landmark detection, head pose
estimation, facial action unit recognition,
and eye-gaze estimation. We get points that
represents the face.
VGG16 FC6 (4096 features): The faces are
cropped (224×224×3), aligned, zero out the
background, and passed through a
pretrained VGG16 to get a take a
dimensional feature vector from FC6 layer.
Face features
Audio features
OpenSMILE (1582 features): The
audio is extracted from the videos and
are processed by OpenSMILE that
extract audio features such as
loudness, pitch, jitter, etc.
Text features
Opinion Lexicon (6 features): depends
on the ratio of sentiment words
(adjectives, adverbs, verbs and nouns),
which express positive or negative
sentiments.
Subjective Lexicon (4 features): They
used the subjective Lexicon from MPQA
(Multi-Perspective Question Answering)
that models the sentiment by its type
and intensity.
Feature fusion
The features of the same source were normalized and
fusioned, getting the following feature sizes:
Face fusioned (4096 + 709 = 4805 features)
Word fusioned (6 + 4 = 10 features)
Audio features came only from OpenSMILE so these
were not fusioned (1582 features)
Early fusion
For early fusion, features from different modalities are projected into the same
joint feature space before being fed into the classifier.
Early fusion
Late fusion
For late fusion, classifications are made on each modality and their decisions or
predictions are later merged together.
Late fusion
LSTM - Long Short-Term Memory
A LSTM network is a recurrent
neural network that models time
or sequence dependent behaviour.
This is performed by feeding back
the output of a neural network layer
at time t to the input of the same
network layer at time t + 1.
Metrics - Concordance Correlation Coefficients
CCC is an index of how well a new test or measurement (Y) reproduces a gold
standard test or measurement (X). It quantifies the agreement between these
two measures of the same variable. Like a correlation, ρc ranges from -1 to 1,
with perfect agreement at 1.
Mean, variance, correlation coefficient between the two variables
As a fine tune they also used 1 - as loss function instead MSE
Metrics - Accuracy and F1-score
Accuracy: percentage of correct predictions from all predictions made
F1-Score: conveys the balance between the precision and the recall
Limitations
The dataset was designed to be downloaded from youtube.
From the 497 videos, 111 were unavailable.
I trained with limited data and the results were different from the ones that were
reported.
Their results
Results
CCC Arousal CCC Valence Accuracy F1-score
Reported in
their paper
0.400 0.353
Contest
evaluation
0.359 0.276
My local
environment
0.210 0.257 0.434 0.362
Mixed features
CCC Arousal
Their value
CCC Arousal
My machine
CCC Valence
Their value
CCC Valence
My machine
Accuracy
My value
F1-score
My value
Face
Visual
0.109 0.075 0.237 0.193 0.405 0.396
Face
Feature
0.046 0.007 0.080 0.012 0.204 0.204
Face
Fusion
0.175 0.113 0.261 0.149 0.381 0.383
Audio
Feature
0.273 0.207 0.266 0.015 0.418 0.420
Text
Fusion
0.137 0.107 0.259 0.037 0.259 0.259
Body Features
OpenPose (BODY_25) (11
features): The normalized angles
between the joints.I did not use the
calculated features because were
25x224x224
VGG16 FC6 Skelethon image (4096
features): I drew the skeleton on a
black background and feed a VGG16
and extracted a feature vector of the
FC6 layer.
Quad Model
The proposed model adds body
gesture features from handcrafted
and deep features as a fusioned
layer and is evaluated through a
LSTM.
My experiments
CCC Arousal CCC Valence Accuracy F1-score
Body Feature 0.067 0.013 0.285 0.283
Body Visual 0.077 0.005 0.361 0.350
Body Fusion 0.002 0.049 0.136 0.191
Trimodal +
Body Feature
0.267 0.283 0.185 0.274
Trimodal +
Body Visual
0.006 0.244 0.411 0.407
Trimodal +
Body Fusion
0.026 0.307 0.449 0.451
Experiments
After running 112
experiments with the
combinations of
features we found the
best models for each
metric.
NVIDIA GTX 1080 ti
Other experiments
CCC Arousal CCC Valence Accuracy F1-score
Fusion_late
Body_feature
Audio_feature
0.272 0.064 0.380 0.380
Fusion_late
Face_fusion
Audio_feature
Word_fusion
Body_fusion
face_fusion
0.173 0.359 0.411 0.358
Fusion_early
Face_fusion
Audio_feature
Word_fusion
body_fusion
0.249 0.267 0.451 0.449
Trimodal +
Body Fusion
0.026 0.307 0.449 0.451
Final results
CCC Arousal CCC Valence Accuracy F1-score
Authors
approach
Trimodal
0.210 0.257 0.434 0.362
My approach
Quadmodal
0.249 0.267 0.451 0.449
Mixed models 0.272 0.359 0.451 0.451
Conclusions
● Multimodal models outperform the baseline methods
● The results show that cross-modal information benefit the estimation of long-
term affective states.
● Early fusion performed better in general but for some for dimensional metrics
late fusion performed better.
Next steps
● I’m planning to explore 3Dconv instead LSTM, rey ResNet instead VGG16,
different network models for each feature.
● UPDATE: These are the evaluations from test datasets.
CCC Arousal CCC Valence Accuracy F1-score
Trimodal Val 0.298 0.428 0.440 0.455
Trimodal Test 0.180 0.405 0.455 0.455
Quadmodal Val 0.340 0.454 0.445 0.453
Quadmodal Test 0.235 0.413 0.453 0.453
Thanks
Trimodal
Quadmodal architecture
LSTM
Decision layers
The activation function used for each metric were:
Emotion (categorical): Softmax
Valence (dimensional): hyperbolic tangent function (tanh)
Arousal (dimensional): Sigmoid
Sigmoid as activation function
A sigmoid activation function turns an
activation into a value between 0 and
1. It is useful for binary classification
problems and is mostly used in the
final output layer of such problems.
Also, sigmoid activation leads to slow
gradient descent because the slope is
small for high and low values.
Hyperbolic tangent as activation function
A Tanh activation function turns an
activation into a value between -1 and
+1. The outputs are normalized. The
gradient is stronger for tanh than sigmoid
(derivatives are steeper)
SoftMax as activation function
The Softmax function is a
wonderful activation function that
turns numbers aka logits into
probabilities that sum to one.
MSE as loss function for linear regression
Linear regression uses Mean Squared
Error as loss function that gives a
convex graph and then we can
complete the optimization by finding its
vertex as global minimum.
SGD as Optimizer
Stochastic gradient descent (SGD)
computes the gradient for each
update using a single training data
point x_i (chosen at random). The
idea is that the gradient calculated
this way is a stochastic approximation
to the gradient calculated using the
entire training data. Each update is
now much faster to calculate than in
batch gradient descent, and over
many updates, we will head in the
same general direction
Layers
Early fusion - Hidden layer
Early fusion Fully connected
LSTM
Late fusion
1DConv Average Pooling
1D convolutional neural nets can be used for extracting local 1D patches
(subsequences) from sequences and can identify local patterns within the window
of convolution. A pattern learnt at one position can also be recognized at a
different position, making 1D conv nets translation invariant. Long sequence to
process so long that it cannot be realistically processed by RNNs. In such cases,
1D conv nets can be used as a pre-processing step to make the sequence smaller
through downsampling by extracting higher level features, which can, then be
passed on to the RNN as input.
Batch Normalization
We normalize the input layer by
adjusting and scaling the activations
to speed up learning, the same thing
also for the values in the hidden
layers, that are changing all the time.
VGG16
Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)
Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)
Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Weitere ähnliche Inhalte

Was ist angesagt?

Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networksananth
 
Kdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiKdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiLaks Lakshmanan
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learningdeawoo Kim
 
soft computing
soft computingsoft computing
soft computingAMIT KUMAR
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processchauhankapil
 
A novel neural network classifier for brain computer
A novel neural network classifier for brain computerA novel neural network classifier for brain computer
A novel neural network classifier for brain computerAlexander Decker
 
11.a novel neural network classifier for brain computer
11.a novel neural network classifier for brain computer11.a novel neural network classifier for brain computer
11.a novel neural network classifier for brain computerAlexander Decker
 

Was ist angesagt? (9)

Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Kdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iiiKdd12 tutorial-inf-part-iii
Kdd12 tutorial-inf-part-iii
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Deep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement LearningDeep Multi-agent Reinforcement Learning
Deep Multi-agent Reinforcement Learning
 
soft computing
soft computingsoft computing
soft computing
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
A novel neural network classifier for brain computer
A novel neural network classifier for brain computerA novel neural network classifier for brain computer
A novel neural network classifier for brain computer
 
11.a novel neural network classifier for brain computer
11.a novel neural network classifier for brain computer11.a novel neural network classifier for brain computer
11.a novel neural network classifier for brain computer
 
ASR_final
ASR_finalASR_final
ASR_final
 

Ähnlich wie Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...csandit
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
Continuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep LearningContinuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep LearningYunchao He
 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social NetworksWei Lu
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
Autom editor video blooper recognition and localization for automatic monolo...
Autom editor  video blooper recognition and localization for automatic monolo...Autom editor  video blooper recognition and localization for automatic monolo...
Autom editor video blooper recognition and localization for automatic monolo...Carlos Toxtli
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxSandeep Kumar
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET Journal
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 

Ähnlich wie Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1) (20)

Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...
 
HP-3 Presentation.pptx
HP-3 Presentation.pptxHP-3 Presentation.pptx
HP-3 Presentation.pptx
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Slide 1
Slide 1Slide 1
Slide 1
 
Interview assessment
Interview assessmentInterview assessment
Interview assessment
 
DeepLearning.pdf
DeepLearning.pdfDeepLearning.pdf
DeepLearning.pdf
 
Continuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep LearningContinuous Sentiment Intensity Prediction based on Deep Learning
Continuous Sentiment Intensity Prediction based on Deep Learning
 
A017410108
A017410108A017410108
A017410108
 
A017410108
A017410108A017410108
A017410108
 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slides
 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social Networks
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Autom editor video blooper recognition and localization for automatic monolo...
Autom editor  video blooper recognition and localization for automatic monolo...Autom editor  video blooper recognition and localization for automatic monolo...
Autom editor video blooper recognition and localization for automatic monolo...
 
Dssg talk CNN intro
Dssg talk CNN introDssg talk CNN intro
Dssg talk CNN intro
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptx
 
IRJET - Audio Emotion Analysis
IRJET - Audio Emotion AnalysisIRJET - Audio Emotion Analysis
IRJET - Audio Emotion Analysis
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
alexVAE_New.pdf
alexVAE_New.pdfalexVAE_New.pdf
alexVAE_New.pdf
 

Mehr von Carlos Toxtli

Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligenceCarlos Toxtli
 
Artificial intelligence and open source
Artificial intelligence and open sourceArtificial intelligence and open source
Artificial intelligence and open sourceCarlos Toxtli
 
Bots in robotic process automation
Bots in robotic process automationBots in robotic process automation
Bots in robotic process automationCarlos Toxtli
 
How to implement artificial intelligence solutions
How to implement artificial intelligence solutionsHow to implement artificial intelligence solutions
How to implement artificial intelligence solutionsCarlos Toxtli
 
Changing paradigms in ai prototyping
Changing paradigms in ai prototypingChanging paradigms in ai prototyping
Changing paradigms in ai prototypingCarlos Toxtli
 
Inteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroInteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroCarlos Toxtli
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Carlos Toxtli
 
Cómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCarlos Toxtli
 
Education 3.0 - Megatendencias
Education 3.0 - MegatendenciasEducation 3.0 - Megatendencias
Education 3.0 - MegatendenciasCarlos Toxtli
 
Understanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConUnderstanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConCarlos Toxtli
 
Understanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementUnderstanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementCarlos Toxtli
 
Single sign on spanish - guía completa
Single sign on   spanish - guía completaSingle sign on   spanish - guía completa
Single sign on spanish - guía completaCarlos Toxtli
 
Los empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaLos empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaCarlos Toxtli
 
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Carlos Toxtli
 
RPA (Robotic Process Automation)
RPA (Robotic Process Automation)RPA (Robotic Process Automation)
RPA (Robotic Process Automation)Carlos Toxtli
 
Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Carlos Toxtli
 
Estrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsEstrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsCarlos Toxtli
 
Tecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompTecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompCarlos Toxtli
 

Mehr von Carlos Toxtli (20)

Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
 
Artificial intelligence and open source
Artificial intelligence and open sourceArtificial intelligence and open source
Artificial intelligence and open source
 
Bots in robotic process automation
Bots in robotic process automationBots in robotic process automation
Bots in robotic process automation
 
How to implement artificial intelligence solutions
How to implement artificial intelligence solutionsHow to implement artificial intelligence solutions
How to implement artificial intelligence solutions
 
Changing paradigms in ai prototyping
Changing paradigms in ai prototypingChanging paradigms in ai prototyping
Changing paradigms in ai prototyping
 
Inteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroInteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to Hero
 
Bots for Crowds
Bots for CrowdsBots for Crowds
Bots for Crowds
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
 
Cómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificial
 
Education 3.0 - Megatendencias
Education 3.0 - MegatendenciasEducation 3.0 - Megatendencias
Education 3.0 - Megatendencias
 
Understanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConUnderstanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsCon
 
Understanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementUnderstanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task Management
 
Single sign on spanish - guía completa
Single sign on   spanish - guía completaSingle sign on   spanish - guía completa
Single sign on spanish - guía completa
 
Los empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaLos empleos del futuro en Latinoamérica
Los empleos del futuro en Latinoamérica
 
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
 
RPA (Robotic Process Automation)
RPA (Robotic Process Automation)RPA (Robotic Process Automation)
RPA (Robotic Process Automation)
 
Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)
 
Estrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsEstrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startups
 
Tecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompTecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiComp
 

Kürzlich hochgeladen

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Kürzlich hochgeladen (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Multimodal emotion recognition at utterance level with spatio-temporal feature fusion by using face, body, audio, and text features. (1)

  • 1. Multimodal Affect Recognition at utterance-level with spatio- temporal feature fusion by using Face, Audio, Text [,and Body] features Carlos Toxtli
  • 2. Index ● Basic concepts ● Architecture ● Experiments ● Results ● Conclusions ● Next steps
  • 3. Paper Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E. Shi Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology IFP, Beckman, University of Illinois at Urbana-Champaign Highest score in the “Visual + Audio + Text” category of the OMG Emotion Challenge 2018
  • 4. Long-term (spatio-temporal) emotion recognition ● The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. ● Much previous work has focused on instantaneous emotion recognition. ● This work addresses long-term emotion recognition by integrating cues from multiple modalities. ● Since emotions normally change gradually under the same context, analyzing long-term dependency of emotions will stabilize the overall predictions.
  • 5. Utterance level ● Spoken statement ● It is a continuous piece of speech beginning and ending with a clear pause. ● Utterances do not exist in written language ● This word does not exist in some languages.
  • 6. Multimodal ● Humans perceive others’ emotional states by combining information across multiple modalities simultaneously. ● Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. ● This work uses multiple modalities including facial expression, audio and language. ● The paper describes a multi-modal neural architecture that integrates visual information over time using LSTMs, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips.
  • 7. Affect (dimensional) vs Emotion (discrete) recognition ● Dimensional models aim to avoid the restrictiveness of discrete states, and allow more flexible definition of affective states as points in a multi- dimensional space spanned by concepts such as affect intensity and positivity. ● For affect recognition, the dimensional space is commonly operationalized as a regression task. ● The most commonly dimensional model is Russell’s circumplex model, which consists of the two dimensions valence and arousal.
  • 9.
  • 10. Database - OMG - One Minute Gradual-Emotion 10 hours of data 497 videos 6422 utterances Annotations: Arousal: -1 Calm to +1 Alert Valence: -1 Negative to +1 Positive Emotions: "Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise"
  • 11. Video example, What emotion is represented? Options: “Anger","Disgust","Fear","Happy","Neutral","Sad","Surprise" Arousal? Valence? (numbers between -1 and 1) https://youtu.be/EWRTue-AeSo
  • 13. OpenFace (709 features): Facial behavior analysis tool that provides accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. We get points that represents the face. VGG16 FC6 (4096 features): The faces are cropped (224×224×3), aligned, zero out the background, and passed through a pretrained VGG16 to get a take a dimensional feature vector from FC6 layer. Face features
  • 14. Audio features OpenSMILE (1582 features): The audio is extracted from the videos and are processed by OpenSMILE that extract audio features such as loudness, pitch, jitter, etc.
  • 15. Text features Opinion Lexicon (6 features): depends on the ratio of sentiment words (adjectives, adverbs, verbs and nouns), which express positive or negative sentiments. Subjective Lexicon (4 features): They used the subjective Lexicon from MPQA (Multi-Perspective Question Answering) that models the sentiment by its type and intensity.
  • 16. Feature fusion The features of the same source were normalized and fusioned, getting the following feature sizes: Face fusioned (4096 + 709 = 4805 features) Word fusioned (6 + 4 = 10 features) Audio features came only from OpenSMILE so these were not fusioned (1582 features)
  • 17. Early fusion For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier.
  • 19. Late fusion For late fusion, classifications are made on each modality and their decisions or predictions are later merged together.
  • 21. LSTM - Long Short-Term Memory A LSTM network is a recurrent neural network that models time or sequence dependent behaviour. This is performed by feeding back the output of a neural network layer at time t to the input of the same network layer at time t + 1.
  • 22. Metrics - Concordance Correlation Coefficients CCC is an index of how well a new test or measurement (Y) reproduces a gold standard test or measurement (X). It quantifies the agreement between these two measures of the same variable. Like a correlation, ρc ranges from -1 to 1, with perfect agreement at 1. Mean, variance, correlation coefficient between the two variables As a fine tune they also used 1 - as loss function instead MSE
  • 23. Metrics - Accuracy and F1-score Accuracy: percentage of correct predictions from all predictions made F1-Score: conveys the balance between the precision and the recall
  • 24. Limitations The dataset was designed to be downloaded from youtube. From the 497 videos, 111 were unavailable. I trained with limited data and the results were different from the ones that were reported.
  • 26. Results CCC Arousal CCC Valence Accuracy F1-score Reported in their paper 0.400 0.353 Contest evaluation 0.359 0.276 My local environment 0.210 0.257 0.434 0.362
  • 27. Mixed features CCC Arousal Their value CCC Arousal My machine CCC Valence Their value CCC Valence My machine Accuracy My value F1-score My value Face Visual 0.109 0.075 0.237 0.193 0.405 0.396 Face Feature 0.046 0.007 0.080 0.012 0.204 0.204 Face Fusion 0.175 0.113 0.261 0.149 0.381 0.383 Audio Feature 0.273 0.207 0.266 0.015 0.418 0.420 Text Fusion 0.137 0.107 0.259 0.037 0.259 0.259
  • 28. Body Features OpenPose (BODY_25) (11 features): The normalized angles between the joints.I did not use the calculated features because were 25x224x224 VGG16 FC6 Skelethon image (4096 features): I drew the skeleton on a black background and feed a VGG16 and extracted a feature vector of the FC6 layer.
  • 29. Quad Model The proposed model adds body gesture features from handcrafted and deep features as a fusioned layer and is evaluated through a LSTM.
  • 30. My experiments CCC Arousal CCC Valence Accuracy F1-score Body Feature 0.067 0.013 0.285 0.283 Body Visual 0.077 0.005 0.361 0.350 Body Fusion 0.002 0.049 0.136 0.191 Trimodal + Body Feature 0.267 0.283 0.185 0.274 Trimodal + Body Visual 0.006 0.244 0.411 0.407 Trimodal + Body Fusion 0.026 0.307 0.449 0.451
  • 31. Experiments After running 112 experiments with the combinations of features we found the best models for each metric. NVIDIA GTX 1080 ti
  • 32. Other experiments CCC Arousal CCC Valence Accuracy F1-score Fusion_late Body_feature Audio_feature 0.272 0.064 0.380 0.380 Fusion_late Face_fusion Audio_feature Word_fusion Body_fusion face_fusion 0.173 0.359 0.411 0.358 Fusion_early Face_fusion Audio_feature Word_fusion body_fusion 0.249 0.267 0.451 0.449 Trimodal + Body Fusion 0.026 0.307 0.449 0.451
  • 33. Final results CCC Arousal CCC Valence Accuracy F1-score Authors approach Trimodal 0.210 0.257 0.434 0.362 My approach Quadmodal 0.249 0.267 0.451 0.449 Mixed models 0.272 0.359 0.451 0.451
  • 34. Conclusions ● Multimodal models outperform the baseline methods ● The results show that cross-modal information benefit the estimation of long- term affective states. ● Early fusion performed better in general but for some for dimensional metrics late fusion performed better.
  • 35. Next steps ● I’m planning to explore 3Dconv instead LSTM, rey ResNet instead VGG16, different network models for each feature. ● UPDATE: These are the evaluations from test datasets. CCC Arousal CCC Valence Accuracy F1-score Trimodal Val 0.298 0.428 0.440 0.455 Trimodal Test 0.180 0.405 0.455 0.455 Quadmodal Val 0.340 0.454 0.445 0.453 Quadmodal Test 0.235 0.413 0.453 0.453
  • 39. LSTM
  • 40. Decision layers The activation function used for each metric were: Emotion (categorical): Softmax Valence (dimensional): hyperbolic tangent function (tanh) Arousal (dimensional): Sigmoid
  • 41. Sigmoid as activation function A sigmoid activation function turns an activation into a value between 0 and 1. It is useful for binary classification problems and is mostly used in the final output layer of such problems. Also, sigmoid activation leads to slow gradient descent because the slope is small for high and low values.
  • 42. Hyperbolic tangent as activation function A Tanh activation function turns an activation into a value between -1 and +1. The outputs are normalized. The gradient is stronger for tanh than sigmoid (derivatives are steeper)
  • 43. SoftMax as activation function The Softmax function is a wonderful activation function that turns numbers aka logits into probabilities that sum to one.
  • 44. MSE as loss function for linear regression Linear regression uses Mean Squared Error as loss function that gives a convex graph and then we can complete the optimization by finding its vertex as global minimum.
  • 45. SGD as Optimizer Stochastic gradient descent (SGD) computes the gradient for each update using a single training data point x_i (chosen at random). The idea is that the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data. Each update is now much faster to calculate than in batch gradient descent, and over many updates, we will head in the same general direction
  • 46. Layers Early fusion - Hidden layer Early fusion Fully connected LSTM Late fusion
  • 47. 1DConv Average Pooling 1D convolutional neural nets can be used for extracting local 1D patches (subsequences) from sequences and can identify local patterns within the window of convolution. A pattern learnt at one position can also be recognized at a different position, making 1D conv nets translation invariant. Long sequence to process so long that it cannot be realistically processed by RNNs. In such cases, 1D conv nets can be used as a pre-processing step to make the sequence smaller through downsampling by extracting higher level features, which can, then be passed on to the RNN as input.
  • 48. Batch Normalization We normalize the input layer by adjusting and scaling the activations to speed up learning, the same thing also for the values in the hidden layers, that are changing all the time.
  • 49. VGG16