Multimodal video action (bloopers) recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features
3. Multimodal video action
(bloopers) recognition and
localization methods for
spatio-temporal feature
fusion by using Face, Body,
Audio, and Emotion features
4. Index
● Basic concepts
● Video bloopers dataset
● Features extraction
● Blooper recognition
● Blooper localization
● System implementation
● Conclusions
5. Previous work on Automatic Video Editing (AVE)
● Previous work over automatic video editing is more focused in how to
enhance existing videos by adding music, transitions, zoom, camera
changes, among other improvements.
● Simple silence detection mechanisms are just beginning to be implemented
by commercial software. However, these are easy to detect visually.
● Video summarization techniques involve editing but are content based.
● Video action recognition is the area that studies the behavioral patterns in
videos.
● Video action recognition applied to bloopers detection is an area that has
not yet been studied by literature.
6. Problem
● According to online sources, basic video editing can take 30 minutes to an
hour for each minute of finished video (a 4-minute video would take 4 hours
to edit). More advanced editing (adding in animations, VFX, and compositing)
can take much longer.
● The time that a video takes to be edited discourages users to produce
periodic content.
7. Solution: AutomEditor
A system that automates monologue video editing.
● AutomEditor is fed with example video clips (1-3 seconds each) of bloopers
and not bloopers (separated by folders)
● Extracts features and trains a model.
● Evaluates its performance.
● Localizes the blooper fragments in full-length videos
● Shows the results in a web interface.
9. Main contributions
● Creation of a video bloopers dataset (Blooper DB)
● Feature extraction methods for video blooper recognition
● Video blooper recognition models
● Video blooper localization techniques
● Web interface for automatic video editing
Problem: Every contribution by itself is enough for an individual publication.
I could not cover all of them in-depth.
10. Creation of a monologue video bloopers dataset
● ~600 videos
● Between 1 and 3 seconds per video
● Train, Validation, and Test batches
○ Train: 464
○ Test: 66
○ Validation: 66
● 2 categories
○ Blooper
○ No blooper
● Stratified data
11. Criterias
● I splitted long bloopers (more than 2 seconds) into a non-blooper (before
the mistake) and blooper (the clip that contains the mistake)
● For short bloopers (1 to 2 seconds) I found other clips from the same
video of about the same length (as non-bloopers).
● The clips do not contain truncated phrases.
● Tried to avoid as much as possible green-screen vs non-green-screen
backgrounds.
14. Feature extraction methods for blooper recognition
The main goal of this process is to extract features invariant in person
descriptors (i.e. gender, age, etc.), scale, position, background, and language
● Audio
■ General (1) Audio handcrafted features per clip (OpenSMILE)
■ Temporal (20) Audio handcrafted features per clip (OpenSMILE)
● Images
○ Face
■ Temporal (20) Face handcrafted features (OpenFace)
■ Temporal (20) Face deep features (VGG16)
○ Body
■ Temporal (20) Body handcrafted features (OpenPose)
■ Temporal (20) Body deep features (VGG16)
○ Emotions
■ General (1) FER predictions (EmoPy and others)
■ Temporal (20) FER predictions (EmoPy and others)
15. Audio features
OpenSMILE (1582 features): The
audio is extracted from the videos and
are processed by OpenSMILE that
extract audio features such as
loudness, pitch, jitter, etc.
It was tested on video-clip length
(general) and 20 fragments (temporal).
16. OpenFace (709 features): Facial behavior
analysis tool that provides accurate facial
landmark detection, head pose
estimation, facial action unit recognition,
and eye-gaze estimation. We get points that
represents the face.
VGG16 FC6 (4096 features): The faces are
cropped (224×224×3), aligned, zero out the
background, and passed through a
pretrained VGG16 to get a take a
dimensional feature vector from FC6 layer.
Face features
17. Body Features
OpenPose (BODY_25) (11
features): The normalized angles
between the joints.I did not use the
calculated features because were
25x224x224
VGG16 FC6 Skelethon image (4096
features): I drew the skeleton (neck
in the center) on a black background
and feed a VGG16 and extracted a
feature vector of the FC6 layer.
18. Emotion features
EmoPy (7 features): A deep neural net
toolkit for emotion analysis via Facial
Expression Recognition (FER).
Other (28 features): Other 4 models from
different FER contest participants.
7 categories per model, 35 features in
total
20 samples per video clip were
predicted (temporal) from there I
computed its normalized sum (general)
19. Feature fusion
The features of the same source were normalized and
fusioned, getting the following feature sizes:
Face fusioned (4096 + 709 = 4805 features)
Body fusioned (4096 + 11 = 4107 features)
Emotion features (7 + 7 + 7 + 7 + 7 = 35 features)
Audio features came only from OpenSMILE so these
were not fusioned (1582 features)
20. Feature sequences
Features extracted are grouped in sequences to feed the RNNs
● Each video clip (fragments between 1 to 3 seconds) is divided into 20
samples (i.e. 20 face images) and equally spaced (i.e. in a 60 frames video,
the frames 1, 4, 7, .. ,57, 60 are processed)
● The samples were extracted from the end to the beginning
● It produces a matrix of [20][feature_size]
21. Fusions
Early fusion: For early fusion, features from different modalities are projected
into the same joint feature space before being fed into the classifier.
Late fusion: For late fusion, classifications are made on each modality and their
decisions or predictions are later merged together.
We used early fusion for our training cases.
26. Evaluation
● The models were trained on a NVIDIA GTX 1080 ti graphic card.
● Since there is no previous work in this field, we used the individual feature
models as baseline.
● 300 epochs
● Optimizer: Adam
● Loss: MSE
● Learning rate: 0.001
27. Emotion features: Global & Temporal
acc_val acc_train acc_test f1_score f1_test loss
Emotion Global 0.59 0.86 0.59 0.60 0.56 0.28
Emotion Temporal 0.62 0.99 0.69 0.66 0.63 0.32
Temporal
Global
28. Body Temporal Features: Handcrafted & Deep
acc_val acc_train acc_test f1_score f1_test Loss
Body Hand 0.63 0.92 0.54 0.72 0.59 0.27
Body Deep 0.68 0.99 0.65 0.72 0.71 0.26
Handcrafted
Deep
29. Body fusion (handcrafted + deep features)
acc_val acc_train acc_test f1_score f1_test Loss
Body Fus 0.66 0.98 0.66 0.74 0.69 0.22
30. Face Temporal Features: Handcrafted & Deep
acc_val acc_train acc_test f1_score f1_test Loss
Face Hand 0.84 0.99 0.87 0.89 0.86 0.12
Face Deep 0.89 1.00 0.81 0.92 0.83 0.12
Handcrafted
Deep
31. Face fusion (handcrafted + deep features)
acc_val acc_train acc_test f1_score f1_test Loss
Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
32. Audio Features: Temporal & General
acc_val acc_train acc_test f1_score f1_test Loss
Audio Temporal 0.86 1.00 0.84 0.89 0.83 0.11
Audio General 0.95 1.00 0.90 0.96 0.92 0.03
Temporal
General
33. Top 3: Face handcrafted + Face deep + Audio gen
acc_val acc_train acc_test f1_score f1_test Loss
Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
36. ResultsModel acc_val acc_train acc_test f1_score f1_test Loss
Emotion Gl 0.59 0.86 0.59 0.60 0.56 0.28
Emotion Te 0.62 0.99 0.69 0.66 0.63 0.32
Body Feat 0.63 0.92 0.54 0.72 0.59 0.27
Body Fus 0.66 0.98 0.66 0.74 0.69 0.26
Body Vis 0.68 0.99 0.65 0.72 0.71 0.22
Face Feat 0.84 0.99 0.87 0.89 0.86 0.12
Audio Te 0.86 1.00 0.84 0.89 0.83 0.11
Face Vis 0.89 1.00 0.81 0.92 0.83 0.12
Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
Audio 0.95 1.00 0.90 0.96 0.92 0.03
Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
Quadmodal 1.00 1.00 0.90 1.00 0.90 0.01
37. Early VS Late fusion
Model acc_val acc_train acc_test f1_score f1_test Loss
Quadmodal Early 1.00 1.00 0.90 1.00 0.90 0.01
Quadmodal Late 0.96 1.00 0.93 0.96 0.93 0.06
38. But, How good is a tr 100% va 100% te 90% model?
Sometimes algorithmic research work ends after the computation of the
performance metrics, but …
Now that we have a model with good performance over small data. How can we
test that it works for real-life applications?
Full length videos will be provided by users, so the first step is to find bloopers
on a video. Localization techniques are needed.
39. Video blooper localization techniques
More challenges ...
● There are no existing localization techniques for video bloopers.
● Temporal action localization techniques in untrimmed videos work mostly
for image-only processing.
● Localization in multimodality is mostly limited to video indexing
● No localization methods for mixed temporal and non-temporal features
● The videos must be analyzed in small fragments.
● The analysis of multiple video fragments is costly
● Taking different frames of the same video clip can give different results.
● The output should be a time range.
40. Diagnosis of how predictions are distributed
To test how algorithms can find bloopers I inserted randomly 6 clips (3
bloopers and 3 non-bloopers) in a 70 seconds length video of the same
person. I created two test videos.
I performed the analysis of 2 second fragments separated by 500 milliseconds
each and plotted the results.
Then I compared my expectations VS reality
43. Defining an algorithm to find the bloopers
I defined the concept of blooper_score as the predicted value of the blooper
category. Instead of using a 0 to 1 scale, I used the discrete 0,1, and 2 values. 0
stands for blooper_score=0, 1 for ‘almost 1’ that are the intermediate values
between a threshold range, and 2 for blooper_score=1. The most important
pattern that I found was the contiguous high numbers.
45. Calculating the sequences of top 3 values
I defined window size and calculated
the percentage of elements that are
in the top 3 values. I used a
threshold to add to a range.
46. Result of ranges
It returned the 3 ranges that contained the bloopers of the video.
Milliseconds accuracy is needed, but this approach is good enough for at least
identifying them.
48. But not everybody is familiar with the command line
Now we have a model that is able to recognize, a localization method, but our
system is not user friendly.
So there is another challenge ...
There are no automatic video editing interfaces on the web.
So I developed an open source web interface for automatic video editing
interface.
49. Web interface for automatic video editing
http://www.carlostoxtli.com/AutomEditor/frontend/
The tool helps users to analyze their videos and visualize their bloopers.
For developers, it brings a simple and easy to integrate platform for testing
their algorithms.
51. Future work
● Explore one of the contributions in depth.
● Data augmentation methods for video bloopers
○ Generative video bloopers ?
● Research about temporal action localization techniques in untrimmed
videos for mixed spatio-temporal modalities
● Detecting multiple people bloopers.
● Study the people’s interaction with AVE interfaces (HCI)
52. Conclusions
● Video bloopers recognition is benefited from multimodal techniques.
● Results are not generalizable enough from small data
● Models for localization of mixed spatio-temporal multimodal features are
needed for reducing the time and processing load.
● AutomEditor interface can
○ Help users to edit their videos automatically online
○ Help developers to test and publish their models to the public.
56. Decision layers
The activation function used for each metric were:
Emotion (categorical): Softmax
Valence (dimensional): hyperbolic tangent function (tanh)
Arousal (dimensional): Sigmoid
57. Sigmoid as activation function
A sigmoid activation function turns an
activation into a value between 0 and
1. It is useful for binary classification
problems and is mostly used in the
final output layer of such problems.
Also, sigmoid activation leads to slow
gradient descent because the slope is
small for high and low values.
58. Hyperbolic tangent as activation function
A Tanh activation function turns an
activation into a value between -1 and
+1. The outputs are normalized. The
gradient is stronger for tanh than sigmoid
(derivatives are steeper)
59. SoftMax as activation function
The Softmax function is a
wonderful activation function that
turns numbers aka logits into
probabilities that sum to one.
60. MSE as loss function for linear regression
Linear regression uses Mean Squared
Error as loss function that gives a
convex graph and then we can
complete the optimization by finding its
vertex as global minimum.
61. SGD as Optimizer
Stochastic gradient descent (SGD)
computes the gradient for each
update using a single training data
point x_i (chosen at random). The
idea is that the gradient calculated
this way is a stochastic approximation
to the gradient calculated using the
entire training data. Each update is
now much faster to calculate than in
batch gradient descent, and over
many updates, we will head in the
same general direction
63. 1DConv Average Pooling
1D convolutional neural nets can be used for extracting local 1D patches
(subsequences) from sequences and can identify local patterns within the window
of convolution. A pattern learnt at one position can also be recognized at a
different position, making 1D conv nets translation invariant. Long sequence to
process so long that it cannot be realistically processed by RNNs. In such cases,
1D conv nets can be used as a pre-processing step to make the sequence smaller
through downsampling by extracting higher level features, which can, then be
passed on to the RNN as input.
64. Batch Normalization
We normalize the input layer by
adjusting and scaling the activations
to speed up learning, the same thing
also for the values in the hidden
layers, that are changing all the time.