SlideShare ist ein Scribd-Unternehmen logo
1 von 65
AutomEditor: Video blooper
recognition and localization
for automatic monologue
video editing
Carlos Toxtli
Multimodal video action
(bloopers) recognition and
localization methods for
spatio-temporal feature
fusion by using Face, Body,
Audio, and Emotion features
Index
● Basic concepts
● Video bloopers dataset
● Features extraction
● Blooper recognition
● Blooper localization
● System implementation
● Conclusions
Previous work on Automatic Video Editing (AVE)
● Previous work over automatic video editing is more focused in how to
enhance existing videos by adding music, transitions, zoom, camera
changes, among other improvements.
● Simple silence detection mechanisms are just beginning to be implemented
by commercial software. However, these are easy to detect visually.
● Video summarization techniques involve editing but are content based.
● Video action recognition is the area that studies the behavioral patterns in
videos.
● Video action recognition applied to bloopers detection is an area that has
not yet been studied by literature.
Problem
● According to online sources, basic video editing can take 30 minutes to an
hour for each minute of finished video (a 4-minute video would take 4 hours
to edit). More advanced editing (adding in animations, VFX, and compositing)
can take much longer.
● The time that a video takes to be edited discourages users to produce
periodic content.
Solution: AutomEditor
A system that automates monologue video editing.
● AutomEditor is fed with example video clips (1-3 seconds each) of bloopers
and not bloopers (separated by folders)
● Extracts features and trains a model.
● Evaluates its performance.
● Localizes the blooper fragments in full-length videos
● Shows the results in a web interface.
End-to-end solution
From database creation to the web
application.
https://github.com/toxtli/AutomEditor
Main contributions
● Creation of a video bloopers dataset (Blooper DB)
● Feature extraction methods for video blooper recognition
● Video blooper recognition models
● Video blooper localization techniques
● Web interface for automatic video editing
Problem: Every contribution by itself is enough for an individual publication.
I could not cover all of them in-depth.
Creation of a monologue video bloopers dataset
● ~600 videos
● Between 1 and 3 seconds per video
● Train, Validation, and Test batches
○ Train: 464
○ Test: 66
○ Validation: 66
● 2 categories
○ Blooper
○ No blooper
● Stratified data
Criterias
● I splitted long bloopers (more than 2 seconds) into a non-blooper (before
the mistake) and blooper (the clip that contains the mistake)
● For short bloopers (1 to 2 seconds) I found other clips from the same
video of about the same length (as non-bloopers).
● The clips do not contain truncated phrases.
● Tried to avoid as much as possible green-screen vs non-green-screen
backgrounds.
Examples
No blooper
Blooper
Examples
No blooper Blooper
Feature extraction methods for blooper recognition
The main goal of this process is to extract features invariant in person
descriptors (i.e. gender, age, etc.), scale, position, background, and language
● Audio
■ General (1) Audio handcrafted features per clip (OpenSMILE)
■ Temporal (20) Audio handcrafted features per clip (OpenSMILE)
● Images
○ Face
■ Temporal (20) Face handcrafted features (OpenFace)
■ Temporal (20) Face deep features (VGG16)
○ Body
■ Temporal (20) Body handcrafted features (OpenPose)
■ Temporal (20) Body deep features (VGG16)
○ Emotions
■ General (1) FER predictions (EmoPy and others)
■ Temporal (20) FER predictions (EmoPy and others)
Audio features
OpenSMILE (1582 features): The
audio is extracted from the videos and
are processed by OpenSMILE that
extract audio features such as
loudness, pitch, jitter, etc.
It was tested on video-clip length
(general) and 20 fragments (temporal).
OpenFace (709 features): Facial behavior
analysis tool that provides accurate facial
landmark detection, head pose
estimation, facial action unit recognition,
and eye-gaze estimation. We get points that
represents the face.
VGG16 FC6 (4096 features): The faces are
cropped (224×224×3), aligned, zero out the
background, and passed through a
pretrained VGG16 to get a take a
dimensional feature vector from FC6 layer.
Face features
Body Features
OpenPose (BODY_25) (11
features): The normalized angles
between the joints.I did not use the
calculated features because were
25x224x224
VGG16 FC6 Skelethon image (4096
features): I drew the skeleton (neck
in the center) on a black background
and feed a VGG16 and extracted a
feature vector of the FC6 layer.
Emotion features
EmoPy (7 features): A deep neural net
toolkit for emotion analysis via Facial
Expression Recognition (FER).
Other (28 features): Other 4 models from
different FER contest participants.
7 categories per model, 35 features in
total
20 samples per video clip were
predicted (temporal) from there I
computed its normalized sum (general)
Feature fusion
The features of the same source were normalized and
fusioned, getting the following feature sizes:
Face fusioned (4096 + 709 = 4805 features)
Body fusioned (4096 + 11 = 4107 features)
Emotion features (7 + 7 + 7 + 7 + 7 = 35 features)
Audio features came only from OpenSMILE so these
were not fusioned (1582 features)
Feature sequences
Features extracted are grouped in sequences to feed the RNNs
● Each video clip (fragments between 1 to 3 seconds) is divided into 20
samples (i.e. 20 face images) and equally spaced (i.e. in a 60 frames video,
the frames 1, 4, 7, .. ,57, 60 are processed)
● The samples were extracted from the end to the beginning
● It produces a matrix of [20][feature_size]
Fusions
Early fusion: For early fusion, features from different modalities are projected
into the same joint feature space before being fed into the classifier.
Late fusion: For late fusion, classifications are made on each modality and their
decisions or predictions are later merged together.
We used early fusion for our training cases.
Early fusion
Late fusion
Quad Model
The proposed model uses all the
features extracted and integrates
into LSTMs (except the audio) and
combines into an early fusion.
LSTM
Evaluation
● The models were trained on a NVIDIA GTX 1080 ti graphic card.
● Since there is no previous work in this field, we used the individual feature
models as baseline.
● 300 epochs
● Optimizer: Adam
● Loss: MSE
● Learning rate: 0.001
Emotion features: Global & Temporal
acc_val acc_train acc_test f1_score f1_test loss
Emotion Global 0.59 0.86 0.59 0.60 0.56 0.28
Emotion Temporal 0.62 0.99 0.69 0.66 0.63 0.32
Temporal
Global
Body Temporal Features: Handcrafted & Deep
acc_val acc_train acc_test f1_score f1_test Loss
Body Hand 0.63 0.92 0.54 0.72 0.59 0.27
Body Deep 0.68 0.99 0.65 0.72 0.71 0.26
Handcrafted
Deep
Body fusion (handcrafted + deep features)
acc_val acc_train acc_test f1_score f1_test Loss
Body Fus 0.66 0.98 0.66 0.74 0.69 0.22
Face Temporal Features: Handcrafted & Deep
acc_val acc_train acc_test f1_score f1_test Loss
Face Hand 0.84 0.99 0.87 0.89 0.86 0.12
Face Deep 0.89 1.00 0.81 0.92 0.83 0.12
Handcrafted
Deep
Face fusion (handcrafted + deep features)
acc_val acc_train acc_test f1_score f1_test Loss
Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
Audio Features: Temporal & General
acc_val acc_train acc_test f1_score f1_test Loss
Audio Temporal 0.86 1.00 0.84 0.89 0.83 0.11
Audio General 0.95 1.00 0.90 0.96 0.92 0.03
Temporal
General
Top 3: Face handcrafted + Face deep + Audio gen
acc_val acc_train acc_test f1_score f1_test Loss
Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
All (Quadmodal): BodyTF+FaceTF+AudioG+EmoT
acc_val acc_train acc_test f1_score f1_test Loss
All 1.00 1.00 0.90 1.00 0.90 0.01
Train Validation Test
Confusion matrices of the Quadmodal model
ResultsModel acc_val acc_train acc_test f1_score f1_test Loss
Emotion Gl 0.59 0.86 0.59 0.60 0.56 0.28
Emotion Te 0.62 0.99 0.69 0.66 0.63 0.32
Body Feat 0.63 0.92 0.54 0.72 0.59 0.27
Body Fus 0.66 0.98 0.66 0.74 0.69 0.26
Body Vis 0.68 0.99 0.65 0.72 0.71 0.22
Face Feat 0.84 0.99 0.87 0.89 0.86 0.12
Audio Te 0.86 1.00 0.84 0.89 0.83 0.11
Face Vis 0.89 1.00 0.81 0.92 0.83 0.12
Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
Audio 0.95 1.00 0.90 0.96 0.92 0.03
Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
Quadmodal 1.00 1.00 0.90 1.00 0.90 0.01
Early VS Late fusion
Model acc_val acc_train acc_test f1_score f1_test Loss
Quadmodal Early 1.00 1.00 0.90 1.00 0.90 0.01
Quadmodal Late 0.96 1.00 0.93 0.96 0.93 0.06
But, How good is a tr 100% va 100% te 90% model?
Sometimes algorithmic research work ends after the computation of the
performance metrics, but …
Now that we have a model with good performance over small data. How can we
test that it works for real-life applications?
Full length videos will be provided by users, so the first step is to find bloopers
on a video. Localization techniques are needed.
Video blooper localization techniques
More challenges ...
● There are no existing localization techniques for video bloopers.
● Temporal action localization techniques in untrimmed videos work mostly
for image-only processing.
● Localization in multimodality is mostly limited to video indexing
● No localization methods for mixed temporal and non-temporal features
● The videos must be analyzed in small fragments.
● The analysis of multiple video fragments is costly
● Taking different frames of the same video clip can give different results.
● The output should be a time range.
Diagnosis of how predictions are distributed
To test how algorithms can find bloopers I inserted randomly 6 clips (3
bloopers and 3 non-bloopers) in a 70 seconds length video of the same
person. I created two test videos.
I performed the analysis of 2 second fragments separated by 500 milliseconds
each and plotted the results.
Then I compared my expectations VS reality
Expectations VS Reality
Expectation...
Expectations VS Reality
Reality...
Defining an algorithm to find the bloopers
I defined the concept of blooper_score as the predicted value of the blooper
category. Instead of using a 0 to 1 scale, I used the discrete 0,1, and 2 values. 0
stands for blooper_score=0, 1 for ‘almost 1’ that are the intermediate values
between a threshold range, and 2 for blooper_score=1. The most important
pattern that I found was the contiguous high numbers.
Adding neighbors
To emphasize the values I added them in bins of neighbor elements.
Calculating the sequences of top 3 values
I defined window size and calculated
the percentage of elements that are
in the top 3 values. I used a
threshold to add to a range.
Result of ranges
It returned the 3 ranges that contained the bloopers of the video.
Milliseconds accuracy is needed, but this approach is good enough for at least
identifying them.
It also worked for the second video
But not everybody is familiar with the command line
Now we have a model that is able to recognize, a localization method, but our
system is not user friendly.
So there is another challenge ...
There are no automatic video editing interfaces on the web.
So I developed an open source web interface for automatic video editing
interface.
Web interface for automatic video editing
http://www.carlostoxtli.com/AutomEditor/frontend/
The tool helps users to analyze their videos and visualize their bloopers.
For developers, it brings a simple and easy to integrate platform for testing
their algorithms.
Examples of the processed videos in the GUI
Future work
● Explore one of the contributions in depth.
● Data augmentation methods for video bloopers
○ Generative video bloopers ?
● Research about temporal action localization techniques in untrimmed
videos for mixed spatio-temporal modalities
● Detecting multiple people bloopers.
● Study the people’s interaction with AVE interfaces (HCI)
Conclusions
● Video bloopers recognition is benefited from multimodal techniques.
● Results are not generalizable enough from small data
● Models for localization of mixed spatio-temporal multimodal features are
needed for reducing the time and processing load.
● AutomEditor interface can
○ Help users to edit their videos automatically online
○ Help developers to test and publish their models to the public.
Thanks
http://www.carlostoxtli.com
@ctoxtli
Back up
Link
http://bit.ly/2V3U3aS
Decision layers
The activation function used for each metric were:
Emotion (categorical): Softmax
Valence (dimensional): hyperbolic tangent function (tanh)
Arousal (dimensional): Sigmoid
Sigmoid as activation function
A sigmoid activation function turns an
activation into a value between 0 and
1. It is useful for binary classification
problems and is mostly used in the
final output layer of such problems.
Also, sigmoid activation leads to slow
gradient descent because the slope is
small for high and low values.
Hyperbolic tangent as activation function
A Tanh activation function turns an
activation into a value between -1 and
+1. The outputs are normalized. The
gradient is stronger for tanh than sigmoid
(derivatives are steeper)
SoftMax as activation function
The Softmax function is a
wonderful activation function that
turns numbers aka logits into
probabilities that sum to one.
MSE as loss function for linear regression
Linear regression uses Mean Squared
Error as loss function that gives a
convex graph and then we can
complete the optimization by finding its
vertex as global minimum.
SGD as Optimizer
Stochastic gradient descent (SGD)
computes the gradient for each
update using a single training data
point x_i (chosen at random). The
idea is that the gradient calculated
this way is a stochastic approximation
to the gradient calculated using the
entire training data. Each update is
now much faster to calculate than in
batch gradient descent, and over
many updates, we will head in the
same general direction
Layers
Early fusion - Hidden layer
Early fusion Fully connected
LSTM
Late fusion
1DConv Average Pooling
1D convolutional neural nets can be used for extracting local 1D patches
(subsequences) from sequences and can identify local patterns within the window
of convolution. A pattern learnt at one position can also be recognized at a
different position, making 1D conv nets translation invariant. Long sequence to
process so long that it cannot be realistically processed by RNNs. In such cases,
1D conv nets can be used as a pre-processing step to make the sequence smaller
through downsampling by extracting higher level features, which can, then be
passed on to the RNN as input.
Batch Normalization
We normalize the input layer by
adjusting and scaling the activations
to speed up learning, the same thing
also for the values in the hidden
layers, that are changing all the time.
VGG16

Weitere ähnliche Inhalte

Was ist angesagt?

Reverse engineering android apps
Reverse engineering android appsReverse engineering android apps
Reverse engineering android appsPranay Airan
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answersITeLearn
 
eclipse
eclipseeclipse
eclipsetvhung
 
Google ART (Android RunTime)
Google ART (Android RunTime)Google ART (Android RunTime)
Google ART (Android RunTime)Niraj Solanke
 
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...manssandstrom
 
Asynchronous Programming in Android
Asynchronous Programming in AndroidAsynchronous Programming in Android
Asynchronous Programming in AndroidJohn Pendexter
 
Interpreter RPG to Java
Interpreter RPG to JavaInterpreter RPG to Java
Interpreter RPG to Javafarerobe
 
Debugging Modern C++ Application with Gdb
Debugging Modern C++ Application with GdbDebugging Modern C++ Application with Gdb
Debugging Modern C++ Application with GdbSenthilKumar Selvaraj
 
Story of Puppet @eBay Global Classifieds Group (eCG)
Story of Puppet @eBay Global Classifieds Group (eCG)Story of Puppet @eBay Global Classifieds Group (eCG)
Story of Puppet @eBay Global Classifieds Group (eCG)Puppet
 
Unit Testing RPG with JUnit
Unit Testing RPG with JUnitUnit Testing RPG with JUnit
Unit Testing RPG with JUnitGreg.Helton
 
Ml goes fruitful
Ml goes fruitfulMl goes fruitful
Ml goes fruitfulPreeti Negi
 
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...0xdaryl
 
Unit Tests? It is Very Simple and Easy!
Unit Tests? It is Very Simple and Easy!Unit Tests? It is Very Simple and Easy!
Unit Tests? It is Very Simple and Easy!Return on Intelligence
 
Managing Your Runtime With P2
Managing Your Runtime With P2Managing Your Runtime With P2
Managing Your Runtime With P2Pascal Rapicault
 
Learn How to Unit Test Your Android Application (with Robolectric)
Learn How to Unit Test Your Android Application (with Robolectric)Learn How to Unit Test Your Android Application (with Robolectric)
Learn How to Unit Test Your Android Application (with Robolectric)Marakana Inc.
 
Process Matters (Cloud2Days / Java2Days conference))
Process Matters (Cloud2Days / Java2Days conference))Process Matters (Cloud2Days / Java2Days conference))
Process Matters (Cloud2Days / Java2Days conference))dev2ops
 
Post-mortem Debugging of Windows Applications
Post-mortem Debugging of  Windows ApplicationsPost-mortem Debugging of  Windows Applications
Post-mortem Debugging of Windows ApplicationsGlobalLogic Ukraine
 
RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG Greg.Helton
 
Mobile development in 2020
Mobile development in 2020 Mobile development in 2020
Mobile development in 2020 Bogusz Jelinski
 

Was ist angesagt? (20)

Introduction to jython
Introduction to jythonIntroduction to jython
Introduction to jython
 
Reverse engineering android apps
Reverse engineering android appsReverse engineering android apps
Reverse engineering android apps
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answers
 
eclipse
eclipseeclipse
eclipse
 
Google ART (Android RunTime)
Google ART (Android RunTime)Google ART (Android RunTime)
Google ART (Android RunTime)
 
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...
Att lyckas med integration av arbetet från flera scrum team - Christophe Acho...
 
Asynchronous Programming in Android
Asynchronous Programming in AndroidAsynchronous Programming in Android
Asynchronous Programming in Android
 
Interpreter RPG to Java
Interpreter RPG to JavaInterpreter RPG to Java
Interpreter RPG to Java
 
Debugging Modern C++ Application with Gdb
Debugging Modern C++ Application with GdbDebugging Modern C++ Application with Gdb
Debugging Modern C++ Application with Gdb
 
Story of Puppet @eBay Global Classifieds Group (eCG)
Story of Puppet @eBay Global Classifieds Group (eCG)Story of Puppet @eBay Global Classifieds Group (eCG)
Story of Puppet @eBay Global Classifieds Group (eCG)
 
Unit Testing RPG with JUnit
Unit Testing RPG with JUnitUnit Testing RPG with JUnit
Unit Testing RPG with JUnit
 
Ml goes fruitful
Ml goes fruitfulMl goes fruitful
Ml goes fruitful
 
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
 
Unit Tests? It is Very Simple and Easy!
Unit Tests? It is Very Simple and Easy!Unit Tests? It is Very Simple and Easy!
Unit Tests? It is Very Simple and Easy!
 
Managing Your Runtime With P2
Managing Your Runtime With P2Managing Your Runtime With P2
Managing Your Runtime With P2
 
Learn How to Unit Test Your Android Application (with Robolectric)
Learn How to Unit Test Your Android Application (with Robolectric)Learn How to Unit Test Your Android Application (with Robolectric)
Learn How to Unit Test Your Android Application (with Robolectric)
 
Process Matters (Cloud2Days / Java2Days conference))
Process Matters (Cloud2Days / Java2Days conference))Process Matters (Cloud2Days / Java2Days conference))
Process Matters (Cloud2Days / Java2Days conference))
 
Post-mortem Debugging of Windows Applications
Post-mortem Debugging of  Windows ApplicationsPost-mortem Debugging of  Windows Applications
Post-mortem Debugging of Windows Applications
 
RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG
 
Mobile development in 2020
Mobile development in 2020 Mobile development in 2020
Mobile development in 2020
 

Ähnlich wie Automated Video Editing for Automatic Detection and Localization of Bloopers

Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Chris Huang
 
Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clusteringSahil Biswas
 
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Md. Mehedi Hasan
 
Kristine Nikoghosyan
Kristine NikoghosyanKristine Nikoghosyan
Kristine Nikoghosyankissul
 
Kristine Nikoghosyan
Kristine NikoghosyanKristine Nikoghosyan
Kristine Nikoghosyankissul
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataIRJET Journal
 
Randomly sampling YouTube users
Randomly sampling YouTube usersRandomly sampling YouTube users
Randomly sampling YouTube usersChengjun Wang
 
How to prepare a perfect video abstract for your research paper – Pubrica.pdf
How to prepare a perfect video abstract for your research paper – Pubrica.pdfHow to prepare a perfect video abstract for your research paper – Pubrica.pdf
How to prepare a perfect video abstract for your research paper – Pubrica.pdfPubrica
 
Methodically Improving Performance of Angular Apps
Methodically Improving Performance of Angular AppsMethodically Improving Performance of Angular Apps
Methodically Improving Performance of Angular AppsNarwhal Technologies Inc.
 
How to prepare a perfect video abstract for your research paper – Pubrica.pptx
How to prepare a perfect video abstract for your research paper – Pubrica.pptxHow to prepare a perfect video abstract for your research paper – Pubrica.pptx
How to prepare a perfect video abstract for your research paper – Pubrica.pptxPubrica
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learningAkhter Al Amin
 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIMLILAB
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]MODUL Technology GmbH
 
EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!melbats
 
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...Andriy Krayniy
 

Ähnlich wie Automated Video Editing for Automatic Detection and Localization of Bloopers (20)

Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...Applying Media Content Analysis to the Production of Musical Videos as Summar...
Applying Media Content Analysis to the Production of Musical Videos as Summar...
 
Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clustering
 
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...
 
Kristine Nikoghosyan
Kristine NikoghosyanKristine Nikoghosyan
Kristine Nikoghosyan
 
Kristine Nikoghosyan
Kristine NikoghosyanKristine Nikoghosyan
Kristine Nikoghosyan
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big Data
 
IMAGE PROCESSING
IMAGE PROCESSINGIMAGE PROCESSING
IMAGE PROCESSING
 
Randomly sampling YouTube users
Randomly sampling YouTube usersRandomly sampling YouTube users
Randomly sampling YouTube users
 
How to prepare a perfect video abstract for your research paper – Pubrica.pdf
How to prepare a perfect video abstract for your research paper – Pubrica.pdfHow to prepare a perfect video abstract for your research paper – Pubrica.pdf
How to prepare a perfect video abstract for your research paper – Pubrica.pdf
 
Methodically Improving Performance of Angular Apps
Methodically Improving Performance of Angular AppsMethodically Improving Performance of Angular Apps
Methodically Improving Performance of Angular Apps
 
How to prepare a perfect video abstract for your research paper – Pubrica.pptx
How to prepare a perfect video abstract for your research paper – Pubrica.pptxHow to prepare a perfect video abstract for your research paper – Pubrica.pptx
How to prepare a perfect video abstract for your research paper – Pubrica.pptx
 
C04841417
C04841417C04841417
C04841417
 
Adobe premiere
Adobe premiereAdobe premiere
Adobe premiere
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
Video editing 01
Video editing 01Video editing 01
Video editing 01
 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAI
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
 
EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!
 
Video editing 1
Video editing  1Video editing  1
Video editing 1
 
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...
Alexey Ostapov: Distributed Video Management and Security Systems: Tips and T...
 

Mehr von Carlos Toxtli

Artificial intelligence and open source
Artificial intelligence and open sourceArtificial intelligence and open source
Artificial intelligence and open sourceCarlos Toxtli
 
Bots in robotic process automation
Bots in robotic process automationBots in robotic process automation
Bots in robotic process automationCarlos Toxtli
 
How to implement artificial intelligence solutions
How to implement artificial intelligence solutionsHow to implement artificial intelligence solutions
How to implement artificial intelligence solutionsCarlos Toxtli
 
Multimodal emotion recognition at utterance level with spatio-temporal featur...
Multimodal emotion recognition at utterance level with spatio-temporal featur...Multimodal emotion recognition at utterance level with spatio-temporal featur...
Multimodal emotion recognition at utterance level with spatio-temporal featur...Carlos Toxtli
 
Changing paradigms in ai prototyping
Changing paradigms in ai prototypingChanging paradigms in ai prototyping
Changing paradigms in ai prototypingCarlos Toxtli
 
Inteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroInteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroCarlos Toxtli
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Carlos Toxtli
 
Cómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCarlos Toxtli
 
Education 3.0 - Megatendencias
Education 3.0 - MegatendenciasEducation 3.0 - Megatendencias
Education 3.0 - MegatendenciasCarlos Toxtli
 
Understanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConUnderstanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConCarlos Toxtli
 
Understanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementUnderstanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementCarlos Toxtli
 
Single sign on spanish - guía completa
Single sign on   spanish - guía completaSingle sign on   spanish - guía completa
Single sign on spanish - guía completaCarlos Toxtli
 
Los empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaLos empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaCarlos Toxtli
 
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Carlos Toxtli
 
RPA (Robotic Process Automation)
RPA (Robotic Process Automation)RPA (Robotic Process Automation)
RPA (Robotic Process Automation)Carlos Toxtli
 
Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Carlos Toxtli
 
Estrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsEstrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsCarlos Toxtli
 
Tecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompTecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompCarlos Toxtli
 

Mehr von Carlos Toxtli (20)

Artificial intelligence and open source
Artificial intelligence and open sourceArtificial intelligence and open source
Artificial intelligence and open source
 
Bots in robotic process automation
Bots in robotic process automationBots in robotic process automation
Bots in robotic process automation
 
How to implement artificial intelligence solutions
How to implement artificial intelligence solutionsHow to implement artificial intelligence solutions
How to implement artificial intelligence solutions
 
Multimodal emotion recognition at utterance level with spatio-temporal featur...
Multimodal emotion recognition at utterance level with spatio-temporal featur...Multimodal emotion recognition at utterance level with spatio-temporal featur...
Multimodal emotion recognition at utterance level with spatio-temporal featur...
 
Changing paradigms in ai prototyping
Changing paradigms in ai prototypingChanging paradigms in ai prototyping
Changing paradigms in ai prototyping
 
Inteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to HeroInteligencia Artificial From Zero to Hero
Inteligencia Artificial From Zero to Hero
 
Bots for Crowds
Bots for CrowdsBots for Crowds
Bots for Crowds
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
Enabling Expert Critique with Chatbots and Micro-Guidance - Ci 2018
 
Cómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificialCómo vivir de la inteligencia artificial
Cómo vivir de la inteligencia artificial
 
Education 3.0 - Megatendencias
Education 3.0 - MegatendenciasEducation 3.0 - Megatendencias
Education 3.0 - Megatendencias
 
Understanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsConUnderstanding Political Manipulation and Botnets - RightsCon
Understanding Political Manipulation and Botnets - RightsCon
 
Understanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task ManagementUnderstanding Chatbot-Mediated Task Management
Understanding Chatbot-Mediated Task Management
 
Single sign on spanish - guía completa
Single sign on   spanish - guía completaSingle sign on   spanish - guía completa
Single sign on spanish - guía completa
 
Los empleos del futuro en Latinoamérica
Los empleos del futuro en LatinoaméricaLos empleos del futuro en Latinoamérica
Los empleos del futuro en Latinoamérica
 
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
Empleos que ya están siendo reemplazados por bots y el futuro del RPA (Roboti...
 
RPA (Robotic Process Automation)
RPA (Robotic Process Automation)RPA (Robotic Process Automation)
RPA (Robotic Process Automation)
 
Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)Chatbots + rpa (robotic process automation)
Chatbots + rpa (robotic process automation)
 
Estrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startupsEstrategias tecnológicas de crecimiento acelerado para startups
Estrategias tecnológicas de crecimiento acelerado para startups
 
Tecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiCompTecnología del futuro, predicciones a 10 años - CiComp
Tecnología del futuro, predicciones a 10 años - CiComp
 

Kürzlich hochgeladen

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 

Kürzlich hochgeladen (20)

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 

Automated Video Editing for Automatic Detection and Localization of Bloopers

  • 1.
  • 2. AutomEditor: Video blooper recognition and localization for automatic monologue video editing Carlos Toxtli
  • 3. Multimodal video action (bloopers) recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features
  • 4. Index ● Basic concepts ● Video bloopers dataset ● Features extraction ● Blooper recognition ● Blooper localization ● System implementation ● Conclusions
  • 5. Previous work on Automatic Video Editing (AVE) ● Previous work over automatic video editing is more focused in how to enhance existing videos by adding music, transitions, zoom, camera changes, among other improvements. ● Simple silence detection mechanisms are just beginning to be implemented by commercial software. However, these are easy to detect visually. ● Video summarization techniques involve editing but are content based. ● Video action recognition is the area that studies the behavioral patterns in videos. ● Video action recognition applied to bloopers detection is an area that has not yet been studied by literature.
  • 6. Problem ● According to online sources, basic video editing can take 30 minutes to an hour for each minute of finished video (a 4-minute video would take 4 hours to edit). More advanced editing (adding in animations, VFX, and compositing) can take much longer. ● The time that a video takes to be edited discourages users to produce periodic content.
  • 7. Solution: AutomEditor A system that automates monologue video editing. ● AutomEditor is fed with example video clips (1-3 seconds each) of bloopers and not bloopers (separated by folders) ● Extracts features and trains a model. ● Evaluates its performance. ● Localizes the blooper fragments in full-length videos ● Shows the results in a web interface.
  • 8. End-to-end solution From database creation to the web application. https://github.com/toxtli/AutomEditor
  • 9. Main contributions ● Creation of a video bloopers dataset (Blooper DB) ● Feature extraction methods for video blooper recognition ● Video blooper recognition models ● Video blooper localization techniques ● Web interface for automatic video editing Problem: Every contribution by itself is enough for an individual publication. I could not cover all of them in-depth.
  • 10. Creation of a monologue video bloopers dataset ● ~600 videos ● Between 1 and 3 seconds per video ● Train, Validation, and Test batches ○ Train: 464 ○ Test: 66 ○ Validation: 66 ● 2 categories ○ Blooper ○ No blooper ● Stratified data
  • 11. Criterias ● I splitted long bloopers (more than 2 seconds) into a non-blooper (before the mistake) and blooper (the clip that contains the mistake) ● For short bloopers (1 to 2 seconds) I found other clips from the same video of about the same length (as non-bloopers). ● The clips do not contain truncated phrases. ● Tried to avoid as much as possible green-screen vs non-green-screen backgrounds.
  • 14. Feature extraction methods for blooper recognition The main goal of this process is to extract features invariant in person descriptors (i.e. gender, age, etc.), scale, position, background, and language ● Audio ■ General (1) Audio handcrafted features per clip (OpenSMILE) ■ Temporal (20) Audio handcrafted features per clip (OpenSMILE) ● Images ○ Face ■ Temporal (20) Face handcrafted features (OpenFace) ■ Temporal (20) Face deep features (VGG16) ○ Body ■ Temporal (20) Body handcrafted features (OpenPose) ■ Temporal (20) Body deep features (VGG16) ○ Emotions ■ General (1) FER predictions (EmoPy and others) ■ Temporal (20) FER predictions (EmoPy and others)
  • 15. Audio features OpenSMILE (1582 features): The audio is extracted from the videos and are processed by OpenSMILE that extract audio features such as loudness, pitch, jitter, etc. It was tested on video-clip length (general) and 20 fragments (temporal).
  • 16. OpenFace (709 features): Facial behavior analysis tool that provides accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. We get points that represents the face. VGG16 FC6 (4096 features): The faces are cropped (224×224×3), aligned, zero out the background, and passed through a pretrained VGG16 to get a take a dimensional feature vector from FC6 layer. Face features
  • 17. Body Features OpenPose (BODY_25) (11 features): The normalized angles between the joints.I did not use the calculated features because were 25x224x224 VGG16 FC6 Skelethon image (4096 features): I drew the skeleton (neck in the center) on a black background and feed a VGG16 and extracted a feature vector of the FC6 layer.
  • 18. Emotion features EmoPy (7 features): A deep neural net toolkit for emotion analysis via Facial Expression Recognition (FER). Other (28 features): Other 4 models from different FER contest participants. 7 categories per model, 35 features in total 20 samples per video clip were predicted (temporal) from there I computed its normalized sum (general)
  • 19. Feature fusion The features of the same source were normalized and fusioned, getting the following feature sizes: Face fusioned (4096 + 709 = 4805 features) Body fusioned (4096 + 11 = 4107 features) Emotion features (7 + 7 + 7 + 7 + 7 = 35 features) Audio features came only from OpenSMILE so these were not fusioned (1582 features)
  • 20. Feature sequences Features extracted are grouped in sequences to feed the RNNs ● Each video clip (fragments between 1 to 3 seconds) is divided into 20 samples (i.e. 20 face images) and equally spaced (i.e. in a 60 frames video, the frames 1, 4, 7, .. ,57, 60 are processed) ● The samples were extracted from the end to the beginning ● It produces a matrix of [20][feature_size]
  • 21. Fusions Early fusion: For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier. Late fusion: For late fusion, classifications are made on each modality and their decisions or predictions are later merged together. We used early fusion for our training cases.
  • 24. Quad Model The proposed model uses all the features extracted and integrates into LSTMs (except the audio) and combines into an early fusion.
  • 25. LSTM
  • 26. Evaluation ● The models were trained on a NVIDIA GTX 1080 ti graphic card. ● Since there is no previous work in this field, we used the individual feature models as baseline. ● 300 epochs ● Optimizer: Adam ● Loss: MSE ● Learning rate: 0.001
  • 27. Emotion features: Global & Temporal acc_val acc_train acc_test f1_score f1_test loss Emotion Global 0.59 0.86 0.59 0.60 0.56 0.28 Emotion Temporal 0.62 0.99 0.69 0.66 0.63 0.32 Temporal Global
  • 28. Body Temporal Features: Handcrafted & Deep acc_val acc_train acc_test f1_score f1_test Loss Body Hand 0.63 0.92 0.54 0.72 0.59 0.27 Body Deep 0.68 0.99 0.65 0.72 0.71 0.26 Handcrafted Deep
  • 29. Body fusion (handcrafted + deep features) acc_val acc_train acc_test f1_score f1_test Loss Body Fus 0.66 0.98 0.66 0.74 0.69 0.22
  • 30. Face Temporal Features: Handcrafted & Deep acc_val acc_train acc_test f1_score f1_test Loss Face Hand 0.84 0.99 0.87 0.89 0.86 0.12 Face Deep 0.89 1.00 0.81 0.92 0.83 0.12 Handcrafted Deep
  • 31. Face fusion (handcrafted + deep features) acc_val acc_train acc_test f1_score f1_test Loss Face Fus 0.89 1.00 0.89 0.92 0.84 0.09
  • 32. Audio Features: Temporal & General acc_val acc_train acc_test f1_score f1_test Loss Audio Temporal 0.86 1.00 0.84 0.89 0.83 0.11 Audio General 0.95 1.00 0.90 0.96 0.92 0.03 Temporal General
  • 33. Top 3: Face handcrafted + Face deep + Audio gen acc_val acc_train acc_test f1_score f1_test Loss Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03
  • 34. All (Quadmodal): BodyTF+FaceTF+AudioG+EmoT acc_val acc_train acc_test f1_score f1_test Loss All 1.00 1.00 0.90 1.00 0.90 0.01
  • 35. Train Validation Test Confusion matrices of the Quadmodal model
  • 36. ResultsModel acc_val acc_train acc_test f1_score f1_test Loss Emotion Gl 0.59 0.86 0.59 0.60 0.56 0.28 Emotion Te 0.62 0.99 0.69 0.66 0.63 0.32 Body Feat 0.63 0.92 0.54 0.72 0.59 0.27 Body Fus 0.66 0.98 0.66 0.74 0.69 0.26 Body Vis 0.68 0.99 0.65 0.72 0.71 0.22 Face Feat 0.84 0.99 0.87 0.89 0.86 0.12 Audio Te 0.86 1.00 0.84 0.89 0.83 0.11 Face Vis 0.89 1.00 0.81 0.92 0.83 0.12 Face Fus 0.89 1.00 0.89 0.92 0.84 0.09 Audio 0.95 1.00 0.90 0.96 0.92 0.03 Aud+Face 0.96 1.00 0.90 0.98 0.92 0.03 Quadmodal 1.00 1.00 0.90 1.00 0.90 0.01
  • 37. Early VS Late fusion Model acc_val acc_train acc_test f1_score f1_test Loss Quadmodal Early 1.00 1.00 0.90 1.00 0.90 0.01 Quadmodal Late 0.96 1.00 0.93 0.96 0.93 0.06
  • 38. But, How good is a tr 100% va 100% te 90% model? Sometimes algorithmic research work ends after the computation of the performance metrics, but … Now that we have a model with good performance over small data. How can we test that it works for real-life applications? Full length videos will be provided by users, so the first step is to find bloopers on a video. Localization techniques are needed.
  • 39. Video blooper localization techniques More challenges ... ● There are no existing localization techniques for video bloopers. ● Temporal action localization techniques in untrimmed videos work mostly for image-only processing. ● Localization in multimodality is mostly limited to video indexing ● No localization methods for mixed temporal and non-temporal features ● The videos must be analyzed in small fragments. ● The analysis of multiple video fragments is costly ● Taking different frames of the same video clip can give different results. ● The output should be a time range.
  • 40. Diagnosis of how predictions are distributed To test how algorithms can find bloopers I inserted randomly 6 clips (3 bloopers and 3 non-bloopers) in a 70 seconds length video of the same person. I created two test videos. I performed the analysis of 2 second fragments separated by 500 milliseconds each and plotted the results. Then I compared my expectations VS reality
  • 43. Defining an algorithm to find the bloopers I defined the concept of blooper_score as the predicted value of the blooper category. Instead of using a 0 to 1 scale, I used the discrete 0,1, and 2 values. 0 stands for blooper_score=0, 1 for ‘almost 1’ that are the intermediate values between a threshold range, and 2 for blooper_score=1. The most important pattern that I found was the contiguous high numbers.
  • 44. Adding neighbors To emphasize the values I added them in bins of neighbor elements.
  • 45. Calculating the sequences of top 3 values I defined window size and calculated the percentage of elements that are in the top 3 values. I used a threshold to add to a range.
  • 46. Result of ranges It returned the 3 ranges that contained the bloopers of the video. Milliseconds accuracy is needed, but this approach is good enough for at least identifying them.
  • 47. It also worked for the second video
  • 48. But not everybody is familiar with the command line Now we have a model that is able to recognize, a localization method, but our system is not user friendly. So there is another challenge ... There are no automatic video editing interfaces on the web. So I developed an open source web interface for automatic video editing interface.
  • 49. Web interface for automatic video editing http://www.carlostoxtli.com/AutomEditor/frontend/ The tool helps users to analyze their videos and visualize their bloopers. For developers, it brings a simple and easy to integrate platform for testing their algorithms.
  • 50. Examples of the processed videos in the GUI
  • 51. Future work ● Explore one of the contributions in depth. ● Data augmentation methods for video bloopers ○ Generative video bloopers ? ● Research about temporal action localization techniques in untrimmed videos for mixed spatio-temporal modalities ● Detecting multiple people bloopers. ● Study the people’s interaction with AVE interfaces (HCI)
  • 52. Conclusions ● Video bloopers recognition is benefited from multimodal techniques. ● Results are not generalizable enough from small data ● Models for localization of mixed spatio-temporal multimodal features are needed for reducing the time and processing load. ● AutomEditor interface can ○ Help users to edit their videos automatically online ○ Help developers to test and publish their models to the public.
  • 56. Decision layers The activation function used for each metric were: Emotion (categorical): Softmax Valence (dimensional): hyperbolic tangent function (tanh) Arousal (dimensional): Sigmoid
  • 57. Sigmoid as activation function A sigmoid activation function turns an activation into a value between 0 and 1. It is useful for binary classification problems and is mostly used in the final output layer of such problems. Also, sigmoid activation leads to slow gradient descent because the slope is small for high and low values.
  • 58. Hyperbolic tangent as activation function A Tanh activation function turns an activation into a value between -1 and +1. The outputs are normalized. The gradient is stronger for tanh than sigmoid (derivatives are steeper)
  • 59. SoftMax as activation function The Softmax function is a wonderful activation function that turns numbers aka logits into probabilities that sum to one.
  • 60. MSE as loss function for linear regression Linear regression uses Mean Squared Error as loss function that gives a convex graph and then we can complete the optimization by finding its vertex as global minimum.
  • 61. SGD as Optimizer Stochastic gradient descent (SGD) computes the gradient for each update using a single training data point x_i (chosen at random). The idea is that the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data. Each update is now much faster to calculate than in batch gradient descent, and over many updates, we will head in the same general direction
  • 62. Layers Early fusion - Hidden layer Early fusion Fully connected LSTM Late fusion
  • 63. 1DConv Average Pooling 1D convolutional neural nets can be used for extracting local 1D patches (subsequences) from sequences and can identify local patterns within the window of convolution. A pattern learnt at one position can also be recognized at a different position, making 1D conv nets translation invariant. Long sequence to process so long that it cannot be realistically processed by RNNs. In such cases, 1D conv nets can be used as a pre-processing step to make the sequence smaller through downsampling by extracting higher level features, which can, then be passed on to the RNN as input.
  • 64. Batch Normalization We normalize the input layer by adjusting and scaling the activations to speed up learning, the same thing also for the values in the hidden layers, that are changing all the time.
  • 65. VGG16