SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
Video + Language: Where Does
Domain Knowledge Fit in?
Jiebo Luo
Department of Computer Science
July 10, 2016
Keynote@2016 IJCAI Workshop on Semantic Machine Learning
Domain Knowledge in Machine Learning
• Domain knowledge is used frequently in ML applications
(sometimes without knowing that you are actually doing it)
– A good example is feature extraction. What features to use?
– Other uses include objective function, parameter selection
– Even in deep learning (architecture, learning rate, etc.)
– Certainly probabilistic graphical models (including priors)
– Data cleaning (yes!)
• Context models encode domain knowledge
– Spatial context (e.g., in computer vision)
– Temporal context (e.g., in sequence analysis)
– Social context (e.g., in social media data mining)
• We will focus on some less obvious, more sophisticated forms
of domain knowledge, especially in the area of “vision and
language”, an emerging fertile ground in machine learning
IEEE Signal Processing, 2005
Introduction
• Video has become ubiquitous on the Internet, TV, as well as
personal devices.
• Recognition of video content has been a fundamental challenge
in computer vision for decades, where previous research
predominantly focused on understanding videos using a
predefined yet limited vocabulary.
• Thanks to the recent development of deep learning techniques,
researchers in both computer vision and multimedia
communities are now striving to bridge video with
natural language, which can be regarded as the ultimate goal of
video understanding.
• We present recent advances in exploring the synergy of video
understanding and language processing, including video-
language alignment, video captioning, and video emotion
analysis.
Video Growth
• CISCO Trends
6
SEMANTICS <> USER INTENT
SEMANTICS IN LARGE SCALE
VIDEO SEARCH & RETRIEVAL
SEMANTIC VIDEO ENTITY LINKING
Users
User
Modeling
User Content
Understanding
Learning
From User
Tags
Yuncheng Li, Xitong Yang, Jiebo Luo
Semantic Video Entity Linking
Semantic Video Entity Linking
• Motivations to use visual content
1. Video entity linking is very challenging with only
title & descriptions, especially for UGC
2. Video entity linking must be of high quality
3. The visual content of a video truly represents the
user intent for video watching and sharing
Semantic Video Entity Linking
Semantic Video Entity Linking
• Sequence-to-Set matching
Key Frame Sequence
…
Representative
Image Set
…
Match?
Semantic Video Entity Linking
• Train Data: Triplets (Keyframes, Images, Label)
Key Frame Sequence
…
Representative Image Set
Yes
Key Frame Sequence
No
Representative Image Set
Semantic Video Entity Linking
• Goal: Keyframe-Image distance (metric)
Key Frame Sequence
…
Representative
Image Set
…
Semantic Video Entity Linking
• Challenge: Not all pairs are true matches
Key Frame Sequence
…
Representative
Image Set
…
Yes
Semantic Video Entity Linking
• Solution: Multiple Instance Metric Learning
– EM like algorithm:
• E-Step: infer the true matches
• M-Step: Retrain the metric
• More Challenges
– Overwhelming noise (variability & ambiguity) in
both visual content
• Solution
– Structured constraints to suppress noise
Semantic Video Entity Linking
Title: “Jason Taylor Career Highlights”
Semantic Video Entity Linking
• Experiments
– Dataset: 1920 videos
– Labels: Amazon Mechanical Turk
Semantic Video Entity Linking
We propose two constraints to guide the video entity linking process:
1. Temporal Smoothness: the entity occurrence (matches with any of the representative
images) is smooth over time
2. Representativeness Smoothness: in order to reduce the significant irrelevant
information
Conclusions
• Metric Learning helps by adapting to different
topics and domains
• Structured constraints are important for
suppressing noises
• Future work includes integrating the video
metadata information and building entity
integrated applications, e.g., video spam
detection
Semantic Video Entity Linking.
An Application: Video Spam Removal
• “guardians of the galaxy full movie”
– Let’s watch the movie
Unsupervised Alignment of Actions
in Video with Text Descriptions
Y. Song, I. Naim, A. Mamun, K. Kulkarni, P. Singla
J. Luo, D. Gildea, H. Kautz
Overview
• Unsupervised alignment of video with text
• Motivations
– Generate labels from data (reduce burden of manual labeling)
– Learn new actions from only parallel video+text
– Extend noun/object matching to verbs and actions
Matching Verbs to Actions
The person takes out a knife
and cutting board
Matching Nouns to Objects
[Naim et al., 2015]
An overview of the text and
video alignment framework
Hyperfeatures for Actions
• High-level features required for alignment with text
→ Motion features are generally low-level
• Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for
vector quantization (clustering)
Originally described in “Hyperfeatures:
Multilevel Local Coding for Visual Recog-
nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
Hyperfeatures for Actions
• From low-level motion features, create high-level
representations that can easily align with verbs in text
Cluster 3
at time t
Accumulate over
frame at time t
& cluster
Conduct vector
quantization
of the histogram
at time t
Cluster 3, 5, …,5,20
= Hyperfeature 6
Each color code
is a vector
quantized
STIP point
Vector quantized
STIP point histogram at time t
Accumulate clusters over
window (t-w/2, t+w/2]
and conduct vector
quantization
→ first-level hyperfeatures
Align hyperfeatures
with verbs from text
(using LCRF)
Latent-variable CRF Alignment
• CRF where the latent variable is the alignment
– N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
– Xi,m represents nouns and verbs extracted from the mth sentence
– Yi,n represents blobs and actions in interval n in the video
• Conditional likelihood
– conditional probability of
• Learning weights w
– Stochastic gradient descent
where
feature function
More details in Naim et al. 2015 NAACL Paper -
Discriminative unsupervised alignment of natural language instructions with corresponding video segments
N
Experiments: Wetlab Dataset
• RGB-Depth video with lab protocols in text
– Compare addition of hyperfeatures generated from motion features to
previous results (Naim et al. 2015)
• Small improvement over previous results
– Activities already highly correlated with object-use
Detection of objects in 3D space
using color and point-cloud
Previous results
using object/noun
alignment only
Addition of different types
of motion features
2DTraj: Dense trajectories
*Using hyperfeature window size w=150
Experiments: TACoS Dataset
• RGB video with crowd-sourced text descriptions
– Activities such as “making a salad,” “baking a cake”
– No object recognition, alignment using actions only
– Uniform: Assume each sentence takes the same amount of time over the entire sequence
– Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
– Unsupervised LCRF: Both segmentation and alignment are unknown
• Effect of window size and number of clusters
– Consistent with average
action length: 150 frames
*Using hyperfeature
window size w=150
*d(2)=64
Experiments: TACoS Dataset
• Segmentation from a sequence in the dataset
Crowd-sourced descriptions
Example of text and video alignment generated
by the system on the TACoS corpus for sequence s13-d28
Image Captioning with Semantic
Attention (CVPR 2016)
Quanzeng You, Jiebo Luo
Hailin Jin, Zhaowen Wang and Chen Fang
Image Captioning
• Motivations
– Real-world Usability
• Help visually impaired people, learning-impaired
– Improving Image Understanding
• Classification, Objection detection
– Image Retrieval
1. a young girl inhales with the intent of blowing out
a candle
2. girl blowing out the candle on an ice cream
1. A shot from behind home
plate of children playing
baseball
2. A group of children playing
baseball in the rain
3. Group of baseball players
playing on a wet field
Introduction of Image Captioning
• Machine learning as an approach to solve the problem
Model sentence
1. A young girl inhales with the intent of blowing out a
candle.
2. A young girl is preparing to blow out her candle.
3. A kid is to blow out the single candle in the bowl of
birthday goodness.
4. Girl blowing out the candle on an ice-cream
5. A little girl is getting ready to blow out a candle on a small
dessert.
1. A shot from behind home plate of children playing
baseball
2. A group of children playing baseball in the rain
3. Group of baseball players playing on a wet field
4. A batter leaning back so they don’t get hit by a ball
5. A group of young boys playing baseball in the rain
1. A girl in a park area flies
a multi-colored kite.
2. A girl flying a kit in the
sky
3. A young woman flying a
rainbow colored kite.
4. A person in a large field
flying a kite in the sky.
5. A woman looks up at her
colorful sailing kite.
Overview
• Brief overview of current approaches
• Our main motivation
• The proposed semantic attention model
• Evaluation results
Brief Introduction of Recurrent Neural Network
• Different from CNN
11),( −− +== ttttt BhAxhxfh
tt Chy =
• Unfolding over time
 Feedforward network
 Backpropagation Through Time
Inputs
Hidden Units
Outputs
xt ht-1
ht
yt
...
Convolutional Neural Network
Inputs
Hidden Units
Outputs
C
yt
Inputs
Hidden Units
Inputs
Hidden Units
B
B
A
A
A B
t-1
t-2
Applications of Recurrent Neural Networks
• Machine Translation
• Reads input sentence “ABC” and produces
“WXYZ”
Decoder RNNEncoder RNN
Encoder-Decoder Framework for Captioning
• Inspired by neural network based machine translation
• Loss function
∑=
−−=
−=
N
t
tt wwIwp
IwpL
1
10 ),,,|(log
)|(log

...
Convolutional Neural Network
w1
wstart
w2
w1
wN
wN-1
wend
wN

Image
Recurrent Neural Network
...
Convolutional Neural
Network

#Start#
Some
riverSome
elephants .
Some elephants roaming around
on a river bank.
bank
bank
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
– Success of low-level tasks
• Visual attributes detection
Image Captioning with Semantic Attention
• Main idea
RNN ...
CNN
attention
wave
riding
man
surfboard
ocean
water
surfer
surfing
person
board
0
0.1
0.2
0.3
surfboard
wave
surfing
First Idea
• Provide additional knowledge at each input node
• Concatenate the input word and the extra attributes K
• Each image has a fixed keyword list
)],,([),( 11 −− +== tktttt hbKWwfhxfh
Visual Features: 1024 GoogleNet
LSTM Hidden states: 512
Training details:
1. 256 image/sentence pairs
2. RMS-Prob
...
Convolutional Neural Network
w1
wstart
w2
w1
wN
wN-1
wend
wN

Retrieve Tags, titles,
descriptions
from weak annotated
images
Feature
extraction
K
Keywords,
key-phrase
Image
Recurrent Neural Network
Using Attributes along with Visual Features
• Provide additional knowledge at each input node
• Concatenate the visual embedding and keywords for h0
];[),( 10 bKWvWhvfh kiv +== −
...
Convolutional Neural Network
w1
wstart
w2
w1
wN
wN-1
wend
wN

Retrieve Tags, titles,
descriptions
from weak annotated
images
Feature
extraction
K
Keywords,
key-phrase
Image
h0
Recurrent Neural Network
Attention Model on Attributes
• Instead of using the same set of attributes at every step
• At each step, select the attributes
∑= m mtmt kKwatt α),(
)softmax VK(wT
tt =α
))],,(;([),( 11 −− == tttttt hKwattxfhxfh
...
CNN
w1
wstart
w2
w1
wN
wN-1
wend
wN

Retrieve Tags, titles,
descriptions
from weak annotated
images
Feature
extraction
K
Keywords,
key-phrase
RNN
Overall Framework
• Training with a bilinear/bilateral attention model
ht
pt
xt
v
{Ai}
Yt~
𝜑
𝜙
RNN
Image
CNN
AttrDet 1
AttrDet 2
AttrDet 3
AttrDet N

RNN ...
CNN
attention
wave
riding
man
surfboard
ocean
water
surfer
surfing
person
board
0
0.1
0.2
0.3
surfboard
wave
surfing
Visual Attributes
• A secondary contribution
• We try different approaches
vase flowers bathroom table glass sink blue
small white clear
k-NN
sitting table small many little glass different
flowers vase shown
Multi-label Ranking
vase flowers table glass sitting kitchen water
room white filled
FCN
Performance
• Examples showing the impact of visual attributes on captions
Google NIC
Top-5 visual
attributes
ATT-FCN
a white plate
topped with a
variety of food.
a plate with a
sandwich and
french fries.
plate broccoli
fries food
french
a baby with a
toothbrush in
its mouth.
a baby is eating
a piece of
paper.
teeth brushing
toothbrush
holding baby
a traffic light is
on a city street.
a street with
cars and a
clock tower.
street sign cars
clock traffic
a yellow and
black train on a
track.
a train traveling
down tracks
next to a
building.
train tracks
clock tower
down
a close up of a
plate of food on
a table.
a table topped
with a cake
with candles on
it.
a teddy bear
sitting on top of
a chair .
a white teddy
bear sitting
next to a
stuffed animal .
a person is
holding colorful
umbrella.
a black
umbrella sitting
on top of a
sandy beach .
a woman is
holding a cell
phone in her
hand .
a woman
holding a pair
of scissors in
her hands .
cake table plate
sitting birthday
teddy cat bear
stuffed white
umbrella beach
water sitting
boat
woman
bathroom her
scissors man
Performance on the Testing Dataset
• Publicly available split
Performance
• MS-COCO Image Captioning Challenge
TGIF: A New Dataset and Benchmark
on Animated GIF Description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry
Goldberg, Jiebo Luo
Overview
Comparison with Existing Datasets
Examples
a skate boarder is doing
trick on his skate board. a gloved hand opens to
reveal a golden ring.
a sport car is swinging on the
race playground
the vehicle is moving fast
into the tunnel
Contributions
• A large scale animated GIF description dataset for
promoting image sequence modeling and
research
• Performing automatic validation to collect natural
language descriptions from crowd workers
• Establishing baseline image captioning methods
for future benchmarking
• Comparison with existing datasets, highlighting
the benefits with animated GIFs
In Comparison with Existing Datasets
• The language in our dataset is closer to common
language
• Our dataset has an emphasis on the verbs
• Animated GIFs are more coherent and self contained
• Our dataset can be used to solve more difficult
movie description problem
Machine Generated Sentence
Examples
Machine Generated Sentence
Examples
Machine Generated Sentence Examples
Comparing Professionals and Crowd-workers
Crowd worker: two people are kissing on a boat.
Professional: someone glances at a kissing couple then
steps to a railing overlooking the ocean an older man
and woman stand beside him.
Crowd worker: two men got into their car and
not able to go anywhere because the wheels
were locked.
Professional: someone slides over the
camaros hood then gets in with his partner he
starts the engine the revving vintage car starts
to backup then lurches to a halt.
Crowd worker: a man in a shirt and tie sits beside a person who is covered in a
sheet.
Professional: he makes eye contact with the woman for only a second.
More: http://beta-
web2.cloudapp.net/ls
mdc_sentence_compar
ison.html
Movie Descriptions versus TGIF
• Crowd workers are encouraged to describe the
major visual content directly, and not to use
overly descriptive language
• Because our animated GIFs are presented to
crowd workers without any context, the
sentences in our dataset are more self-contained
• Animated GIFs are perfectly segmented since
they are carefully curated by online users to
create a coherent visual story
Where is CV (AI) in 2016?
Winter 2002 Fall 2003 Summer 2008
Image/Video captioning
看图识字 看图说话
Thanks
Q & A
Google
***
Baidu
Sogou
Bing
XiaoIce
Are you smarter than a 5th grader?
What doe it take to go from a 5-
year old to a 5th grader?
1. Learning from “small data”
2. Unsupervised learning
3. Transfer learning
4. Integration of domain
knowledge or experience
Visual Intelligence & Social Multimedia Analytics
www.cs.rochester.edu/u/jluo
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Video + Language: Where Does Domain Knowledge Fit in?

The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introductionAdwait Bhave
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learningIEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learningIEEEBEBTECHSTUDENTPROJECTS
 
Image Processing and Computer Vision in iOS
Image Processing and Computer Vision in iOSImage Processing and Computer Vision in iOS
Image Processing and Computer Vision in iOSOge Marques
 
Video analysis techniques
Video analysis techniquesVideo analysis techniques
Video analysis techniquesLiz FitzGerald
 
Long-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep LearningLong-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep LearningElaheh Rashedi
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
Intro to oop.pptx
Intro to oop.pptxIntro to oop.pptx
Intro to oop.pptxUmerUmer25
 
Week1- Introduction.pptx
Week1- Introduction.pptxWeek1- Introduction.pptx
Week1- Introduction.pptxfahmi324663
 
Introduction To Design Patterns Class 4 Composition vs Inheritance
 Introduction To Design Patterns Class 4 Composition vs Inheritance Introduction To Design Patterns Class 4 Composition vs Inheritance
Introduction To Design Patterns Class 4 Composition vs InheritanceBlue Elephant Consulting
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonJonathon Hare
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye viewRoelof Pieters
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
Week3-Deep Neural Network (DNN).pptx
Week3-Deep Neural Network (DNN).pptxWeek3-Deep Neural Network (DNN).pptx
Week3-Deep Neural Network (DNN).pptxfahmi324663
 

Ähnlich wie Video + Language: Where Does Domain Knowledge Fit in? (20)

Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learningIEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
 
Image Processing and Computer Vision in iOS
Image Processing and Computer Vision in iOSImage Processing and Computer Vision in iOS
Image Processing and Computer Vision in iOS
 
Video analysis techniques
Video analysis techniquesVideo analysis techniques
Video analysis techniques
 
Long-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep LearningLong-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep Learning
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Intro to oop.pptx
Intro to oop.pptxIntro to oop.pptx
Intro to oop.pptx
 
Week1- Introduction.pptx
Week1- Introduction.pptxWeek1- Introduction.pptx
Week1- Introduction.pptx
 
Introduction To Design Patterns Class 4 Composition vs Inheritance
 Introduction To Design Patterns Class 4 Composition vs Inheritance Introduction To Design Patterns Class 4 Composition vs Inheritance
Introduction To Design Patterns Class 4 Composition vs Inheritance
 
Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Week3-Deep Neural Network (DNN).pptx
Week3-Deep Neural Network (DNN).pptxWeek3-Deep Neural Network (DNN).pptx
Week3-Deep Neural Network (DNN).pptx
 

Mehr von Goergen Institute for Data Science

When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...Goergen Institute for Data Science
 
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...Goergen Institute for Data Science
 
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...Goergen Institute for Data Science
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Goergen Institute for Data Science
 

Mehr von Goergen Institute for Data Science (11)

Vision and Language: Past, Present and Future
Vision and Language: Past, Present and FutureVision and Language: Past, Present and Future
Vision and Language: Past, Present and Future
 
Learning with Unpaired Data
Learning with Unpaired DataLearning with Unpaired Data
Learning with Unpaired Data
 
How to properly review AI papers?
How to properly review AI papers?How to properly review AI papers?
How to properly review AI papers?
 
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
 
Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?
 
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
 
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
 
Big Data Better Life
Big Data Better LifeBig Data Better Life
Big Data Better Life
 
Social Multimedia as Sensors
Social Multimedia as SensorsSocial Multimedia as Sensors
Social Multimedia as Sensors
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 

Kürzlich hochgeladen

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Kürzlich hochgeladen (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Video + Language: Where Does Domain Knowledge Fit in?

  • 1. Video + Language: Where Does Domain Knowledge Fit in? Jiebo Luo Department of Computer Science July 10, 2016 Keynote@2016 IJCAI Workshop on Semantic Machine Learning
  • 2. Domain Knowledge in Machine Learning • Domain knowledge is used frequently in ML applications (sometimes without knowing that you are actually doing it) – A good example is feature extraction. What features to use? – Other uses include objective function, parameter selection – Even in deep learning (architecture, learning rate, etc.) – Certainly probabilistic graphical models (including priors) – Data cleaning (yes!) • Context models encode domain knowledge – Spatial context (e.g., in computer vision) – Temporal context (e.g., in sequence analysis) – Social context (e.g., in social media data mining) • We will focus on some less obvious, more sophisticated forms of domain knowledge, especially in the area of “vision and language”, an emerging fertile ground in machine learning
  • 4. Introduction • Video has become ubiquitous on the Internet, TV, as well as personal devices. • Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. • Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge video with natural language, which can be regarded as the ultimate goal of video understanding. • We present recent advances in exploring the synergy of video understanding and language processing, including video- language alignment, video captioning, and video emotion analysis.
  • 7. SEMANTICS IN LARGE SCALE VIDEO SEARCH & RETRIEVAL
  • 8. SEMANTIC VIDEO ENTITY LINKING Users User Modeling User Content Understanding Learning From User Tags Yuncheng Li, Xitong Yang, Jiebo Luo
  • 10. Semantic Video Entity Linking • Motivations to use visual content 1. Video entity linking is very challenging with only title & descriptions, especially for UGC 2. Video entity linking must be of high quality 3. The visual content of a video truly represents the user intent for video watching and sharing
  • 12. Semantic Video Entity Linking • Sequence-to-Set matching Key Frame Sequence … Representative Image Set … Match?
  • 13. Semantic Video Entity Linking • Train Data: Triplets (Keyframes, Images, Label) Key Frame Sequence … Representative Image Set Yes Key Frame Sequence No Representative Image Set
  • 14. Semantic Video Entity Linking • Goal: Keyframe-Image distance (metric) Key Frame Sequence … Representative Image Set …
  • 15. Semantic Video Entity Linking • Challenge: Not all pairs are true matches Key Frame Sequence … Representative Image Set … Yes
  • 16. Semantic Video Entity Linking • Solution: Multiple Instance Metric Learning – EM like algorithm: • E-Step: infer the true matches • M-Step: Retrain the metric • More Challenges – Overwhelming noise (variability & ambiguity) in both visual content • Solution – Structured constraints to suppress noise
  • 17. Semantic Video Entity Linking Title: “Jason Taylor Career Highlights”
  • 18. Semantic Video Entity Linking • Experiments – Dataset: 1920 videos – Labels: Amazon Mechanical Turk
  • 19. Semantic Video Entity Linking We propose two constraints to guide the video entity linking process: 1. Temporal Smoothness: the entity occurrence (matches with any of the representative images) is smooth over time 2. Representativeness Smoothness: in order to reduce the significant irrelevant information
  • 20. Conclusions • Metric Learning helps by adapting to different topics and domains • Structured constraints are important for suppressing noises • Future work includes integrating the video metadata information and building entity integrated applications, e.g., video spam detection
  • 21. Semantic Video Entity Linking. An Application: Video Spam Removal • “guardians of the galaxy full movie” – Let’s watch the movie
  • 22. Unsupervised Alignment of Actions in Video with Text Descriptions Y. Song, I. Naim, A. Mamun, K. Kulkarni, P. Singla J. Luo, D. Gildea, H. Kautz
  • 23. Overview • Unsupervised alignment of video with text • Motivations – Generate labels from data (reduce burden of manual labeling) – Learn new actions from only parallel video+text – Extend noun/object matching to verbs and actions Matching Verbs to Actions The person takes out a knife and cutting board Matching Nouns to Objects [Naim et al., 2015] An overview of the text and video alignment framework
  • 24. Hyperfeatures for Actions • High-level features required for alignment with text → Motion features are generally low-level • Hyperfeatures, originally used for image recognition extended for use with motion features → Use temporal domain instead of spatial domain for vector quantization (clustering) Originally described in “Hyperfeatures: Multilevel Local Coding for Visual Recog- nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
  • 25. Hyperfeatures for Actions • From low-level motion features, create high-level representations that can easily align with verbs in text Cluster 3 at time t Accumulate over frame at time t & cluster Conduct vector quantization of the histogram at time t Cluster 3, 5, …,5,20 = Hyperfeature 6 Each color code is a vector quantized STIP point Vector quantized STIP point histogram at time t Accumulate clusters over window (t-w/2, t+w/2] and conduct vector quantization → first-level hyperfeatures Align hyperfeatures with verbs from text (using LCRF)
  • 26. Latent-variable CRF Alignment • CRF where the latent variable is the alignment – N pairs of video/text observations {(xi, yi)}i=1 (indexed by i) – Xi,m represents nouns and verbs extracted from the mth sentence – Yi,n represents blobs and actions in interval n in the video • Conditional likelihood – conditional probability of • Learning weights w – Stochastic gradient descent where feature function More details in Naim et al. 2015 NAACL Paper - Discriminative unsupervised alignment of natural language instructions with corresponding video segments N
  • 27. Experiments: Wetlab Dataset • RGB-Depth video with lab protocols in text – Compare addition of hyperfeatures generated from motion features to previous results (Naim et al. 2015) • Small improvement over previous results – Activities already highly correlated with object-use Detection of objects in 3D space using color and point-cloud Previous results using object/noun alignment only Addition of different types of motion features 2DTraj: Dense trajectories *Using hyperfeature window size w=150
  • 28. Experiments: TACoS Dataset • RGB video with crowd-sourced text descriptions – Activities such as “making a salad,” “baking a cake” – No object recognition, alignment using actions only – Uniform: Assume each sentence takes the same amount of time over the entire sequence – Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels – Unsupervised LCRF: Both segmentation and alignment are unknown • Effect of window size and number of clusters – Consistent with average action length: 150 frames *Using hyperfeature window size w=150 *d(2)=64
  • 29. Experiments: TACoS Dataset • Segmentation from a sequence in the dataset Crowd-sourced descriptions Example of text and video alignment generated by the system on the TACoS corpus for sequence s13-d28
  • 30. Image Captioning with Semantic Attention (CVPR 2016) Quanzeng You, Jiebo Luo Hailin Jin, Zhaowen Wang and Chen Fang
  • 31. Image Captioning • Motivations – Real-world Usability • Help visually impaired people, learning-impaired – Improving Image Understanding • Classification, Objection detection – Image Retrieval 1. a young girl inhales with the intent of blowing out a candle 2. girl blowing out the candle on an ice cream 1. A shot from behind home plate of children playing baseball 2. A group of children playing baseball in the rain 3. Group of baseball players playing on a wet field
  • 32. Introduction of Image Captioning • Machine learning as an approach to solve the problem Model sentence 1. A young girl inhales with the intent of blowing out a candle. 2. A young girl is preparing to blow out her candle. 3. A kid is to blow out the single candle in the bowl of birthday goodness. 4. Girl blowing out the candle on an ice-cream 5. A little girl is getting ready to blow out a candle on a small dessert. 1. A shot from behind home plate of children playing baseball 2. A group of children playing baseball in the rain 3. Group of baseball players playing on a wet field 4. A batter leaning back so they don’t get hit by a ball 5. A group of young boys playing baseball in the rain 1. A girl in a park area flies a multi-colored kite. 2. A girl flying a kit in the sky 3. A young woman flying a rainbow colored kite. 4. A person in a large field flying a kite in the sky. 5. A woman looks up at her colorful sailing kite.
  • 33. Overview • Brief overview of current approaches • Our main motivation • The proposed semantic attention model • Evaluation results
  • 34. Brief Introduction of Recurrent Neural Network • Different from CNN 11),( −− +== ttttt BhAxhxfh tt Chy = • Unfolding over time  Feedforward network  Backpropagation Through Time Inputs Hidden Units Outputs xt ht-1 ht yt ... Convolutional Neural Network Inputs Hidden Units Outputs C yt Inputs Hidden Units Inputs Hidden Units B B A A A B t-1 t-2
  • 35. Applications of Recurrent Neural Networks • Machine Translation • Reads input sentence “ABC” and produces “WXYZ” Decoder RNNEncoder RNN
  • 36. Encoder-Decoder Framework for Captioning • Inspired by neural network based machine translation • Loss function ∑= −−= −= N t tt wwIwp IwpL 1 10 ),,,|(log )|(log  ... Convolutional Neural Network w1 wstart w2 w1 wN wN-1 wend wN  Image Recurrent Neural Network ... Convolutional Neural Network  #Start# Some riverSome elephants . Some elephants roaming around on a river bank. bank bank
  • 37. Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) – Visually similar nearest neighbor images – Success of low-level tasks • Visual attributes detection
  • 38. Image Captioning with Semantic Attention • Main idea RNN ... CNN attention wave riding man surfboard ocean water surfer surfing person board 0 0.1 0.2 0.3 surfboard wave surfing
  • 39. First Idea • Provide additional knowledge at each input node • Concatenate the input word and the extra attributes K • Each image has a fixed keyword list )],,([),( 11 −− +== tktttt hbKWwfhxfh Visual Features: 1024 GoogleNet LSTM Hidden states: 512 Training details: 1. 256 image/sentence pairs 2. RMS-Prob ... Convolutional Neural Network w1 wstart w2 w1 wN wN-1 wend wN  Retrieve Tags, titles, descriptions from weak annotated images Feature extraction K Keywords, key-phrase Image Recurrent Neural Network
  • 40. Using Attributes along with Visual Features • Provide additional knowledge at each input node • Concatenate the visual embedding and keywords for h0 ];[),( 10 bKWvWhvfh kiv +== − ... Convolutional Neural Network w1 wstart w2 w1 wN wN-1 wend wN  Retrieve Tags, titles, descriptions from weak annotated images Feature extraction K Keywords, key-phrase Image h0 Recurrent Neural Network
  • 41. Attention Model on Attributes • Instead of using the same set of attributes at every step • At each step, select the attributes ∑= m mtmt kKwatt α),( )softmax VK(wT tt =α ))],,(;([),( 11 −− == tttttt hKwattxfhxfh ... CNN w1 wstart w2 w1 wN wN-1 wend wN  Retrieve Tags, titles, descriptions from weak annotated images Feature extraction K Keywords, key-phrase RNN
  • 42. Overall Framework • Training with a bilinear/bilateral attention model ht pt xt v {Ai} Yt~ 𝜑 𝜙 RNN Image CNN AttrDet 1 AttrDet 2 AttrDet 3 AttrDet N  RNN ... CNN attention wave riding man surfboard ocean water surfer surfing person board 0 0.1 0.2 0.3 surfboard wave surfing
  • 43. Visual Attributes • A secondary contribution • We try different approaches vase flowers bathroom table glass sink blue small white clear k-NN sitting table small many little glass different flowers vase shown Multi-label Ranking vase flowers table glass sitting kitchen water room white filled FCN
  • 44. Performance • Examples showing the impact of visual attributes on captions Google NIC Top-5 visual attributes ATT-FCN a white plate topped with a variety of food. a plate with a sandwich and french fries. plate broccoli fries food french a baby with a toothbrush in its mouth. a baby is eating a piece of paper. teeth brushing toothbrush holding baby a traffic light is on a city street. a street with cars and a clock tower. street sign cars clock traffic a yellow and black train on a track. a train traveling down tracks next to a building. train tracks clock tower down a close up of a plate of food on a table. a table topped with a cake with candles on it. a teddy bear sitting on top of a chair . a white teddy bear sitting next to a stuffed animal . a person is holding colorful umbrella. a black umbrella sitting on top of a sandy beach . a woman is holding a cell phone in her hand . a woman holding a pair of scissors in her hands . cake table plate sitting birthday teddy cat bear stuffed white umbrella beach water sitting boat woman bathroom her scissors man
  • 45. Performance on the Testing Dataset • Publicly available split
  • 46. Performance • MS-COCO Image Captioning Challenge
  • 47. TGIF: A New Dataset and Benchmark on Animated GIF Description Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Jiebo Luo
  • 50. Examples a skate boarder is doing trick on his skate board. a gloved hand opens to reveal a golden ring. a sport car is swinging on the race playground the vehicle is moving fast into the tunnel
  • 51. Contributions • A large scale animated GIF description dataset for promoting image sequence modeling and research • Performing automatic validation to collect natural language descriptions from crowd workers • Establishing baseline image captioning methods for future benchmarking • Comparison with existing datasets, highlighting the benefits with animated GIFs
  • 52. In Comparison with Existing Datasets • The language in our dataset is closer to common language • Our dataset has an emphasis on the verbs • Animated GIFs are more coherent and self contained • Our dataset can be used to solve more difficult movie description problem
  • 56. Comparing Professionals and Crowd-workers Crowd worker: two people are kissing on a boat. Professional: someone glances at a kissing couple then steps to a railing overlooking the ocean an older man and woman stand beside him. Crowd worker: two men got into their car and not able to go anywhere because the wheels were locked. Professional: someone slides over the camaros hood then gets in with his partner he starts the engine the revving vintage car starts to backup then lurches to a halt. Crowd worker: a man in a shirt and tie sits beside a person who is covered in a sheet. Professional: he makes eye contact with the woman for only a second. More: http://beta- web2.cloudapp.net/ls mdc_sentence_compar ison.html
  • 57. Movie Descriptions versus TGIF • Crowd workers are encouraged to describe the major visual content directly, and not to use overly descriptive language • Because our animated GIFs are presented to crowd workers without any context, the sentences in our dataset are more self-contained • Animated GIFs are perfectly segmented since they are carefully curated by online users to create a coherent visual story
  • 58. Where is CV (AI) in 2016? Winter 2002 Fall 2003 Summer 2008 Image/Video captioning 看图识字 看图说话
  • 59. Thanks Q & A Google *** Baidu Sogou Bing XiaoIce Are you smarter than a 5th grader? What doe it take to go from a 5- year old to a 5th grader? 1. Learning from “small data” 2. Unsupervised learning 3. Transfer learning 4. Integration of domain knowledge or experience
  • 60. Visual Intelligence & Social Multimedia Analytics www.cs.rochester.edu/u/jluo Questions?