SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Data-driven Generation of Image
Descriptions
Vicente Ordonez-Roman
Advisor: Tamara Berg

Previously:
The State University of New York
What most Computer Vision systems aim
to say about a picture

Computer Vision

sky
trees
water
building
bridge
river
tree
What we are able to say about a picture

An old bridge over dirty green water.

Our Goal
One of the many stone bridges in town
that carry the gravel carriage roads.
A stone bridge over a peaceful river.
Let’s just borrow captions from similar images!

Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
Harness the Web!
Images + Captions
from the Web

Smallest house in paris
between red (on right)
and beige (on left).

Matching using Global
Image Features
(GIST + Color)

Bridge to temple in
Hoan Kiem lake.

A walk around the
lake near our house
with Abby.

Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”

The water is clear
enough to see
fish swimming
around in it.

Hangzhou bridge in
West lake.

...

The daintree river by
boat.
Use the web to collect
images + captions

90, 000, 000, 000 pictures~!! (**)
A lot of them with captions
(a lot of them not publicy available )

6, 000, 000, 000 photographs! (*)
A lot of them with captions
(lots of them publicly available )

(*) http://blog.flickr.net/en/2011/08/04/6000000000/
(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

Dog with a ball in its mouth
running around like crazy on the
green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

cat catsink a
in a in

sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.

A 10-kg cat called Hercules.. and got caught in a pet

cat in a sink

A 10-kg cat called Hercules..sneak into another house to steal
and got caught in a pet
door when trying to
door when trying to'Nuff saidinto another house to steal
dog food. sneak
dog food. 'Nuff said
Solution:
Collect hundreds of millions of captions
Filter them out
We found “good captions” have visual concepts and
relation words “by”, “in”, “over”, “beside”, “on top of”
~1 “good caption” for every 1000 “bad captions”
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
SBU Captioned Photo Dataset

The Egyptian cat statue by the
floor clock and perpetual
motion machine in the
pantheon

Man sits in a rusted car buried
in the sand on Waitarere beach

Little girl and her dog in
northern Thailand. They both
seemed interested in what we
were doing

Our dog Zoe in her
bed

Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.

Emma in her hat looking
super cute
Results

(1) while walking by the water
(2) plane flying over the sun
(3) shot this in a moving car at the nkve highway
(4) sunset over creve coeur lake and the page bridge
(5) sunset on 12th sep 2009 as seen from the field polder near my house
(6) window over yellow door
(7) sunset over capitol hill as seen from the roof of my building
(8) an orange sky over the irish sea
(9) beautiful golden sunset reflected in the waves of the ocean
(10) red sky probably caused by volcanic ash from iceland
(11) a view of sunset over river brahmaputa from koliyabhumura bridge
(12) red sky in the morning
Results

(1) burnt wooden door in derelict building portugal
(2) peterborough cathedral norman door in south wall
(3) amazing wooden door with wider light above
(4) door in wall
(5) girl looking in a classroom window
(6) a interesting cross in a window of an ancient city
(7) this mirror decorated with fruit painting was left behind by theprevious owners
(8) unusual exterior wall postbox at st albans post office in st peters street al1
(9) door in oxford uk in black and white
(10) 19 plate behind glass in brass mat and preserver
(11) this is some of the window decoration external on the house justover the porch 0364
(12) cat in a window
Results

(1) img8783 ginger in the red chair
(2) red sky in the morning
(3) the cat is in the bag and the bag is in the river quot
(4) the light in the kitchen made everythin glow my little girl is growing up
(5) my cat in a box that is far too small for her
(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata
(7) baby in her later years turned from green to red but she never went fully red all over
(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed
(9) glazed ceramic poop form in orange wooden box
(10) rock garden in library
(11) it s funny to capture the preciousest cat in the house at his most devillicious
(12) the pink will get replaced by orange and blue in the fall
Results

(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange
fabricstuffing is pillow stuffing
(2) mural of birds and trees in the crypt of wat ratburana ayutthaya
(3) carvings in the rock wall
(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar
(5) epsom and table salt crystals growing in concentrated green tea solution
(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up
(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the
(8) you know you re in wisconsin when the beach has pine needles inthe sand
(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual
(10) made by fusing plastic bags
(11) bark pattern from a ponderosa pine tree in grand canyon national park
(12) the peasant that found a statue of the black virgin on a rock in ariver
What to do next?
Use High Level Content to Rerank
(Objects, Stuff, People, Scenes, Captions)

The bridge over the
lake on Suzhou Street.

Iron bridge over the Duck
river.

Transfer Caption(s)
e.g. “The bridge over the
lake on Suzhou Street.”

The Daintree river by boat. Bridge over Cacapon river.

...
Some success…

Amazing colours in
the sky at sunset
with the orange of
the cloud and the
blue of the sky
behind.

A female mallard duck in the
lake at Luukki Espoo

Strange cloud formation
literally flowing through the sky
like a river in relation to the
other clouds out there.

The sun was
coming through
the trees while I
was sitting in my
chair by the river

Fresh fruit and
vegetables at the market
in Port Louis Mauritius.

Tree with red leaves in the
field in autumn.

Under the sky of burning
clouds.

Stained glass
window in
Eusebius church.
Still far from perfect
Incorrect objects

Kentucky cows in a field.
The cat in the window.
Still far from perfect
Incorrect context

The sky is blue over the Gherkin.

Tree beside the river.

Completely wrong

The boat ended up a kilometre from
the water in the middle of the airstrip.

Water over the road.
How to Evaluate?
• “Ground truth”: The car is parked next to the
train station besides a building.
• Candidates:
“There is car parked in front of an office building”
“This is the building that hosted the ceremony”
“A vehicle stopped next to my house”

Similar to evaluation on Machine
Translation
BLEU score evaluation against Human Captions
Method

BLEU score

Global matching (1k)

0.0774

Global matching (10k)

0.0909

Global matching (100k)

0.0917

Global matching (1million)

0.1177

Global + Content matching
(linear regression)

0.1215

Global + Content matching
(linear SVM)

0.1259
Human Visual Verification
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:
Human Visual Verification
Caption from
Flickr

Please choose
the image that
better
corresponds to
the given
caption:

Random image

View overlooking Kuala Lumpur from my office
building
Human Visual Verification
Caption from
Flickr

Random image

View overlooking Kuala Lumpur from my office
building

Please choose
the image that
better
corresponds to
the given
caption:

Caption used

Success rate

Original human caption

96.0%

Top caption

66.7%

Best from our top 4 captions

92.7%
Human Visual Evaluation
Caption
produced by
our system

Random image
The view from the 13th floor of an apartment building in
Nakano awesome.

Please choose
the image that
better
corresponds to
the given
caption:

Caption used

Success rate

Original human caption

96.0%

Top caption

66.7%

Best from our top 4 captions

92.7%
Human Visual Evaluation
Caption
produced by
our system

Random image
The view from the 13th floor of an apartment building in
Nakano awesome.

Please choose
the image that
better
corresponds to
the given
caption:

Caption used

Success rate

Original human caption

96.0%

Top caption

66.7%

Best from our top 4 captions

92.7%
What to do next?
Let’s not borrow captions from other
images, let’s just borrow short phrases!
Collective Generation of Natural Image Descriptions.
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.
Association for Computational Linguistics. ACL 2012.
Large Scale Retrieval for Image Description Generation
Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,
Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,
Alexander C. Berg, Yejin Choi, Tamara L. Berg
On Submission to IJCV special issue on Big Data.
Retrieving noun phrases from similar object
detections
Retrieving verb
phrases from similar
object detections
Contented dog just laying
on the edge of the road in
front of a house..

Peruvian dog sleeping on
city street in the city of
Cusco, (Peru)

Detect: dog

Find matching
dog detections
by visual
similarity

this dog was laying in the
middle of the road on a
back street in jaco

Closeup of my dog sleeping
under my desk.
Retrieving prepositional
phrases from region +
detection matches

Find matching region
detections using
appearance +
arrangement

Object: car

Cordoba - lonely elephant
under an orange tree...

Comfy chair under a tree.

I positioned the chairs
around the lemon tree -it's like a shrine

Mini Nike soccer ball all
alone in the grass
Retrieving prepositional phrases from scene matches

Extract scene descriptor

Pedestrian street in the Old
Lyon with stairs to climb up
the hill of fourviere

Find matching
images by scene
similarity
View from our B&B in this
photo

I'm about to blow the building
across the street over with my
massive lung power.

Only in Paris will you find a
bottle of wine on a table
outside a bookstore
Data Processing
1 million images:
– Run object detectors
– Run region based stuff detectors (e.g.
grass, sky, etc)
– Run global scene classifiers
– Parse captions associated with images
and retrieve phrases referring to objects
(NPs, VPs), region relationships (PPstuff),
and general scene context (PPscene).
Recognition, aka Vision is hard
Detecting one hundred objects
Sometimes you can make it (a little) better
Detecting “mentioned” objects

Look in the mountain for a lion face

Ecuador, amazon basin, near coca, rain forest,
passion fruit flower

The background is a vintage paint by number painting I have
and the fabulous forest dress is by candyjunky!

Kevin’s mom, so punxrawk in Kev’s black flag hat
Everything together
Scene

Objects
Actions

bird

Stuff

looking
for food

in water

in Lincoln City
Oregon coast
Everything together
Retrieved phrases
bird
looking for
food

bird

looking for
food

in Atlantic City

in water
on the beach

bird

in water

in water
looking for
food

in Lincoln City
Oregon coast
Binary Integer Linear Programming
Phrase sij

Position k

Phrase Vision
Confidence

Phrase sij
Phrase spq

Pairwise
phrase
cohesion

=

Position k
Position k+1

Head words
Ngram
co+
cohesion
occurrence
Composing Descriptions
Compose descriptions from phrases with ILP approach

• Linguistic constraints
– Allow only one phrase of each type
– Enforce plural/singular agreement between NP and VP

• Discourse constraints
– Prevent inclusion of repeated phrasing

• Phrase cohesion constraints
– n-gram statistics between phrases
– Co-occurrence statistics between head words of phrases (last
word or main verb) to encourage longer range cohesion
Good Results

This is a sporty little red convertible
made for a great day in Key West FL. This
car was in the 4th parade of the
apartment buildings.

Taken in front of my cat sitting in a shoe
box. Cat likes hanging around in my
recliner.

This is a brass viking
boat moored on
beach in Tobago by
the ocean.
Bad Results
Grammatically incorrect.

Cognitive absurdity.

One of the most shirt in the wall of
the house.

Here you can see a cross by the frog
in the sky.

Not relevant

This is a shoulder bag with a blended
rainbow effect
BLEU score evaluation
Method

BLEU score

HMM (using cognitive phrases)

0.111

HMM (without using cognitive phrases)

0.114

ILP (using cognitive phrases)

0.114

ILP (without using cognitive phrases)

0.116
Human Forced Choice Evaluation
Caption used

ILP Selection

ILP vs. HMM (no images, no cognitive phrases)

67.2%

ILP vs. HMM (no images, with cognitive phrases)

66.3%

ILP vs. HMM (with images, no cognitive phrases)

53.17%

ILP vs. HMM (with images, with cognitive phrases)

54.5%

ILP vs. NIPS 2011 (Global matching 1M)

71.8%

ILP vs. HUMAN

16%
Visual Turing Test
Us vs Original Human Written Caption

In some cases (16%), ILP
generated captions were
preferred over human
written ones!
What’s next?
To be presented at ICCV 2013

Meaning from large-scale computer vision
Images with the word “house”

Images recognized as more likely
to produce the word “house”
To be presented at ICCV 2013

Meaning from large-scale computer vision
Images with the word “girl”

Images recognized as more likely
to produce the word “girl”
To be presented at ICCV 2013

Meaning from large-scale computer vision
Weights learned to recognize
images with “desk” in caption

Mammals

Top weighted classifier outputs

Birds InstrumentsStructures Plants Other

Weights learned over outputs of ~8k classifiers
To be presented at ICCV 2013

Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption

Mammals

Top weighted classifier outputs

Birds InstrumentsStructures Plants Other

Weights learned over outputs of ~8k classifiers
Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption

Mammals

Top weighted classifier outputs

Birds InstrumentsStructures Plants Other

Weights learned over outputs of ~8k classifiers
Questions?

Weitere ähnliche Inhalte

Mehr von Vicente Ordonez

Google Earth Maps Api Barcamp Quito 2009
Google Earth Maps Api Barcamp Quito 2009Google Earth Maps Api Barcamp Quito 2009
Google Earth Maps Api Barcamp Quito 2009Vicente Ordonez
 
Sistema de Recuperacion de Audio
Sistema de Recuperacion de AudioSistema de Recuperacion de Audio
Sistema de Recuperacion de AudioVicente Ordonez
 
Transmision de Vídeo por Red / Internet
Transmision de Vídeo por Red / InternetTransmision de Vídeo por Red / Internet
Transmision de Vídeo por Red / InternetVicente Ordonez
 
Buscadores de Podcast en Internet
Buscadores de Podcast en InternetBuscadores de Podcast en Internet
Buscadores de Podcast en InternetVicente Ordonez
 
Portal Concepts and .NET Webparts
Portal Concepts and .NET WebpartsPortal Concepts and .NET Webparts
Portal Concepts and .NET WebpartsVicente Ordonez
 

Mehr von Vicente Ordonez (11)

Pantallas Plasma vs LCD
Pantallas Plasma vs LCDPantallas Plasma vs LCD
Pantallas Plasma vs LCD
 
Google Earth Maps Api Barcamp Quito 2009
Google Earth Maps Api Barcamp Quito 2009Google Earth Maps Api Barcamp Quito 2009
Google Earth Maps Api Barcamp Quito 2009
 
Sistema de Recuperacion de Audio
Sistema de Recuperacion de AudioSistema de Recuperacion de Audio
Sistema de Recuperacion de Audio
 
Suenaemprendevive
SuenaemprendeviveSuenaemprendevive
Suenaemprendevive
 
MapReduce
MapReduceMapReduce
MapReduce
 
Robotica
RoboticaRobotica
Robotica
 
Transmision de Vídeo por Red / Internet
Transmision de Vídeo por Red / InternetTransmision de Vídeo por Red / Internet
Transmision de Vídeo por Red / Internet
 
Buscadores de Podcast en Internet
Buscadores de Podcast en InternetBuscadores de Podcast en Internet
Buscadores de Podcast en Internet
 
Sistemas Operativos 3D
Sistemas Operativos 3DSistemas Operativos 3D
Sistemas Operativos 3D
 
Ajax Atlas
Ajax AtlasAjax Atlas
Ajax Atlas
 
Portal Concepts and .NET Webparts
Portal Concepts and .NET WebpartsPortal Concepts and .NET Webparts
Portal Concepts and .NET Webparts
 

Kürzlich hochgeladen

12 Week Weight Loss Planner to help with planning weight loss
12 Week Weight Loss Planner to help with planning weight loss12 Week Weight Loss Planner to help with planning weight loss
12 Week Weight Loss Planner to help with planning weight lossSimpleMoneyMaker
 
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdf
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdfLiving in the Light_ A guide to personal transformation ( PDFDrive ).pdf
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdfkalpana413121
 
Group Discussion and panel Discussion
Group Discussion  and   panel DiscussionGroup Discussion  and   panel Discussion
Group Discussion and panel DiscussionAbdulGhaffarGhori
 
Masjid Ishaq The Mosque of Babo Dehri Swabi
Masjid Ishaq The Mosque of Babo Dehri SwabiMasjid Ishaq The Mosque of Babo Dehri Swabi
Masjid Ishaq The Mosque of Babo Dehri SwabiAlhamdulillah 33
 
Uttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfUttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfNoel Sergeant
 
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNIS
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNISFUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNIS
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNISe98298697
 

Kürzlich hochgeladen (6)

12 Week Weight Loss Planner to help with planning weight loss
12 Week Weight Loss Planner to help with planning weight loss12 Week Weight Loss Planner to help with planning weight loss
12 Week Weight Loss Planner to help with planning weight loss
 
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdf
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdfLiving in the Light_ A guide to personal transformation ( PDFDrive ).pdf
Living in the Light_ A guide to personal transformation ( PDFDrive ).pdf
 
Group Discussion and panel Discussion
Group Discussion  and   panel DiscussionGroup Discussion  and   panel Discussion
Group Discussion and panel Discussion
 
Masjid Ishaq The Mosque of Babo Dehri Swabi
Masjid Ishaq The Mosque of Babo Dehri SwabiMasjid Ishaq The Mosque of Babo Dehri Swabi
Masjid Ishaq The Mosque of Babo Dehri Swabi
 
Uttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdfUttoxeter & Cheadle Voice, Issue 122.pdf
Uttoxeter & Cheadle Voice, Issue 122.pdf
 
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNIS
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNISFUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNIS
FUNDAMENTALS OF ARNIS ARNIS ARNIS ARNIS ARNIS
 

Data-driven Generation of Image Descriptions

  • 1. Data-driven Generation of Image Descriptions Vicente Ordonez-Roman Advisor: Tamara Berg Previously: The State University of New York
  • 2. What most Computer Vision systems aim to say about a picture Computer Vision sky trees water building bridge river tree
  • 3. What we are able to say about a picture An old bridge over dirty green water. Our Goal One of the many stone bridges in town that carry the gravel carriage roads. A stone bridge over a peaceful river.
  • 4. Let’s just borrow captions from similar images! Im2Text: Describing Images Using 1 Million Captioned Photographs. Vicente Ordonez, Girish Kulkarni, Tamara L. Berg. Advances in Neural Information Processing Systems. NIPS 2011.
  • 5. Harness the Web! Images + Captions from the Web Smallest house in paris between red (on right) and beige (on left). Matching using Global Image Features (GIST + Color) Bridge to temple in Hoan Kiem lake. A walk around the lake near our house with Abby. Transfer Caption(s) e.g. “The water is clear enough to see fish swimming around in it.” The water is clear enough to see fish swimming around in it. Hangzhou bridge in West lake. ... The daintree river by boat.
  • 6. Use the web to collect images + captions 90, 000, 000, 000 pictures~!! (**) A lot of them with captions (a lot of them not publicy available ) 6, 000, 000, 000 photographs! (*) A lot of them with captions (lots of them publicly available ) (*) http://blog.flickr.net/en/2011/08/04/6000000000/ (**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
  • 7. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  • 8. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  • 9. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  • 10. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat catsink a in a in sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  • 11. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. cat in a sink A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
  • 12. Flickr images + captions Dog with a ball in its mouth running around like crazy on the green grass. A 10-kg cat called Hercules.. and got caught in a pet cat in a sink A 10-kg cat called Hercules..sneak into another house to steal and got caught in a pet door when trying to door when trying to'Nuff saidinto another house to steal dog food. sneak dog food. 'Nuff said
  • 13. Solution: Collect hundreds of millions of captions Filter them out We found “good captions” have visual concepts and relation words “by”, “in”, “over”, “beside”, “on top of” ~1 “good caption” for every 1000 “bad captions” Im2Text: Describing Images Using 1 Million Captioned Photographs. Vicente Ordonez, Girish Kulkarni, Tamara L. Berg. Advances in Neural Information Processing Systems. NIPS 2011.
  • 14. SBU Captioned Photo Dataset The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Our dog Zoe in her bed Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Emma in her hat looking super cute
  • 15. Results (1) while walking by the water (2) plane flying over the sun (3) shot this in a moving car at the nkve highway (4) sunset over creve coeur lake and the page bridge (5) sunset on 12th sep 2009 as seen from the field polder near my house (6) window over yellow door (7) sunset over capitol hill as seen from the roof of my building (8) an orange sky over the irish sea (9) beautiful golden sunset reflected in the waves of the ocean (10) red sky probably caused by volcanic ash from iceland (11) a view of sunset over river brahmaputa from koliyabhumura bridge (12) red sky in the morning
  • 16. Results (1) burnt wooden door in derelict building portugal (2) peterborough cathedral norman door in south wall (3) amazing wooden door with wider light above (4) door in wall (5) girl looking in a classroom window (6) a interesting cross in a window of an ancient city (7) this mirror decorated with fruit painting was left behind by theprevious owners (8) unusual exterior wall postbox at st albans post office in st peters street al1 (9) door in oxford uk in black and white (10) 19 plate behind glass in brass mat and preserver (11) this is some of the window decoration external on the house justover the porch 0364 (12) cat in a window
  • 17. Results (1) img8783 ginger in the red chair (2) red sky in the morning (3) the cat is in the bag and the bag is in the river quot (4) the light in the kitchen made everythin glow my little girl is growing up (5) my cat in a box that is far too small for her (6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata (7) baby in her later years turned from green to red but she never went fully red all over (8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed (9) glazed ceramic poop form in orange wooden box (10) rock garden in library (11) it s funny to capture the preciousest cat in the house at his most devillicious (12) the pink will get replaced by orange and blue in the fall
  • 18. Results (1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange fabricstuffing is pillow stuffing (2) mural of birds and trees in the crypt of wat ratburana ayutthaya (3) carvings in the rock wall (4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar (5) epsom and table salt crystals growing in concentrated green tea solution (6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up (7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the (8) you know you re in wisconsin when the beach has pine needles inthe sand (9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual (10) made by fusing plastic bags (11) bark pattern from a ponderosa pine tree in grand canyon national park (12) the peasant that found a statue of the black virgin on a rock in ariver
  • 19. What to do next?
  • 20. Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions) The bridge over the lake on Suzhou Street. Iron bridge over the Duck river. Transfer Caption(s) e.g. “The bridge over the lake on Suzhou Street.” The Daintree river by boat. Bridge over Cacapon river. ...
  • 21. Some success… Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. A female mallard duck in the lake at Luukki Espoo Strange cloud formation literally flowing through the sky like a river in relation to the other clouds out there. The sun was coming through the trees while I was sitting in my chair by the river Fresh fruit and vegetables at the market in Port Louis Mauritius. Tree with red leaves in the field in autumn. Under the sky of burning clouds. Stained glass window in Eusebius church.
  • 22. Still far from perfect Incorrect objects Kentucky cows in a field. The cat in the window.
  • 23. Still far from perfect Incorrect context The sky is blue over the Gherkin. Tree beside the river. Completely wrong The boat ended up a kilometre from the water in the middle of the airstrip. Water over the road.
  • 24. How to Evaluate? • “Ground truth”: The car is parked next to the train station besides a building. • Candidates: “There is car parked in front of an office building” “This is the building that hosted the ceremony” “A vehicle stopped next to my house” Similar to evaluation on Machine Translation
  • 25. BLEU score evaluation against Human Captions Method BLEU score Global matching (1k) 0.0774 Global matching (10k) 0.0909 Global matching (100k) 0.0917 Global matching (1million) 0.1177 Global + Content matching (linear regression) 0.1215 Global + Content matching (linear SVM) 0.1259
  • 26. Human Visual Verification View overlooking Kuala Lumpur from my office building Please choose the image that better corresponds to the given caption:
  • 27. Human Visual Verification Caption from Flickr Please choose the image that better corresponds to the given caption: Random image View overlooking Kuala Lumpur from my office building
  • 28. Human Visual Verification Caption from Flickr Random image View overlooking Kuala Lumpur from my office building Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  • 29. Human Visual Evaluation Caption produced by our system Random image The view from the 13th floor of an apartment building in Nakano awesome. Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  • 30. Human Visual Evaluation Caption produced by our system Random image The view from the 13th floor of an apartment building in Nakano awesome. Please choose the image that better corresponds to the given caption: Caption used Success rate Original human caption 96.0% Top caption 66.7% Best from our top 4 captions 92.7%
  • 31. What to do next?
  • 32. Let’s not borrow captions from other images, let’s just borrow short phrases! Collective Generation of Natural Image Descriptions. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi. Association for Computational Linguistics. ACL 2012. Large Scale Retrieval for Image Description Generation Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell, Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III, Alexander C. Berg, Yejin Choi, Tamara L. Berg On Submission to IJCV special issue on Big Data.
  • 33. Retrieving noun phrases from similar object detections
  • 34. Retrieving verb phrases from similar object detections Contented dog just laying on the edge of the road in front of a house.. Peruvian dog sleeping on city street in the city of Cusco, (Peru) Detect: dog Find matching dog detections by visual similarity this dog was laying in the middle of the road on a back street in jaco Closeup of my dog sleeping under my desk.
  • 35. Retrieving prepositional phrases from region + detection matches Find matching region detections using appearance + arrangement Object: car Cordoba - lonely elephant under an orange tree... Comfy chair under a tree. I positioned the chairs around the lemon tree -it's like a shrine Mini Nike soccer ball all alone in the grass
  • 36. Retrieving prepositional phrases from scene matches Extract scene descriptor Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere Find matching images by scene similarity View from our B&B in this photo I'm about to blow the building across the street over with my massive lung power. Only in Paris will you find a bottle of wine on a table outside a bookstore
  • 37. Data Processing 1 million images: – Run object detectors – Run region based stuff detectors (e.g. grass, sky, etc) – Run global scene classifiers – Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).
  • 38. Recognition, aka Vision is hard Detecting one hundred objects
  • 39. Sometimes you can make it (a little) better Detecting “mentioned” objects Look in the mountain for a lion face Ecuador, amazon basin, near coca, rain forest, passion fruit flower The background is a vintage paint by number painting I have and the fabulous forest dress is by candyjunky! Kevin’s mom, so punxrawk in Kev’s black flag hat
  • 41. Everything together Retrieved phrases bird looking for food bird looking for food in Atlantic City in water on the beach bird in water in water looking for food in Lincoln City Oregon coast
  • 42. Binary Integer Linear Programming Phrase sij Position k Phrase Vision Confidence Phrase sij Phrase spq Pairwise phrase cohesion = Position k Position k+1 Head words Ngram co+ cohesion occurrence
  • 43. Composing Descriptions Compose descriptions from phrases with ILP approach • Linguistic constraints – Allow only one phrase of each type – Enforce plural/singular agreement between NP and VP • Discourse constraints – Prevent inclusion of repeated phrasing • Phrase cohesion constraints – n-gram statistics between phrases – Co-occurrence statistics between head words of phrases (last word or main verb) to encourage longer range cohesion
  • 44. Good Results This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings. Taken in front of my cat sitting in a shoe box. Cat likes hanging around in my recliner. This is a brass viking boat moored on beach in Tobago by the ocean.
  • 45. Bad Results Grammatically incorrect. Cognitive absurdity. One of the most shirt in the wall of the house. Here you can see a cross by the frog in the sky. Not relevant This is a shoulder bag with a blended rainbow effect
  • 46. BLEU score evaluation Method BLEU score HMM (using cognitive phrases) 0.111 HMM (without using cognitive phrases) 0.114 ILP (using cognitive phrases) 0.114 ILP (without using cognitive phrases) 0.116
  • 47. Human Forced Choice Evaluation Caption used ILP Selection ILP vs. HMM (no images, no cognitive phrases) 67.2% ILP vs. HMM (no images, with cognitive phrases) 66.3% ILP vs. HMM (with images, no cognitive phrases) 53.17% ILP vs. HMM (with images, with cognitive phrases) 54.5% ILP vs. NIPS 2011 (Global matching 1M) 71.8% ILP vs. HUMAN 16%
  • 48. Visual Turing Test Us vs Original Human Written Caption In some cases (16%), ILP generated captions were preferred over human written ones!
  • 50. To be presented at ICCV 2013 Meaning from large-scale computer vision Images with the word “house” Images recognized as more likely to produce the word “house”
  • 51. To be presented at ICCV 2013 Meaning from large-scale computer vision Images with the word “girl” Images recognized as more likely to produce the word “girl”
  • 52. To be presented at ICCV 2013 Meaning from large-scale computer vision Weights learned to recognize images with “desk” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers
  • 53. To be presented at ICCV 2013 Meaning from large-scale computer vision Weights learned to recognize images with “tree” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers
  • 54. Meaning from large-scale computer vision Weights learned to recognize images with “tree” in caption Mammals Top weighted classifier outputs Birds InstrumentsStructures Plants Other Weights learned over outputs of ~8k classifiers

Hinweis der Redaktion

  1. Add previous affiliations
  2. Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  3. Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  4. Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  5. We approach this task in a data-driven manner by first building a 1 million dataset of images with visually relevant captions. We construct this dataset by collecting an enormous amount of captions assigned to images by web users and filtering these captions in such a way that we end up with captions that are more likely to refer to visual content. We use standard global image feature descriptors such as GIST and Tinyimages to retrieve similar images from which we can directly transfer captions.
  6. Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  7. Again we make use of the million image sbu captioned photo dataset
  8. Additionally we incorporate high level information to rerank the retrieved images used by the previous baseline method by running object detectors, scene classification, stuff detection, people and action detection and computing text statistics. So in this example we have a bridge and a water detections, we use those to match them with similar detections in the retrieved set of images. As you can see here we run object detectors in our retrieved images only if a relevant keyword is mentioned. Text statistics are also relevant because if in the retrieved set a lot of images agree that there is a bridge then those images are rewarded in the final ranking as well. And then again we can transfer captions from this reranked set of images.
  9. Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  10. Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  11. Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
  12. Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
  13. We can retrieve noun phrases referring to an object in a query using visual similarity between the query detection and detections from the database
  14. Similarly we can retrieve verb phrases based on similar matching poses. For example giving us – laying on the edge of the road in front of a house.
  15. For relationships between objects and stuff detections we use a combination of matching appearance and similarity in spatial arrangement. So here for this car, tree, and grass detections. We can retrieve phrases like “under a tree”, “in the grass” and so on.
  16. Finally we can use our scene detectors to find matching images by scene similarity. Again for this we use the output of all of our scene classifiers as a descriptor for the image scene and then find similar scenes according to similarity between scene descriptors. This sometimes, but not always produces quite pleasing results. here we generally get similar european street scenes matching our query image. These phrases provide a sort of general scene context for a description.
  17. First we do some processing on the data, including running about 100 object detections, regional stuff detectors, global scene classifiers and finally we parse the captions using the berkeley parser to get phrases referring to objects, spatial arrangements with background elements, and general scene descriptions.
  18. But one issue with running lots of detectors is that it produces really noisy results. If for example you try to run 100 object and pose detectors on even these fairly simple images you get a big mess of detections. Here’s a bicycle in the mountain, a chair down here… The correct detections may be in there somewhere, but you can’t really see them amongst all the noisy false detections.So obviously we had to make these results better if we were going to be able to use them.
  19. So we decided to play some simple tricks to make our recognition problem a little easier.For example, if you have some prior on what you expect to be in the image, then you can guide recognition in the right direction. In our case, with our giant captioned data set we have really good evidence for what might be in an image. We have some text telling us the likely objects. So for an image with a caption, we can just run the detectors for the objects mentioned in the caption. Woohoo that produces still not perfect, but considerably better recognition results! Now we can use these for captioning.
  20. We compose descriptions from retrieved phrases using an ILP approach with a number of constraints from the vision predictions, linguistic constraints, discourse constraints, and phrase cohesion constraints.
  21. The captions we produce are often quite reasonable, sometimes even preferred over the original human written ones!