1. Data-driven Generation of Image
Descriptions
Vicente Ordonez-Roman
Advisor: Tamara Berg
Previously:
The State University of New York
2. What most Computer Vision systems aim
to say about a picture
Computer Vision
sky
trees
water
building
bridge
river
tree
3. What we are able to say about a picture
An old bridge over dirty green water.
Our Goal
One of the many stone bridges in town
that carry the gravel carriage roads.
A stone bridge over a peaceful river.
4. Let’s just borrow captions from similar images!
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
5. Harness the Web!
Images + Captions
from the Web
Smallest house in paris
between red (on right)
and beige (on left).
Matching using Global
Image Features
(GIST + Color)
Bridge to temple in
Hoan Kiem lake.
A walk around the
lake near our house
with Abby.
Transfer Caption(s)
e.g. “The water is clear
enough to see fish
swimming around in it.”
The water is clear
enough to see
fish swimming
around in it.
Hangzhou bridge in
West lake.
...
The daintree river by
boat.
6. Use the web to collect
images + captions
90, 000, 000, 000 pictures~!! (**)
A lot of them with captions
(a lot of them not publicy available )
6, 000, 000, 000 photographs! (*)
A lot of them with captions
(lots of them publicly available )
(*) http://blog.flickr.net/en/2011/08/04/6000000000/
(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
7. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
8. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
Dog with a ball in its mouth
running around like crazy on the
green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
9. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
10. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat catsink a
in a in
sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
11. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet
door when trying to sneak into another house to steal
dog food. 'Nuff said
12. Flickr images + captions
Dog with a ball in its mouth running around like
crazy on the green grass.
A 10-kg cat called Hercules.. and got caught in a pet
cat in a sink
A 10-kg cat called Hercules..sneak into another house to steal
and got caught in a pet
door when trying to
door when trying to'Nuff saidinto another house to steal
dog food. sneak
dog food. 'Nuff said
13. Solution:
Collect hundreds of millions of captions
Filter them out
We found “good captions” have visual concepts and
relation words “by”, “in”, “over”, “beside”, “on top of”
~1 “good caption” for every 1000 “bad captions”
Im2Text: Describing Images Using 1 Million Captioned Photographs.
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
14. SBU Captioned Photo Dataset
The Egyptian cat statue by the
floor clock and perpetual
motion machine in the
pantheon
Man sits in a rusted car buried
in the sand on Waitarere beach
Little girl and her dog in
northern Thailand. They both
seemed interested in what we
were doing
Our dog Zoe in her
bed
Interior design of modern white
and brown living room furniture
against white wall with a lamp
hanging.
Emma in her hat looking
super cute
15. Results
(1) while walking by the water
(2) plane flying over the sun
(3) shot this in a moving car at the nkve highway
(4) sunset over creve coeur lake and the page bridge
(5) sunset on 12th sep 2009 as seen from the field polder near my house
(6) window over yellow door
(7) sunset over capitol hill as seen from the roof of my building
(8) an orange sky over the irish sea
(9) beautiful golden sunset reflected in the waves of the ocean
(10) red sky probably caused by volcanic ash from iceland
(11) a view of sunset over river brahmaputa from koliyabhumura bridge
(12) red sky in the morning
16. Results
(1) burnt wooden door in derelict building portugal
(2) peterborough cathedral norman door in south wall
(3) amazing wooden door with wider light above
(4) door in wall
(5) girl looking in a classroom window
(6) a interesting cross in a window of an ancient city
(7) this mirror decorated with fruit painting was left behind by theprevious owners
(8) unusual exterior wall postbox at st albans post office in st peters street al1
(9) door in oxford uk in black and white
(10) 19 plate behind glass in brass mat and preserver
(11) this is some of the window decoration external on the house justover the porch 0364
(12) cat in a window
17. Results
(1) img8783 ginger in the red chair
(2) red sky in the morning
(3) the cat is in the bag and the bag is in the river quot
(4) the light in the kitchen made everythin glow my little girl is growing up
(5) my cat in a box that is far too small for her
(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata
(7) baby in her later years turned from green to red but she never went fully red all over
(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed
(9) glazed ceramic poop form in orange wooden box
(10) rock garden in library
(11) it s funny to capture the preciousest cat in the house at his most devillicious
(12) the pink will get replaced by orange and blue in the fall
18. Results
(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange
fabricstuffing is pillow stuffing
(2) mural of birds and trees in the crypt of wat ratburana ayutthaya
(3) carvings in the rock wall
(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar
(5) epsom and table salt crystals growing in concentrated green tea solution
(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up
(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the
(8) you know you re in wisconsin when the beach has pine needles inthe sand
(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual
(10) made by fusing plastic bags
(11) bark pattern from a ponderosa pine tree in grand canyon national park
(12) the peasant that found a statue of the black virgin on a rock in ariver
20. Use High Level Content to Rerank
(Objects, Stuff, People, Scenes, Captions)
The bridge over the
lake on Suzhou Street.
Iron bridge over the Duck
river.
Transfer Caption(s)
e.g. “The bridge over the
lake on Suzhou Street.”
The Daintree river by boat. Bridge over Cacapon river.
...
21. Some success…
Amazing colours in
the sky at sunset
with the orange of
the cloud and the
blue of the sky
behind.
A female mallard duck in the
lake at Luukki Espoo
Strange cloud formation
literally flowing through the sky
like a river in relation to the
other clouds out there.
The sun was
coming through
the trees while I
was sitting in my
chair by the river
Fresh fruit and
vegetables at the market
in Port Louis Mauritius.
Tree with red leaves in the
field in autumn.
Under the sky of burning
clouds.
Stained glass
window in
Eusebius church.
22. Still far from perfect
Incorrect objects
Kentucky cows in a field.
The cat in the window.
23. Still far from perfect
Incorrect context
The sky is blue over the Gherkin.
Tree beside the river.
Completely wrong
The boat ended up a kilometre from
the water in the middle of the airstrip.
Water over the road.
24. How to Evaluate?
• “Ground truth”: The car is parked next to the
train station besides a building.
• Candidates:
“There is car parked in front of an office building”
“This is the building that hosted the ceremony”
“A vehicle stopped next to my house”
Similar to evaluation on Machine
Translation
25. BLEU score evaluation against Human Captions
Method
BLEU score
Global matching (1k)
0.0774
Global matching (10k)
0.0909
Global matching (100k)
0.0917
Global matching (1million)
0.1177
Global + Content matching
(linear regression)
0.1215
Global + Content matching
(linear SVM)
0.1259
26. Human Visual Verification
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:
27. Human Visual Verification
Caption from
Flickr
Please choose
the image that
better
corresponds to
the given
caption:
Random image
View overlooking Kuala Lumpur from my office
building
28. Human Visual Verification
Caption from
Flickr
Random image
View overlooking Kuala Lumpur from my office
building
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
29. Human Visual Evaluation
Caption
produced by
our system
Random image
The view from the 13th floor of an apartment building in
Nakano awesome.
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
30. Human Visual Evaluation
Caption
produced by
our system
Random image
The view from the 13th floor of an apartment building in
Nakano awesome.
Please choose
the image that
better
corresponds to
the given
caption:
Caption used
Success rate
Original human caption
96.0%
Top caption
66.7%
Best from our top 4 captions
92.7%
32. Let’s not borrow captions from other
images, let’s just borrow short phrases!
Collective Generation of Natural Image Descriptions.
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.
Association for Computational Linguistics. ACL 2012.
Large Scale Retrieval for Image Description Generation
Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,
Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,
Alexander C. Berg, Yejin Choi, Tamara L. Berg
On Submission to IJCV special issue on Big Data.
34. Retrieving verb
phrases from similar
object detections
Contented dog just laying
on the edge of the road in
front of a house..
Peruvian dog sleeping on
city street in the city of
Cusco, (Peru)
Detect: dog
Find matching
dog detections
by visual
similarity
this dog was laying in the
middle of the road on a
back street in jaco
Closeup of my dog sleeping
under my desk.
35. Retrieving prepositional
phrases from region +
detection matches
Find matching region
detections using
appearance +
arrangement
Object: car
Cordoba - lonely elephant
under an orange tree...
Comfy chair under a tree.
I positioned the chairs
around the lemon tree -it's like a shrine
Mini Nike soccer ball all
alone in the grass
36. Retrieving prepositional phrases from scene matches
Extract scene descriptor
Pedestrian street in the Old
Lyon with stairs to climb up
the hill of fourviere
Find matching
images by scene
similarity
View from our B&B in this
photo
I'm about to blow the building
across the street over with my
massive lung power.
Only in Paris will you find a
bottle of wine on a table
outside a bookstore
37. Data Processing
1 million images:
– Run object detectors
– Run region based stuff detectors (e.g.
grass, sky, etc)
– Run global scene classifiers
– Parse captions associated with images
and retrieve phrases referring to objects
(NPs, VPs), region relationships (PPstuff),
and general scene context (PPscene).
39. Sometimes you can make it (a little) better
Detecting “mentioned” objects
Look in the mountain for a lion face
Ecuador, amazon basin, near coca, rain forest,
passion fruit flower
The background is a vintage paint by number painting I have
and the fabulous forest dress is by candyjunky!
Kevin’s mom, so punxrawk in Kev’s black flag hat
42. Binary Integer Linear Programming
Phrase sij
Position k
Phrase Vision
Confidence
Phrase sij
Phrase spq
Pairwise
phrase
cohesion
=
Position k
Position k+1
Head words
Ngram
co+
cohesion
occurrence
43. Composing Descriptions
Compose descriptions from phrases with ILP approach
• Linguistic constraints
– Allow only one phrase of each type
– Enforce plural/singular agreement between NP and VP
• Discourse constraints
– Prevent inclusion of repeated phrasing
• Phrase cohesion constraints
– n-gram statistics between phrases
– Co-occurrence statistics between head words of phrases (last
word or main verb) to encourage longer range cohesion
44. Good Results
This is a sporty little red convertible
made for a great day in Key West FL. This
car was in the 4th parade of the
apartment buildings.
Taken in front of my cat sitting in a shoe
box. Cat likes hanging around in my
recliner.
This is a brass viking
boat moored on
beach in Tobago by
the ocean.
45. Bad Results
Grammatically incorrect.
Cognitive absurdity.
One of the most shirt in the wall of
the house.
Here you can see a cross by the frog
in the sky.
Not relevant
This is a shoulder bag with a blended
rainbow effect
47. Human Forced Choice Evaluation
Caption used
ILP Selection
ILP vs. HMM (no images, no cognitive phrases)
67.2%
ILP vs. HMM (no images, with cognitive phrases)
66.3%
ILP vs. HMM (with images, no cognitive phrases)
53.17%
ILP vs. HMM (with images, with cognitive phrases)
54.5%
ILP vs. NIPS 2011 (Global matching 1M)
71.8%
ILP vs. HUMAN
16%
48. Visual Turing Test
Us vs Original Human Written Caption
In some cases (16%), ILP
generated captions were
preferred over human
written ones!
50. To be presented at ICCV 2013
Meaning from large-scale computer vision
Images with the word “house”
Images recognized as more likely
to produce the word “house”
51. To be presented at ICCV 2013
Meaning from large-scale computer vision
Images with the word “girl”
Images recognized as more likely
to produce the word “girl”
52. To be presented at ICCV 2013
Meaning from large-scale computer vision
Weights learned to recognize
images with “desk” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
53. To be presented at ICCV 2013
Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
54. Meaning from large-scale computer vision
Weights learned to recognize
images with “tree” in caption
Mammals
Top weighted classifier outputs
Birds InstrumentsStructures Plants Other
Weights learned over outputs of ~8k classifiers
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
We approach this task in a data-driven manner by first building a 1 million dataset of images with visually relevant captions. We construct this dataset by collecting an enormous amount of captions assigned to images by web users and filtering these captions in such a way that we end up with captions that are more likely to refer to visual content. We use standard global image feature descriptors such as GIST and Tinyimages to retrieve similar images from which we can directly transfer captions.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
Again we make use of the million image sbu captioned photo dataset
Additionally we incorporate high level information to rerank the retrieved images used by the previous baseline method by running object detectors, scene classification, stuff detection, people and action detection and computing text statistics. So in this example we have a bridge and a water detections, we use those to match them with similar detections in the retrieved set of images. As you can see here we run object detectors in our retrieved images only if a relevant keyword is mentioned. Text statistics are also relevant because if in the retrieved set a lot of images agree that there is a bridge then those images are rewarded in the final ranking as well. And then again we can transfer captions from this reranked set of images.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Finally here are some good and bad results obtained using our full approach. The first picture says Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. The captions are very human like because they were written by actual humans. And it works suprisingly well for a some types of images. On the other hand even with 1 million images we can’t generalize to all possible observable images and also our image matching methods can fail thus leading to bad results. If you would like to check in more detail our quantitative results please come to our poster. Thanks.
Most computer vision methods deal with the problem of identifying individual pieces of information but do not output the same type of output you would expect from a human. From this picture a good computer vision system would identify sky, trees, water, building, perhaps even bridge but a person on the other hand would say things about this picture like “a stone bridge over a peaceful river”. So our goal in this paper is to generate image descriptions as opposed to generate the individual pieces of information that computer vision methods would usually output.
We can retrieve noun phrases referring to an object in a query using visual similarity between the query detection and detections from the database
Similarly we can retrieve verb phrases based on similar matching poses. For example giving us – laying on the edge of the road in front of a house.
For relationships between objects and stuff detections we use a combination of matching appearance and similarity in spatial arrangement. So here for this car, tree, and grass detections. We can retrieve phrases like “under a tree”, “in the grass” and so on.
Finally we can use our scene detectors to find matching images by scene similarity. Again for this we use the output of all of our scene classifiers as a descriptor for the image scene and then find similar scenes according to similarity between scene descriptors. This sometimes, but not always produces quite pleasing results. here we generally get similar european street scenes matching our query image. These phrases provide a sort of general scene context for a description.
First we do some processing on the data, including running about 100 object detections, regional stuff detectors, global scene classifiers and finally we parse the captions using the berkeley parser to get phrases referring to objects, spatial arrangements with background elements, and general scene descriptions.
But one issue with running lots of detectors is that it produces really noisy results. If for example you try to run 100 object and pose detectors on even these fairly simple images you get a big mess of detections. Here’s a bicycle in the mountain, a chair down here… The correct detections may be in there somewhere, but you can’t really see them amongst all the noisy false detections.So obviously we had to make these results better if we were going to be able to use them.
So we decided to play some simple tricks to make our recognition problem a little easier.For example, if you have some prior on what you expect to be in the image, then you can guide recognition in the right direction. In our case, with our giant captioned data set we have really good evidence for what might be in an image. We have some text telling us the likely objects. So for an image with a caption, we can just run the detectors for the objects mentioned in the caption. Woohoo that produces still not perfect, but considerably better recognition results! Now we can use these for captioning.
We compose descriptions from retrieved phrases using an ILP approach with a number of constraints from the vision predictions, linguistic constraints, discourse constraints, and phrase cohesion constraints.
The captions we produce are often quite reasonable, sometimes even preferred over the original human written ones!