Presentation of our top-scoring solution to the MediaEval 2023 NewsImages Task, "Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023", by A. Leventakis, D. Galanopoulos, V. Mezaris, delivered at the 2023 Multimedia Evaluation Workshop (MediaEval'23), Amsterdam, NL, Feb. 2024.
1. Title of presentation
Subtitle
Name of presenter
Date
Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual
Softmax Operation for MediaEval NewsImages 2023
Antonios Leventakis, Damianos Galanopoulos, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
MediaEval 2023 Workshop
1-2 Feb. 2024
2. 2
Our takeaway message
• Our contributions
• Data augmentation: Generated one extra text for every training and testing pair
• Used pre-trained CLIP models
• Also tested fine-tuning CLIP model
• Dual-softmax similarity revision
• Our observations
• Fine-tuning improves performance
• The official results contrast with our internal experiments; important to consider
data’s nature when selecting pre-trained/fine-tuned CLIP model
3. 3
Motivation
• CLIP’s proven capabilities in image-text association
• Fine-tuning’s potential in capturing unique relationships between
images and texts in the news domain
• Data Augmentation could enhance models’ robustness
• Dual softmax as results re-ranking method can improve
performance (also shown in last year’s findings[1])
[1] D. Galanopoulos, V. Mezaris, Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022, in: Working Notes
Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop Proceedings, 2023.
4. 4
Approach: CLIP fine-tuning
• Training data collection
• 4.8M image-caption pairs from public datasets in the news domain: NYTimes800K,
N24News, BreakingNews, Al Jazeera Newsi, CNN Newsii, BBC UK Newsiii, Huffpost
Newsiv and Bloombergv
• Data augmentation
• One additional caption was generated for every image via the “T5” attention-based
transformer model[2]; 9.6M image-text pairs in total for training
• Fine-tuning of pre-trained CLIP model
• The “ViT-L/14@336px” model was fine-tuned with the original and the augmented
data with a learning rate of 3e-7 for 1 epoch
ihttps://data.world/opensnippets/al-jazeera-news-dataset, iihttps://data.world/opensnippets/cnn-news-dataset, iiihttps://data.world/opensnippets/bbc-uk-news-dataset,
ivhttps://data.world/crawlfeeds/huffspot-news-dataset vhttps://data.world/crawlfeeds/bloomberg-quint-news-dataset
[2] R. Colin, S. Noam, R. Adam, L. Katherine, N. Sharan, M. Michael, Z. Yanqi, W. Li, P. J. Liu, Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer, in: Journal of Machine Learning Research, 2020, pp. 1–67.
5. 5
Approach: using CLIP
• Pre-trained CLIP models (in addition to fine-tuned one)
• The “ViT-H/14” model of openCLIP and the “ViT-L/14@336px” model of CLIP were
used directly for retrieval
• Inference-stage scores aggregation
• Same data augmentation applied on test data; the similarity scores from the original
and augmented pairs were aggregated via mean pooling to obtain final predictions
• Dual softmax similarity revision
• Dual softmax operations were applied at inference stage to investigate effects on
performance
6. 6
Submitted Runs
Model Fine-tuning Dual Softmax
Run #1 ViT-H/14 ✓
Run #2 ViT-L/14@336px
Run #3 ViT-L/14@336px ✓
Run #4 ViT-L/14@336px ✓
Run #5 ViT-L/14@336px ✓ ✓
9. 9
Results
• ViT-H/14 is more suitable for the
GDELT-P2 dataset
• Fine-tuning benefits performance
10. 10
Results
• ViT-H/14 is more suitable for the
GDELT-P2 dataset
• Fine-tuning benefits performance
• Different pre-trained CLIP versions
significantly affect the final
performance
11. 11
Results
• ViT-H/14 is more suitable for the
GDELT-P2 dataset
• Handling significant amount of
synthetic images (GDELT-P2) is
probably important to consider
when selecting CLIP version
• Fine-tuning benefits performance
• Different pre-trained CLIP versions
significantly affect the final
performance
• Dual softmax results are mixed
12. 12
Results
• Official results contrast, in part, with our
internal findings:
• Both fine-tuning and dual softmax
benefit performance
13. 13
• CLIP fine-tuning improves performance
• Utilizing different pre-trained CLIP/openCLIP versions could reveal
further possibilities
• Further exploration of fine-tuning strategies could lead to a deeper
understanding on how to effectively adapt pre-trained models to
specific domains and tasks
• Future research could delve into understanding the capabilities and
limitations of pre-trained models in processing synthetic data and
develop strategies to improve performance in such scenarios
Lessons Learned
14. 14
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
This work was supported by the EU’s Horizon Europe and Horizon 2020 research and innovation
programs under grant agreements 101070190 AI4Trust and 101021866 CRiTERIA, respectively.