CERTH-ITI at MediaEval 2023 NewsImages Task

Title of presentation
Subtitle
Name of presenter
Date
Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual
Softmax Operation for MediaEval NewsImages 2023
Antonios Leventakis, Damianos Galanopoulos, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
MediaEval 2023 Workshop
1-2 Feb. 2024

2
Our takeaway message
• Our contributions
• Data augmentation: Generated one extra text for every training and testing pair
• Used pre-trained CLIP models
• Also tested fine-tuning CLIP model
• Dual-softmax similarity revision
• Our observations
• Fine-tuning improves performance
• The official results contrast with our internal experiments; important to consider
data’s nature when selecting pre-trained/fine-tuned CLIP model

3
Motivation
• CLIP’s proven capabilities in image-text association
• Fine-tuning’s potential in capturing unique relationships between
images and texts in the news domain
• Data Augmentation could enhance models’ robustness
• Dual softmax as results re-ranking method can improve
performance (also shown in last year’s findings[1])
[1] D. Galanopoulos, V. Mezaris, Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022, in: Working Notes
Proceedings of the MediaEval 2022 Workshop, volume 3583, CEUR Workshop Proceedings, 2023.

4
Approach: CLIP fine-tuning
• Training data collection
• 4.8M image-caption pairs from public datasets in the news domain: NYTimes800K,
N24News, BreakingNews, Al Jazeera Newsi, CNN Newsii, BBC UK Newsiii, Huffpost
Newsiv and Bloombergv
• Data augmentation
• One additional caption was generated for every image via the “T5” attention-based
transformer model[2]; 9.6M image-text pairs in total for training
• Fine-tuning of pre-trained CLIP model
• The “ViT-L/14@336px” model was fine-tuned with the original and the augmented
data with a learning rate of 3e-7 for 1 epoch
ihttps://data.world/opensnippets/al-jazeera-news-dataset, iihttps://data.world/opensnippets/cnn-news-dataset, iiihttps://data.world/opensnippets/bbc-uk-news-dataset,
ivhttps://data.world/crawlfeeds/huffspot-news-dataset vhttps://data.world/crawlfeeds/bloomberg-quint-news-dataset
[2] R. Colin, S. Noam, R. Adam, L. Katherine, N. Sharan, M. Michael, Z. Yanqi, W. Li, P. J. Liu, Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer, in: Journal of Machine Learning Research, 2020, pp. 1–67.

5
Approach: using CLIP
• Pre-trained CLIP models (in addition to fine-tuned one)
• The “ViT-H/14” model of openCLIP and the “ViT-L/14@336px” model of CLIP were
used directly for retrieval
• Inference-stage scores aggregation
• Same data augmentation applied on test data; the similarity scores from the original
and augmented pairs were aggregated via mean pooling to obtain final predictions
• Dual softmax similarity revision
• Dual softmax operations were applied at inference stage to investigate effects on
performance

6
Submitted Runs
Model Fine-tuning Dual Softmax
Run #1 ViT-H/14  ✓
Run #2 ViT-L/14@336px  
Run #3 ViT-L/14@336px  ✓
Run #4 ViT-L/14@336px ✓ 
Run #5 ViT-L/14@336px ✓ ✓

8
Results
• ViT-H/14 is more suitable for the
GDELT-P2 dataset

9
Results
GDELT-P2 dataset
• Fine-tuning benefits performance

10
Results
GDELT-P2 dataset
• Different pre-trained CLIP versions
significantly affect the final
performance

11
Results
GDELT-P2 dataset
• Handling significant amount of
synthetic images (GDELT-P2) is
probably important to consider
when selecting CLIP version
• Different pre-trained CLIP versions
significantly affect the final
performance
• Dual softmax results are mixed

12
Results
• Official results contrast, in part, with our
internal findings:
• Both fine-tuning and dual softmax
benefit performance

13
• CLIP fine-tuning improves performance
• Utilizing different pre-trained CLIP/openCLIP versions could reveal
further possibilities
• Further exploration of fine-tuning strategies could lead to a deeper
understanding on how to effectively adapt pre-trained models to
specific domains and tasks
• Future research could delve into understanding the capabilities and
limitations of pre-trained models in processing synthetic data and
develop strategies to improve performance in such scenarios
Lessons Learned

14
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
This work was supported by the EU’s Horizon Europe and Horizon 2020 research and innovation
programs under grant agreements 101070190 AI4Trust and 101021866 CRiTERIA, respectively.

CERTH-ITI at MediaEval 2023 NewsImages Task

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie CERTH-ITI at MediaEval 2023 NewsImages Task

Ähnlich wie CERTH-ITI at MediaEval 2023 NewsImages Task (20)

Mehr von VasileiosMezaris

Mehr von VasileiosMezaris (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CERTH-ITI at MediaEval 2023 NewsImages Task