The Covid-on-the-Web project aims to allow biomedical researchers to access, query and make sense of COVID-19 related literature. To do so, it adapts, combines and extends tools to process, analyze and enrich the "COVID-19 Open Research Dataset" (CORD-19) that gathers 50,000+ full-text scientific articles related to the coronaviruses. We report on the RDF dataset and software resources produced in this project by leveraging skills in knowledge representation, text, data and argument mining, as well as data visualization and exploration. The dataset comprises two main knowledge graphs describing (1) named entities mentioned in the CORD-19 corpus and linked to DBpedia, Wikidata and other BioPortal vocabularies, and (2) arguments extracted using ACTA, a tool automating the extraction and visualization of argumentative graphs, meant to help clinicians analyze clinical trials and make decisions. On top of this dataset, we provide several visualization and exploration tools based on the Corese Semantic Web platform, MGExplorer visualization library, as well as the Jupyter Notebook technology. All along this initiative, we have been engaged in discussions with healthcare and medical research institutes to align our approach with the actual needs of the biomedical community, and we have paid particular attention to comply with the open and reproducible science goals, and the FAIR principles.
3. vs. use cases…
• Scenario 1: Help clinicians analyze clinical trials and take evidence-based decisions
• Scenario 3: Help missions heads from Cancer Institute elaborate research programs
to study the links between cancer and coronavirus
3
[Giboin, et al.]
6. identify what
exists on the
web
http://my-site.fr
identify,
on the web, what
exists
http://animals.org/this-zebra
6
7. URIs for…
• URI for Paris in DBpedia:
http://dbpedia.org/resource/Paris
• URI for name of Victor Hugo in the Library of Congress:
http://id.loc.gov/authorities/names/n79091479
• The MUC18 protein at UniProt
http://www.uniprot.org/uniprot/P43121
• Xavier Dolan in Wikidata
http://www.wikidata.org/entity/Q551861
• The book with doi:10.1007/3-540-45741-0_18
http://dx.doi.org/10.1007/3-540-45741-0_18
•
7
8. URIs for… things mentioned in papers
AI methods: NLP and IR for named entity annotation in text
DBpedia Spotlight → DBpedia URIs
Entity-fishing → Wikidata URIs
BioPortal Annotator → BioPortal URIs
[Gazzotti, et al.]
8
9. URIs for… things mentioned in papers
AI methods: NLP and IR for named entity annotation in text
DBpedia Spotlight → DBpedia URIs
Entity-fishing → Wikidata URIs
BioPortal Annotator → BioPortal URIs
“Effects on QT interval of hydroxychloroquine
associated with ritonavir/darunavir or azithromycin
in patients with SARS-CoV-2 infection” [Danzi et al.]
http://dbpedia.org/resource/Severe_acute_respiratory
_syndrome_coronavirus_2
http://dbpedia.org/resource/Hydroxychloroquine
http://dbpedia.org/resource/Azithromycin
http://dbpedia.org/resource/Ritonavir
http://dbpedia.org/resource/Darunavir
http://dbpedia.org/resource/Heart_arrhythmia
[Gazzotti, et al.]
9
10. Extracting arguments and PICO elements
AI methods: BERT+SciBERT, LSTM, Conditional Random Field
[Mayer, Cabrio, Villata, Marro, et al.]
Evidence-Based Practice (EBP)
•Patient Problem, (or Population)
•Intervention,
•Comparison or Control, and
•Outcome
10
[ECAI 2020]
11. COVID ON THE WEB
RDF
translator
Named
Entities
extractors
Covid-on-the-Web
dataset
LOD
ACTA pipeline
extraction of argument graphs
1
1
2
2
1
vocabularies
& datasets
2
Covid-19
Open
Research
Dataset
Process data &
derive “smarter” data
[Michel, Gazzotti, Gandon, Cabrio, Villata, Mayer et al.]
11
[ISWC 2020, IC 2021]
16. linked open data(sets) cloud on the Web in RDF
0
200
400
600
800
1000
1200
1400
01/05/2007 08/10/2007 07/11/2007 10/11/2007 28/02/2008 31/03/2008 18/09/2008 05/03/2009 27/03/2009 14/07/2009 22/09/2010 19/09/2011 30/08/2014 26/01/2017
number of linked open datasets on the Web
16
17. Integrate in RDF named entities
17
morph-xr2rml, virtuoso [Michel, Gazzotti et al.]
20. Integrate in RDF
metadata
named entities
arguments
PICO elements
provenance
morph-xr2rml, virtuoso [Michel, Gazzotti et al.]
20
21. Integrate in RDF
metadata
named entities
arguments
PICO elements
provenance
+ other data
sources
morph-xr2rml, virtuoso [Michel, Gazzotti et al.]
21
22. Dataset description No. RDF triples
dataset description + definition of a few properties 170
articles metadata (title, authors, DOIs, journal etc.) 3 722 381
named entities identified by Entity-fishing in articles titles/abstracts 35 049 832
named entities identified by Entity-fishing in articles bodies 1 156 611 321
named entities identified by Bioportal Annotator in articles titles/abstracts 104 430 547
named entities identified by DBpedia Spotlight in articles titles/abstracts 65 359 664
argumentative components and PICO elements by ACTA from articles titles/abstracts 7 469 234
Total 1 361 451 364
22
24. COVID ON THE WEB
RDF
translator
Jupyter Notebook
Python, R & analytics
Corese
engine
query
&
infer
ACTA Web application
visualization of argument graphs
Corese portal
Data browsing
MGExplorer
Data visualization
Open Data publication
Zenodo, Github, Virtuoso
Named
Entities
extractors
Covid-on-the-Web
dataset
LOD
ACTA pipeline
extraction of argument graphs
1
1
2
2
1
3
3
3
vocabularies
& datasets
3
3
3
2
Covid-19
Open
Research
Dataset
Process data &
derive “smarter” data
Means to exploit data
[Corby, Michel, Gazzotti, Gandon et al.]
24
[ISWC 2020, IC 2021]
26. Articles about cancer in the COVID dataset:
• "Antipsychotic treatment effects on cardiovascular, cancer, infection, and intentional self-harm
as cause of death in patients with Alzheimer's dementia"
• "Targeting cancer stem cell pathways for cancer therapy"
• "Deubiquitinases and cancer: A snapshot"
• "The Functional Properties of Preserved Eggs: From Anti-cancer and Anti-inflammatory
Aspects"
• "The functional role of the novel biomarker karyopherin α 2 (KPNA2) in cancer"
• "Biochemical characterisation of lectin from Indian hyacinth plant bulbs with potential
inhibitory action against human cancer cells"
• "Purification, identification and profiling of serum amyloid A proteins from sera of advanced-
stage cancer patients"
• "Darwin, medicine and cancer"
• "Outcome of Oncology Patients Infected With Coronavirus"
• "Review and Meta-Analyses of TAAR1 Expression in the Immune System and Cancers"
• "Experimental Data-Mining Analyses Reveal New Roles of Low-Intensity Ultrasound in
Differentiating Cell Death Regulatome in Cancer and Non-cancer Cells via Potential Modulation
of Chromatin Long-Range Interactions"
• "Golgi anti-apoptotic protein: a tale of camels, calcium, channels and cancer"
• "Severe novel influenza A (H1N1) infection in cancer patients"
• "Molecular Profiling of Multiple Human Cancers Defines an Inflammatory Cancer-Associated
Molecular Pattern and Uncovers KPNA2 as a Uniform Poor Prognostic Cancer Marker"
• "SARS-CoV-2 transmission in cancer patients of a tertiary hospital in Wuhan"
• "Creosote bush lignans for human disease treatment and prevention: Perspectives on
combination therapy"
• "8 Electrospinning and microfluidics An integrated approach for tissue engineering and cancer"
• "Genomic and proteomic approaches for studying human cancer: Prospects for true patient-
tailored therapy"
• "Community acquired respiratory virus infections in cancer patients—Guideline on diagnosis
and management by the Infectious Diseases Working Party of the German Society for
haematology and Medical Oncology"
• "RIG-I Enhanced Interferon Independent Apoptosis upon Junin Virus Infection"
• "Respiratory Viral Infections in Transplant and Oncology Patients“ …. 26
Cancer
GQ
29. F∧O → R ⇔ GD ≤ GQ
mapping modulo an ontology
Lymphoma
Cancer
Lymphoma(x)⇒ Cancer(x)
GD
GQ
Cancer
Lymphoma
O
[Corby et al.]
AI methods: knowledge graphs, ontology-based formalisms, querying, validating and reasoning
29
33. COVID ON THE WEB
Biomedical
researchers
& managers
Data analysts
…
RDF
translator
Jupyter Notebook
Python, R & analytics
Corese
engine
query
&
infer
ACTA Web application
visualization of argument graphs
Corese portal
Data browsing
MGExplorer
Data visualization
Open Data publication
Zenodo, Github, Virtuoso Applications
Named
Entities
extractors
dereference, query, download
query, browse, analyze, make sense
Covid-on-the-Web
dataset
LOD
ACTA pipeline
extraction of argument graphs
1
1
2
2
1
3
3
3
vocabularies
& datasets
3
3
3
2
Covid-19
Open
Research
Dataset
Process data &
derive “smarter” data
Means to exploit data Biomedical research
[Menin, Winckler, Corby, Giboin, Faron, Cadorel, Tettamanzi et al. 2020]
33
[ISWC 2020, IC 2021]
34. Examples of questions from
Est-ce que les Coronavirus peuvent causer des cancers ?
Est-ce qu’ils font partie de la famille des virus oncogènes ?
Quelles sont les séquelles des infections SARS CoV 1 et 2, et MERS ?
Est-ce que les épidémies SARS-Cov1 et MERS sont liées à des apparitions de cancers ?
Est-ce que les lésions causées par l’infection SARS CoV2 peuvent potentialiser une
transformation tumorale? (évolution tissulaire et sensibilité au développement de cancers
suite à une fibrose, ou autres lésions)
Quelles sont les voies de signalisation intracellulaires activées par les coronavirus ? Pendant
l’inflammation ? Quelles sont les adaptations métaboliques ?
Est-ce qu’il y a des similitudes avec des virus oncogènes déjà connus tels que HPV, HBV, EBV,
etc. ? …
[Giboin, Faron et al. 2020]
34
39. mining interesting association rules
AI methods: clustering + community detection + dimensionality
reduction (auto-encoder) + Frequent Pattern Growth
[Cadorel, Tettamanzi]
39
[WI-IAT 2020]
40. mining interesting association rules
AI methods: clustering + community detection + dimensionality
reduction (auto-encoder) + Frequent Pattern Growth
• hidden patterns to enrich the dataset
• novel hypotheses for biomedical research
[Cadorel, Tettamanzi]
40
[WI-IAT 2020]
41. mining interesting association rules
AI methods: clustering + community detection + dimensionality
reduction (auto-encoder) + Frequent Pattern Growth
• hidden patterns to enrich the dataset
• novel hypotheses for biomedical research
• error detection in the dataset
• relevant clusters & communities for navigation
[Cadorel, Tettamanzi]
41
[WI-IAT 2020]
44. CovidOnTheWeb access Stats
Period January-April 2021
• Total URI accesses= 48 735 hence an average of 393 URI access/day
• Total SPARQL queries 49 036 hence an average of 395 queries/day
• 300 different agent types with two major ones:
Apache-Jena-ARQ (38 767) and Mozilla/4.0 (40 935)
Full dump of dataset on Zenodo
(end of April 2021)
• 61 unique downloads
• 954 unique views
[Michel, Gazzotti]
46. Dataset description No. RDF triples
dataset description + definition of a few properties 170
articles metadata (title, authors, DOIs, journal etc.) 3 722 381
named entities identified by Entity-fishing in articles titles/abstracts 35 049 832
named entities identified by Entity-fishing in articles bodies 1 156 611 321
named entities identified by Bioportal Annotator in articles titles/abstracts 104 430 547
named entities identified by DBpedia Spotlight in articles titles/abstracts 65 359 664
argumentative components and PICO elements by ACTA from articles titles/abstracts 7 469 234
Total 1 361 451 364
46
47. MULTI-DISCIPLINARY TEAM
40~50 members
~15 nationalities
1 DR, 4 Professors
3CR, 3 Assistant professors
DR/Professors:
Fabien GANDON, Inria, AI, KRR, Semantic Web, Social Web, K. Graphs
Nhan LE THANH, UCA, Logics, KR, Emotions, Workflows, K. Graphs
Peter SANDER, UCA, Web, Emotions
Andrea TETTAMANZI, UCA, AI, Logics, Evo, Learning, Agents, K. Graphs
Marco WINCKLER, UCA, Human-Computer Interaction, Web, K. Graphs
CR/Assistant Professors:
Michel BUFFA, UCA, Web, Social Media, K. Graphs
Elena CABRIO, UCA, NLP, KR, Linguistics, Q&A, Text Mining, K. Graphs
Olivier CORBY, Inria, KR, AI, Sem. Web, Programming, K. Graphs
Catherine FARON-ZUCKER, UCA, KR, AI, Semantic Web, K. Graphs
Damien GRAUX, Inria, Linked Data, Sem. Web, Querying, K. Graphs
Serena VILLATA, CNRS, AI, Argumentation, Licenses, Rights, K. Graphs
Research engineer: Franck MICHEL, CNRS, Linked Data, Integration, DB, K. Graphs
External:
Andrei Ciortea (University of St. Gallen) Agents, WoT, Sem. Web, K. Graphs
Nicolas DELAFORGE (Mnemotix) Sem. Web, KM, Integration, K. Graphs
Alain GIBOIN, (Retired CR Inria), Interaction Design, KE, User & Task, K. Graphs
Freddy LECUE (Thales, Montreal) AI, Logics, Mining, Big Data, S. Web , K. Graphs
48. DBPEDIA.FR
180 000 000 arcs in an
encyclopedic knowledge
graph
number of queries per day
70 000 on average
2.5 millions max
185 377 686 RDF triples extracted and mapped
public dumps, endpoints, interfaces, APIs…
49. This project has received funding from the European Union's Horizon 2020
research and innovation programme under grant agreement 825619.
ONTOLOGY FOR AI ITSELF
ontology and metadata of AI resources
SHACL to validate AI4EU these RDF graphs
online endpoint http://corese.inria.fr
predefined SPARQL queries, SHACL shapes, display
[Corby et al., 2019]
50. PREDICT HOSPITALIZATION
Predict hospitalization from
Physician’s records classification
Augment records data with
Web knowledge graphs
Study impact on prediction
[Gazzotti, Faron, Gandon et al. 2020]
Sexe Date Cause CISP2 ... History Observations
H 25/04/2012 vaccin-antitétanique A44 ... Appendicite EN CP - Bon état général - auscult
pulm libre; bdc rég sans souffle -
tympans ok-
Element Number
Patients
Consultations
Past medical history
Biometric data
Semiotics
Diagnosis
Row of prescribed drugs
Symptoms
Health care procedures
Additional examination
Paramedical prescription
Observations/notes
55 823
364 684
187 290
293 908
250 669
117 442
847 422
23 488
11 850
871 590
17 222
56 143
(1)
(2)
PRIMEGE
51. Image Metadata Score
portrait
50350012455
C:Jocondejoconde0138m503501_d0012455-000_p.jpg
cheval:
0.999
Image Metadata Score
figure (saint Eloi de Noyon, évêque, en pied, bénédiction,
vêtement liturgique, mitre, attribut, cheval, marteau, outil :
ferronnerie)
000SC022652
C:/Joconde/joconde0355/m079806_bsa0030101_p.jpg
cheval:
0.006
MonaLIA
reason & query on RDF to build training sets.
transfer learning & CNN classifiers on targeted
categories (topics, techniques, etc.)
reason & query RDF of results to address
silence, noise and explain
350 000 images
of artworks
RDF metadata based
on external thesauri
Joconde database from French museums
(1)
(3)
[Bobasheva, Gandon, Precioso, 2021]
(2)