SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Web Archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Requires cleaning, filtering, selection, extraction and finally, processing
1
Source: Yahoo!
• Distributed computing
• MapReduce or variants
• Transform, aggregate, write
• Homogeneous data
• Well-defined format, clean
• Easy to split and handle
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Web Archives as Scholarly Source
• Web archive data is heterogeneous, may include text, video, images, …
• Requires cleaning, filtering, selection, extraction and finally, processing
• Data is temporal, time differences within a website, among websites
• A resource may have been crawled and be included multiple times
• With no or only slight changes, layout changes, content changes, …
• Users are often from various disciplines, not only computer scientists
• Social scientists, political scientists, historians, …
• Want to express their needs, define sub-sets, create research corpora
• Require a format they know, which is readable and reusable
• Technical persons can support, but need the tools
2
23/05/2016 Helge Holzmann (holzmann@L3S.de)
ArchiveSpark:
Efficient Web Archive
Access, Extraction and Derivation
Helge Holzmann, Vinay Goel, Avishek Anand
To appear at JCDL 2016 (please cite)
https://github.com/helgeho/ArchiveSpark
23/05/2016 Helge Holzmann (holzmann@L3S.de) 3
Overview
• Goal: Easy research corpus building from Web archives
• Framework based on Apache Spark [1]
• Supports efficient filter / selection operation
• Supports derivations by applying third-party libraries
• Output in a pretty and machine readable format
• Identified six objectives based on practical requirements
• Benchmarked against pure Spark and HBase
• Both using WarcBase helpers [2]
4
[1] http://spark.apache.org
[2] https://github.com/lintool/warcbase
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Example Use Case
• Political scientist wants to analyze sentiments and reactions on the
Web from a previous election cycle.
• Five decisions to be taken to narrow down the dataset:
1. Filter temporally to focus on the election period
2. Select text documents by filtering on MIME type
3. Only keep online captures with HTTP status code 200
4. Look for political signal terms in the content to get rid of unrelated pages
5. Choose a captured version, for instance the latest of each page
• Finally, extract relevant terms / snippets to analyze sentiments
• Document lineage, e.g., the title might have more value than the body text
5
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Objectives
1. A simple and expressive interface
2. Compliance to and reuse of standard formats
3. An efficient selection and filtering process
4. An easily extensible architecture
5. Lineage support to comprehend and reconstruct derivations
6. Output in a standard, readable and reusable format
6
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O1: Simple and expressive interface
• Based on Spark
• Expressing instructions in Scala
• Simple accessors provided
• Easy extraction / enrichment mechanisms
7
val rdd = ArchiveSpark.hdfs(cdxPath, warcPath)
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O2: Standard formats
• (W)ARC and CDX widely used among the big Web archives
• (Web) ARChive files (ISO 28500:2009)
• Crawl index
• No additional pre-processing / indexing required
8
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O3: Efficient selection and filtering
• Initial filtering purely based on meta data (CDX)
• without touching the archive
• Seamless integration of contents / extractions / enrichments
9
val Title = HtmlText.of(Html.first("title"))
val filtered = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val corpus = filtered.enrich(Title)
.filter(r => r.value(Title).get.contains("science"))
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O4: Easily extensible architecture
• Enrich functions allow integrating any Java/Scala code/library
• with a unified interface
• Base classes provided
• Implementations need to define the following only:
1. Default dependency enrich function (e.g., StringContent)
2. Function body (Java/Scala code / library calls, e.g., Named Entity Extractor)
3. Result fields (e.g., persons, organizations, locations)
10
rdd.mapEnrich(StringContent, "length") {str => str.length}
Dependency Result field Body
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O5: Lineage documentation
• Nested JSON objects represent enrichments / derivations
11
Record (meta data)
Payload
String
Length
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O6: Readable / reusable output
• Filtered, cleaned, enriched JSON vs. raw (W)ARC
• widely used, structured, pretty printable, reusable
12
vs.
23/05/2016 Helge Holzmann (holzmann@L3S.de)
The ArchiveSpark Approach
• Filter as much as possible on meta data before touching the archive
• Enrich instead of mapping / transforming data
13
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Benchmarks
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
14
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Conclusion
• Expressive and efficient corpus creation from Web archives
• Easily extensible
• Please provide enrich functions of your own tools / third-party libraries
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Learn more at our hands-on session, today at 3:30 pm.
15
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Thank you!
23/05/2016 Helge Holzmann (holzmann@L3S.de)
16
https://github.com/helgeho/ArchiveSpark

Weitere ähnliche Inhalte

Was ist angesagt?

Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data ApplicationsEUCLID project
 
How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?Dirk Lewandowski
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelRobert Sanderson
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapAAT Taiwan
 
File organization
File organizationFile organization
File organizationGokul017
 
File Organization
File OrganizationFile Organization
File OrganizationManyi Man
 
Fundamental File Processing Operations
Fundamental File Processing OperationsFundamental File Processing Operations
Fundamental File Processing OperationsRico
 

Was ist angesagt? (13)

Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
File organisation
File organisationFile organisation
File organisation
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation Model
 
File organization
File organizationFile organization
File organization
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldap
 
File organization
File organizationFile organization
File organization
 
File Organization
File OrganizationFile Organization
File Organization
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Fundamental File Processing Operations
Fundamental File Processing OperationsFundamental File Processing Operations
Fundamental File Processing Operations
 

Ähnlich wie ArchiveSpark Introduction @ WebSci' 2016 Hackathon

Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
 
ontology.ppt
ontology.pptontology.ppt
ontology.pptPrerak10
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019Helge Holzmann
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?John F. Holliday
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathStructured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathJosh Anderson
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersUniversity of Bologna
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 
From Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerFrom Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerRindra Ramli
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM4Science
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365DocFluix, LLC
 

Ähnlich wie ArchiveSpark Introduction @ WebSci' 2016 Hackathon (20)

Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathStructured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
From Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerFrom Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource Manager
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365
 

Kürzlich hochgeladen

Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 

Kürzlich hochgeladen (20)

Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 

ArchiveSpark Introduction @ WebSci' 2016 Hackathon

  • 1. Web Archives are Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • Web archive data is heterogeneous, may include text, video, images, … • Requires cleaning, filtering, selection, extraction and finally, processing 1 Source: Yahoo! • Distributed computing • MapReduce or variants • Transform, aggregate, write • Homogeneous data • Well-defined format, clean • Easy to split and handle 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 2. Web Archives as Scholarly Source • Web archive data is heterogeneous, may include text, video, images, … • Requires cleaning, filtering, selection, extraction and finally, processing • Data is temporal, time differences within a website, among websites • A resource may have been crawled and be included multiple times • With no or only slight changes, layout changes, content changes, … • Users are often from various disciplines, not only computer scientists • Social scientists, political scientists, historians, … • Want to express their needs, define sub-sets, create research corpora • Require a format they know, which is readable and reusable • Technical persons can support, but need the tools 2 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 3. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation Helge Holzmann, Vinay Goel, Avishek Anand To appear at JCDL 2016 (please cite) https://github.com/helgeho/ArchiveSpark 23/05/2016 Helge Holzmann (holzmann@L3S.de) 3
  • 4. Overview • Goal: Easy research corpus building from Web archives • Framework based on Apache Spark [1] • Supports efficient filter / selection operation • Supports derivations by applying third-party libraries • Output in a pretty and machine readable format • Identified six objectives based on practical requirements • Benchmarked against pure Spark and HBase • Both using WarcBase helpers [2] 4 [1] http://spark.apache.org [2] https://github.com/lintool/warcbase 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 5. Example Use Case • Political scientist wants to analyze sentiments and reactions on the Web from a previous election cycle. • Five decisions to be taken to narrow down the dataset: 1. Filter temporally to focus on the election period 2. Select text documents by filtering on MIME type 3. Only keep online captures with HTTP status code 200 4. Look for political signal terms in the content to get rid of unrelated pages 5. Choose a captured version, for instance the latest of each page • Finally, extract relevant terms / snippets to analyze sentiments • Document lineage, e.g., the title might have more value than the body text 5 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 6. Objectives 1. A simple and expressive interface 2. Compliance to and reuse of standard formats 3. An efficient selection and filtering process 4. An easily extensible architecture 5. Lineage support to comprehend and reconstruct derivations 6. Output in a standard, readable and reusable format 6 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 7. O1: Simple and expressive interface • Based on Spark • Expressing instructions in Scala • Simple accessors provided • Easy extraction / enrichment mechanisms 7 val rdd = ArchiveSpark.hdfs(cdxPath, warcPath) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz") 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 8. O2: Standard formats • (W)ARC and CDX widely used among the big Web archives • (Web) ARChive files (ISO 28500:2009) • Crawl index • No additional pre-processing / indexing required 8 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 9. O3: Efficient selection and filtering • Initial filtering purely based on meta data (CDX) • without touching the archive • Seamless integration of contents / extractions / enrichments 9 val Title = HtmlText.of(Html.first("title")) val filtered = rdd.filter(r => r.status == 200 && r.mime == "text/html") val corpus = filtered.enrich(Title) .filter(r => r.value(Title).get.contains("science")) 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 10. O4: Easily extensible architecture • Enrich functions allow integrating any Java/Scala code/library • with a unified interface • Base classes provided • Implementations need to define the following only: 1. Default dependency enrich function (e.g., StringContent) 2. Function body (Java/Scala code / library calls, e.g., Named Entity Extractor) 3. Result fields (e.g., persons, organizations, locations) 10 rdd.mapEnrich(StringContent, "length") {str => str.length} Dependency Result field Body 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 11. O5: Lineage documentation • Nested JSON objects represent enrichments / derivations 11 Record (meta data) Payload String Length 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 12. O6: Readable / reusable output • Filtered, cleaned, enriched JSON vs. raw (W)ARC • widely used, structured, pretty printable, reusable 12 vs. 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 13. The ArchiveSpark Approach • Filter as much as possible on meta data before touching the archive • Enrich instead of mapping / transforming data 13 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 14. Benchmarks • Three scenarios, from basic to more sophisticated: a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month • Benchmarks do not include derivations • Those are applied on top of all three methods and involve third-party libraries 14 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 15. Conclusion • Expressive and efficient corpus creation from Web archives • Easily extensible • Please provide enrich functions of your own tools / third-party libraries • Open source • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Learn more at our hands-on session, today at 3:30 pm. 15 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 16. Thank you! 23/05/2016 Helge Holzmann (holzmann@L3S.de) 16 https://github.com/helgeho/ArchiveSpark