SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Content Extraction with Apache Tika
     Jukka Zitting | Tika committer, co-author of Tika in Action




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Content Extraction with Apache Tika

    Introduction to Apache Tika
    Full text extraction with Tika
    Tika and Solr - the ExtractingRequestHandler
    Tika and Lucene - direct feeding of the index
         forked parsing
         link extraction




© 2012 Adobe Systems Incorporated. All Rights Reserved.   2
Introduction to Apache Tika
                                                          section 1 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Introduction to Apache Tika




                          The Apache Tika™ toolkit
                          - detects and extracts
                          - metadata and structured text content
                          - from various documents
                          - using existing parser libraries.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   4
Problem domain




© 2012 Adobe Systems Incorporated. All Rights Reserved.   5
The Tika solution



                                                              It is a truth
                                                              universally
                                                              acknowledged, that
                                                              a single man in
                                                              possession of a
                                                              good fortune, must
                                                              be in want of a
                                                              wife...
                                                              Content
                                                              dc:title=
                                                               Pride and Prejudice
                                                              dc:creator=
                                                               Jane Austen
         Document                                             dc:date=1813

                                                              Metadata




© 2012 Adobe Systems Incorporated. All Rights Reserved.   6
Project background

    Brief history
         2007           Tika       started in the Apache Incubator
         2008           Tika       graduates into a Lucene subproject
         2010           Tika       becomes a standalone TLP
         2011           Tika       1.0 released
         2011           Tika       in Action published
    Latest release is Apache Tika 1.2
         thousands of known media types
             most with associated type detection patterns
         dozens of supported document formats
             including all major office formats
         basic language detection
         etc.
    For more information http://tika.apache.org/


© 2012 Adobe Systems Incorporated. All Rights Reserved.   7
Ohloh summary (http://www.ohloh.net/p/tika)




© 2012 Adobe Systems Incorporated. All Rights Reserved.   8
Full text extraction with Tika
                                                          section 2 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Demo: tika-app-1.2.jar

    https://github.com/jukka/tika-demo
    java -jar tika-app-1.2.jar




© 2012 Adobe Systems Incorporated. All Rights Reserved.   10
tika-app as a command line tool

$ java -jar tika-app-1.2.jar --xhtml /path/to/
document.doc

$ java -jar tika-app-1.2.jar --text http://example.com/
 document

$ java -jar tika-app-1.2.jar --metadata < document.doc

$ cat document.doc | java -jar tika-app-1.2.jar --text |
 grep foo

$ java -jar tika-app-1.2.jar --help



© 2012 Adobe Systems Incorporated. All Rights Reserved.   11
Tika’s Java API

    Divided in two layers
         The Tika facade: org.apache.tika.Tika
         Lower-level interfaces like Parser, Detector, etc.

    Use the Tika facade by default
         Provides simple support for most common use cases
         Example: new Tika().parseToString(“/path/to/document.doc”)

    Use the lower-level interfaces for more power or flexibility
         Allows fine-grained control of Tika functionality
         More complicated programming model
             Parsed content handled as XHTML SAX events
         Not all functionality is exposed through the Tika facade




© 2012 Adobe Systems Incorporated. All Rights Reserved.   12
Tika and Solr - the ExtractingRequestHandler
                                                          section 3 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
ExtractingRequestHandler

    aka Solr Cell
    http://wiki.apache.org/solr/ExtractingRequestHandler

     “Solr's ExtractingRequestHandler uses Tika
     to allow users to upload binary files to Solr and
     have Solr extract text from it and then index it.”

    For example:
     $ curl "http://localhost:8983/solr/update/extract?
     literal.id=document&commit=true" -F "file=@document.doc"

    Supports both text and metadata extraction
         with plenty of configurable options




© 2012 Adobe Systems Incorporated. All Rights Reserved.   14
ExtractingRequestHandler parameters

    Helping Tika do it’s job
         resource.name=document.doc - Helps Tika’s automatic type detection
         resource.password=secret - Allows Tika to read encrypted documents
         passwordsFile=/path/to/password-file - Resource name to password
          mappings
             for example: .*.pdf$ = pdf-secret
    Capturing special content
         xpath=//a - Capture only content inside elements that match the
          specified query
         capture=h1 - Capture content inside specific elements to a separate
          field
         captureAttr=true - Capture attributes into separate fields named after
          the element
    Mapping field names
         lowernames=true - Normalize metadata field names to “content_type”,
          etc.

© 2012 Adobe Systems Incorporated. All Rights Reserved.   15
Tika and Lucene - direct feeding of the index
                                                          section 4 / 4




© 2012 Adobe Systems Incorporated. All Rights Reserved.
Using the Tika facade to feed Lucene

// Index first part of the document
String text = new Tika().parseToString(“/path/to/document.doc”);
document.add(new TextField(“text”, text, Field.Store.NO));


// Index the full document
Reader reader = new Tika().parse(“/path/to/document.doc”);
document.add(new TextField(“text”, reader));


// Index also some metadata
Metadata metadata = new Metadata();
Reader reader = new Tika().parse(
   new FileInputStream(“/path/to/document.doc”), metadata);
document.add(new TextField(“text”, reader));
document.add(new StringField(“type”,
metadata.get(Metadata.CONTENT_TYPE));

© 2012 Adobe Systems Incorporated. All Rights Reserved.   17
Things to consider

    What if the document is larger than your memory?
         Index only first N bytes/characters?
         WriteOutContentHandler supports an explicit write limit
             Enabled by default in the Tika facade, see get/setMaxStringLength()

    What if the document is malformed or intentionally broken?
         Could cause denial of service problems
             Might even crash the entire JVM due to bugs in native libraries in the JDK!
         SecureContentHandler monitors parsing and terminates it if things look
          bad
             Enabled by default in the Tika facade

    Ultimate solution: forked parsing and the Tika server
         Parse documents in separate, sandboxed JVM processes
         A document could fail to parse, but your application won’t crash
         Code is already there, but still a bit tricky to set up

© 2012 Adobe Systems Incorporated. All Rights Reserved.   18
Link extraction for web crawlers

    The LinkContentHandler class can be used to extract all links from
     a document
         Works also with links in things like PDF, MS Word and email documents
         Use TeeContentHandler to combine with other ways of capturing content

// for example
LinkContentHandler lch = new LinkContentHandler();
BodyContentHandler bch = new BodyContentHandler();
new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...);

System.out.println(“Content: “ + bch);
for (Link link : lch.getLinks()) {
   System.out.println(“Link: “ + link):
}


© 2012 Adobe Systems Incorporated. All Rights Reserved.   19
Questions?
                                                          http://tika.apache.org/




© 2012 Adobe Systems Incorporated. All Rights Reserved.
© 2012 Adobe Systems Incorporated. All Rights Reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Linux command ppt
Linux command pptLinux command ppt
Linux command pptkalyanineve
 
Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014Toni de la Fuente
 
Docker and kubernetes_introduction
Docker and kubernetes_introductionDocker and kubernetes_introduction
Docker and kubernetes_introductionJason Hu
 
Alfresco Development Framework Basic
Alfresco Development Framework BasicAlfresco Development Framework Basic
Alfresco Development Framework BasicMario Romano
 
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformBest Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformDevOps.com
 
X Window System
X Window SystemX Window System
X Window SystemRon Bandes
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
CamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacksCamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacksJerome Smith
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB FundamentalsMongoDB
 
The Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to CompromiseThe Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to CompromiseWill Schroeder
 
Geo server pt_br
Geo server pt_brGeo server pt_br
Geo server pt_brMarcos Rosa
 
Docker Swarm Introduction
Docker Swarm IntroductionDocker Swarm Introduction
Docker Swarm Introductionrajdeep
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDuraSpace
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring MicroservicesWeaveworks
 

Was ist angesagt? (20)

Advanced Terraform
Advanced TerraformAdvanced Terraform
Advanced Terraform
 
Programmation shell
Programmation shellProgrammation shell
Programmation shell
 
Linux command ppt
Linux command pptLinux command ppt
Linux command ppt
 
Basics of-linux
Basics of-linuxBasics of-linux
Basics of-linux
 
Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014
 
Docker and kubernetes_introduction
Docker and kubernetes_introductionDocker and kubernetes_introduction
Docker and kubernetes_introduction
 
Alfresco Development Framework Basic
Alfresco Development Framework BasicAlfresco Development Framework Basic
Alfresco Development Framework Basic
 
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with TerraformBest Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with Terraform
 
X Window System
X Window SystemX Window System
X Window System
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
CamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacksCamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacks
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 
Nfs
NfsNfs
Nfs
 
The Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to CompromiseThe Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to Compromise
 
Geo server pt_br
Geo server pt_brGeo server pt_br
Geo server pt_br
 
Docker swarm
Docker swarmDocker swarm
Docker swarm
 
Docker Networking
Docker NetworkingDocker Networking
Docker Networking
 
Docker Swarm Introduction
Docker Swarm IntroductionDocker Swarm Introduction
Docker Swarm Introduction
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
 
Monitoring Microservices
Monitoring MicroservicesMonitoring Microservices
Monitoring Microservices
 

Ähnlich wie Content extraction with apache tika

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containersMihir Shah
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasMikael Nilsson
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Amazon Web Services
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...buildacloud
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collabKaren Vuong
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookellamuralikrishnanookella
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAmazon Web Services
 

Ähnlich wie Content extraction with apache tika (20)

PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Securing docker containers
Securing docker containersSecuring docker containers
Securing docker containers
 
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemasDC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
 
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
 
Open writing-cloud-collab
Open writing-cloud-collabOpen writing-cloud-collab
Open writing-cloud-collab
 
Multi Stage Docker Build
Multi Stage Docker Build Multi Stage Docker Build
Multi Stage Docker Build
 
People aggregator
People aggregatorPeople aggregator
People aggregator
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna NookellaWhat is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Cloudera and Spark setup
Cloudera and Spark setupCloudera and Spark setup
Cloudera and Spark setup
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic BeanstalkAWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
 

Mehr von Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On SteroidsJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 

Mehr von Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Content extraction with apache tika

  • 1. Content Extraction with Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 2. Content Extraction with Apache Tika  Introduction to Apache Tika  Full text extraction with Tika  Tika and Solr - the ExtractingRequestHandler  Tika and Lucene - direct feeding of the index  forked parsing  link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
  • 3. Introduction to Apache Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 4. Introduction to Apache Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
  • 5. Problem domain © 2012 Adobe Systems Incorporated. All Rights Reserved. 5
  • 6. The Tika solution It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
  • 7. Project background  Brief history  2007 Tika started in the Apache Incubator  2008 Tika graduates into a Lucene subproject  2010 Tika becomes a standalone TLP  2011 Tika 1.0 released  2011 Tika in Action published  Latest release is Apache Tika 1.2  thousands of known media types  most with associated type detection patterns  dozens of supported document formats  including all major office formats  basic language detection  etc.  For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
  • 8. Ohloh summary (http://www.ohloh.net/p/tika) © 2012 Adobe Systems Incorporated. All Rights Reserved. 8
  • 9. Full text extraction with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 10. Demo: tika-app-1.2.jar  https://github.com/jukka/tika-demo  java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
  • 11. tika-app as a command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
  • 12. Tika’s Java API  Divided in two layers  The Tika facade: org.apache.tika.Tika  Lower-level interfaces like Parser, Detector, etc.  Use the Tika facade by default  Provides simple support for most common use cases  Example: new Tika().parseToString(“/path/to/document.doc”)  Use the lower-level interfaces for more power or flexibility  Allows fine-grained control of Tika functionality  More complicated programming model  Parsed content handled as XHTML SAX events  Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
  • 13. Tika and Solr - the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 14. ExtractingRequestHandler  aka Solr Cell  http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.”  For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc"  Supports both text and metadata extraction  with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
  • 15. ExtractingRequestHandler parameters  Helping Tika do it’s job  resource.name=document.doc - Helps Tika’s automatic type detection  resource.password=secret - Allows Tika to read encrypted documents  passwordsFile=/path/to/password-file - Resource name to password mappings  for example: .*.pdf$ = pdf-secret  Capturing special content  xpath=//a - Capture only content inside elements that match the specified query  capture=h1 - Capture content inside specific elements to a separate field  captureAttr=true - Capture attributes into separate fields named after the element  Mapping field names  lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
  • 16. Tika and Lucene - direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 17. Using the Tika facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
  • 18. Things to consider  What if the document is larger than your memory?  Index only first N bytes/characters?  WriteOutContentHandler supports an explicit write limit  Enabled by default in the Tika facade, see get/setMaxStringLength()  What if the document is malformed or intentionally broken?  Could cause denial of service problems  Might even crash the entire JVM due to bugs in native libraries in the JDK!  SecureContentHandler monitors parsing and terminates it if things look bad  Enabled by default in the Tika facade  Ultimate solution: forked parsing and the Tika server  Parse documents in separate, sandboxed JVM processes  A document could fail to parse, but your application won’t crash  Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
  • 19. Link extraction for web crawlers  The LinkContentHandler class can be used to extract all links from a document  Works also with links in things like PDF, MS Word and email documents  Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
  • 20. Questions? http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
  • 21. © 2012 Adobe Systems Incorporated. All Rights Reserved.

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n