Suche senden
Hochladen
Content extraction with apache tika
•
11 gefällt mir
•
16,569 views
Jukka Zitting
Folgen
Melden
Teilen
Melden
Teilen
1 von 21
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
I Hunt Sys Admins
I Hunt Sys Admins
Will Schroeder
Sticky bit suid sgid
Sticky bit suid sgid
Madhavendra Dutt
"Continuously delivering infrastructure using Terraform and Packer" training ...
"Continuously delivering infrastructure using Terraform and Packer" training ...
Anton Babenko
Jena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for Java
Aleksander Pohl
Introduction to the Semantic Web
Introduction to the Semantic Web
Marin Dimitrov
From zero to hero Backing up alfresco
From zero to hero Backing up alfresco
Toni de la Fuente
k8s practice 2023.pptx
k8s practice 2023.pptx
wonyong hwang
Weitere ähnliche Inhalte
Was ist angesagt?
Advanced Terraform
Advanced Terraform
Samsung Electronics
Programmation shell
Programmation shell
Soukaina Boujadi
Linux command ppt
Linux command ppt
kalyanineve
Basics of-linux
Basics of-linux
Singsys Pte Ltd
Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014
Toni de la Fuente
Docker and kubernetes_introduction
Docker and kubernetes_introduction
Jason Hu
Alfresco Development Framework Basic
Alfresco Development Framework Basic
Mario Romano
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with Terraform
DevOps.com
X Window System
X Window System
Ron Bandes
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
Docker, Inc.
CamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacks
Jerome Smith
MongoDB Fundamentals
MongoDB Fundamentals
MongoDB
Nfs
Nfs
tmavroidis
The Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to Compromise
Will Schroeder
Geo server pt_br
Geo server pt_br
Marcos Rosa
Docker swarm
Docker swarm
Alberto Guimarães Viana
Docker Networking
Docker Networking
Kingston Smiler
Docker Swarm Introduction
Docker Swarm Introduction
rajdeep
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
DuraSpace
Monitoring Microservices
Monitoring Microservices
Weaveworks
Was ist angesagt?
(20)
Advanced Terraform
Advanced Terraform
Programmation shell
Programmation shell
Linux command ppt
Linux command ppt
Basics of-linux
Basics of-linux
Alfresco Security Best Practices 2014
Alfresco Security Best Practices 2014
Docker and kubernetes_introduction
Docker and kubernetes_introduction
Alfresco Development Framework Basic
Alfresco Development Framework Basic
Best Practices of Infrastructure as Code with Terraform
Best Practices of Infrastructure as Code with Terraform
X Window System
X Window System
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
CamSec Sept 2016 - Tricks to improve web app excel export attacks
CamSec Sept 2016 - Tricks to improve web app excel export attacks
MongoDB Fundamentals
MongoDB Fundamentals
Nfs
Nfs
The Travelling Pentester: Diaries of the Shortest Path to Compromise
The Travelling Pentester: Diaries of the Shortest Path to Compromise
Geo server pt_br
Geo server pt_br
Docker swarm
Docker swarm
Docker Networking
Docker Networking
Docker Swarm Introduction
Docker Swarm Introduction
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
Monitoring Microservices
Monitoring Microservices
Ähnlich wie Content extraction with apache tika
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
Securing docker containers
Securing docker containers
Mihir Shah
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
Mikael Nilsson
DataCite How To: Use the MDS
DataCite How To: Use the MDS
Frauke Ziedorn
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Amazon Web Services
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
buildacloud
Open writing-cloud-collab
Open writing-cloud-collab
Karen Vuong
Multi Stage Docker Build
Multi Stage Docker Build
Prasenjit Sarkar
People aggregator
People aggregator
Huntor Group
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
Phil Cryer
Core os dna_automacon
Core os dna_automacon
Patrick Galbraith
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Marco Garcia
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
muralikrishnanookella
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
Hadoop and object stores can we do it better
Hadoop and object stores can we do it better
gvernik
Understanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
Understanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
Cloudera and Spark setup
Cloudera and Spark setup
Sumit Mendiratta
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Mark Wilkinson
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
Amazon Web Services
Ähnlich wie Content extraction with apache tika
(20)
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Securing docker containers
Securing docker containers
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DC-2008 Tutorial 3 - Dublin Core and other metadata schemas
DataCite How To: Use the MDS
DataCite How To: Use the MDS
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Mythical Mysfits: Monolith to Microservice with Docker and AWS Fargate (CON21...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open Writing! Collaborative Authoring for CloudStack Documentation by Jessica...
Open writing-cloud-collab
Open writing-cloud-collab
Multi Stage Docker Build
Multi Stage Docker Build
People aggregator
People aggregator
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
Core os dna_automacon
Core os dna_automacon
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
What is WebDAV - uploaded by Murali Krishna Nookella
What is WebDAV - uploaded by Murali Krishna Nookella
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
Hadoop and object stores can we do it better
Hadoop and object stores can we do it better
Understanding information content with apache tika
Understanding information content with apache tika
Understanding information content with apache tika
Understanding information content with apache tika
Cloudera and Spark setup
Cloudera and Spark setup
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
AWS Update | London - Elastic Beanstalk
AWS Update | London - Elastic Beanstalk
Mehr von Jukka Zitting
The new repository in AEM 6
The new repository in AEM 6
Jukka Zitting
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
Jukka Zitting
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
Jukka Zitting
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
Jukka Zitting
MicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
Jukka Zitting
OSGifying the repository
OSGifying the repository
Jukka Zitting
Repository performance tuning
Repository performance tuning
Jukka Zitting
The return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
NoSQL Oakland
NoSQL Oakland
Jukka Zitting
Content Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
File System On Steroids
File System On Steroids
Jukka Zitting
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
Design and architecture of Jackrabbit
Design and architecture of Jackrabbit
Jukka Zitting
Apache Tika
Apache Tika
Jukka Zitting
Content Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
Mehr von Jukka Zitting
(19)
The new repository in AEM 6
The new repository in AEM 6
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
MicroKernel & NodeStore
MicroKernel & NodeStore
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
OSGifying the repository
OSGifying the repository
Repository performance tuning
Repository performance tuning
The return of the hierarchical model
The return of the hierarchical model
Mime Magic With Apache Tika
Mime Magic With Apache Tika
NoSQL Oakland
NoSQL Oakland
Content Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
File System On Steroids
File System On Steroids
Mime Magic With Apache Tika
Mime Magic With Apache Tika
Design and architecture of Jackrabbit
Design and architecture of Jackrabbit
Apache Tika
Apache Tika
Content Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Content extraction with apache tika
1.
Content Extraction with
Apache Tika Jukka Zitting | Tika committer, co-author of Tika in Action © 2012 Adobe Systems Incorporated. All Rights Reserved.
2.
Content Extraction with
Apache Tika Introduction to Apache Tika Full text extraction with Tika Tika and Solr - the ExtractingRequestHandler Tika and Lucene - direct feeding of the index forked parsing link extraction © 2012 Adobe Systems Incorporated. All Rights Reserved. 2
3.
Introduction to Apache
Tika section 1 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
4.
Introduction to Apache
Tika The Apache Tika™ toolkit - detects and extracts - metadata and structured text content - from various documents - using existing parser libraries. © 2012 Adobe Systems Incorporated. All Rights Reserved. 4
5.
Problem domain © 2012
Adobe Systems Incorporated. All Rights Reserved. 5
6.
The Tika solution
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife... Content dc:title= Pride and Prejudice dc:creator= Jane Austen Document dc:date=1813 Metadata © 2012 Adobe Systems Incorporated. All Rights Reserved. 6
7.
Project background
Brief history 2007 Tika started in the Apache Incubator 2008 Tika graduates into a Lucene subproject 2010 Tika becomes a standalone TLP 2011 Tika 1.0 released 2011 Tika in Action published Latest release is Apache Tika 1.2 thousands of known media types most with associated type detection patterns dozens of supported document formats including all major office formats basic language detection etc. For more information http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved. 7
8.
Ohloh summary (http://www.ohloh.net/p/tika) ©
2012 Adobe Systems Incorporated. All Rights Reserved. 8
9.
Full text extraction
with Tika section 2 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
10.
Demo: tika-app-1.2.jar
https://github.com/jukka/tika-demo java -jar tika-app-1.2.jar © 2012 Adobe Systems Incorporated. All Rights Reserved. 10
11.
tika-app as a
command line tool $ java -jar tika-app-1.2.jar --xhtml /path/to/ document.doc $ java -jar tika-app-1.2.jar --text http://example.com/ document $ java -jar tika-app-1.2.jar --metadata < document.doc $ cat document.doc | java -jar tika-app-1.2.jar --text | grep foo $ java -jar tika-app-1.2.jar --help © 2012 Adobe Systems Incorporated. All Rights Reserved. 11
12.
Tika’s Java API
Divided in two layers The Tika facade: org.apache.tika.Tika Lower-level interfaces like Parser, Detector, etc. Use the Tika facade by default Provides simple support for most common use cases Example: new Tika().parseToString(“/path/to/document.doc”) Use the lower-level interfaces for more power or flexibility Allows fine-grained control of Tika functionality More complicated programming model Parsed content handled as XHTML SAX events Not all functionality is exposed through the Tika facade © 2012 Adobe Systems Incorporated. All Rights Reserved. 12
13.
Tika and Solr
- the ExtractingRequestHandler section 3 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
14.
ExtractingRequestHandler
aka Solr Cell http://wiki.apache.org/solr/ExtractingRequestHandler “Solr's ExtractingRequestHandler uses Tika to allow users to upload binary files to Solr and have Solr extract text from it and then index it.” For example: $ curl "http://localhost:8983/solr/update/extract? literal.id=document&commit=true" -F "file=@document.doc" Supports both text and metadata extraction with plenty of configurable options © 2012 Adobe Systems Incorporated. All Rights Reserved. 14
15.
ExtractingRequestHandler parameters
Helping Tika do it’s job resource.name=document.doc - Helps Tika’s automatic type detection resource.password=secret - Allows Tika to read encrypted documents passwordsFile=/path/to/password-file - Resource name to password mappings for example: .*.pdf$ = pdf-secret Capturing special content xpath=//a - Capture only content inside elements that match the specified query capture=h1 - Capture content inside specific elements to a separate field captureAttr=true - Capture attributes into separate fields named after the element Mapping field names lowernames=true - Normalize metadata field names to “content_type”, etc. © 2012 Adobe Systems Incorporated. All Rights Reserved. 15
16.
Tika and Lucene
- direct feeding of the index section 4 / 4 © 2012 Adobe Systems Incorporated. All Rights Reserved.
17.
Using the Tika
facade to feed Lucene // Index first part of the document String text = new Tika().parseToString(“/path/to/document.doc”); document.add(new TextField(“text”, text, Field.Store.NO)); // Index the full document Reader reader = new Tika().parse(“/path/to/document.doc”); document.add(new TextField(“text”, reader)); // Index also some metadata Metadata metadata = new Metadata(); Reader reader = new Tika().parse( new FileInputStream(“/path/to/document.doc”), metadata); document.add(new TextField(“text”, reader)); document.add(new StringField(“type”, metadata.get(Metadata.CONTENT_TYPE)); © 2012 Adobe Systems Incorporated. All Rights Reserved. 17
18.
Things to consider
What if the document is larger than your memory? Index only first N bytes/characters? WriteOutContentHandler supports an explicit write limit Enabled by default in the Tika facade, see get/setMaxStringLength() What if the document is malformed or intentionally broken? Could cause denial of service problems Might even crash the entire JVM due to bugs in native libraries in the JDK! SecureContentHandler monitors parsing and terminates it if things look bad Enabled by default in the Tika facade Ultimate solution: forked parsing and the Tika server Parse documents in separate, sandboxed JVM processes A document could fail to parse, but your application won’t crash Code is already there, but still a bit tricky to set up © 2012 Adobe Systems Incorporated. All Rights Reserved. 18
19.
Link extraction for
web crawlers The LinkContentHandler class can be used to extract all links from a document Works also with links in things like PDF, MS Word and email documents Use TeeContentHandler to combine with other ways of capturing content // for example LinkContentHandler lch = new LinkContentHandler(); BodyContentHandler bch = new BodyContentHandler(); new Tika().getParser().parse(..., new TeeContentHandler(lch, bch), ...); System.out.println(“Content: “ + bch); for (Link link : lch.getLinks()) { System.out.println(“Link: “ + link): } © 2012 Adobe Systems Incorporated. All Rights Reserved. 19
20.
Questions?
http://tika.apache.org/ © 2012 Adobe Systems Incorporated. All Rights Reserved.
21.
© 2012 Adobe
Systems Incorporated. All Rights Reserved.
Hinweis der Redaktion
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Jetzt herunterladen