SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Lucene today, tomorrow and beyond
                Simon Willnauer
                Apache Lucene Core Committer & PMC Chair
                simonw@apache.org / simon.willnauer@searchworkings.org




Thursday, October 20, 2011
Who am I?



       • Lucene Core Committer
       • Project Management Committee Chair (PMC)
       • Apache Member
       • BerlinBuzzwords Co-Founder
       • Addicted to OpenSource
       • Apache Solr & Lucene User / Consultant / Promoter




                                                             2

Thursday, October 20, 2011
http://www.searchworkings.org

       • Community Portal targeting OpenSource Search




                                                        3

Thursday, October 20, 2011
What makes this talk different?




       • The most of the talks here are presenting what Lucene can do or what
          people do with Lucene, right?

       • This talk will show what Lucene can’t do today (trunk) but might be
          doing in the future.

       • I won’t talk about what people going to do in the future - maybe next
          time :)




                                                                                 4

Thursday, October 20, 2011
Lu
                                         2001           ce
                                                             ne
                                                                  jo
                                         2002                          in
                                                                            ed




Thursday, October 20, 2011
                                                                                 th
                                                                                    e
                                                   Lu                                   AS
                                         2003           ce                                 F
                                                             ne
                                                                  be
                                         2004      Lu                  co
                                                        ce                   m
                                                                                 es
                                                             ne                       Ap
                                                   Lu
                                         2005           ce
                                                                  1.
                                                                    2                      ac
                                                             ne                                he
                                                                  1.                                TL
                                         2006      Lu               4                                  P
                                                        ce
                                                                                                           Let’s go back in time a bit




                                                             ne
                                                                  2.
                                         2007      Lu               0
                                                        ce
                                                             ne
                                         2008      Lu       2.
                                                              1
                                                      ce
                                                   Lu    ne & 2
                                                      ce 2.           .2
                                         2009            ne 3
                                                            2.
                                                   Lu          4
                                         2010         ce
                                                         ne
                                                   Lu       2.
                                                      ce       9
                                         2011            ne      &
                                                                     3.
                                 Happy Birthday!   Lu       &          0
                                                      ce       So
                                          2012                    lr
                                                         ne          M
                                                            3.          er
                                                               1          ge
                                                   Lu            -3
                                                      ce             .4
                                                         ne
                                                            4.
                                                               0
                                                                 ?
                             5




                                          2014
And who did all the work?




                                                                                                            Created from Lucene core CHANGES.TXT




 Especially “via” is interesting since we use this for contributions from non-committers (FooBar via $committer_name)                              6

Thursday, October 20, 2011
Lets make this a fair game!




                             28 committers from 8 different countries

                                                                        7

Thursday, October 20, 2011
And the companies




                             8

Thursday, October 20, 2011
Where are we now - once 4.0 is out?



       • Lucene 4.0 contains a ton of smallish improvements
       • Lots of refined APIs
       • Large speed improvements
       • New modules
       • And lots of paths to explore for the future!




                                                              9

Thursday, October 20, 2011
Some random improvements

       • FuzzyQuery speedup by 20000% (yes 20k!)
       • Indexing throughput improvements 200% to 280%
       • Document Filtering speedup up to 480%
       • Loading term dictionaries up to 30x faster using 10% of the memory
          compared to 3.x

       • 600000 key-value lookups/second
       • Tremendous reduction of GC needs at runtime


                      Your mileage may vary!
                                                                              10

Thursday, October 20, 2011
Flexible Indexing & Codecs

       • Allows to customize low level index structure per field
       • Yields significant performance gains depending on the use-case
       • Highly optimized data-structures
       • Allows future improvements due to per codec Backwards Compatibility
       • Lets you decide on memory consumption




                                                                               11

Thursday, October 20, 2011
IndexDocValues

       • Value per field & document - similar to FieldCache
       • Type-safe and efficient on-disk & in-memory access
       • Soon update-able
       • More flexible than FieldCache
       • Fast loading times




                                                              12

Thursday, October 20, 2011
Flexible Scoring

       • New ranking models in addition to VSM
       • Adds key statistics to Lucene index to support other scoring models
       • Decoupled matching from ranking
       • Powerful Similarity API (can use IndexDocValues)




                                                                               13

Thursday, October 20, 2011
What else?

       • DocumentWriterPerThread
          • High throughput incremental indexing
          • Preparation for RT-Search
       • AutomatonQuery (FuzzyQuery)
          • Query as s Deterministic Finite Automata (DFA)
          • Levenshtein Automata for fast Fuzzy Queries (up to 20000%
                improvement over 3.x)

             • Flexible Automata concatenation



                                                                        14

Thursday, October 20, 2011
This was what we get with Lucene 4.0 (roughly)

       • What is missing in this picture?
       • Where are we going?
       • What comes after 4.0?
       • What is not going to make it into 4.0?


       All this boils down to: “What do WE & YOU want
       Lucene to become in the future?”




                                                        15

Thursday, October 20, 2011
Lucene - a Full Text Search Library




                     CORE SEARCH
                  FEATURES! - LIMITATIONS?



                                             16

Thursday, October 20, 2011
Positions - not a first class citizen

       • We have:
          • Spans (Near, First, MultiTerm...)
          • PhraseQuery (sloppy & strict)
       • The Problem:
          • Either use “common” query hierarchy or Spans
          • Score ALL or NOTHING
          • Scoring lots of documents takes ages




                                                           17

Thursday, October 20, 2011
Positions - not a first class citizen

       • Solutions?
          • Multi-Phase searches
             • Collect documents without positions
             • Re-score top N based on position data
          • Query hierarchy can be complex
             • We need an API with the same granularity as Scorer
          • Span semantics should not be bound to a query
             • Divorce scoring & matching for positions


                                                                    18

Thursday, October 20, 2011
Positions - not a first class citizen

       • What about highlighting?
          • The implementation is a mess
          • Tons of If (query instanceof FooQuery)
          • Hard to extend for custom queries
       • First steps are already taken!
          • http://svn.apache.org/repos/asf/lucene/dev/branches/positions/
          • Scorer allows to pull positions for any query - Help Wanted!




                                                                             19

Thursday, October 20, 2011
Updates - Huh? Incremental you know!

       • Everybody wants it, right?
          • Updating a field without reindexing the entire doc? Yeah!
          • Watch out, this comes not for free!
       • You can’t simply update a field - it’s a reverse index!
          • Term -> [ (docID, freq) ] ( how to update this )
          • Lucene is write once - no in-place updates (which is good!)
       • We have write per field per segment deltas and merge them on
          IndexReader open?! - seems tricky?

       • Lots of paths need to be explored - maybe “appending fields”?

                                                                          20

Thursday, October 20, 2011
Updates - Huh? Incremental you know!
                   term      fre   Posting list   1   The old night keeper keeps the keep in the town
                    and       q
                              1    6              2   In the big old house in the big old gown.
                     big      2    23
                                                  3   The house in the town had the big old keep
                   dark       1    6
                     did      1    4              4   Where the old night keeper never did sleep.

                   gown       1    2              5   The night keeper keeps the keep in the night
                    had       1    3              6   And keeps in the dark and sleeps in the light.
                  house       2    23
                      in      5    12356
                   keep
                  keeper
                              3
                              3
                                   135
                                   145
                                                           update freq & postings
                  keeps       3    156
                                                  2   In the small old house in the big old gown.
                    light     1    6
                   never      1    4
                   night      3    145                     insert new term
                     old      4    1234
                   sleep      1    4
                  sleeps      1    6
                     the      6    123456
                   town       2    13
                  where       1    4




                                                                                                        21

Thursday, October 20, 2011
Updates - Hu? Incremental you know!

       • Much easier (and closer) for not-indexed values
          • IndexDocValues
       • Assumption:
          • Document Title OR Body changes are low frequent
          • PageRank OR User-Ratings change very frequently
       • Maybe available in 4.0
       • Bottom Line: this is still far away but on the list!




                                                                22

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • Unpredictable Mr. JIT




                             Grouping benchmark changes Spans? WTF?

                                                                  23

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • The cost of a virtual method call




                                ConjunctionScorer Code Specialization
                                                                   24

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • Lucene has a lot of HOT loops
          • Each TermScorer needs DocID & TermFreq for every possible hit
          • Calling DocsEnum#next() & #freq() adds up
          • Inlining seems unreliable

       • Solutions?




                                                                            25

Thursday, October 20, 2011
Possible Solutions / Paths to explore

       • Native Code / Generation (thats gonna be fun!)
       • Code Specialization
          • Can bring 50% to 100% performance improvements
       • ByteCode Generation & Query Compilation
          • Prototypes for FunctionQuery yields 300% speed improvements
       • Bulk Reading APIs - BulkPostings branch - watch out its hairy
          • Reading more than one DocID / TermFreq at a time
          • More than one step backwards - API wise


                                                                          26

Thursday, October 20, 2011
ByteCode generation

       • Specializing Queries at Runtime?
          • Might bring nice speed improvements per use-case
          • Problems arise with testing and correctness?
       • Could help tremendously with bulk postings
          • Some people say the API is unusable (Uwe?)
          • Maybe you don’t need to use it at all?
          • Would be nice if you could specify you query on a very high level and
                Lucene generates optimal code for you?




                                                                                27

Thursday, October 20, 2011
The Future beyond the core



       • Users have two options
          • Nothing - plain Lucene (well its a lot already - a lot to code)
          • All - Solr / ElasticSearch etc.


       •I’d like something in between, you?



                                                                              28

Thursday, October 20, 2011
<dream>Lucene 5.0</dream>

       • actually, XML is backwards: { “dream” : “Lucene 5.0” }
       • Solr has grown, grown large and is showing its age!
       • 95% of the time I only want one or two “services” Solr provides
          • still I got to use it - all or nothing!
          • I have to setup a (to me) heavy weight container (5 years ago Jetty /
                Tomcat was lightweight - times ‘r changing)

             • I got to figure out this documentation - fair enough!




                                                                                    29

Thursday, October 20, 2011
{“dream” : “Lucene 5.0”}

       • Can we get this more modular, lightweight & lean?
          • I rather do some coding than configure 2 lines of XML, you?




          Suggestions                                               Replication
                                              Faceting


                                                         Modules
                                                                                           CoreUtils
                                 Grouping



                 Spellchecking                                     Durability / Recovery


                                            Join

                                   today                           tomorrow
                                                                                                       30

Thursday, October 20, 2011
Isn’t this what Solr is?

       • Not quiet!
          • Lucene tries to provide APIs where you hardly can’t take anything
                away

             • When I think of Solr, you can hardly add anything
       • Everybody should be able to build their own $Solr
       • How hard will it be to draw the line?
       • Who is going to benefit?




                                                                                31

Thursday, October 20, 2011
Back to {“dream” : “Lucene 5.0”}

       • Can we go one step further?
                                                      Service - Module
                                          HTTP - Module




       • ElasticSearch did a great job making things dead simple!
          • we should follow this example and less might be more eventually!
       • Taking it as far as ElasticSearch (all or nothing again) seems not the
          right path for Lucene but simple is good, no?




                                                                                  32

Thursday, October 20, 2011
Disclaimer




       • This was my personal vision maybe not the one other people have.

       • Lets see what the community wants / needs - It’s all about the users!




                                                                                 33

Thursday, October 20, 2011
Questions




                             Thank you!



                                          34

Thursday, October 20, 2011

Weitere ähnliche Inhalte

Mehr von lucenerevolution

Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platformlucenerevolution
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucenelucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 

Mehr von lucenerevolution (20)

Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 

Kürzlich hochgeladen

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Lucene Today, Tomorrow & Beyond- Simon Willnauer

  • 1. Lucene today, tomorrow and beyond Simon Willnauer Apache Lucene Core Committer & PMC Chair simonw@apache.org / simon.willnauer@searchworkings.org Thursday, October 20, 2011
  • 2. Who am I? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource • Apache Solr & Lucene User / Consultant / Promoter 2 Thursday, October 20, 2011
  • 3. http://www.searchworkings.org • Community Portal targeting OpenSource Search 3 Thursday, October 20, 2011
  • 4. What makes this talk different? • The most of the talks here are presenting what Lucene can do or what people do with Lucene, right? • This talk will show what Lucene can’t do today (trunk) but might be doing in the future. • I won’t talk about what people going to do in the future - maybe next time :) 4 Thursday, October 20, 2011
  • 5. Lu 2001 ce ne jo 2002 in ed Thursday, October 20, 2011 th e Lu AS 2003 ce F ne be 2004 Lu co ce m es ne Ap Lu 2005 ce 1. 2 ac ne he 1. TL 2006 Lu 4 P ce Let’s go back in time a bit ne 2. 2007 Lu 0 ce ne 2008 Lu 2. 1 ce Lu ne & 2 ce 2. .2 2009 ne 3 2. Lu 4 2010 ce ne Lu 2. ce 9 2011 ne & 3. Happy Birthday! Lu & 0 ce So 2012 lr ne M 3. er 1 ge Lu -3 ce .4 ne 4. 0 ? 5 2014
  • 6. And who did all the work? Created from Lucene core CHANGES.TXT Especially “via” is interesting since we use this for contributions from non-committers (FooBar via $committer_name) 6 Thursday, October 20, 2011
  • 7. Lets make this a fair game! 28 committers from 8 different countries 7 Thursday, October 20, 2011
  • 8. And the companies 8 Thursday, October 20, 2011
  • 9. Where are we now - once 4.0 is out? • Lucene 4.0 contains a ton of smallish improvements • Lots of refined APIs • Large speed improvements • New modules • And lots of paths to explore for the future! 9 Thursday, October 20, 2011
  • 10. Some random improvements • FuzzyQuery speedup by 20000% (yes 20k!) • Indexing throughput improvements 200% to 280% • Document Filtering speedup up to 480% • Loading term dictionaries up to 30x faster using 10% of the memory compared to 3.x • 600000 key-value lookups/second • Tremendous reduction of GC needs at runtime Your mileage may vary! 10 Thursday, October 20, 2011
  • 11. Flexible Indexing & Codecs • Allows to customize low level index structure per field • Yields significant performance gains depending on the use-case • Highly optimized data-structures • Allows future improvements due to per codec Backwards Compatibility • Lets you decide on memory consumption 11 Thursday, October 20, 2011
  • 12. IndexDocValues • Value per field & document - similar to FieldCache • Type-safe and efficient on-disk & in-memory access • Soon update-able • More flexible than FieldCache • Fast loading times 12 Thursday, October 20, 2011
  • 13. Flexible Scoring • New ranking models in addition to VSM • Adds key statistics to Lucene index to support other scoring models • Decoupled matching from ranking • Powerful Similarity API (can use IndexDocValues) 13 Thursday, October 20, 2011
  • 14. What else? • DocumentWriterPerThread • High throughput incremental indexing • Preparation for RT-Search • AutomatonQuery (FuzzyQuery) • Query as s Deterministic Finite Automata (DFA) • Levenshtein Automata for fast Fuzzy Queries (up to 20000% improvement over 3.x) • Flexible Automata concatenation 14 Thursday, October 20, 2011
  • 15. This was what we get with Lucene 4.0 (roughly) • What is missing in this picture? • Where are we going? • What comes after 4.0? • What is not going to make it into 4.0? All this boils down to: “What do WE & YOU want Lucene to become in the future?” 15 Thursday, October 20, 2011
  • 16. Lucene - a Full Text Search Library CORE SEARCH FEATURES! - LIMITATIONS? 16 Thursday, October 20, 2011
  • 17. Positions - not a first class citizen • We have: • Spans (Near, First, MultiTerm...) • PhraseQuery (sloppy & strict) • The Problem: • Either use “common” query hierarchy or Spans • Score ALL or NOTHING • Scoring lots of documents takes ages 17 Thursday, October 20, 2011
  • 18. Positions - not a first class citizen • Solutions? • Multi-Phase searches • Collect documents without positions • Re-score top N based on position data • Query hierarchy can be complex • We need an API with the same granularity as Scorer • Span semantics should not be bound to a query • Divorce scoring & matching for positions 18 Thursday, October 20, 2011
  • 19. Positions - not a first class citizen • What about highlighting? • The implementation is a mess • Tons of If (query instanceof FooQuery) • Hard to extend for custom queries • First steps are already taken! • http://svn.apache.org/repos/asf/lucene/dev/branches/positions/ • Scorer allows to pull positions for any query - Help Wanted! 19 Thursday, October 20, 2011
  • 20. Updates - Huh? Incremental you know! • Everybody wants it, right? • Updating a field without reindexing the entire doc? Yeah! • Watch out, this comes not for free! • You can’t simply update a field - it’s a reverse index! • Term -> [ (docID, freq) ] ( how to update this ) • Lucene is write once - no in-place updates (which is good!) • We have write per field per segment deltas and merge them on IndexReader open?! - seems tricky? • Lots of paths need to be explored - maybe “appending fields”? 20 Thursday, October 20, 2011
  • 21. Updates - Huh? Incremental you know! term fre Posting list 1 The old night keeper keeps the keep in the town and q 1 6 2 In the big old house in the big old gown. big 2 23 3 The house in the town had the big old keep dark 1 6 did 1 4 4 Where the old night keeper never did sleep. gown 1 2 5 The night keeper keeps the keep in the night had 1 3 6 And keeps in the dark and sleeps in the light. house 2 23 in 5 12356 keep keeper 3 3 135 145 update freq & postings keeps 3 156 2 In the small old house in the big old gown. light 1 6 never 1 4 night 3 145 insert new term old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4 21 Thursday, October 20, 2011
  • 22. Updates - Hu? Incremental you know! • Much easier (and closer) for not-indexed values • IndexDocValues • Assumption: • Document Title OR Body changes are low frequent • PageRank OR User-Ratings change very frequently • Maybe available in 4.0 • Bottom Line: this is still far away but on the list! 22 Thursday, October 20, 2011
  • 23. The JVM - or is it the JIT? • Unpredictable Mr. JIT Grouping benchmark changes Spans? WTF? 23 Thursday, October 20, 2011
  • 24. The JVM - or is it the JIT? • The cost of a virtual method call ConjunctionScorer Code Specialization 24 Thursday, October 20, 2011
  • 25. The JVM - or is it the JIT? • Lucene has a lot of HOT loops • Each TermScorer needs DocID & TermFreq for every possible hit • Calling DocsEnum#next() & #freq() adds up • Inlining seems unreliable • Solutions? 25 Thursday, October 20, 2011
  • 26. Possible Solutions / Paths to explore • Native Code / Generation (thats gonna be fun!) • Code Specialization • Can bring 50% to 100% performance improvements • ByteCode Generation & Query Compilation • Prototypes for FunctionQuery yields 300% speed improvements • Bulk Reading APIs - BulkPostings branch - watch out its hairy • Reading more than one DocID / TermFreq at a time • More than one step backwards - API wise 26 Thursday, October 20, 2011
  • 27. ByteCode generation • Specializing Queries at Runtime? • Might bring nice speed improvements per use-case • Problems arise with testing and correctness? • Could help tremendously with bulk postings • Some people say the API is unusable (Uwe?) • Maybe you don’t need to use it at all? • Would be nice if you could specify you query on a very high level and Lucene generates optimal code for you? 27 Thursday, October 20, 2011
  • 28. The Future beyond the core • Users have two options • Nothing - plain Lucene (well its a lot already - a lot to code) • All - Solr / ElasticSearch etc. •I’d like something in between, you? 28 Thursday, October 20, 2011
  • 29. <dream>Lucene 5.0</dream> • actually, XML is backwards: { “dream” : “Lucene 5.0” } • Solr has grown, grown large and is showing its age! • 95% of the time I only want one or two “services” Solr provides • still I got to use it - all or nothing! • I have to setup a (to me) heavy weight container (5 years ago Jetty / Tomcat was lightweight - times ‘r changing) • I got to figure out this documentation - fair enough! 29 Thursday, October 20, 2011
  • 30. {“dream” : “Lucene 5.0”} • Can we get this more modular, lightweight & lean? • I rather do some coding than configure 2 lines of XML, you? Suggestions Replication Faceting Modules CoreUtils Grouping Spellchecking Durability / Recovery Join today tomorrow 30 Thursday, October 20, 2011
  • 31. Isn’t this what Solr is? • Not quiet! • Lucene tries to provide APIs where you hardly can’t take anything away • When I think of Solr, you can hardly add anything • Everybody should be able to build their own $Solr • How hard will it be to draw the line? • Who is going to benefit? 31 Thursday, October 20, 2011
  • 32. Back to {“dream” : “Lucene 5.0”} • Can we go one step further? Service - Module HTTP - Module • ElasticSearch did a great job making things dead simple! • we should follow this example and less might be more eventually! • Taking it as far as ElasticSearch (all or nothing again) seems not the right path for Lucene but simple is good, no? 32 Thursday, October 20, 2011
  • 33. Disclaimer • This was my personal vision maybe not the one other people have. • Lets see what the community wants / needs - It’s all about the users! 33 Thursday, October 20, 2011
  • 34. Questions Thank you! 34 Thursday, October 20, 2011