Passage indexing is likely more important than you think

Passage Indexing – It is
Probably a Bigger ‘Thing’
Than You Think

Hello… I am
Dawn Anderson
from Bertey

I’m a nosey crow SEO who likes to
follow happenings in the information
retrieval research space… a lot

Information
retrieval is the
arm of computer
science behind
web search

It’s ‘complicated’ but
it’s not ‘magic’

It’s just maths and
programming after all

“That’s not
‘real life’
‘production
search’

True… There’s a
lot of web
search research
which may not
go anywhere

But...There are
(themes)
‘directions’
research heads
in – ‘Hot topics’

Plus… Google takes a ‘Hybrid Approach’ to
research
https://research.google/pubs/pub38149/

Research &
engineering
are closely
aligned
Research designed to meet needs of
users
Research designed to make it to
production in a relatively short space
of time often
Engineers are often researchers and
vice versa
Able to research and deploy iteratively
and quickly

Plus… There are
research applications
built to align IR
research with ‘real life’
industry – e.g. Anserini
(Yang, Fang & Lin, 2017)

Designed to build ‘reproducible’ code,
tests & studies to avoid the below:

I’m going to talk
about a ‘Hot topic’

Passage retrieval,
ranking & re-ranking

You may have
heard of
Google’s new
‘Passage
indexing
Algorithm’

Live in US
Search
already since
Feb 2021

It will
‘eventually’
impact 7% of
queries globally

Groan… You all know
all about it… right?

‘Passage
indexing
algorithm’ is
just the tip of
the iceberg

Being aware of
how important
passage research
in IR is might help
your SEO efforts
in other ways

Passage research is a big
deal in information
retrieval right now

It’s much bigger
than just this
passage
indexing
algorithm

Passage Indexing is Part of
a “Bigger Breakthrough”

“A Breakthrough in Ranking”
• “We’ve recently made a breakthrough in ranking and are now able to
not just index web pages, but individual passages from the pages,”
“By better understanding the relevancy of specific passages, not just
the overall page, we can find that needle-in-a-haystack information
you’re looking for.” (Raghavan, P, 2020)

So, just what is Google’s
‘Passage Indexing
Algorithm’ about?

Spoiler
I’m not going to tell you to ’optimize’ for the passage indexing algorithm

It’s not
something
you can
‘optimize’ for

If anyone offers to ‘optimise
passages’ – Run for the hills

It’s kind of a new
‘clarifying algorithm’

Google first announced its
‘passage indexing algorithm’ in
October 2020

Passage indexing is
about better
understanding of
contextual language
use by search
engines using AI

More specifically better
‘natural language
understanding’

E.g. Understanding when ‘bank’ in queries or
content means ‘river bank’ versus ‘financial bank’

It’s a way to judge parts of a document
individually for relevance to a query
(passages), rather than as just one very
small part of a full document

Passage
indexing is
actually about
passage
ranking

So let’s just
call it
‘passage
ranking’

It’s primarily for long form
content

Which cover multiple
topics (scope creep)

OR… the author just used more
words than needed (verbosity)

Likely over a threshold >
‘x’ unique ‘uncommon’
words

i.e. other than … Common English Words
a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,
be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ev
er,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,
if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neithe
r,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,
should,since,so,some,than,that,the,their,them,then,there,these,they,this,t
is,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,w
hom,why,will,with,would,yet,you,your

ASIDE: These words nowadays actually
add ‘contextual glue’ in natural language
models, but were traditionally removed
‘stop words’

They still add
very little
cumulative
information
gain (value)

It’s probably for
LARGE
documents
without
semantic
structure

Literally like a loose
‘Bag of Words’

Maybe
disorganised
topics too –
‘Jumble sale
content’

It won’t impact
ecommerce /
transactional pages

But… Understanding
passages in long
form content ‘might’
help ecommerce
pages indirectly
(more on that later)

Why not? – Transactional pages are
different beasts to informational pages
Not that many words on
an ecommerce page
(either product page or
category page)
Ecommerce pages tend
to have an ‘out of the
box’ organised content
structure template
Ecommerce pages will
likely have lots of other
'value add’ features too
Ecommerce content is
not conversational in
nature
Ecommerce pages tend
to have a descending
'importance' hierarchy
in the page content 'out
of the box'

Lots of clues
as well
‘beyond the
words’
Internal links with
indicative anchors
Strong
categorization &
subcategorization
‘Probably’ some
product schema
markup too

So, it’s probably not that difficult for search
engines to understand ecommerce pages

Passage indexing / ranking
is a ‘leg-up’ to pages which
likely don’t rank for much
currently

They don’t look like this
<h1>Main theme</h1>
<p>paragraph</p>
<h2>Key point - Section – Subtopic</h2>
<p>paragraph</p>
<h3>Further point heading connected to h2 section</h3>
<p>paragraph</p>
<h2>Key point – Section – Subtopic</h2>

They could
be randomly
‘tagged’ with
nonsense

Real topic
experts
‘probably’
don’t optimise
pages

There may be some very
relevant expert points in a
long document they author

Sometimes there can be
‘diamonds’ of knowledge

Buried… and not
obvious at first sight

AI / NLP &
passage
research has
makes finding
‘diamonds’
possible

Passage
research is
about wider
ranking
‘challenges’ too

Longstanding ranking
‘problems’

It mostly comes down to
thr ‘stages’ of ranking &
ranking efficiency

Balancing
costs versus
potential
returns

Two (or multi) stage ranking
Retrieval (first
stage)
Re-ranking
(refinement of
the initial fetch
for optimal
search results)

Retrieval stage
From all possible text (or synonym)
matching candidates above an x
quality threshold fetch e.g. 1000
Pass to the next stage for
further consideration

Re-ranking stages (Cascading stages?)
From e.g. initial
fetched 1000,
fine tune an
optimal set of
combined
results to meet
query
Present in
response to
query in search

Efficiency &
Effectiveness
Considerations

Not ‘EVERYTHING’ is worth
seriously even considering

Only a relatively few pages
& sites will be important

The ‘fine-tuning’ is
expensive

The ‘fine-tuning’ part uses heavy
machine learning to ‘meet needs in
Top-K search positions’

The aim is to build ‘the
perfect set of results in a
perfectly ranked order’

The Probability
Ranking
Principle in
IR(PRP)
(Robertson,
1977)

But PRP only
considers
documents
independently

Doesn’t consider what else
might make up a ‘rich
SERPs experience’

And how to rank with a
range of ‘intent aspects’

Because it’s the richness
of the results which meet
‘possible information
needs’

Query aspects – ‘Harry
Potter’ – Did you mean the
book or the film or images?

Diverse results matter to users

Also
Computers
Were (Are)
Hopeless at
Ambiguity
“Computers are hopeless at disambiguation – at
understanding which of multiple meanings is
correct – because they don’t have our world
knowledge.” Dr Stephen Clark, Cambridge
University
https://www.cam.ac.uk/research/features/our-
ambiguous-world-of-words

Different
types of
ambiguous
queries
Genuine ambiguous queries –
‘bank’, ‘rose’, ‘bond’
Underspecified queries – ‘Harry
Potter’ (is it the book or the
film?)
Fully specified queries – ’Harry
Potter film’ (aspect is fully
defined)

Queries can
be
ambiguous –
e.g. Apple
‘Apple’
(fruit)
‘Apple’
(Computer)

Also…The results should
match the probable
audience needs

Who is the
user?
What the niche
is? (Do videos
matter more?)
Do more people
search for apple
fruit there?
Where is the
user? (location)
Are images or
video a good fit?
When is it?
(Query intent
shift) (

What’s the probability x
user wanted x information
need meeting?

Perhaps it’s best to show a
range of possible intent
matching results?

Relevance Ranking & Result Diversification
A list of documents ranked
in order of descending
relevance to a query (PRP)
But…Diversity matters in
results returned to account
for ambiguity and
redundancy. Better to have
broad diversity and novelty
to cover all query aspects

Modern fine-tuning is NOT
document independent

‘Search’ is about
the ‘WHOLE’
experience

This is ‘The
Search Result
Diversification
Problem’

In addition to Probability Ranking Principle several
other approaches have been explored
Portfolio Theory (PT)
(Wang and Zhu, 2009)
Quantum Probability Ranking Principle (QPRP)
(Zuccon, G., Azzopardi, L.A. and Van Rijsbergen, K., 2009)
Maximal Marginal Relevance (MMR)
(Carbonell and J. Goldstein, 1998)

Seeking to
find the
‘BEST’ blend
for the
moment

It’s this…
and it’s not
‘magic’

It’s ‘Learning-to-
Rank’ (LeToR)

Learning-to-
Rank
(learning
‘current
relevance’)
Feed a model with click data / queries (learn current
relevance)
Train & lab evaluation
Send to human quality raters scoring (human in the loop)
Aggregate the scores from raters 'on the whole evaluation'
Use NDCG (Non-discounted cumulative gain) to adjust the
whole test algorithm based on scores from human raters

It’s like this…
Predicting what
people want in
given scenarios

So…SERP
Stages
First stage - Retrieve initial fetch of
e.g. 1000 documents to get a top-k
(e.g. 20) candidates to re-order
Final Stage - Refine to build an
optimal page dynamically refined
for relevancy & diversity

Dynamic Search Result
Diversification in IR is an NP-
Hard problem

Here’s a great book on the subject

Also… there
are…
pipeline
problems

The first stage (retrieval)
has been ‘lacking’ for a
long, long time

Classical
Information
Retrieval (first
stage) was ‘lexical’ –
matching ‘keywords’
present in both
documents and
queries

i.e. Are the
words in the
document?

BM25
TF:IDF
Example common first
stage algorithms

But does having
word frequency
make something
automatically
relevant?

Some
semantics
were likely
added
Synonyms
Boolean OR, OR, OR
parameters added perhaps
Query expansion

But nothing too
expensive initially

Later stages
of ranking
are for the
‘magic’

Precision refinement
Novelty in results
Subtopic retrieval
Reduce redundancy in results
Determine ‘the best mix of results for the query’
Determine the optimal position for each candidate type

Re-ranking uses more expensive methods
COMPUTATIONALLY FINANCIALLY ENVIRONMENTALLY

When the first
stage of ranking
is based on
relative
keyword (term)
frequency
matching to e.g.
document
length
Lexical matching
E.g. BM25 (Best
match 25
algorithm)
commonly used
in information
retrieval

Large
documents
are a BIG
problem

Plus there’s also an
increasing number
of long documents
specifically
and an increasing
number of
documents to
choose from

Large documents can contain ‘too much information’
Verbosity Dilution Scope creep
Vagueness Waffle
Content for
the sake of
content

So much
information about
too many topics.
Makes a document
irrelevant for
everything

Those ‘diamonds’
buried can be missed

Not just long
documents either…
Just any full
documents too
(i.e. any text filled
URI)

Can be hard to determine
off ‘word matching’ alone

Lots of gems of diverse
knowledge / query aspect
meeting info can be missed

The long document
problem is well known

E.g. Relevance
feedback with
too much data
(Allan, 1995)

So passage retrieval
research is not new

Indeed… Every snippet in
the SERPs in a passage of
sorts

Designed for
summarization
& search result
diversification
in re-ranking

IR Researchers have been working
on understanding relevance in
passages for decades

Some past approaches
included… ignoring them
completely

It looks like…
Several
different
approaches,
including:
Ignore Them
• Ignore them completely to save on efficiency. As long
as there are other smaller relevant documents to rank
Trim Them
• Trim the document by selecting one good passage
Identify a passage
• Identify one passage only on large documents over x
size and for everything else (smaller documents) use
normal full document ranking
Use TextTiling
• Try to identify subtopics in natural paragraph breaks
Use only The First 'n' Terms
• Anything after 'n' terms is ignored

No wonder
‘diamonds’
could be
missed

But judging
individual
passages versus
full documents
can increase
perceived
relevance

Divide the
problem into
parts &
solve each

Divide the document
into ‘passages’

And then
apply the
same
principles

Multi-Stage
Ranking
Systems on
passages
too
Full
ranking –
document
Re-
ranking –
document
Full
ranking –
passage
Re-
ranking -
passage

Recently
breakthroughs…
NLP language
models came
along

Specifically BERT
(Devlin, 2018) &
Better BERTS

What is BERT (Bi-
Directional
Encoder
Representations
from
Transformers?
A PLM (Pre-trained language model)
Used for neural matching to understand word’s meaning in
‘context’
So, not just ‘keyword matching’ to documents on a page
But understanding multiple meanings of a word in different
contexts
BERT is pre-trained on millions of words so natural language
researchers / engineers have a ‘starter for ten’ language
model which already understands the contextual glue /
nuance in words

BERT can dig
out ‘meaning’ in
informational /
conversational
content

By providing context
& disambiguating

There’s now LOADS of BERTs
EVERYWHERE

There’s even
a BERT-lang
Street

Many
researchers
jumped on
board

That’s the
nature of
‘open source’
research
after all

And Big
Tech
extended or
segwayed
BERT
ELECTRA (Google)
T5 (Google) (Huge compared with BERT)
RoBERTa (Facebook)
ERNIE (Baidu)
Turing-NLG
MT-DNN (Google)
GPT, GPT-2, GPT-3 (Open AI)

Plus ‘Student’ &
‘Teacher’ BERTs

But BERT’s
were mostly
‘nice to
have’s’

BERT has architectural
limitations

It has a 512 token
capacity – so not full
document length

But then… BERT was repurposed
as a passage re-reranker
(Nogueira, R. and Cho, K., 2019)

Google researchers utilized this too

An
Ensemble -
Taking the
best parts of
other
language
models &
features
RoBERTa (Facebook)
ELECTRA (Google)
BERT (Google)
Tensorflow (Google)
DeepCT (Dhai)
Learning to Rank (LeToR)

DeepCT (Dhai,
2019)
‘Deep
Contextualized
Term-
Weighting
Framework’
tfDeepCT – An alternative to term frequency which
replaces tf with tfDeepCT
DeepCT-Index – Alternative weights added to an
original index, with no additional postings. Weighting
is carried out offline, and therefore does not add any
latency to search engine online usage
DeepCT-Query – An updated bag-of-words query
which has been adapted using the deep contextual
features from BERT to identify important terms in a
given text context or query context.

It’s ‘probably’ this or
something better still

Using passage rerankers on long
form content suddenly makes long
form content understandable
(even in the first stage of retrieval)

Passage research has NOT
slowed down

Passages
are
everywhere
too
Every long conversational query
In long form content
In forums, UGC, tweets,
Facebook posts
Every People Also Ask is a
passage (question)

Passages are data & data is like oil for NLP
researchers

Search engines like Google and Bing ‘gift’ training
data to NLP / IR researchers to progress

MSMARCO is
one of the
biggest
drivers
(Real Bing
queries)
1,000,000 question dataset
Natural language generation dataset
Passage ranking dataset
Keyphrase extraction dataset
Crawling dataset
Conversational search dataset

When you
combine
passage data
with BERT type
models… rocket

Understanding
passages is the key to
understanding
language, intent & user
tasks it seems

And
improving the
whole ranking
pipeline

Break the
tasks down
Open domain question answering
Greater understanding in expert
disorganized content
Greater conversational search
Greater understanding about how
everything 'fits together'

Passages &
subtopics
are VERY
connected

Further progress in
passage research

The NLP
Research
Community is
‘ALL IN’ on
passage ranking
& re-ranking

Obviously fueled by MS-MARCO Passage
Ranking Leaderboard

Move over BERT… VERY different
models are appearing
T5 (Google)
DocTTTTTQuery (Nogueira & Lin, 2019)
Expando-mono-duo (Pradeep, R., Nogueira, R. and Lin, J., 2021)
RocketQA + Ernie (Qu et al., 2020, (Baidu))
COIL (Gao, L., Dai, Z. and Callan, J., 2021
DeepCT (Dai, Z and Callan, 2019)

And many are
focused on
improving the
first stage of
ranking
dramatically

First stage
retrieval has
been a
‘bottleneck’

Hot Topic - Improving
first stage retrieval

Passage research
seems to be very
connected to this

Alternatives
to BM25 &
TF:IDF are
being found
in passages
Dense Passage Retrieval
ANCE
DocTTTTTQuery
COIL
ColBERT
DeepCT

Dense Passage Retrieval – ‘Packing’ sparse initial
stage data with training passages
ANCE – Nearest neighbour approach
DocTTTTTQuery – Query expansion in first stage
COIL – Scoring system to add ‘context’ to BM25 first
stage retrieval – adds semantic & lexical (hybrid)
ColBERT – Late stage BERT introduction
DeepCT – Index term weighting with DeepCT
framework for ‘importance weights’

2020 & 2021
‘dense passage
retrieval’
research is on
fire

Google are working on hybrids for first stage
too

Leveraging Semantic and Lexical
Matching to Improve the Recall of
Document Retrieval Systems: A Hybrid
Approach (Kuzi, S., Zhang, M., Li, C.,
Bendersky, M. and Najork, M., 2020)

Advances in
TF-Ranking
(July, 2021)

They talked about the ensemble
success here in 2021

This rapidly evolving
research ‘may’ be
part of why Google’s
Passage Indexing
Algorithm has not
been rolled out
further yet

Things are
changing quickly.
‘Passage
Indexing’ could be
greatly improved
soon

So, what does
all this mean
for you?

Understanding passages &
subtopics more could greatly
improve search result
diversification

First stage
retrieval covers a
mountain of
unstructured text
/ unstructured
data

A better ranking
pipeline gives
more choice

Subtopics & passages
go hand-in-hand

”You shall
know a word
by the
company it
keeps” (Firth,
1957)

Your page is
NOT ranked
independently

Your page is judged for the ‘added
value / diversity’ it can bring to a
mix of results ALREADY deemed
relevant

What you create
in informational
sections will
impact ‘value’ in
other ‘intent’
sections

You will be Judged
on The ‘Whole Site
Pie’

On all
sections of
the site
Informational content (blogs, guides,
FAQs etc)
Informational ‘collection’ pages (e.g.
blog category pages, guide hub
pages)
Transactional content (e.g. -
ecommerce product pages, local
listings)
Transactional ‘collection’ pages (e.g. -
ecommerce category pages, local
listing category pages)

Since meeting the needs of
under-specified queries
requires broad coverage

You will be judged on
“How many of the intents of this under-
specified query do you meet?”

It could be hard to rank for
big ‘head’ terms without
comprehensive stock
(‘information gain’ (value))
within your document
collection (site), for all
query aspects (intents) to
meet an under-specified
query

You need to
do everything
with ‘high
quality’ in
mind

Start with a
‘few’ GREAT
pages

Always think
“Where is the
information gain
beyond what is
already in this site?”

Always think “Where
is the information
gain beyond what is
already in the
SERPs?”
Or do it much better

i.e. Make sure you have
content that meets
sufficient demand

There could be
‘Dynamic Index
Pruning’ by search
engines

But you
should move
and ‘prune’
with caution

Think…
Are we bringing a new
perspective?
Is the page substantively different
to what we already have
Does this page meet another
aspect of information need /
different intent?
Does this page substantively bring
another type of media into the
mix?
Don’t try to be the same as what is
already ranking… do something
more / different

Bring a new perspective
THOUGHT LEADERSHIP NEW DATA DRIVEN
REFRESHING
PERSPECTIVES
INTERACTIVE TOOLS ADDITIONAL FEATURES IN
YOUR CONTENT WHICH
OTHERS DON’T HAVE?

Improving
informational
content
Add semantic headings
Connect up topics via ‘relatedness’
Learn to structure content properly
Inverted pyramid approach to web content
Avoid verbosity
Move’ and ‘prune’ with caution
Don’t ‘prune’ away your semantic ‘relatedness’

Relevance
and diversity
in SERPs
combined

Unlocked by a better end-
to-end search ranking
pipeline

Across a range
of task intents &
media types
dynamically
suited to the
audience

Maybe something like
this… ->

From answers to journeys
A new visual way to meet
informational needs
From queries to queryless

Understanding user’s
information needs a
passage at a time

Passage indexing is likely more important than you think

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Passage indexing is likely more important than you think

Ähnlich wie Passage indexing is likely more important than you think (20)

Mehr von Dawn Anderson MSc DigM

Mehr von Dawn Anderson MSc DigM (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Passage indexing is likely more important than you think