Whilst passage indexing may seem like a small tweak to search ranking, it is potentially much more symptomatic of the beginning of a fundamental shift in the way that search engines understand unstructured content, determine relevance in natural language, and rank efficiently and effectively.
It could also be a means of assessing overall quality of content and a means of dynamic index pruning. We will look at the landscape, and also provide some takeaways for brands and business owners looking to improve quality in unstructured content overall in this fast changing landscape.
11. Plus… Google takes a ‘Hybrid Approach’ to
research
https://research.google/pubs/pub38149/
12. Research &
engineering
are closely
aligned
Research designed to meet needs of
users
Research designed to make it to
production in a relatively short space
of time often
Engineers are often researchers and
vice versa
Able to research and deploy iteratively
and quickly
13. Plus… There are
research applications
built to align IR
research with ‘real life’
industry – e.g. Anserini
(Yang, Fang & Lin, 2017)
14. Designed to build ‘reproducible’ code,
tests & studies to avoid the below:
27. “A Breakthrough in Ranking”
• “We’ve recently made a breakthrough in ranking and are now able to
not just index web pages, but individual passages from the pages,”
“By better understanding the relevancy of specific passages, not just
the overall page, we can find that needle-in-a-haystack information
you’re looking for.” (Raghavan, P, 2020)
37. E.g. Understanding when ‘bank’ in queries or
content means ‘river bank’ versus ‘financial bank’
38. It’s a way to judge parts of a document
individually for relevance to a query
(passages), rather than as just one very
small part of a full document
44. OR… the author just used more
words than needed (verbosity)
45. Likely over a threshold >
‘x’ unique ‘uncommon’
words
46. i.e. other than … Common English Words
a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,
be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ev
er,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,
if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neithe
r,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,
should,since,so,some,than,that,the,their,them,then,there,these,they,this,t
is,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,w
hom,why,will,with,would,yet,you,your
47. ASIDE: These words nowadays actually
add ‘contextual glue’ in natural language
models, but were traditionally removed
‘stop words’
55. Why not? – Transactional pages are
different beasts to informational pages
Not that many words on
an ecommerce page
(either product page or
category page)
Ecommerce pages tend
to have an ‘out of the
box’ organised content
structure template
Ecommerce pages will
likely have lots of other
'value add’ features too
Ecommerce content is
not conversational in
nature
Ecommerce pages tend
to have a descending
'importance' hierarchy
in the page content 'out
of the box'
56. Lots of clues
as well
‘beyond the
words’
Internal links with
indicative anchors
Strong
categorization &
subcategorization
‘Probably’ some
product schema
markup too
57. So, it’s probably not that difficult for search
engines to understand ecommerce pages
58. Passage indexing / ranking
is a ‘leg-up’ to pages which
likely don’t rank for much
currently
60. They don’t look like this
<h1>Main theme</h1>
<p>paragraph</p>
<h2>Key point - Section – Subtopic</h2>
<p>paragraph</p>
<h3>Further point heading connected to h2 section</h3>
<p>paragraph</p>
<h2>Key point – Section – Subtopic</h2>
73. Two (or multi) stage ranking
Retrieval (first
stage)
Re-ranking
(refinement of
the initial fetch
for optimal
search results)
74. Retrieval stage
From all possible text (or synonym)
matching candidates above an x
quality threshold fetch e.g. 1000
Pass to the next stage for
further consideration
75. Re-ranking stages (Cascading stages?)
From e.g. initial
fetched 1000,
fine tune an
optimal set of
combined
results to meet
query
Present in
response to
query in search
92. Also
Computers
Were (Are)
Hopeless at
Ambiguity
“Computers are hopeless at disambiguation – at
understanding which of multiple meanings is
correct – because they don’t have our world
knowledge.” Dr Stephen Clark, Cambridge
University
https://www.cam.ac.uk/research/features/our-
ambiguous-world-of-words
93. Different
types of
ambiguous
queries
Genuine ambiguous queries –
‘bank’, ‘rose’, ‘bond’
Underspecified queries – ‘Harry
Potter’ (is it the book or the
film?)
Fully specified queries – ’Harry
Potter film’ (aspect is fully
defined)
98. Who is the
user?
What the niche
is? (Do videos
matter more?)
Do more people
search for apple
fruit there?
Where is the
user? (location)
Are images or
video a good fit?
When is it?
(Query intent
shift) (
100. Perhaps it’s best to show a
range of possible intent
matching results?
101. Relevance Ranking & Result Diversification
A list of documents ranked
in order of descending
relevance to a query (PRP)
But…Diversity matters in
results returned to account
for ambiguity and
redundancy. Better to have
broad diversity and novelty
to cover all query aspects
105. In addition to Probability Ranking Principle several
other approaches have been explored
Portfolio Theory (PT)
(Wang and Zhu, 2009)
Quantum Probability Ranking Principle (QPRP)
(Zuccon, G., Azzopardi, L.A. and Van Rijsbergen, K., 2009)
Maximal Marginal Relevance (MMR)
(Carbonell and J. Goldstein, 1998)
109. Learning-to-
Rank
(learning
‘current
relevance’)
Feed a model with click data / queries (learn current
relevance)
Train & lab evaluation
Send to human quality raters scoring (human in the loop)
Aggregate the scores from raters 'on the whole evaluation'
Use NDCG (Non-discounted cumulative gain) to adjust the
whole test algorithm based on scores from human raters
111. So…SERP
Stages
First stage - Retrieve initial fetch of
e.g. 1000 documents to get a top-k
(e.g. 20) candidates to re-order
Final Stage - Refine to build an
optimal page dynamically refined
for relevancy & diversity
124. Precision refinement
Novelty in results
Subtopic retrieval
Reduce redundancy in results
Determine ‘the best mix of results for the query’
Determine the optimal position for each candidate type
125. Re-ranking uses more expensive methods
COMPUTATIONALLY FINANCIALLY ENVIRONMENTALLY
126. When the first
stage of ranking
is based on
relative
keyword (term)
frequency
matching to e.g.
document
length
Lexical matching
E.g. BM25 (Best
match 25
algorithm)
commonly used
in information
retrieval
143. It looks like…
Several
different
approaches,
including:
Ignore Them
• Ignore them completely to save on efficiency. As long
as there are other smaller relevant documents to rank
Trim Them
• Trim the document by selecting one good passage
Identify a passage
• Identify one passage only on large documents over x
size and for everything else (smaller documents) use
normal full document ranking
Use TextTiling
• Try to identify subtopics in natural paragraph breaks
Use only The First 'n' Terms
• Anything after 'n' terms is ignored
153. What is BERT (Bi-
Directional
Encoder
Representations
from
Transformers?
A PLM (Pre-trained language model)
Used for neural matching to understand word’s meaning in
‘context’
So, not just ‘keyword matching’ to documents on a page
But understanding multiple meanings of a word in different
contexts
BERT is pre-trained on millions of words so natural language
researchers / engineers have a ‘starter for ten’ language
model which already understands the contextual glue /
nuance in words
154. BERT can dig
out ‘meaning’ in
informational /
conversational
content
170. An
Ensemble -
Taking the
best parts of
other
language
models &
features
RoBERTa (Facebook)
ELECTRA (Google)
BERT (Google)
Tensorflow (Google)
DeepCT (Dhai)
Learning to Rank (LeToR)
171. DeepCT (Dhai,
2019)
‘Deep
Contextualized
Term-
Weighting
Framework’
tfDeepCT – An alternative to term frequency which
replaces tf with tfDeepCT
DeepCT-Index – Alternative weights added to an
original index, with no additional postings. Weighting
is carried out offline, and therefore does not add any
latency to search engine online usage
DeepCT-Query – An updated bag-of-words query
which has been adapted using the deep contextual
features from BERT to identify important terms in a
given text context or query context.
172. “A Breakthrough in Ranking”
• “We’ve recently made a breakthrough in ranking and are now able to
not just index web pages, but individual passages from the pages,”
“By better understanding the relevancy of specific passages, not just
the overall page, we can find that needle-in-a-haystack information
you’re looking for.” (Raghavan, P, 2020)
186. Break the
tasks down
Open domain question answering
Greater understanding in expert
disorganized content
Greater conversational search
Greater understanding about how
everything 'fits together'
192. Move over BERT… VERY different
models are appearing
T5 (Google)
DocTTTTTQuery (Nogueira & Lin, 2019)
Expando-mono-duo (Pradeep, R., Nogueira, R. and Lin, J., 2021)
RocketQA + Ernie (Qu et al., 2020, (Baidu))
COIL (Gao, L., Dai, Z. and Callan, J., 2021
DeepCT (Dai, Z and Callan, 2019)
197. Alternatives
to BM25 &
TF:IDF are
being found
in passages
Dense Passage Retrieval
ANCE
DocTTTTTQuery
COIL
ColBERT
DeepCT
198. Dense Passage Retrieval – ‘Packing’ sparse initial
stage data with training passages
ANCE – Nearest neighbour approach
DocTTTTTQuery – Query expansion in first stage
COIL – Scoring system to add ‘context’ to BM25 first
stage retrieval – adds semantic & lexical (hybrid)
ColBERT – Late stage BERT introduction
DeepCT – Index term weighting with DeepCT
framework for ‘importance weights’
202. Leveraging Semantic and Lexical
Matching to Improve the Recall of
Document Retrieval Systems: A Hybrid
Approach (Kuzi, S., Zhang, M., Li, C.,
Bendersky, M. and Najork, M., 2020)
221. On all
sections of
the site
Informational content (blogs, guides,
FAQs etc)
Informational ‘collection’ pages (e.g.
blog category pages, guide hub
pages)
Transactional content (e.g. -
ecommerce product pages, local
listings)
Transactional ‘collection’ pages (e.g. -
ecommerce category pages, local
listing category pages)
222. Since meeting the needs of
under-specified queries
requires broad coverage
223. You will be judged on
“How many of the intents of this under-
specified query do you meet?”
224. It could be hard to rank for
big ‘head’ terms without
comprehensive stock
(‘information gain’ (value))
within your document
collection (site), for all
query aspects (intents) to
meet an under-specified
query
236. Think…
Are we bringing a new
perspective?
Is the page substantively different
to what we already have
Does this page meet another
aspect of information need /
different intent?
Does this page substantively bring
another type of media into the
mix?
Don’t try to be the same as what is
already ranking… do something
more / different
237. Bring a new perspective
THOUGHT LEADERSHIP NEW DATA DRIVEN
REFRESHING
PERSPECTIVES
INTERACTIVE TOOLS ADDITIONAL FEATURES IN
YOUR CONTENT WHICH
OTHERS DON’T HAVE?
238. Improving
informational
content
Add semantic headings
Connect up topics via ‘relatedness’
Learn to structure content properly
Inverted pyramid approach to web content
Avoid verbosity
Move’ and ‘prune’ with caution
Don’t ‘prune’ away your semantic ‘relatedness’