The document discusses automating the assessment of search quality by analyzing search result properties. It proposes calculating a quality score from 0-1 for each result based on factors like keyword matching in titles/snippets, uniqueness of titles, result set size, and result age. The scores would be averaged to assess a search term's results. This automated approach aims to mimic human evaluation of search quality without requiring manual reviews. The analysis focuses on the first page of results and user-visible aspects to quickly gauge changes' impacts without an in-depth content review.
SIKM Leaders July 2012 - Understanding your Search Log
1. Search analytics –
Understanding the long tail
SIKM Leaders July 2012
Lee Romero
blog.leeromero.org
July 2012
2. About me
My background and early career are both in software engineering.
I've worked in the knowledge management field for the last 12+
years – almost all of it in the technology of KM
I’ve worked with various search solutions for the last 7-8 years –
and spent most of that time trying to figure out how to measure
their usefulness and improve them in any way I can.
I’ve spoken at both Enterprise Search Summit and Taxonomy Boot
Camp twice.
My writings on search analytics have been featured by a number of
experts in the field including Lou Rosenfeld and Avi Rappoport
2
3. Search Analytics
Definition: Search analytics is the field of analyzing and
aggregating usage statistics of your search solution to
understand user behavior and to improve the experience.
Some search analytics are focused on SEO / SEM activities (for
internet searches).
The focus here will be on enterprise search, so will primarily be
focusing on the aspect of improving the user experience.
Further, I will primarily focus here on keyword search and
understanding the user language found in search logs
Always remember – analytics without action does not have much
value.
3
5. Understanding your search log
For enterprise search solutions1, the “80-20” rule is not true
The language variability is very high in a couple of ways (covered
in the next few slides)
Yet having a good understanding of the language, frequency and
commonality in your search log is critical to being able to make
sustainable improvements to your search
The remainder of this presentation first provides some evidence
supporting my claim and then will cover some ideas and research
into this problem
1 This does not seem to apply equally to e-commerce solutions
5
6. Some facts about search terms
There’s an anecdote that goes something like, “80% of your
searches are from 20% of your search terms”
• Equivalently, some will say that you can make significant impact by paying
attention to a few of your most common terms (you can, but in limited ways)
Fact: in enterprise search solutions the curve is much shallower:
This chart shows the
inverted power curve for
two different solutions
I’m currently working
with
In the second case, it takes 13% of terms to cover 50% of searches,
and that is over 7000 distinct terms in a typical month!
6
7. Some facts about search terms: part 2
Another myth: a large percent of searches repeat over and over
again
Fact: on enterprise search solutions, there is surprisingly little
commonality month-to-month
Over a recent six month period, which saw a total of ~289K distinct
search terms, only 11% of terms occurred in more than 1 month!
# of months # terms % of searches
1 257665 89.2%
2 17994 6.2%
3 5790 2.0%
4 2900 1.0%
5 2019 0.7%
6 2340 0.8%
7
8. Some facts about search terms: part 3
Another myth: a good percentage of your search terms will repeat in
sequential periods
Fact: There is much more churn even month-to-month than you
might expect – in the period covered below, only about 13% of
terms repeated from one month to the next (covering about 36%
of searches)
8
9. What to do with your search log?
The summary of the previous slides:
• It is hard to understand a decent percentage of terms within a
given time period (month)!
• If you could do that, the problem during the next time period isn’t
that much easier!
The next sections describe a couple of research projects I’ve been
working on to tackle these issues
9
11. Categorizing your users’ language
Given the challenges previously laid out, using the search log to
understand user needs seems very challenging
Beyond the first several dozen terms, it is hard to understand what
users are looking for
• And those several dozen terms cover a vanishingly small percentage of all
searches!
However, it would be very useful to understand your users’
information needs if we could somehow understand the entirety
of the search log
How do we handle this? Categorize the search terms!
11
12. Categorizing your users’ language, p2
So we need to categorize search terms to really be able to
understand our users’ information needs.
To do this, we face two challenges
1. What categorization scheme should we use?
2. How do we apply categorization in a repeatable, scalable and manageable
way?
For the first challenge, I would recommend you use your taxonomy
(you do have one, right?)
The second challenge is a bit more difficult but is addressed later in
this deck
12
13. Categories to use
Proposal: Start with your own taxonomy and its vocabularies as the
categories into which search terms are grouped
Some searches will not fit into any of these categories, so you can
anticipate the need to add further categories
As an aside, this exercise actually provides a great measurement
tool for your taxonomy
• You can quantitatively assess the percent of your users’ language that is
classifiable with your taxonomy
• A number you may wish to drive up over time (through evolution of your
taxonomy)
13
14. Automating categorization
Now we turn to the hairier challenge – how can we categorize
search terms?
To describe the problem, we have:
1. A set of categories, which may be hierarchically related (most taxonomies
are)
2. A set of search terms, as entered by users, that need to be assigned to
those categories
Search Term
Category
Category ? Search Term
Category
Search Term
Category
Category Category
Search Term
Category
... Search Term
... ? ...
...
14
15. Automating categorization, p2
The proposed solution is based on a couple of concepts:
1. You can think of this categorization problem as search!
2. You are taking each search term and searching in an index in which the
potential search results are categories!
Question: What is the “body” of what you are searching?
Answer: Previously-categorized search terms!
Using this approach, you can consider the set of previously-
categorized search terms as a corpus against which to search
• You can apply all of the same heuristics to this search as any search:
• Word matching (not string matching)
• Stemming
• Relevancy (word ordering, proximity, # of matches, etc.)
15
16. Automating categorization, p3
Here’s a depiction of this solution
Previously
categorized
terms Search Term
Category
Category
Category Search Term
Category Search Term
Category Category Previously
Category categorized Search Term
terms
... Search Term
...
... ...
Previously
categorized
terms
This red oval represents the “matching” process
– it takes as input the search terms to be
categorized, the set of categories along with
previously-matched search terms and produces
as output a set of categories associated with the
new search terms
16
17. Automating categorization, p4: Bootstrapping
This approach depends on matching to previously-categorized terms
• Every time you categorize a new search term, you expand the set of
categorized terms, enabling more matches in the future
Bootstrapping: You can take the names of the categories (the terms
in your taxonomy) as the first set of “categorized search terms”
• This allows you to start with no search terms having been categorized at all
• You run a first round of matching against the categories to find first-level
matches
• Take those that seem like “good” matches and pull those into the set of
categorized search terms for a second iteration, etc.
• Using this in initial testing resulted in 10% of distinct terms from a month
being associated with at least one category
Another aspect: Any manual categorization of common search terms
will add to the success of categorization
17
18. Automating categorization, p5: Iterative
Previously
categorized Search Term
Category terms
Category
Search Term
Category
Category New categorizations Search Term
Category Category
Category Previously Search Term
categorized
... terms Search Term
...
... ...
New categorizations
Previously
categorized
terms
New categorizations
19. Automating categorization, p5: Iterative
This approach also needs to be applied iteratively
• You start with a set of categorized search terms and a new set of
(uncategorized) search terms
• You then apply this matching to the uncategorized search terms, getting a set
of newly-categorized search terms (with some measure of probability of
“correctness” of the match, i.e., relevancy)
• You pull in the newly-categorized search terms and run the matching process
again
• Each time, as you expand the set of categorized search terms (from a
previous match), you increase the possibility of more matches (in
subsequent matches)
19
20. Automating categorization, p6: Iterative
It will be beneficial to have a human review the set of matches for
each iteration and determine if they are accurate enough
• The measurement of relevancy is intended to do this but would likely only be
partially successful
Over time, using this process, you build up a larger and larger set of
categorized search terms
• This makes it more likely in future iterations that more terms will be
categorizable
20
21. Automating categorization, p7: No matches
There will always be search terms that do not get matched.
• This may be because the terminology used does not match
• This may be because there are no categories in the global taxonomy that
would be useful for categorization
The first issue would require a human to recognize the association
(thus, categorizing the term and then enabling matches on future
uses of that term)
The second issue would require adding in new categories (not part
of the global taxonomy)
• And then categorizing the term into the newly-added category(ies)
21
22. Summary
With this approach, we can take a set of search terms at any time
and categorize them (partially) automatically
• Over time, the accuracy of the matching will improve through human review-
and-approval of matches
We then are able to relate these information needs to a variety of
other pieces of data:
• Volume of content available to users – significant mismatches can highlight
need for new content
• Rating of content in these categories – can highlight that a particular area of
interest has content but it isn’t quality content
• Downloads of content in these categories – could highlight navigational
issues (e.g., when a category is much more highly represented in search
than in downloads)
This does not require directly working with end-users and is scalable
22
23. Additional benefits: Measuring your taxonomy
As mentioned earlier, part of the challenge will be that there will be
terms that do not match the starting categories (i.e., the global
taxonomy)
This actually highlights some valuable insight obtainable from this:
• We can identify gaps in our taxonomy (terms requiring new categories)
• We can identify areas of our taxonomy where we have many search terms
associated with a taxonomy term and consider if we need to either add or
split search terms in order to better match our users’ real language
• We can identify areas of the taxonomy that are of little use in terms of the
language used by our users
23
24. Additional benefits: Linguistic statistics
Word Distinct Terms Searches
management 3128 8283
Word counts – independent of term usage, sap 1931 3873
strategy 1414 3728
what are the most common individual business 1558 3599
words? it 1343 2992
process 1515 2920
data 1264 2899
project 1249 2823
model 1296 2791
plan 987 2170
Word networks – we can understand
the inter-relationships between
individual words (which pairs
occur commonly together,
which words occur commonly
for a given word)
These are not as much about information needs as about understanding the language
users use (so this insight can help shape categorization)
These are also very useful to prioritize your efforts in reviewing your search logs
24
25. Additional benefits: Comparing to your content space
With the statistics described in the previous slide, you could,
conceivably compare it to the same analysis applied to your
“content space”
For example, derive the statistics for the titles of content available in
your search
• Do you find significant differences? This could represent differences in the
names people apply to things and what they expect to use to find the content
Another interesting angle is to use other controlled lists as the
matched terms in a category
• People names (applied this and found about 8% of terms match a person’s
name)
• Client names
25
27. The Problem
Search sucks!
Yes, the common refrain from many users – “search doesn’t return
what I’m looking for” or “I can never find what I’m looking for”
There are many tools available to improve the users’ experience,
including:
• Improving the UI
• Improving the content included
• Manipulating settings in the engine to modify relevancy
calculations, possibly even the engine itself
The challenge for many of these is, once you make a change, how
do you know it has improved the results?
27
28. A solution?
One way to assess the impact is to have a set of users perform
either a set of pre-defined searches or a set of their own searches
and then evaluate the quality of results
The challenge with this is that it is very labor intensive, can take a
long calendar time and is hard to do iteratively.
An alternative could be to automate this evaluation!
It is important to keep in mind that this is not about the relevancy of
the results or determining whether the engine is returning the
“right” items
• It’s about assessing the user-perceived quality of a set of
results given a set of criteria for a search
28
29. Automating evaluation
The idea is to automate some of the analysis of the quality of the
result set by examining properties of the result set
This approach attempts to perform a simple test similar to what a
human user would do in scanning a set of search results
• It uses the data returned by the search engine and displayed on
the first page of results
• It does not do a “deep” review of content
29
30. The approach
The algorithm takes the following approach:
• For each search term, it executes the query against the search
engine and retrieves the results
‒For each individual result, it calculates a quality score from 0.0 to
1.0 (a higher score implies the result looks like a better result)
‒The individual scores for a search term’s set of results are
averaged to get a single score for that search term
• In addition, the current POC outputs data in a tabular format
including most of the individual elements returned by the search
engine along with the derived score
30
31. What are we looking at in assessing quality?
Facets that influence quality
• Focusing primarily on user-visible aspects
First page
Result set size
Snippet
Title
Age
Uniqueness of
title
31
32. What are we looking at in assessing quality?
Factors that influence quality
• Only examining the first page of results
• Similarity / dissimilarity of keywords to title
• Similarity / dissimilarity of keywords to excerpt
• Uniqueness of titles within the result set (just first page)
• Size of total result set
• Age of results
• Looking for specific “known” targets
• (one “cheat”) Presence of keywords in “concepts” identified by
engine
32
33. What are we looking at in assessing quality?
Others that may be explored
• Balance across sources of content (does it match overall ratio?)
• Ratings of individual results
• Web domain of content (following an internet expectation that “some sources are
better than others”)
• Match of terms could be altered to consider synonyms
• Examining taxonomy values
‒ Could apply matching to taxonomy values?
‒ Could be a “bonus” to items that have taxonomy?
• May want to make weights (e.g., impact of age) consider source or class of
content
• Currently, in our search engine, best bets are automatically included.
‒ Would prefer to have them not included to see where they end up organically.
• Also, in our search engine, the exact order on a page has not been replicated so
we can’t include the exact order as a factor
33
34. Validating the approach
Does this reflect how a human user would perceive the quality?
• This idea seems reasonable, but do we really have a way to
determine if it is valid
‒Or, do we run the risk that this would lead to “local maximums” for
the factors measured but not meaningfully improve the user’s
experience?
• So far, I have 2 independent ways to assess this
‒Comparing the results of this against a human assessment
‒Comparing the results of this against other factors that have been
used as indicators of quality in the past
34
35. Validating the approach, p2
Comparing against a human assessment
• One of our on-going operations in GCKM is to review the quality of
results for a very small number of terms
‒The below takes the output of the most recent of this for our a
subset of our “super search terms” and compares it against the
programmatically calculated quality
‒There is at least a correlation
0.8
between the automated score y = 0.2781x + 0.3826
0.7
R² = 0.5803
(the Y axis) and the manual 0.6
score (the X axis) Automated Score
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2
Manual Score
35
36. Validating the approach, p3
Comparing against searches/term
• Within our search program, we use the ratio of searches per visit
for a term as an indicator of the quality the results
‒The more pages of results a user looks at for a term, indicates
that it’s harder for the user to find what they are looking for
‒The following chart displays a comparison between searches/visit
(X-axis) and the automated quality score (Y-axis)
‒Again, we can see that there 80
is a correlation, though perhaps y = -0.6857x + 55.234
R² = 0.5225
70
not as strong as 60
50
compared to the manual 40
review 30
20
10
0
50 40 30 20 10 0
Quality Linear (Quality)
36
37. Validating the approach, p4
Summing up
• At this point, I am confident that the quality assessment we are
producing automatically is reflecting the user’s general experience.
‒On individual items, it can vary significantly but in aggregate it
appears to be valid
‒I have not yet dug into this but the automation enables the
weights of each factor to be adjusted and it’s possible that we can
get the automated score closer still to the “real” quality of results
through adjusting weights
37
38. Additional benefits of this tool
Better analysis
• Given that this utility can output data in a spreadsheet format, this
presents some other capabilities
‒Estimate total “search impressions” for specific targets
• Analyze “search impressions” vs. usage
‒Analyze spread of returned results across sources
‒Analyze quality along a variety of dimensions (source,
taxonomy values, etc.)
‒Comparing results sets between terms that should show
similar results
• E.g., how similar are the results really for two synonyms?
‒Also, comparing result sets along a temporal dimension
• How much change is there from one month (week) to the next?
‒Analyzing factors by depth into the “long tail”
‒Evaluating the quality of results for auto-complete terms
38
39. Quality of results split by taxonomy on the content
Better analysis - examples
• Quality of results averaged over the service area assigned to
content
Quality by Service Area of content
38.0
37.5
37.0
36.5
Overall Avg
36.0
35.5
35.0
34.5
34.0
33.5
33.0
Enterprise Human Capital Outsourcing Strategy & Technology
Applications (Consulting) Operations Integration
39
40. Quality of results by depth into the “long tail”
Better analysis - examples
• A chart of the quality of the result pages by how far into the long
tail a search term is
Quality by Depth into the "long tail"
60.0
50.0
40.0
30.0
20.0 y = 55.685x-0.14
R² = 0.5253
10.0
0.0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
7500
8000
8500
9000
9500
10000
10500
11000
11500
12000
12500
13000
13500
14000
14500
15000
15500
16000
16500
17000
17500
40
41. Quality over time – comparing before and after an upgrade
Better analysis - examples
• This chart shows the # of terms by their change in quality through
an upgrade of our search engine – overall change was +2%!
Change in Quality through an upgrade
450
400
Worse Better
350
300
250
200
150
100
50
0
11%
13%
15%
17%
19%
21%
23%
25%
27%
29%
31%
33%
35%
37%
39%
41%
44%
47%
49%
51%
54%
56%
59%
66%
81%
-9%
-7%
-5%
-3%
-1%
1%
3%
5%
7%
9%
-46%
-39%
-34%
-31%
-29%
-27%
-25%
-23%
-21%
-19%
-17%
-15%
-13%
-11%
41
42. And, finally
For more about search analytics, I highly would recommend:
• “Search Analytics for your Site” by Lou Rosenfeld
• www.searchtools.com – edited by Avi Rappoport
Also, you can find my own writings on search analytics (along with a
variety of other KM topics) on my blog:
• blog.leeromero.org
42