As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.
8. Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Polysemy in linguistics is
problematic
11. But I will be talking about some concepts
covering:
Data Sciene
01
Information
Retrieval
02
Algorithms
03
Linguistics
04
Information
Architecture
05
Library
Science
38. There are at
least as
many niche
areas of
Information
Retrieval
Mobile IR
Contextual
Search
Natural
Language
Processing
Conversational
Search
Similarity
Search
Recommender
Systems
Library Science
47. Structured versus unstructured data
⢠Structured data â high
degree of organization
⢠Readily searchable by
simple search engine
algorithms or known search
operators (e.g. SQL)
⢠Logically organized
⢠Often stored in a relational
database
60. There are still many
open challenges in
natural language
processing
61.
62. Text cohesion
⢠Cohesion is
the grammatical and lexical linking
within a text or sentence that holds a
text together and gives it meaning.
⢠It is related to the broader concept
of coherence. (Wikipedia)
63. âTopic Modellingâ
According to Wikipedia:
âIn machine learning and natural language processing, a
topic model is a type of statistical model for discovering the
abstract "topics" that occur in a collection of documents.
Topic modeling is a frequently used text-mining tool for
discovery of hidden semantic structures in a text bodyâ.
80. Typical window size might be 5
Source Text
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
Writing a list of random sentences is harder than I Initially thought it would be
11 letters (5 left and 5 right of the moving target word)
87. Continuous Bag of Words (CBOW)
Taking a continuous bag of
words with no context utilize a
context window of n size n-gram)
to ascertain words which are
similar or related using Euclidean
distances to create vector
models and word embeddings
88. The opposite of CBOW (continuous bag of words)
Skip-gram model
92. Word2Vec
Single words
Word embeddings
01
Doc2Vec
Words & meta data
Word embeddings
Document embeddings
02
Sentence2Vec
Chunks of words
More context available
03
Paragraph2Vec
Full paragraphs
Even more context and
semantics
04
93. In order to understand what
words in documents constitute
ârelevanceâ to a query
95. GloVe: Global
Vectors for
Word
Representation
⢠What is GloVe?
⢠âGloVe is an unsupervised learning
algorithm for obtaining vector
representations for words. Training is
performed on aggregated global word-word
co-occurrence statistics from a corpus, and
the resulting representations showcase
interesting linear substructures of the word
vector space.â
⢠(https://nlp.stanford.edu/projects/glove/)
96. Linear
Substructures
in GloVe
⢠Sometimes more than one word pair is needed to understand
meaning
⢠Particularly when the words are opposites of each other (e.g.
man and woman)
⢠By adding addition word pairs further semantic hints provide
context to understand meaning of concepts
101. Wikipedia is a gold mine for IR researchers â each page is considered a concept
102. Similarity and
Relatedness
Similarity â words that mean the
same or nearly the same
Relatedness - Words that live
together within a topic / co-occur
in the same corpora / collection /
sub-section of a collection
113. Probably
BM25F is
used for
web pages
BM25F allows for web pages which have
structure compared with normal flat text
output (e.g. from text files) (additional fields)
Takes into consideration elements such as
page title, meta data, sections, footers,
headers, anchor text
Adds weights for different elements on a
page
116. Semi-
structured
data
⢠Hierarchical nature of a
website
⢠Tree structure
⢠Well sectioned and
including clear containers
and meta headings
⢠An ontology map between
semi and structured
117. Lexical
ânymsâ
Antonym â The
opposite meaning
Synonym â The same
meaning
Meronym â Part of
something else (whole)
â e.g. finger (hand)
(Part / whole relations)
Hyponym â A subset of
something else â e.g.
fork (cutlery)
Hypernym â A superset
(superordinate) â e.g.
colour hypernym
131. Cancel your noise
Unstructured data is
voluminous
Filled with
irrelevance
Lacks focus
Riddled with
stopwords
Lots of meaningless
text and further
ambiguating jabber
134. Use well
organised
hyponyms
(Hyponomy and
Hypernymy)
⢠Cutlery
⢠Spoons
⢠Dessert
⢠Tea
⢠Table
⢠Forks
⢠Knives
⢠Carving
⢠Steak
⢠Butchers
Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Hyponym + Hypernym
Hyponym + Hypernym
(co) Hyponym
(co) Hyponym
(co) Hyponym
Simple unordered list with children
135. Image alt tags (and image title
tags) help with disambiguation too
136. Stemming &
Lemmatization
Both aim to take a word back to its common
base form
Avoid keyword
stuffing⌠be aware of
stemming and
lemmatization
137.
138. Tables are relational databases
too â use liberally (with headers)
ID Event Name Event Type Event City Event Country
1 Ungagged Las
Vegas
Conference Las Vegas US
2 Ungagged London Conference London UK
3 State of Digital Conference London UK
4 Brighton SEO Conference Brighton UK
163. Sources and References
⢠Kira Radinsky Tedx Talk -
https://www.youtube.com/watch?v=gAifa_CVGCY
⢠Stop Word Library Example -
https://sites.google.com/site/kevinbouge/stopwords-lists
⢠Image credit: Bird, Steven, Edward Loper and Ewan Klein
(2009), Natural Language Processing with Python. OâReilly Media Inc.
⢠The work of - Radinsky, K., 2012, December. Learning to predict the
future using Web knowledge and dynamics. In ACM SIGIR Forum(Vol.
46, No. 2, pp. 114-115). ACM.
⢠http://9ol.es/porter_js_demo.html
164. Sources and References
⢠Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA:
Fast document aligner using word embedding. The Prague Bulletin of
Mathematical Linguistics, 106(1), pp.169-179.
⢠https://en.wikipedia.org/wiki/List_of_locations_in_Australia_with_an
_English_name
⢠Barbara Plank | Keynote - Natural Language Processing: -
https://www.youtube.com/watch?v=Wl6c0OpF6Ho
165. Further Reading
⢠https://github.com/Hironsan/awesome-embedding-models
⢠https://nlp.stanford.edu/IR-book/html/htmledition/document-
representations-and-measures-of-relatedness-in-vector-spaces-
1.html
⢠https://www.youtube.com/watch?time_continue=790&v=wI5O-
lYLBCw
⢠https://en.wikipedia.org/wiki/Euclidean_distance
⢠Ibrahim, O.A.S. and Landa-Silva, D., 2016. Term frequency with
average term occurrences for textual information retrieval. Soft
Computing, 20(8), pp.3045-3061.
166. Further Reading
⢠Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity,
M., 2018. Advances in Computational Intelligence
Systems. Intelligence.
⢠https://www.researchgate.net/post/What_is_the_difference_betwee
n_TFIDF_and_term_distribution_for_feature_selection
⢠https://radimrehurek.com/gensim/models/word2vec.html
⢠McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep
relevance ranking using enhanced document-query interactions. arXiv
preprint arXiv:1809.01682.
167. Further Reading
Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic
models. Foundations and TrendsÂŽ in Information Retrieval, 11(2-3),
pp.143-296.
https://nlp.stanford.edu/projects/glove/
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-
1.html
Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of wikipedia
concepts and entities. In International conference on applications of
natural language to information systems (pp. 418-428). Springer, Cham.
169. Precision and Recall
GOLD
STANDARD
Lots of results inaccurately
deemed highly relevant
retrieved. Lots of results
inaccurately deemed
irrelevant not retrieved
Maybe not enough relevant
documents to fetch much
here
Lots of documents came back
but not enough highly
relevant
Many highly relevant docs to
meet informational need
returned
Results were highly relevant
but not enough came back
Maybe being âtoo pickyâ
A wide net was cast but not
many of the right type of
fishes caught
https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9
Manning, C.D., Manning, C.D. and SchĂźtze, H., 1999. Foundations of statistical natural language processing. MIT press
172. You should be aware
âindexingâ and
ârankingâ are two very
separate things
173. Example context window size 3
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
(the, fox)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
(quick,
jumps)
The quick brown fox jumps over the lazy dog Etcetera
The quick brown fox jumps over the lazy dog Etcetera
174.
175. Stemming
(Popular stemmer is PorterStemmer)
⢠https://github.com/johnpcarty/mysql-porter-
stemmer/blob/master/porterstemmer.sql
Conditions Suffix Replacement Examples
-------------------------- ------- ------------- -----------------------
(m>1) al NULL revival -> reviv
(m>1) ance NULL allowance -> allow
(m>1) ence NULL inference -> infer
(m>1) er NULL airliner-> airlin
(m>1) ic NULL gyroscopic -> gyroscop
(m>1) able NULL adjustable -> adjust
(m>1) ible NULL defensible -> defens
(m>1) ant NULL irritant -> irrit
(m>1) ement NULL replacement -> replac
(m>1) ment NULL adjustment -> adjust
(m>1) ent NULL dependent -> depend
(m>1 and (*<S> or *<T>)) ion NULL adoption -> adopt
(m>1) ou NULL homologou-> homolog
(m>1) ism NULL communism-> commun
(m>1) ate NULL activate -> activ
(m>1) iti NULL angulariti -> angular
(m>1) ous NULL homologous -> homolog
(m>1) ive NULL effective -> effect
(m>1) ize NULL bowdlerize -> bowdler
176. WordSim353
Dataset
Some words
with low
similarity or
relatedness
Word 1 Word 2 Human (mean)
king cabbage 0.23
professor cucumber 0.31
chord smile 0.54
noon string 0.54
rooster voyage 0.62
sugar approach 0.88
stock jaguar 0.92
stock life 0.92
monk slave 0.92
lad wizard 0.92
delay racism 1.19
stock CD 1.31
drink ear 1.31
stock phone 1.62
holy sex 1.62
production hike 1.75
precedent group 1.77
stock egg 1.81
energy secretary 1.81
month hotel 1.81
forest graveyard 1.85
cup substance 1.92
possibility girl 1.94
cemetery woodland 2.08
glass magician 2.08
cup entity 2.15
Wednesday news 2.22
direction combination 2.25
177. Coast and
Shore Example
⢠Coast and shore have a similar meaning
⢠They co-occur in first and second level
relatedness documents in a collection
⢠They would receive a high score in
similarity
181. Context window (n = window size = 2) (2 words either side)
Source Text Training
Samples
The quick brown fox jumps over the lazy dog (the, quick)
(the,
brown)
The quick brown fox jumps over the lazy dog (quick, the)
(quick,
brown)
(quick, fox)
The quick brown fox jumps over the lazy dog (brown,
the)
(brown,
quick)
(brown, fox)
(brown,
jumps)
The quick brown fox jumps over the lazy dog (fox, quick)
(fox, brown)
(fox, jumps)
(fox, over)
182. Other areas of IR Including⌠but not limited to
Mobile IR
Contextual IR
Natural language processing
Conversational search
Similarity search
Recommender systems
Image IR
Music IR
183. Recall and Precision in IR
Precision is the
best results â the
most relevant for
the query
Recall is all of the
results returned
for the query
185. Increase Recall â 2 Main Methods
Query Expansion
/ Query Rewriting
Query Relaxation
⢠Ignore stop words in query
⢠Relax specificity (remove specific)
⢠Use lexical database (a âknowledge
graphâ (e.g. Wordnet) to find a more
general term (hypernym - superset)
⢠Use Part of Speech Tagger to identify
structure of query & expand nouns
(things)
⢠Preserve head noun and strip
modifiers (e.g. dog)
⢠Use Word2Vec to identify semantics
from a vector space using word
embeddings from the query â find
semantics (related and similar)
Take bits away from the query
Add bits to the query
⢠Broaden the query
⢠Expand abbreviations
⢠Stemming and lemmatization (in
reverse)
⢠Use Word2Vec to find abbreviations
in a vector space using semantic
similarity
⢠Use synonyms (same / very similar
meanings)
⢠Use minimum cosine similarity from
Word2Vec as safety net
187. Most popular word embedding
tool â probably Word2Vec (new
ones are emerging)
188. Stemming &
Lemmatization
Stemming
⢠Runs a series of rules to
chop known âstemsâ off the
end of words
⢠Suffix stripping algorithm
⢠Often leaves incorrect
endings/ crude performance
(understemming)
⢠Popular â PorterStemmer
(Martin Porter)
⢠Example: âalumnusâ ->
âalumnuâ
Lemmatization
⢠Aims to do things properly
⢠Tool from natural language
processing
⢠Needs a complete
vocabulary and
morphological analysis to
work well
⢠Also not perfect
⢠Relies on lexical knowledge
base like WordNet to correct
base form
190. Part of Speech Tags (Python NLTK Library)
⢠NNPS proper noun, plural
âAmericansâ
⢠PDT predeterminer âall the kidsâ
⢠POS possessive ending parentâs
⢠PRP personal pronoun I, he, she
⢠PRP$ possessive pronoun my, his,
hers
⢠RB adverb very, silently,
⢠RBR adverb, comparative better
⢠RBS adverb, superlative best
⢠RP particle give up
⢠TO, to go âtoâ the store.
⢠UH interjection, errrrrrrrm
⢠CC coordinating conjunction
⢠CD cardinal digit
⢠DT determiner
⢠EX existential there (like: âthere isâ ⌠think
of it like âthere existsâ)
⢠FW foreign word
⢠IN preposition/subordinating conjunction
⢠JJ adjective âbigâ
⢠JJR adjective, comparative âbiggerâ
⢠JJS adjective, superlative âbiggestâ
⢠LS list marker 1)
⢠MD modal could, will
⢠NN noun, singular âdeskâ
⢠NNS noun plural âdesksâ
⢠NNP proper noun, singular âHarrisonâ
⢠VB verb, base form take
⢠VBD verb, past tense took
⢠VBG verb, gerund/present
participle taking
⢠VBN verb, past participle taken
⢠VBP verb, sing. present, non-3d
take
⢠VBZ verb, 3rd person sing.
present takes
⢠WDT wh-determiner which
⢠WP wh-pronoun who, what
⢠WP$ possessive wh-pronoun
whose
⢠WRB wh-abverb where, when
191. Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
Stop Word Libraries are Huge
192. Enable gzip compression via caching
plugins, .htaccess or via
compression plugins
MOBILE
NO
INTERSTITIALS
ON MOBILE
THANK YOU
a a's able about above according accordingly across actually after
afterwards again against ain't all allow allows almost alone along
already also although always am among amongst an and another
any anybody anyhow anyone anything anyway anyways anywhere apart appear
appreciate appropriate are aren't around as aside ask asking associate
d
at available away awfully
âAâ words in an EN stop word list
193. ⢠Anaphora resolution (AR) which most commonly appears as pronoun
resolution is the problem of resolving references to earlier or later
items in the discourse.
⢠Example: "John found the love of his life" where 'his' refers to 'Johnâ
⢠âHisâ refers to John (easily understood by humans but not so much by
machines)
Example and definition from: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf
Anaphora
194. Cataphora â
According to
Wikipedia
In linguistics cataphora is the use of an
expression or word that co-refers with a later,
more specific, expression in the discourse.
EXAMPLE: âWhen he arrived home, John went
to sleepâ
HE IS JOHN, BUT JOHN WAS NOT KNOWN
WHEN âHEâ WAS REFERRED TO â SO CAUSES
CONFUSION REGARDING WHO âHEâ IS
195. Anaphora and
Coreference
Resolution
⢠There are some algorithms in place to handle
anaphora resolution
⢠In conversational search this still struggles after
a few multi-turn questions
196.
197. NLTK Toolkit
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. OâReilly Media Inc.
203. How can all this help you as an SEO?
⢠Consider exploring your topical models by using noindex admin only
tag clouds to visualise the topics you see
⢠Utilise relatedness considering words likely in Sim353 or other data
sources
⢠Pass supporting topical hints to emphasise the meanings in 1st and 2nd
level relatedness
⢠Consider query intent shift and mobile IR / contextual search as niche
fields of IR
⢠Utilise semi-structured elements to strengthen unstructured pages in
noisy (particularly longer) pages
204. How can all this help you as an SEO?
⢠Further disambiguation measures on locations used in different
countries with same name
⢠Be consistent in your naming conventions. Refer to Wikipedia if in
doubt â check their redirects on terms
⢠Other semantic clues for entities with same name â e.g. gender / role
/ location / geographic clues
⢠Utilise anchors to emphasise further from 2nd level relatedness pages
⢠Utilise co-occurring terms from databases considered similar /
connected / related